Atkinson-Riani - Robust Diagnostic Regression Analysis
Atkinson-Riani - Robust Diagnostic Regression Analysis
Advisors:
P. Bickel, P. Diggle, s. Fienberg K Krickeberg,
1. Olkin, N. Wermuth, S. Zeger
Robust Diagnostic
Regression Analysis
" Springer
Anthony Atkinson Marco Riani
Department of Statistics Departimento di Economia (Sezione di Statistica)
London School of Economics Universita di Parma
London WC2A 2AE 43100 Parma
UK Italy
[email protected] mriani®unipr.it
QA278.2.A85 2000
519.5'36-dc21 00-026154
987 6 5 4 3 2 1
ISBN 978-1-4612-7027-0
Dla Basi
a Fabia
Preface
data. As our work on the forward search grows, we hope that the material
on the website will grow in a similar manner.
The first chapter of this book contains three examples of the use of the
forward search in regression. We show how single and multiple outliers
can be identified and their effect on parameter estimates determined. The
second chapter gives the theory of regression, including deletion diagnostics,
and describes the forward search and its properties.
Chapter Three returns to regression and analyzes four further examples.
In three of these a better model is obtained if the response is transformed,
perhaps by regression with the logarithm of the response, rather than with
the response itself. The transformation of a response to normality is the
subject of Chapter Four which includes both theory and examples of data
analysis. We use this chapter to illustrate the deleterious effect of outliers
on methods based on deletion of single observations.
Chapter Four ends with an example of transforming both sides of a
regression model. This is one example of the nonlinear models that are the
subject of Chapter Five. The sixth chapter is concerned with generalized
linear models. Our methods are thus extended to the analysis of data from
contingency tables and to binary data.
The theoretical material is complemented by exercises. We give references
to the statistical literature, but believe that our book is reasonably self-
contained. It should serve as a textbook for courses on applied regression
and generalized linear models, even if the emphasis in such courses is not
on the forward search.
This book is concerned with data in which the observations are inde-
pendent and in which the response is univariate. A companion volume,
coauthored with Andrea Cerioli and tentatively called Robust Diagnostic
Data Analysis, is under active preparation. This will cover topics in the
analysis of multivariate data including regression, transformations, princi-
pal components analysis, discriminant analysis, clustering and the analysis
of spatial data.
The writing of this book, and the research on which it is based, has been
both complicated and enriched by the fact that the authors are separated
by half of Europe. Our travel has been supported by the Italian Ministry
for Scientific Research, by the Staff Research Fund of the London School of
Economics and, also at the LSE, by STICERD (The Suntory and Toyota
International Centres for Economics and Related Disciplines). The develop-
ment of S-Plus functions was supported by Doug Martin of MathSoft Inc.
Kjell Konis helped greatly with the programming. We are grateful to our
numerous colleagues for their help in many ways. In England we especially
thank Dr Martin Knott at the London School of Economics, who has been
an unfailingly courteous source of help with both statistics and computing.
In Italy we thank Professor Sergio Zani of the University of Parma for his
insightful comments and continuing support and Dr Aldo Corbellini of the
same university who has devoted time, energy and skill to the creation of
Preface ix
our web site. Luigi Grossi and Fabrizio Laurini read the text with great
care and found some mistakes. We would like to be told about any others.
Anthony Atkinson's visits to Italy have been enriched by the warm hospi-
tality of Giuseppina and Luigi Riani. To all our gratitude and thanks.
Anthony Atkinson
a.c.atkinson©lse.ac.uk
www.lse.ac.uk/experts/
Marco Riani
mriani©unipr.it
stat.econ.unipr.it/riani
Preface vii
3 Regression 43
3.1 Hawkins' Data . 43
3.2 Stack Loss Data 50
3.3 Salinity Data 62
3.4 Ozone Data 67
3.5 Exercises 73
3.6 Solutions. 74
4 Transformations to Normality 81
4.1 Background 81
4.2 Transformations in Regression 82
4.2.1 Transformation of the Response 82
4.2.2 Graphics for Transformations 86
4.2.3 Transformation of an Explanatory Variable 87
4.3 Wool Data. 88
4.4 Poison Data 95
4.5 Modified Poison Data . 98
4.6 Doubly Modified Poison Data: An Example of Masking 101
4.7 Multiply Modified Poison Data-More Masking 104
4.7.1 A Diagnostic Analysis 104
4.7.2 A Forward Analysis . 106
4.7.3 Other Graphics for Transformations. 108
4.8 Ozone Data 110
4.9 Stack Loss Data . III
4.10 Mussels' Muscles: Transformation of the Response. 116
4.11 Transforming Both Sides of a Model. 121
4.12 Shortleaf Pine 124
4.13 Other Transformations and Further Reading 127
4.14 Exercises 128
4.15 Solutions. . . . . 129
AD~ 2"
Bibliography 311
A.l Forbes' data on air pressure in the Alps and the boiling point
of water . . . . . . . . . . . . . . . . . . . . ..
. . . 278. . .
A.2 Multiple regression data showing the effect of masking .. 279
A.3 Wool data: number of cycles to failure of samples of worsted
yarn in a 33 experiment . . . . . . . . . . . . . . . . 281
. . .
A.4 Hawkins' data simulated to baffle data analysts . . . . .. 282
A.5 Brownlee's stack loss data on the oxidation of ammonia. The
response is ten times the percentage of ammonia escaping up
a stack, or chimney . . . . . . . . . . . . . . . .... . 285. .
A.6 Salinity data. Measurements on water in Pamlico Sound,
North Carolina . . . . . . . . . . . . . . . . . .. . . .286. .
A.7 Ozone data: ozone concentration at Upland, CA as a
function of eight meteorological variables. . . . . . . . . .287
A.8 Box and Cox poison data. Survival times in lO-hour units of
animals in a 3 x 4factorial experiment. Each cell in the table
includes both the observation number and the response. 289
A.9 Mussels data from Cook and Weisberg. The response is the
mass of the edible portion of the mussel . . . . . . . ... 290
A.lO Shortleaf pine. The response is the volume of the tree, Xl
the girth and X2 the height . . . . . . . . . . . . . . . . . 292
A.ll Radioactivity and the molar concentration of nifedipene. 294
A.12 Enzyme kinetics data. The response is the initial velocity of
the reaction . . . . . . . . .. . . . . . . . . .. . . . . . 295
A.13 Calcium data. Calcium uptake of cells suspended in a
solution of radioactive calcium. . . . . . . . . . . ... . . 296
XVI Tables of Data
,
•
on
~
•
•
~
::>
•
<n
<n
i!?
S,
0
•
,
0>
~
.Q
x
0
~
••
on
~ ••
•
195 200 205 210
Boiling point
Figure 1.1. Forbes' data: scatter plot of 100 x log pressure against boiling point.
There is a suggestion of one outlier
x: boiling point, OF
y: lOOxlog(pressure).
The data are plotted in Figure 1.1. A quick glance at the plot shows there
is a strong linear relationship between log (pressure) and boiling point. A
slightly longer glance reveals that one of the points lies slightly off the line.
Linear regression of y on x yields a t value for the regression of 54.45, clear
evidence of the significance of the relationship.
Two plots of the least squares residuals e are often used to check fitted
models. Figure 1.2 (left ) shows a plot of residuals against fitted values y.
This clearly shows one outlier, observation 12. The normal plot of the stu-
dentized residuals, Figure 1.2(right) , is an almost straight line from which
the large residual for observation 12 is clearly distanced. It is clear that
observation 12 is an outlier.
Now that observation 12 has been identified as different, two strategies
can be followed. One is to delete it from the data and to refit to the re-
4 1. Some Regression Examples
'"
'"o
.
o
.
-;-
•
•• •• • •
0
0
• •
• • •
135 140 145 -2 -1 o 2
Predicted values Quantiles of standard normal
Figure 1.2. Forbes' data: (left) least squares residuals e against predicted values
ii, showing that observation 12 is an outlier; (right) normal plot ofthe studentized
residuals, with 90% simulation envelope, confirming that observation 12 is indeed
an outlier
EQ)
o ,
'u 0
~ o
8
~~ g~
'E"
'
0
I = ~~~~ePt jI
d
L()
o
d
W"i
~o _ _ _
o
d
5 10 15 5 10 15
Subset size m Subset size m
Figure 1.3. Forbes ' data: parameter estimates from the forward search: (left)
slope and intercept So
and Sl
(the values are virtually unaffected by the outlying
observation 12); (right) the value of the estimate of (72 increases dramatically
when observation 12 is included in the last step of the search
is a plot of the values of the parameter estimates during the forward search.
The values are extremely stable, reflecting the closeness of all observa-
tions to the straight line. The introduction of observation 12 at the end of
the search causes virtually no change in the position of the line. However,
Figure 1.3(right) shows that introduction of observation 12 causes a huge
increase in 8 2 , the residual mean square estimate of the error variance a 2 .
The information from these plots about observation 12 confirms and quan-
tifies that from the scatterplot of Figure 1.1: observation 12 is an outlier,
but the observation is at the centre of the data, so that its exclusion or in-
clusion has a small effect on the estimated parameters. The plots also show
that all other observations agree with the overall model. This is also the
conclusion from Figure 1.4 which shows the residuals during the forward
search. Throughout the search, all cases have small residuals, apart from
case 12 which is outlying from all fitted subsets. Even when it is included
in the last step of the search, its residual only decreases slightly.
Our analysis shows that Forbes' data have a simple structure - there
is one outlying observation, 12, that is not influential for the estimates
of the parameters of the linear model. Inclusion of this observation does
however cause the estimate 8 2 to increase from 0.0128 to 0.1436 with a
corresponding decrease in the t statistic for regression from 180.73 to 54.45.
We now consider a much more complicated example for which the forward
search again illuminates the structure of the data.
I/')
....
-..........
12
Vl C')
OJ
:l
"C
.0;
~ N
"C
Q>
OJ
"
(I)
5 10 15
Subset size m
Figure 1.4. Forbes' data: forward plot of least squares residuals scaled by the
final estimate of 0". Observation 12 is an outlier during the whole of this stable
forward search
have been a point of high leverage. Patterns in the outliers from the linear
regression might also have indicated the need for a transformation of y or for
the inclusion of further terms, for example, a quadratic, in the linear model.
However the forward search provides a powerful method of understanding
the effect of groups of observations on inferences. In the following analysis
we focus on the presence of a group of outliers and their effect on the t
test of one parameter. This complicated structure is clearly revealed by the
forward search.
Table A.2 gives 60 observations on a response y with the values of three
explanatory variables. The scatterplot matrix of the data in Figure 1.5
shows y increasing with each of Xl, X2 and X3. The plot of residuals against
fitted values, Figure 1. 6 (1eft), shows no obvious pattern, unlike that of
Figure 1.2. The largest residual is that of case 43. However the normal
probability plot of Figure 1.6(right) shows that this residual lies within the
simulation envelope. The finer detail of this plot hints at some structure,
but it is not clear what. There is thus no clear indication that the data are
not homogeneous and well behaved.
Evidence of the structure of the data is clearly shown in Figure 1.7,
the scaled squared residuals from the forward search. This fascinating plot
reveals the presence of six masked outliers. The left-hand end of the plot
gives the residuals from the least median of squares estimates found by
sampling 1,000 subsets of size p = 4. From the most extreme residual
downwards, the cases giving rise to the outliers are 9, 30, 31, 38, 47 and
21. When all the data are fitted the largest 10 residuals belong to, in order,
1.2. Three Examples 7
-3 -1 0 2 -15 -5 0 5 10
• •• • •• • • •• •
••
·-... . ... .:".
... .r. ..
• ••• • • • 4•
~ .1'
• ."
. •••
...,.
I
· .1'
..
o
X1 ~~.,",
t. • • •
•• ••
"ot· .. ~'.
..-
... ... .
•• • • • ••
-..
• •
•
.. •
.....,._.~; .. ......
•
. ...
• •• 14· •
... :.. }.i'.'
. ..
· .,,. .
J
o ~ X2
.•;'1-1.,.•• .-.'I·· .
~.;.-
• • •
• • • • • •
....... .. ,....••
• • •
•• , ..
..,..,.
• •• : ~..
•• .:. • .1•
• • • •• •
.,..
•
• • •
.. . • •
::"
,.......
o
•• • ••
,. ~
III
• · ...... .~~
•• "
.~
••
• • • • ••
-3 -1 0 2 -8 -6 -4 -2
Figure 1.5. Multiple regression data: scatterplot matrix of response and three
variables
N
..-. ••
•• co
• ••• • •• ••• • • •• •
N
'" •' •• • •
• ••
. '"
-.;
~"
•'" • • • • • •
-.; 0
~
u"
'in
u
Q) 0
• • • .'
Q)
N
II: ~
• • • • ••
';' Q)
u
a ';'
•• • • '"
'l' 'l'
43
.43 '?
•
-10 -5 5 10 -2 -1 0
Predicted values Quantiles of standard normal
Figure 1.6. Multiple regression data: (left) least squares residuals e against fitted
values y; (right) normal QQ plot of studentized residuals
8 1. Some Regression Examples
-----".. . =----------==-----------------
---
o
o 10 20 30 40 50 60
Subset size m
Figure 1.7. Multiple regression data: forward plot of squared least squares residu-
als scaled by the final estimate of (T. Six masked outliers are evident in the earlier
stages of the search, but the largest residual at the end of the search belongs to
the nonoutlying observation 43
cases 43, 51, 2, 47, 31, 9, 38, 29, 7 and 48. The first outlier to be included
in this list produces the fourth largest residual and only four outliers are
included at all.
The assessment of the importance of these outliers can be made by con-
sidering the behaviour of the parameter estimates and of the related t
statistics. Apart from /31 all remain positive with t values around 10 or
greater during the course of the forward search. We therefore concentrate
on the behaviour of h, the t statistic for /31' The values for the last 20 steps
of the forward search are plotted in Figure 1.8(1eft). The general downwards
trend is typical of plots of t statistics from the forward search. It is caused
by the increasing value of 8 2 , Figure 1.8(right), as observations with larger
residuals are entered during the search. This figure also indicates the pres-
ence of some outliers by the unsmooth behaviour in the last three steps. If
the data can be ordered in agreement with the model, the curve is typically
monotonic.
An important feature in the interpretation of Figure 1.8(1eft) is the two
upward jumps in the value of the statistic. The first results from the in-
clusion of observation 43 when m = 54, giving a t value of 2.25, evidence,
significant at the 3% level, of a positive value for /31' Thereafter the out-
liers enter the subset, with observation 43 leaving when m = 58, as two
outliers enter. When m = 59 the value of the statistic has decreased to
- 1.93, close to evidence for a negative value of the parameter. Reintroduc-
1.2. Three Examples 9
o
~ +-- _________-"'L--j
45 50 55 60 10 20 30 40 50 60
Subset size m Subset size m
Figure 1.8. Multiple regression data: (left) the t statistic for /31 during the forward
search and (right) the increase in the estimate of 0"2 ; in both figures the jumps
in the curves are caused by the inclusion of outliers
The number of cycles to failure ranges from 90, for the shortest specimen
subject to the most severe conditions, to 3,636 for observation 19 which
comes from the longest specimen subjected to the mildest conditions. In
their analysis Box and Cox (1964) recommend that the data be fitted after
10 1. Some Regression Examples
. r - - - - - - - -- ---,--,
•
• C')
0
0 • •
;!
0
•
"'::>
"iii 0
•
'" •
• •• •
"0
'iii
Q)
a:
...• • ..
0
-I• •••••
0
0
u;>
•
-500 0 500 1500 -2 -1 o 2
Predicted values Quantiles of standard normal
Figure 1.9, Wool data: (left) least squares residuals e against fitted values ii;
(right) normal QQ plot of studentized residuals
19 . . . . . . . .
...
'"
c;;
:J
"0
.iij
~
"0
II>
'"
c;;
"
Cf)
5 10 15 20 25
Subset size m
F igure 1.10. Wool data: forward plot of least squares residuals scaled by t he fi nal
estimate of 0' . The t hree largest residuals can be directly rela ted to t he levels of
t he factors
"
~
0
0
~ u;>
0
0
u; U")
u; ~
~ 0 '"
~ '";"
0
"
(/)
U")
0
0
0
'";" 0
U")
0
C)I 0
10 15 20 25 5 10 15 20 25
Subset size m Subset size m
F igure 1.11. Wool data: (left) score t est for t ransformation during t he forward
search and ( ight)
r the increasing value o f the estimate 8 2
12 1. Some Regression Examples
0
$ 0
~
c
Q)
·u 0
0
m :;: 0
Q) It)
0 0
~
"
~ -._........-..... ---........_-_....- .....
/-
tI: .0
U
0 2 0
co
0 '"
E ---------------
~
w ------~:~f------------~~-~~<,~~'
0
0
12 u;> beta_3
0
5 10 15 20 25 5 10 15 20 25
Subset size m Subset size m
Figure 1.12. Wool data: (left) the multiple correlation coefficient R2 during the
forward search and (right) the values of the parameter estimates
Other forward plots indicate the way in which the model changes as more
cases are introduced. The forward plot of 8 2 in Figure 1. 11 (right) increases
dramatically towards the end, whereas that of R2, Figure 1.12(left), de-
creases to around 0.8 for part of the search, with a final value of 0.729.
Further evidence of a relationship that changes with the search is given by
the forward plot of estimated coefficients in Figure 1.12(right). Initially the
values are stable, but later they start to diverge.
These plots are to be contrasted with those from the forward search for
the transformed data when the response is log y. The plot of residuals,
Figure 1.13, suggests that perhaps cases 24 and 27 are outlying. But what
effect do they have on inferences drawn from the data? Figure 1. 14 (left) , the
forward plot of the approximate score statistic for transformation, shows
the logarithmic transformation as acceptable; the cases giving rise to large
residuals, which enter at the end of the search, have no effect whatsoever
on the value of the statistic. The plot of the parameter estimates in Fig-
ure 1. 14(right) shows how stable the estimates ofthe parameters are during
the forward search. The value of 8 2 , Figure 1. 15 (left) , increases towards the
end of the search as cases with larger residuals enter. The same pattern,
in reverse, is shown by Figure 1.15(right) for R2 , which decreases in a
smooth fashion as the later cases enter the subset. Despite the decrease,
the value of R2 is now 0.966 for all cases, a great increase from 0.729 for
the untransformed data.
In one sense these last four plots are noninformative. If an interesting
diagnostic plot is one that reveals some unexpected or unexplained feature
of the data, these are boring. However they serve as prototypes of the plots
that we expect to see when model and data agree.
1.2. Three Examples 13
-------------------~/--------',-
-----------------------
..
0
...,.
<0
:>
~
-g <)I 22
~
en
23
27
-------------'
-------- -_ ... - ------- , ........ --
24
"'i
5 10 15 20 25
Subset size m
Figure 1.13. Transformed wool data: forward plot of least squares residuals for
log y scaled by the final estimate of (J". Are observations 24 and 27 outlying?
0
III
~
i
(ij 0
;--
~
.2! \
~
0
0
UJ "?
0
";"
10 15 20 25 5 10 15 20 25
Subset size m Subset size m
Figure 1.14. Transformed wool data: (left) score test for transformation during
the forward search, showing that the log transformation is satisfactory and (right)
the extremely stable values of the parameter estimates
14 1. Some Regression Examples
0
~
M
0
ci
'"'"ci
'"ci
0
<Xl
'"~ ~
'"
cr: ci
;; ,...
ci
'"ci
0 CD
ci '"ci
5 10 15 20 25 5 10 15 20 25
Subset size m Subset size m
Figure 1.15. Transformed wool data: (left) the increasing value of the estimate
82 during the forward search and (right) the smoothly decreasing value of the
squared multiple correlation coefficient R2
• Does the linear model contain any irrelevant variables? There are
several standard methods for removing variables from models, usually
described as variable selection.
• Are the variables in the right form, or should they be transformed? In
our analysis of Forbes' data we regressed log pressure on temperature.
Brown (1993 , p. 3) observes that the Clausius- Clapeyron equation in-
dicates that the reciprocal of absolute temperature is linearly related
to log pressure. Over the range of the data the two models are not
easily distinguished. But the difference could become important if the
model were to be used for extrapolation. Methods for the choice of a
transformation are the subject of Chapter 4.
• Are there sufficient terms in the model, or are extra carriers needed,
for example, quadratic or interaction terms in the variables already
in the model?
The linear model is only part of a regression model. Even if a regression
model is appropriate, there are also a number of questions that need to be
answered about the errors.
• Do the errors have common variance? If not, weighted least squares
may be appropriate, for example, if the observations are averages of
varying numbers of readings.
• Are the errors approximately normal? If not , can they be transformed
to approximate normality by the methods described in Chapter 4?
• Are the errors independent? If not , are time series methods
appropriate?
If the errors are not normal but, for example, binomial, the regression model
will need replacing by some other member of the family of generalized linear
models. Then, in addition to the choice of the linear predictor, the choice
of a suitable link function also needs to be investigated and scrutinized.
This rich family of models forms the subject of Chapter 6.
Examples of many of these choices arise in successive chapters and we
show how the forward search provides information to guide model building.
Some references to standard procedures are given at the end of Chapter 2.
2
Regression and the Forward Search
The basic algebra of least squares is presented in the first section of the
chapter, followed by that for added variables, which is used in the con-
struction of some score tests for regression models, particularly that for
transformations in Chapter 4. Related results are needed for testing the
goodness of the link in a generalized linear model, Chapter 6. Several of
the quantities monitored during the forward search come from considering
the effect of deletion of an observation. Deletion diagnostics are described
in §2.3 of the chapter and, in §2.4, related to the mean shift outlier model.
Simulation envelopes are described in §2.5 and the forward search is de-
fined and discussed in §2.6. The chapter concludes with some suggestions
for further reading.
i =J
(2.3)
i=!=j ,
conditions on only the first two moments of the Ei . We assume in the
regression chapters that, in addition, the errors are normally distributed.
The least squares estimates /3 of the parameters (3 minimize the sum of
squares
5((3) = (y - x(3f(y - X(3) (2.4)
and so satisfy the relationship
XT(y - x/3) = 0, (2.5)
which yields the normal equations
XTX/3 = XTy. (2.6)
The least squares estimates are therefore
/3 = (XT X)-l XT y, (2.7)
a linear combination of the observations, which will be normally distributed
if the observations are.
These estimates have been found by minimizing the sum of squares 5((3).
The minimized value is the residual sum of squares
5(/3) (y - x/3f (y - x/3)
yT y _ yT X(XT X)-lXT Y
yT {In - X(XT X)-l XT}y, (2.8)
where In is the n x n identity matrix. When, as in this chapter, the
dimension of the matrix is obvious we write I instead of In.
The vector of n predictions from the fitted model is
f) = x/3 = X(XT X)-l XT Y = Hy , (2.9)
18 2. Regression and the Forward Search
where the projection matrix H is called the hat matrix because it "puts
the hats on y" . It has an important role in the algebra of least squares. For
example, let the ith residual be ei = Yi - Yi , so that the vector of residuals
is
e= y- Y= y- xS = (1 - H)y . (2.10)
(2.14)
The quantity hi also occurs in the variance of the fitted values. From
(2.9),
(2.15)
so that the variance of fh = (J"2h i . The value of hi is called the leverage of
the ith observation. Since
(i=l, ... , n) , (2.16)
it follows that
i=l
(2.17)
so that the average value of hi is pin, with °: :;hi :::; 1 (Exercise 2.3). A
large value indicates high leverage. For such points the variance of iii will,
from (2.15), be close to (J"2, indicating that the fit is mostly determined by
the value of Yi' Likewise, from (2.13), the variance of the residual will be
small (Exercise 2.8). The effect of this local fit can be that an extra term
in a model may be included solely to give a good fit for Yi, with a small
residual. Inspection of plots of least squares, or even studentized, residuals
may not indicate how influential this observation is for the fitted model.
Calculation of the test statistic also requires the residual mean square s; ,
estimate of 0"2 from regression on X and w, given by (Exercise 2.6)
(n - p - l)s; yTy _ /FXTy _ iwTy
yTAy _ (yTAw)2 /(w TAw). (2.29)
The t statistic for testing that , = 0 is then
tw =
i
---;:;:::::::;;==;::;==;:;:;=;=::::;::=:= (2.30)
V{s~/(wTAw)}
If w is the explanatory variable Xb (2.30) is an alternative way of writing
the usual t test (2.19). But the advantage of (2.30) is that the effect of
individual observations on this statistic can be found by using the methods
of deletion diagnostics in which explicit formulae are found for the effect
22 2. Regression and the Forward Search
Yi - Y(i) - Xi
24 2. Regression and the Forward Search
has variance
(2.35)
To estimate (J'2 we use the estimate 8~i) which is also independent of Yi.
The test for agreement of the observed and predicted values is
TA
* Yi - XJ3(i)
(2.36)
ri = 8(i)V{1 + XnX&r(i))-lxd'
which, when the ith observation comes from the same population as the
other observations, has a t distribution on (n-p-1) degrees of freedom. The
deletion results given above make it possible to simplify (2.36) to obtain
(Exercise 2.7)
Yi - ili (2.37)
We call ri the deletion residual. Comparison with (2.14) shows that the
deletion residual differs from the studentized residual only in the estimate
of (J' employed. This is enough to ensure that ri has the unbounded t
distribution as opposed to the scaled beta distribution of rf (§2.1.2). In
most applications the difference between the two residuals will be slight,
the difference being most acute if there is a single outlier at a point of high
leverage.
In the forward search we also consider the effect of adding observations.
Given a parameter estimate {J, a design matrix X and an estimate 8 2 of
(J'2, we add an observation Yi at Xi. Then the test for agreement between
(2.38)
which is the usual t test for the agreement of a new datapoint with a
previous set of observations. Since, apart from replications, Xi is not a row
of X, it may be that di is greater than one. The power of the test is improved
by using 8 2 as the estimate of error variance, rather than updating for the
new observation.
(2.40)
for detecting influential observations. Large values of Di indicate observa-
tions that are influential on joint inferences about all the linear parameters
in the model. A suggestive alternative form for Di is
(2.41 )
(2.42)
with ri the studentized residual defined in (2.14). This form for Di shows
that influence depends on hi , so that appreciable outliers with low lever-
age can be expected to have little effect on the parameter estimates. Our
analysis of Forbes' data supports this interpretation. That observation 12 is
clearly an outlier is shown by all the residual plots. It is the last observation
to be included in the forward search. Figure 1.3(1eft) shows that there is
no detectable change in the slope of the regression line and little change in
the value of the intercept when observation 12 is included. For these data
h12 = 0.0596 compared with an average value of 2/17 = 0.118.
If one of the observations is an outlier, the estimate 8 2 will be too large,
except when the outlying observation is deleted. Atkinson (1982) therefore
suggested replacing 8 2 by the deletion estimate s(i) ' In addition the square
root of Di was taken, to give a residual like quantity, and the statistic scaled
by the average leverage pin. The resulting modified Cook statistic is
1/2{ }1/2
{-p-}
n - p hi ei2
(1 - h i )2 8(i)
(2.46)
If the parameter estimate in the mean shift outlier model is denoted /Jd, it
follows from (2.24) that
/Jd = (XTX)-l XTy - (XTX)-l XTdrP,
so that, from (2.46)
(2.47)
Comparison of (2.47) with (2.33) shows that /Jd = /J(i) , confirming the
equivalence of deletion and a single mean shift outlier.
The expression for the change in residual sum of squares comes from
(2.29). If the new estimate of (1"2 is s~ we have immediately that
(n-p-1)s~ yTAy _ (yTAd)2 / (d TAd)
(n - p)S2 - eU(1- hi), (2.48)
which is (2.34).
The mean shift outlier model likewise provides a simple method of finding
the effect of multiple deletion. We first need to extend the results on added
variables in §2.2 to the addition of m variables, so that W is an n x m
matrix and 'Y an m x 1 vector of parameters. We then apply these results
to the mean shift outlier model
E(Y) = Xj3 + D¢,
2.5. Simulation Envelopes 27
with D a matrix that has a single one in each of its columns, which are oth-
erwise zero, and m rows with one nonzero element. These m entries specify
the observations that are to have individual parameters or, equivalently,
are to be deleted.
where efkJ(b) is the kth ordered squared residual. In order to allow for
estimation of the parameters of the linear model the median is taken as
med = [(n + p + 1)/2], (2.50)
the integer part of (n + p + 1)/2.
The parameter estimate satisfying (2.49) has, asymptotically, a break-
down point of 50%. Thus, for large n, almost half the data can be outliers,
or come from some other model and LMS will still provide an unbiased
estimate of the regression line. This is the maximum breakdown that can
be tolerated. For a higher proportion of outliers there is no longer a model
that fits the majority of the data.
The very robust behaviour of the LMS estimate is in stark contrast to
that of the least squares estimate /3 minimizing (2.4) which can be written
as
n
S(b) = L e;(b). (2.51 )
i=l
that is, both parameter estimates are unbiased estimators of the same quan-
tity. The same property holds for the sequence of estimates S:n produced
in the forward search. Therefore, in the absence of outliers, we expect both
parameter estimates and residuals to remain sensibly constant during the
forward search. We saw in the examples of Chapter 1 that this was so.
2.6. The Forward Search 31
where zl is the il th row of Z, for 1 :s: i 1, ... ,ip :s: nand i j of. i j ,. Specifically,
let ,,' = [il' ... ,ip] and let e. s(p) be the least squares residual for unit i given
t, t
observations in sIp). We take as our initial subset the ptuple sip) which
satisfies
(2.52)
In most moves from m to m+ 1 just one new unit joins the subset. It may
si
also happen that two or more units join m ) as one or more leave. However
our experience is that such an event is quite unusual, only occurring when
the search includes one unit that belongs to a cluster of outliers. At the
next step the remaining outliers in the cluster seem less outlying and so
several may be included at once. Of course, several other units then have
to leave the subset.
The search that we use avoids, in the first steps, the inclusion of outliers
and provides a natural ordering of the data according to the specified null
model. Note that in this approach we use a highly robust method and at
the same time least squares (that is, fully efficient) estimators. The zero
breakdown point of least squares estimators, in the context of the forward
search, does not turn out to be disadvantageous. The introduction of atyp-
ical (influential) observations is signaled by sharp changes in the curves
that monitor parameter estimates, t tests, or any other statistic at every
step. In this context, the robustness of the method does not derive from
the choice of a particular estimator with a high breakdown point, but from
the progressive inclusion of the units into a subset which, in the first steps,
is outlier free. As a bonus of the suggested procedure, the observations can
be naturally ordered according to the specified null model and it is possible
to know how many of them are compatible with a particular specification.
Furthermore, the suggested approach enables us to analyze the inferential
effect of the atypical units (outliers) on the results of statistical analyses.
Remark 1: The method is not sensitive to the method used to select an
initial subset, provided unmasked outliers are not included at the start.
For example, the least median of squares criterion (2.49) for regression
can be replaced by that of least trimmed squares (LTS). This criterion
provides estimators with better properties than LTS estimators, found by
minimizing the sum of the smallest h squared residuals
h
Sh(b) = I:>fij(b), (2.54)
i=l
for some h with [(n + p + 1)/2] ::::; h < n. The rate of convergence of LTS
estimates is n- 1 / 2 as opposed to n- 1 / 3 for LMS. But, for datasets ofthe size
2.6. The Forward Search 33
m = p, ... , n. (2.56)
At the start of the search we have only p observations, each of which has
leverage one. The leverages decrease thereafter. An example of such be-
haviour is in Figure 3.11, which shows a forward plot of the leverages for a
four- parameter model for Brownlee's stack loss data.
The forward version of the modified Cook distance (2.43) can, from
(2.56), be calculated as
2}
p} {
1/2 h s(>n) e s(>n) 1/2
C . -
mt - {
m-p
(l -
" •
S~i>n-l)
h i ,si>n»)2
" •
m = p + I, . .. , n - 1 (2.58)
Both indices run from p + 1 since this number of observations is the min-
imum allowing estimation of (7'2 . If one or more atypical observations are
present in the data, the plot of r*m+1 J against m will show a peak in the
step prior to the inclusion of the Mrst outlier. On the other hand , the plot
2.7. Further Reading 35
that monitors T[m ] shows a sharp increase when the first outlier joins S~m).
Both plots may show a subsequent decrease, due to the effect of masking, as
further outliers enter the subset. Examples of these forward plots of resid-
uals are in Figure 3.6, with a forward plot of the modified Cook distance
in Figure 3.5.
2.8 Exercises
Exercise 2.1 Show that the matrix H = X(XTX)-lX T is (a) symmetric
and (b) idempotent (§2.1).
Exercise 2.2 Show that if a matrix H is idempotent I - H is also
idempotent (§ 2.1).
Exercise 2.3 Show that (§2.1):
(a}O:::;h i :::;l;
(b) -0.5 :::; h ij :::; 0.5 for all j =J i, where hi and hij are, respectively, the
ith diagonal element and the ijth element of the hat matrix H.
Exercise 2.4 Show that in the case of simple regression of Y on a constant
term and a single explanatory variable x the ith diagonal element of H is
equal to (§2.1):
i = 1, . .. ,n.
(b) hi is nonincreasing in n;
(c) ifm observations (i1 , ... ,i m ) are equal,
h
h (. .) - 2m
21 22,· · ·2", - 1 - (m - l)hi",
(2.61 )
Exercise 2.8 Show that the quantity hd(l - hi) (the ratio of variance of
the ith predicted value var(Yi) = a 2 h i ) to the var'iance of the ith ordinary
residual (var(ei) = a 2 (1 - hi)), can be interpreted as the ratio of the part
of fh due to Yi to the part due to the predicted value xi O( i); that is, show
that (§2.3):
2.9. Solutions 37
2.9 Solutions
Exercise 2.1
(a) HT = (X(XT X) - l XT)T = X(XT X)-1 X T = H.
(b) HH = X(X T X) - 1X T X(X T X) - 1XT = X(XTX) - 1XT.
Exercise 2.2
(1 - H)(1 - H) = 1 + H2 - 2H = 1 + H - 2H = 1 - H.
Exercise 2.3
(a) The i th diagonal element of H can be written as:
n n
hi = L h7j = hr + L h7j
i=1 j#i
from which it follows that 0 ~ hi ~ 1 for all i.
(b)
n
hi = h7 + h7j + L h;k (2.63)
k#i ,j
from which it follows that h7j ~ h i (1 - hi). Since 0 ~ hi ~ 1, it must be
that -0.5 ~ h ij ~ 0.5.
Exercise 2.4
In the case of simple regression, the matrix X has the structure:
(2.64)
38 2. Regression and the Forward Search
XT X = ( ",n
n .
L....i=l Xt
1 (
= n Lni=l (X t . _X )2
T - 1
(X X)
1 (Xi - X)2
= :;;: + ",n
L....t=l
(X. _ X)2 .
t
Thus in simple regression hi will be large if Xi is far removed from the bulk
of other points in the data.
Exercise 2.5
(a) We start from identity (2.32):
Given that the second term on the right- hand side of equation (2.66) is
positive we have that hj(i) 2 hj.
(c) If the i 1 th and the i2th row of X are identical equation (2.66) reduces
to:
(2.67)
2.9. Solutions 39
(2 .68)
In order to prove that if m observations are equal hir :::: 11m, r = 1, ... , m,
it is enough to notice that equation (2.69) is monotonically increasing in
[0, 11m] and is exactly equal to 1 when him = 11m.
Exercise 2.6
(n-p-1)s~ (y - X~ - W'y)T(y - X~ - wi)
yT y _ yT X ~ _ yT wi
_ ~TXTy + ~TXTX~+ wTX~i
+w T X~i - w T yi + i 2 w T w
yT Y _ yT X ~ _ yT wi
+~T ( _ XTy+XTX~+XTwi)
'- v '
= 0, equation (2.22)
+i (w T X ~ - w T Y + iw T W )
" v "
= 0, equation (2.23)
yTy _ ~T XT Y _ i wTy.
Now, using expressions (2.24) and (2.25),
(n - p - l)s~ yT Y _ {yT X(XT X)-l _ iw T X(XT X) - l }XT Y
wTAy T
---w y
wTAw
T(J _ H) w TA yw TH Y - w TA yw T y
Y n Y+ w TA w
TA w T AywT (In - H)y
y y- wTAw
T (w T Ay)2
Y Ay -
wTA w .
Exercise 2.7
From equation (2.33) we have:
40 2. Regression and the Forward Search
hi
ei +1_ hi ei
ei
1 - hi
Using equation (2.32),
Yi - Yi
J(l - hi)
Exercise 2.8
Using equation (2.33) we can write:
hi 1 _ hi (3(i)
TA TA
hi (Yi - xi + Xi (3( i)
(3(i»)
TA
(1 - hi)x i (3(i) + hiYi.
Exercise 2.9
From equation (2.34) we can write:
(n - p - l)sCi) = (n - p)s2 - s2rT.
Rearranging we obtain:
s
-=
j$-P-12·
S(i) n - P- ri
Using this result and the identity in equation (2.61):
(n-p-1)r;
n- p- r; .
If we recall that r: has a t distribution on (n - p - 1) degrees of freedom
the result follows immediately.
2.9. Solutions 41
Exercise 2.10
This result is obtained much more easily using the mean shift outlier model.
Exercise 2.11
We must show that the product of (A - UVT) with the right- hand side of
(2.31) gives the identity matrix.
(A - UVT){A- 1 + A-1U(Im - VT A-1U)-lVT A-I} =
Ip + U(Im - VT A-1U)-lVT A-I - UV T A-I
-UVT A-1U(Im - VT A-1U) - lV T A-I =
Ip - UV T A-I + U(Im - VT A-1U)(Im - VT A-1U)-lVT A-I =
Ip - UV T A-I + UV T A - I = Ip.
Exercise 2.12
We have:
, T -1 T
f3(i) = (X(i)X(i)) (X Y - XiYi).
Using equation (2.32),
/3(i) {(XT X)-l + (XT X)-lXiX;(X T X)-l /(1 - hi)} (XT Y - XiYi)
= /3+ (XTX)-lXi Yi - (1- hi)Yi - hiYi
1 - hi
/3 + (XT X) - lXi Yi - Yi + hiYi - hiYi
1 - hi
42 2. Regression and the Forward Search
Exercise 2.13
Let Ho : E(Y) = X(3 and HI : E(Y) = X(3 + dO. Under the normality
assumption the F statistic for testing Ho versus Hl is
F _ {SS(eo) - SS(el)} / l
(2.70)
- SS(el)/(n - p - 1) ,
where SS(ej) is the residual sum of squares under the hypothesis H j j =
{O, I}. Using the identity in equation (2.34) we find that
In this chapter we exemplify some of the theory of Chapter 2 for four sets
of data. We start with some synthetic data that were designed to con-
tain masked outliers and so provide difficulties for least squares diagnostics
based on backwards deletion. We show that the data do indeed present
such problems, but that our procedure finds the hidden structure.
The analysis of such data sets is much clearer than that of real data
where, often, ambiguities remain even after the most careful analysis. Our
other examples in this chapter are of increasing complexity and of increas-
ing number of variables. One complication is the choice of a suitable linear
model and the relationship between a misspecified model and our diag-
nostics. A second complication is that the last three examples also involve
the transformation of the response, combined with the choice of the linear
model. We choose the transformation in an informal manner. Our more
structured examples of choice of a transformation are left to Chapter 4.
Figure 3.1. Hawkins' data: scatter plot matrix. The only apparent structure
involving the response is the relationship between y and Xs
3.1. Hawkins' Data 45
..... ....
r--
o o ---
......
·2 ., ·2 ., 0
Quantiles of standard normal Quantiles of standard normal
Figure 3.2. Hawkins' data: normal plots of residuals. The least squares residuals
(left) seem to indicate six outliers and a nonnormal structure; there are 86 zero
LMS residuals (right)
The normal plot of least squares residuals in Figure 3.2(left) shows a cu-
riously banded symmetrical pattern, with six apparent outliers. The data
would seem not to be normal , but it is hard to know what interpretation to
put on this structure. For some kinds of data such patterns indicate that
the wrong class of models has been fitted . One of the generalized linear
models with non normal errors described in Chapter 6 might be appropri-
ate. Here we continue with regression and look at the normal plot of LMS
residuals. Figure 3.2(right) shows (on counting) that 86 residuals are virtu-
ally zero, with three groups of almost symmetrical outliers from the modeL
Our forward search provides a transition between these two figures. More
helpfully, it enables us to monitor changes in residuals and parameter es-
timates and their significance as the apparent outliers are included in the
subset used for fitting.
Figure 3.3 is the forward plot of squared residuals, scaled as described in
§2.6.4 by the final estimate of 0- 2 . This shows three groups of residuals, the
fourth group, the 86 smallest, being so small as to lie on the y axis of the
plot. From m = 87 onwards, the 24 cases with the next smallest residuals in
Figure 3.2(right) enter the subset. The growth in the subset causes changes
in the other two groups of residuals; in particular, the most extreme obser-
vations become less so. After m = 110, the second group of outliers begins
to enter the subset and all residuals decrease. By the end of the process,
the six largest outliers, cases 19, 21 , 46, 73, 94 and 111 still form a distinct
group, arguably more marked in Figure 3.3 than in Figure 3.2(left), which
is a normal plot of the residuals when m = n . At the end of the search, the
other groups of outliers are mixed together and masked.
The plot of residuals from the forward search reveals the structure of the
data. It is however not clear how the groups of outlying observations change
the fitted modeL This is revealed by Figure 3.4(left), which shows how the
estimated coefficients change during the forward search. The values are
constant until m = 87, after which they mostly decline to zero, apart from
46 3. Regression
7J----------------------~
~ -- -----------------------"
~
: :-:-:~-:--:--:-:-:-.-:-:~-:-~:-~-:-:-~-..-. -. , ~:
-,
'1 1 -------------------------'Y~~~~·:.;~~\ ,~
!!! -~, ;::';""\
"
:>
"!
·ilI
.....""
\" ....
.~ \
"I \
..
~
¥ I"
I .
..
"cr \\I
"
i
\
'" ------------------_.
o 20 40 60 eo 100 120
Subset size m
Figure 3.3. Hawkins' data: forward plot of scaled squared residuals. The three
groups of outliers are clearly shown, as is the effect of masking of some outliers
at the end of the search
the estimate of /30, which oscillates wildly. Such changes in parameter esti-
mates, very different from those for Figure 1.14(right) for the transformed
wool data, are an indication of outliers or of a misspecified model.
The t statistics for the parameters are in Figure 3.4(right). Initially, when
the error is close to zero, the statistics are very large and off the scale of
the plot. As groups of observations with larger variance are introduced,
the statistics decrease until, at the end of the search, there is only one
significant term, that for regression on Xg, which was suggested by the
scatterplots of Figure 3.1.
Several other plots also serve to show that there are three groups of
outliers. Three are similar in appearance. Figure 3.5 shows the modified
Cook distances (2.57), which reflect the changes in parameter estimates as
the forward search progresses. The three peaks show the effect of the large
changes due to the initial inclusion of each group of observations. After
a few observations in the group have been included, further changes in
the parameter estimates become relatively unimportant and so the values
of the distances again become small. Figures 3.6(top) and (bottom) show
similar patterns, but in plots of the residuals. Figure 3.6( top) shows the
maximum studentized residual in the subset used for fitting (2.59). This
will be large when one or two outliers are included in the subset. Finally in
this group of three plots, Figure 3.6(bottom) shows the minimum deletion
residual at each stage (2.58), where the minimization is over those cases
not yet in the subset. The three peaks in the figure show the distance of the
nearest observation from the model that has been fitted so far. The first
peak is the largest because the variance of the first 86 cases is so small.
The declining shape of each peak is caused by the increase in 8 2 as outliers
3.1. Hawkins' Data 47
b_O 0 '
' ,,
b_ ' " La
U \ \
\,\ ,,,
b_2 U
b_3 '_3 ,
b_4
b_5
b 0
'_4 "
'_5
,,
0
b_6 N
CO
bJ I
o'" b_8
'J
'_8 ~
'-: '\
\~_8
,
~\
~ b_O, b_7
b:s-·_·--- -- / :
- n-- -·---·--·· --.....-·~ I
- I
/
_~~----------- __ ., .... __I
9'"
20 40 60 80 100 120 20 40 60 80 100 120
Subset size m Subset size m
Figure 3.4. Hawkins' data: forward plots of (left) parameter estimates and (right)
the t statistics. The outliers have an extreme effect .on the parameter estimates
0
N
"g
£1
'0
'8
()
~
-c
~
'0
0
:;
o 20 40 60 80 100 120
Subset size m
Figure 3.5. Hawkins' data: forward plot of modified Cook's distance. The first
large peak is caused by the introduction of the first outlier when m = 87. The
other two peaks indicate the first inclusion of outliers from the other two groups.
The decay of the curves is due to masking
48 3. Regression
~ ;:!
'0
'<;;
<Xl
i!!
'0
~ '"
~
-g
~
1;;
"
x '"
'"
::; 0
20 40 60 80 100 120
0
'"
OJ
~
:'?
e
:2
c:
0 ;:!
:;
Q;
'0
" '"
::!
0
20 40 60 80 100 120
Subset size m
Figure 3.6. Hawkins' data: forward plot of (top) the maximum studentized resid-
ual in the subset used for fitting (2.59) and (bottom) the minimum deletion
residual outside the subset (2 .58). The effects of the three groups of outliers are
evident
are introduced during the search, which reduces the size of the deletion
residuals. At the end of the peaks there is nothing remarkable about the
values of the deletion residuals.
In this example the two plots of the residuals and that of the modified
Cook distances are very similar in structure. In other examples, not only
may the plot of the Cook distances be different from that of the residual
plots, but the two residual plots may also be distinct. These plots are one
way in which the forward search reveals the masked nature of the outliers.
Another is from forward residual plots such as Figure 3.3.
Another, very different, plot also serves to show the groups of outliers.
Figure 3.7 gives the behaviour of the estimate of 0'2 during the forward
search. Initially it is close to zero. Then the inclusion of each set of outliers
causes an increase in the estimate. The resulting plot is virtually in the form
of four line segments, one for each group of observations. The monotone
form of the plot indicates that the observations have been correctly sorted
by the forward search. Although, in this example, this plot produces no
new information about the grouping of outliers, in general we find such
plots unexpectedly helpful in determining the adequacy of models - jumps
or regions of small increase of 8 2 can usually be directly interpreted as
evidence of a particular failure of the model. As we show, forward plots of
R2 can likewise be surprisingly helpful.
Our analysis has led to the detection of four separate groups in the data.
However division of the observations into these groups does not reveal any
new structure. Figure 3.8 is the scatterplot matrix for the 86 observations
forming the group with smallest variance. The structure is not markedly
3.1. Hawkins' Data 49
o 20 40 60 80 100 120
Subset size m
Figure 3.7. Hawkins' data: forward plot of 8 2 . Initially the estimate is virtually
zero. The four line segments comprising the plot correspond to the four groups
of observations
~ .---------------------------------------.
•
• +.
• • 0
+
+
• •
•
+*
+
* +
++0
00 0
+ ++ +
+
•+ + + c'i
¥"
++ +
+
-!4 +++'0 + @ +
+ -<!l 0+ +-H- + +
0 ++ +
+ 0.f\-
*0 +&'+"0 +0 0
Q ++ + t o+ + +
++ 0 + +
0
.p- * + + + •
Si' * •
+
•
~ c,--_\----,------,-------,-------,------,J
o 20 40 60 80 100
X8
Figure 3.9. Hawkins' data: scatter plot of y against Xs. The filled squares are
the six largest outliers, the small crosses the 86 observations that first enter the
forward search
different from that of Figure 3.1. Finally, in Figure 3.9, we repeat one
panel of the scatterplot matrix of Figure 3.1, with the six largest outliers
highlighted. It can be seen that their effect on the relationship of y with
X8 is likely to be mainly on the residual variance, rather than on the slope
of the fitted line. The plot of the t statistics in Figure 3.4(right) quantifies
this impression.
The clear nature of the outlier structure of these data is in sharp contrast
to that of the remaining examples of this chapter, where problems of which
terms to include in the linear model, and whether to transform the response,
are intertwined with those of the detection of outliers and of influential
observations.
/',
I
/ '\
/------" I \
4 _.. ___/!'-~-------- ----- -- -- ""----_ I \
1 -_/ "----------- ,, / \
Figure 3.10. Stack loss data: first-order model, forward plot of scaled squared
residuals . Observations 1, 3, 4and 21 have large residuals , for most of the search
C>
co
c::i
"~
<D
Ol c::i
">
"
..J
":
0
N
c::i
0
c::i
5 10 15 20
Subset size m
Figure 3.11. Stack loss data: first-order model, forward plot of leverage. As well
as observations 1, 2 and 21, observation 17 has high leverage
0
v
0
'"
g
10
0
~
0
C)I
0
"1
5 10 15 20
Subset size m
Figure 3.12. Stack loss data: first-order model, t values for the coefficients. From
m = 7 onwards, that for {33 is never significant
3.2. Stack Loss Data 53
co
0
Q) <0
0> 0
~
.
Q)
>Q)
-'
0
'"0 -------- . .. . .~
0
0
5 10 15 20
Subsel size m
Figure 3.13. Stack loss data: first-order model in Xl and X2, forward plot of
leverage. The highest curve for most of the search is for observation 12
0
~
<Xl
~
0
co
~
0
"a:
<
..,.
CJ)
0
"0
CJ)
0
~
0
5 10 15 20
Subset size m
Figure 3.14. Stack loss data: fi rst-order model i n Xl and X2 , forward plot of R2 .
T he lo cal maximum at m = 17 indicates disagreement between model and data
/ /--
1 ------------/ '"
\
\
\
~ \
'"
0;
:J
" ' ..........
"0
·in \
~ \
"0 \
'" \
" '"
0;
(/)
4 - - - - ---- - -- - ----- - -------- - - -- - - - -------- - ------ -- __ '\
--\,,-
"
0
5 10 15 20
Subset size m
Figure 3.1 5. Stack loss data: second-order model with terms in Xl, X2, XIX2 and
xi. Forward plo t of scaled squared residuals. Observations 1, 2, 3 a nd 4 initially
have large residuals, but t here isappreciable masking at t he end of the search.
Observation 21 has a small resid ual throughout
3.2 . Stack Loss Data 55
q
,, -=--==.,------------
CD
,, \
\
\
\
\
.... ...... ·21
0 \
----t---........-.... ....
'>-,,
\
\\ ... ..
Q> CD
0
\\
OJ
~
\
Q>
>
Q>
...J
\ .. . ... . .
"0
'"0
0
0
5 10 15 20
Subset size m
Figure 3.16. Stack loss data: second-order model, forward plot of leverages.
Observation 21 always has a leverage close to one
"
N
"
~
0
N
'"
1ii ')'
~
~
0
"
Cf) 't
'"
"?
10 15 20
Subset size m
Figure 3.17. Stack loss data: second-order model, score test for transformation.
The evidence for a transformation depends on only the last four observations to
enter
'":J
(ij
u
"in
u
~
0
')'
i~~;:::~i!jjj
13 I
"
(ij
I
I
"
Cf)
I
I
't I
I
I
I
'" 21-----------------/-',
----- "..._--...1
5 10 15 20
Subset size m
Figure 3.18. Stack loss data: second-order model with log y , forward plot of scaled
residuals, which are large for observations 4 and 21
3.2. Stack Loss Data 57
;? constant
xl
x2
x1 11 2
xl * x2
'" / xl
/
------
~"
0
~ o tant
'\'
10 15 20
Subset size m
Figure 3.19. Stack loss data: second-order model with log y, forward plot of t
statistics. The bands define the 1% region. The evidence for the second-order
terms (xi and XIX2) comes from the last two observations to enter, 4 and 21
10 15 20
Subset size m
Figure 3.20. Stack loss data: second-order model with log y, forward plot of
modified Cook statistic. Confirmation of the importance of observations 4 and 21
58 3. Regression
0
C!
co
C1>
0
~
0
~
a:
...
C1>
0
'"0
C1>
0
C1>
0
5 10 15 20
Subset size m
Figure 3.21. Stack loss data: first-order model in Xl and X2, with response log y.
Forward plot of R2. The local maximum at m = 18 is evidence that all is still
not well
'"
4 - - --- ---- ---- - ---- --- - - -- - - - - - - -- --- - --
'"::>
fti
'0
'0;
~ 0
~~~~~-~~~~-~~~~~-~~~~~~~::~-~~:
:::: - $ *--------- ~~. _..: . . ';"'4,:
----------------------
'0
.91 -- --
rl
C/l 1----------------_________ - ---
21___ _______
~/
-,,(
/...---
'l' 2
, ,..... /
./"
/
...... _ - - . " . /
5 10 15 20
Subset size m
Figure 3.22. Stack loss data: first-order model in Xl and X2, with response log y.
Forward plot of scaled residuals. There is still some change towards the end of
the search
5 10 15 20
Subset size m
Figure 3.23. Stack loss data: first-order model and log y, forward plot of score test
for transformations. Deletion of observations 4 and 21 would lead to rejection of
the log transformation
60 3. Regression
5 10 15 20
Subset size m
Figure 3.24. Stack loss data: first-order model in Xl and X2, and Vfj, forward plot
of the score test for transformation. The square root transformation is acceptable
throughout the search
5 10 15 20
Subset size m
Figure 3.25. Stack loss data: first-order model and Vfj, forward plot of the t tests
for the parameters. All are now significantly different from zero
3.2. Stack Loss Data 61
co
'"ci
:g
ci
~
a:
...
'"ci
C\I
'"ci
0
'"ci
5 10 15 20
Subset size m
Figure 3.26. Stack loss data: first-order model and -/Y, forward plot of R2 . The
data and model now agree
-----------------------------~
~~====--------~-- .
ji5 .. . .. ......--,---::: /
_ ::.-=@-- -- -::
~---------------------------- -- _/
/
5 10 15 20
Subset size m
Figure 3.27. Stack loss data: first-order model and -/Y, forward plot of scaled
residuals. This stable plot suggests that observations 4 and 21 may be outliers,
but that they are not influential for the fitted model
62 3. Regression
CD
5 10 15 20 25
Subset size m
Figure 3.28. Salinity data: forward plot of scaled residuals. Observation 16 has a
very large residual for most of the search
co
o
N
o
o
o
5 10 15 20 25
Subset size m
Figure 3.29. Salinity data: forward plot of leverages. Observation 16 is the last
to enter the search, wit h a high leverage
Ie •
3
• •• • •••
!e 160 • •
I
4
......
~
;,•
•
4 6 8
.....-
10 12 14
•
" !!
~ 1 =Lagged salini~
• •• I .,.• 16C
r.
• 16r. . ,
-
• •
1_. --
• • ••• • •
•••
-·
Ie •
- M
• •
• • ••• M
• • •• X2=Trend
• •• • •• •
• ••• •• • • • ••
j. •••• •
...
0 •• 0 ••
1~ 16C 16• • _
I. •
,e,rr ,e,rr '6u
•
•
. • •
• •• • • •
..
X3=Waler flow
• ••• • •• I •
~
Ie.
.:-: ",.,:... • • •• ; ii •Ie.·,),.,
...-; . : : ...
"
--.-
• I •~
!!
•• ..
... • '6
•• •
@l
•
M
. 160
• •
•
•
•
•
•
•
•••
•
I I -:
••
.. '6e y=Salinity
" 6 8 10 12 14 22
• •
Z6 30
Figure 3.30. Salinity data: scatterplot matrix. The value of X3 for observation 16
seems suspect
3.3. Salinity Data 65
23----
1---______ ---
15
5 10 15 20 25
Subset size m
Figure 3.31. Corrected salinity data: forward plot of scaled residuals . Observation
16 is not evident
0 0
'Q"
'Q" 00
9 0 0
· 0
0 8
~~
(\J (\J
0 0
0
.\,1
;;; ~
0
8 o.
~ 0 :~
0 15
to
;;; - ~_2-------
-------
--J-CTID
(/)
...
0 0
0
')' I -
CX)
00 •
17
I '..... I --_._- U 0
I ,../ L3 <0 0
----- L2
--_. L3 0
0
""1" 'Q"
0
5 10 15 20 25 4 6 8 10 12 14
Subset size m Lagged salinity
Figure 3.32. Corrected salinity data: (left) forward plot of t statistics; /32 is of
marginal significance; (right) scatter plot of salinity y against lagged salinity, Xl
66 3. Regression
----:c=::==::=--~=~----
./ ----
--------- ------ ",--~~
5 10 15 20 25
Subset size m
Figure 3.33. Corrected salinity data: model in Xl and X3 : forward plot of scaled
residuals
If we drop X2 and refit we obtain a model with two terms that are highly
significant at the end of the search: tl = 10.17 and t3 = -5.30. The plot of
forward residuals in Figure 3.33 is appreciably more stable than the plot of
Figure 3.31. The three largest absolute residuals are for observations 9, 15
and 17. It is interesting to see where these observations are on a scatterplot
of y against Xl. As Figure 3.32(right) shows, they lie on the outside of a
band of observations with a clear relationship between the two variables.
Whether fitting is by least median of squares, as at the beginning of the
forward search, or by least squares, as at the end, they will always have
large residuals. The stability in the forward plot of residuals in Figure 3.33
confirms this interpretation.
Finally we consider transformation of the data. Since the response, salin-
ity, is a nonnegative quantity, a transformation is physically plausible. For
the three-variable model the approximate score test for transformation is
-1.61. As Atkinson (1985, p. 122) shows, this value depends critically on
observation 3. If it is deleted the value is -2.50, evidence of the need for a
transformation. But if observation 5 is then deleted, the statistic becomes
a nonsignificant -1.148. For the model with just Xl and X3 there is no
evidence of the need for a transformation. For all observations the score
statistic equals -1.465. When observation 3 is deleted it is -1.77. There
are two reasons for not pursuing this analysis further .
One reason is that observations 3 and 5 are the two smallest, and so
can be expected to be informative about transformations. The other, more
important, reason is that the values of Xl are the values of y lagged by one
period. A sensible approach is then to consider joint transformation of y and
Xl with the same parameter. We give an example of a joint transformation
of the response and an explanatory variable for data on mussels in §4.1O.
3.4. Ozone Data 67
o 20 40 60 80
Subset size m
<D
III III
Q)
c<> <>
01
u; ~ 'ffi
15 .~ 0
"'8" C') u;
u 2
"0 ~
Q)
C\I 0 ~
~
::E
<>
(f)
0
'7
0
20 40 60 80 20 40 60 80
Subset size m Subset size m
Figure 3.35. Ozone data: (left) forward plot of modified Cook distances; (right)
score test for transformation of the response
3.4. Ozone Data 69
C!
53
I()
0
co'"::>
'0
'iii
l!!
'0 0
'E"
N
0
Q)
'0
::>
iii I()
9
56
65
C! 31
0 20 40 60 80
Time
Figure 3.36. Logged ozone data: studentized residuals against day of the year.
There is a clear upward time trend
surprising feature of the fitted model is that none of the t tests for the
coefficients are significant at the end of the search, the most extreme value
being -1.32, with an R2 value of 0.430. One reason for this seemingly poor
fit may, of course, be that there is no relationship between ozone concen-
tration and the eight measured variables. Another may be that some of
the variables are highly correlated, so that the coefficients are poorly de-
termined, with large variances and correspondingly small t values. There
is some evidence for this in the value of R2, which is not approximately
zero. Another, not exclusive, possibility is that there is some systematic
misspecification of the model. In fact, the plot of the score statistic for
transformation of the response, Figure 3.35(right), indicates that , after half
the cases have been included in the forward search, there is evidence of the
need for a transformation. The significance of the statistic grows steadily,
there again being no evidence of any especially influential observations.
There is thus at least one systematic failure of the model.
As we see in Chapter 4, an appropriate transformation is found by taking
log y as the response. We repeat the regression and check once more for
outliers and influential observations. The forward search still does not reveal
any strange features. The score statistic for the transformation lies within
the bounds ±2.58 throughout, although it tends to be in the lower half of
the region. However the data are in time order, so it is sensible to check
whether there is a misspecification due to the presence of a time trend.
Figure 3.36 shows the residuals against observation number. There is a clear
upward trend, so we include a term in observation number in the model.
This new term has a t value of 6.60 and R2 = 0.696. The reduction in the
residual sum of squares caused by this new term increases the significance
70 3. Regression
Model Number 1 2 3 4 5 6 7
Response y y log y log y log y log y log y
Terms
Constant -0.08 -1.90 -2.64 -2.69 -2.80 -4.22 -4.32
Time 6.19 6.60 6.64 6.81 7.25 7.08
Xl 1.29 0.68 1.03 0.98 1.00
X2 - 0.77 -0.96 -1.72 -2.50 -2.77 -2.88 -3.45
X3 1.23 -0.74 -0.80 -0.71
X4 -0.90 -2.06 -1.69 -1.69 -1.57 -1.64
X5 0.06 1.80 2.80 2.96 3.10 5.01 5.06
X6 1.00 2.12 1.78 1.83 1.75 1.71 2.13
X7 0.80 0.26 -0.37
Xs -1.32 -1.30 -1.98 -2.05 -2.08 -2.05 -2.19
R2 0.430 0.632 0.696 0.696 0.693 0.689 0.678
of the t tests for the other terms. However, as the results in Table 3.1 show,
only that for X5 is significant at the 5% level.
Since the data appear well behaved, we now use a standard backwards
model-building procedure to find a sensible linear model. Once we have
found a model, we then check it using the forward search. Backwards elim-
ination, using the values of the t statistics from refitting each time a variable
is removed, leads to successive elimination of X7 , X3, Xl and X4. The final
model thus contains an intercept, the time trend, and X2, X5, X6 and xs ,
all terms being significant at the 5% level. The value of R2 is 0.678, hardly
decreased by the omission of four variables. The details of the t statistics
are in Table 3.l.
We now repeat the forward search to check that all cases support this
model. Figure 3.37 shows the forward plot of residuals, which is very stable,
with cases 65 and 31 showing the largest residuals for most of the search.
There is still no evidence of any high leverage points, Figure 3.38. This
result is to be expected; although it is possible that removal of carriers
can cause appreciable changes in leverage, as it did in our analysis of the
stack loss data, the requisite structure of the data is unlikely. The plot of
the score statistic, Figure 3.39(left), shows that all data agree with the log
transformation. The final plot, Figure 3.39(right) , is of the t statistics for
those terms included in the final model. They show the typical decline in
absolute value as cases are introduced that are increasingly less close to
the model. But the curves are smooth, giving no sign of any exceptionally
influential cases. We also monitored a large number of other quantities,
such as the modified Cook's distance, normality and kurtosis plots, and the
3.4. Ozone Data 71
20 40 60 80
Subset size m
Figure 3.37. Logged ozone data, final model (X2, X5, X6 , X8 and time trend) :
forward plot of scaled residuals, showing a ppreciable stability
20 40 60 80
Subset size m
Figure 3.38. Logged ozone data, final model (X2, X5, X6, X8 and time trend) :
forward plot of leverages . There are no observations with unduly high leverage
72 3. Regression
CD 0
"<t
"<t
0
0 C\J
~ C\J
*
u;
0
"g
*
0 0
2
~
0 C)I
0
C/) 0
'-0
trend
C)I
U
'f '-5
'-6
0
,-8
~ 'f
20 40 60 80 20 40 60 80
Subset size m Subset size m
Figure 3.39. Logged ozone data, final model (X2 ' Xs , X6, Xs and time trend) :
(left) score test for transformation; the log transformation is acceptable, (right)
forward plot of t statistics for parameters; there are no effects of outliers (the
central band is at ±2.58)
3.5 Exercises
Exercise 3.1 Find the least median of squares estimate of location for the
two samples reported in Table 3.2 (§3.1):
Sample 1 Sample 2
192 192
134 134
124 124
128 128
201 1201
120 120
186 186
204 1204
Exercise 3.2 Hawkins et al. {1984} give some synthetic regression data
with n = 75 and three explanatory variables. The purpose is to illustrate
the problems outliers at leverage points can cause for least squares. A robust
analysis of the data is given by Rousseeuw and Leroy (198'l, p. 93).
Regress y on the three explanatory variables using least squares. What do
you see from a normal plot of the residuals? Also try index plots of residuals
and leverages. What do you conclude? Now try other least squares methods
for model checking. What does a robust analysis add (§3.1)?
Exercise 3.3 What do you think the forward plot of leverages looks like
when a first-order model is fitted to the wool data (§3.1)?
Exercise 3.4 Estimate the parameters for Hawkins ' data in §3.1 using
only observations 11, 15, 37, 51, 58, 70, 96, 109, 114, 97, which were
among the first to enter in our forward search. You should now be able to
divide the data into the groups shown in Figure 3.3. How does the QQ plot
of residuals change as each successive group is included in the fit? Comment
on the differences between Figure 3.3 and the forward plot of the residuals
in Figure 3.40 (§3.1).
Exercise 3.5 Using the mean shift outlier model, or otherwise, calculate
the F test for the hypothesis that observations 4 and 21 are outlying when
the stack loss data, using only variables 1 and 2, are analyzed with response
,;y. Comment on the significance level of your test (§3.2).
Exercise 3.6 Table 3.3 gives demographic data about 49 countries taken
from Gunst and Mason {1980, p. 358}. The variables are
74 3. Regression
73
~--------------------------------------------------
--------------------~---~~=~~~~
'.
'"
"
iii
::0
~
! 0
-g
iii
ell
'l'
o 20 40 60 80 100 120
Subset size m
3.6 Solutions
Exercise 3.1
The least median of squares estimate of location is the midpoint of the
shortest half (Rousseeuw and Leroy (1987, p. 169)). The shortest half of a
3.6. Solutions 75
"'"
•
.&
.. C\I
,-•• •
/
"
C\I (J)
(ij
·w"
0 "0 0
(J)
(ij [I!
"
"0
·w
~ "0
<II
<II
a: 'r • .!:!
C
<II
~
••
"0
'9 • "
Ci5
'9 • 'r •
• •
0 2 4 6 8 ·2 ·1 0 2
Predicted values Quantiles of standard normal
Figure 3.41. Hawkins et al (1984) data: (left) residuals against fitted values and
(right) QQ plot of residuals
Yh - Yl Yh+l - Y2 Yn - Yn-h+l,
where h = [nI2] + 1 and Yl ::; Y2 ::; ... ::; Yn are the ordered observations.
In our first sample the intervals are 186 - 120 = 66, 192 - 124 = 68,
201 - 128 = 73 and 204 - 134 = 70. The shortest half is 66 and the least
median of squares estimate of location is 0.5(120 + 186) = 153. This is also
the least median of squares estimate of location of the second sample. Note
that the presence of the two outliers in the second sample (1201 and 1204)
does not affect the LMS estimate of location.
Exercise 3.2
The QQ plot ofstudentized residuals in Figure 3.41 (right) shows that there
are four outlying observations. The other panel, of residuals against fitted
values, again shows these four observations, as well as a further group of
10 outliers. The scatterplot matrix in Figure 3.42 clearly shows the three
groups of observations.
It seems that robust estimation adds little here to the identification of
outliers, although a forward search makes it possible to monitor the effect
of these outliers on, for example, t statistics.
Exercise 3.3
As Figure 3.43 shows, the factorial structure of the data becomes apparent
as the search progresses. At the end of the search the four leverage values
come from groups of points with respectively 0, 1, 2 and 3 nonzero coordi-
nates. The smallest leverage, lin, is for the centrepoint of the design. See
Farrell et al. (1967) or Atkinson and Donev (1992, p. 129).
3.6. Solutions 77
..
o 2 .. e a 10
o
19° 0 I
."",..~~~--'
0102030
Figure 3.42. Hawkins et al (1984) data: scatterplot matrix. The three groups of
observations can be clearly identified
CD
ci
CD <D
ci
'"CD~
>
CD
..J ..,.
ci
C\I
ci
5 10 IS 20 25
Subset size m
Exercise 3.4
The QQ plot of residuals from the fit using these 10 observations is similar
to that for the LMS residuals in Figure 3.2 (why?). As successive groups of
observations are introduced there is a gradual change to the least squares
residuals in the left-hand panel and the groups merge. The forward plot
of residuals in Figure 3.40 shows the four groups of residuals more clearly
than does the plot of squared residuals in Figure 3.3, which emphasizes the
largest residuals.
Exercise 3.5
The mean shift outlier model leads to the F test comparing residual sums
of squares with and without the two observations. Numerically, the F test
is
{SS(0.5, 21) - SS(0.5, 19)}/2 = (1.9666 - 0.5396)/2 = 2116
SS(0.5,19)/16 0.5396/16· ,
where SS(0.5, 19) is the residual sum of squares when the square root of
the response is taken and observations 4 and 21 are removed. SS(0.5,21)
is the residual sum of squares using all the units.
The significance level of this result is very high: F 2 ,16(0.001) = 10.97 and
we observe 21.16. However the forward procedure has found two observa-
tions that give a large reduction in the residual sum of squares when they
are deleted. If the true value of the significance mattered, some allowance
should be made for the effect of selecting the observations, either by using
the Bonferroni inequality, for example, Cook and Prescott (1981), or by
simulation. However here the effect of the two observations is so large that
such refinements are not required.
Exercise/ 3.6
Figure 3.44 gives three index plots of leverages. Observation 27 has the
highest leverage, but the pair of observations 17 and 39 have, together
with observation 20, the next highest leverages, almost 0.5. If either of the
pair of observations is deleted the other has a leverage close to one, as the
lower two plots of the figure show.
The reason for this behaviour is clear from the scatter plots of Fig-
ure 3.45. The two units are close together and remote from all others in
their values of X3 and X4. Their leverages therefore illustrate the theoret-
ical results of Exercise 2.5. The countries concerned are Hong Kong and
Singapore.
Table 3.4 shows the results of the backward selection of variables using t
statistics. On this untransformed scale only X5 and X6 are included in the
final model.
Exercise 3.7
In the Minitab output below the response C12 is logged ozone concentra-
tion, C11 is day and C2 to C9 correspond to Xl to X8. The models with 6 or
3.6. _Solutions 79
Leverage
I
,;
~-'lrl~ITIT" I 'I~I'I~:~lrrl-'IT,~,~I""~I",,r-r,rIL,~I~,-.:'1~"'rlrrl~'TI~I'!'I'I~'~lrl,LIrl'ITI~
<Xl
:
10 20 30 40 50
:I~~I~I~'I~I~I~:I~I~II~I-'I~I~I-'I~'~I~:I~"~I~'~,,~I,~,~,~,~,~,~I
o 10 20 30 40 50
~I "
~ II 1' 1I I : II
10
,II
I I II
20
II
, I
II
40
,."i" , I 50
Figure 3.44. Demographic data: index plots of leverage. The top panel i s for all
the data. The other two panels show what happens when observations 17 and 39,
respectively, are deleted
0
0
ill
C\I
••
•
0
0
ill
C\I -
•
0
0
0
0
0
39 •
1
•
t
0 0
0
>-~
0
>- ~ ....xo
t.
0
0
....0
.
0 0
•
0 0
ill
0
39-·
1
ill
0
1·~ 0
". •
0 1000 2000 3000 0 40000 100000 0 1000 2000 3000
X3 X4 X3
Figure 3.45. Demographic data: three scatter diagrams showing why units 17 and
39 have high leverage
80 3. Regression
Model number 1 2 3 4 5
Xl -1. 7322 -1. 7307 -1. 7499 -1.5825
0.2727
-0.4620 -0.5039 -0.7917
0.2830 0.3159
1.4877 1.7063 1.7122 2.1927 3.7345
4.1474 4.1843 4.3011 4.3346 4.3923
0.584 0.583 0.582 0.576 0.553
4.1 Background
Several analyses in this book have been improved by using a transformation
of the response, rather than the original response itself, in the analysis of
the data. For the introductory example of the wool data in Chapter 1, the
normal plot of residuals in Figure 1.9 is improved by working with log y
rather than y (Figure 4.2). The transformation improves the approximate
normality of the errors. The transformation also improves the homogeneity
of the errors. The plot of residuals against fitted values for the original data,
also given in Figure 1.9, showed the variance of the residuals increasing with
fitted value. The same plot for log y, given in Figure 4.2, shows no such
increase.
Two analyses in Chapter 3 also showed the advantages of transformation
of the response. For the stack loss data a simpler model was obtained when
the square root of y was used as the transformation. There was no need for
interactions - an additive model sufficed. We also saw advantages in work-
ing with the log of ozone concentration, rather than with the concentration
itself.
There are physical reasons why a transformation might be expected to
be helpful in these examples. In two of the sets of data, the response is a
concentration and in the third it is the number of cycles to failure. All are
nonnegative variables and so cannot be subject to additive errors of con-
stant variance. In this chapter we analyze such data using the parametric
family of power transformations introduced by Box and Cox (1964). A full
82 4. Transformations to Normality
that is,
z(,X.) = X(3+£.. (4.2)
When ,X. = 1, there is no transformation: ,X. = 1/2 is the square root
transformation, ,X. = 0 gives the log transformation and ,X. = -1 the recipro-
cal. These are the most widely used transformations, frequently supported
by some empirical reasoning. For example, measurements of concentration
often have a standard deviation proportional to the mean, so that the vari-
ance of the logged response is approximately constant (Exercise 4.1). For
this form of transformation to be applicable, all observations need to be
positive. For it to be possible to detect the need for a transformation the
ratio of largest to smallest observation should not be too close to one. A
similar requirement applies to the transformation of explanatory variables.
The purpose of the analysis is to find an estimate of ,X. for which the
errors in the z('x') (4.2) are, at least approximately, normally distributed
with constant variance and for which a simple linear model adequately
describes the data. This is achieved by finding the maximum likelihood
estimate of ,X., assuming a normal theory linear regression model.
Once a value of ,X. has been decided upon, the analysis is the same as
that using the simple power transformation
(4.4)
a standard normal theory likelihood for the response z(.-\). For the power
transformation (4.3) ,
so that
10gJ = (.-\ -1) LlogYi = n(.-\ -1) logy.
The maximum likelihood estimates of the parameters are found in two
stages. For fixed .-\ the likelihood (4.5) is maximized by the least squares
estimates
~(.-\) = (XT X) - l XT z(.-\) ,
with the residual sum of squares of the z(oX),
R('-\) = z(oXf(I - H)z(.-\) = z(oXf Az(oX). (4.6)
Division of (4.6) by n yields the maximum likelihood estimator of a 2 as
0- 2 (.-\) = R(oX)/n.
Replacement of this estimate by the mean square estimate 8 2 (oX) in which
n is replaced by (n - p) does not affect the development that follows.
For fixed oX we find the loglikelihood maximized over both j3 and a 2 by
substitution of ~(.-\) and 8 2 (oX) into (4.5). If an additive constant is ignored
this partially maximized, or profile, loglikelihood of the observations is
Lmax(oX) = -(n/2) log{R(oX)/(n - p)} (4.7)
so that ~ minimizes R(oX). To repeat what has already been stressed, it is
important that R(oX) in (4.7) is the residual sum of squares of the z(.-\),
a normalized transformation with the physical dimension of Y for any oX
(Exercise 4.2). Comparisons of residual sums of squares of the simple power
transformation y(oX) are misleading. Let
S(oX) = y(.-\f(I - H)y(oX) (4.8)
be the residual sum of squares of the unnormalized y(oX). Suppose, for ex-
ample, that the observations are of order 103 , the residual sum of squares
S( 1) will be of order 106 , whereas, when oX = -1, the reciprocal transforma-
tion, the observations and S( -1) will be of order 10- 6 . However relatively
well the models for .-\ = 1 and oX = -1 explain the data, S (-1) will be very
much smaller than S(l). Comparison of these two residual sums of squares
will therefore indicate that the reciprocal transformation is to be preferred.
This bias is avoided by the use of R( oX) in (4.7) , since the magnitude of
z(.-\) does not depend on oX.
For inference about the transformation parameter oX , Box and Cox
suggest likelihood ratio tests using (4.7), that is, the statistic
TLR = 2{Lmax(~) - Lmax(oX o)} = nlog{R(oXo)/R(~)}. (4.9)
4.2. Transformations in Regression 85
J S~(A)/ {WT(A)Aw(A)}
-yeA)
(4.13)
JS~(A)/{';/(A) :v (An
The negative sign arises because in (4.12) "( = -(A - AD). The mean square
estimate of (72 can, from (2.29), be written in the form
These formulae show how i is the coefficient for regression of the residuals
; on the residuals it, both being the residuals from regression on X. If, as
is usually the case, X contains a constant, any constant in W(A) can be
w
disregarded in the construction of (Exercise 4.3). Under these conditions
(4.11) becomes
I = -(A - AD) = AD - A.
If there is no regression, i ~ 0 and the value of AD is acceptable. The
constructed variable plot will then show no trend. If the value of AD is too
high, the plot will show a positive trend. This is often seen at the beginning
of an analysis when the hypothesis of no transformation is explored, the
positive slope indicating that the data should be transformed. On the other
hand, if the data are overtransformed, the plot will show a negative slope
and a higher value of A is indicated.
4.2. Transformations in Regression 87
t( ) _ { yA A f= 0
y - logy A= 0
which gives the same numerical results as the Box- Cox transformation in
(4.1) or (4.3) , although lacking the mathematical property of continuity at
A = O.
0
<D
')I
"0
0 0
0 <Xl
,f; ')I
a;
,l!
c;,
.2
~'"
Il.
0
0
'?
'"'?
Figure 4.1. Wool data: profile loglikelihood Lmax(.)..) (4.7) showing the narrow
95% confidence interval for A
In (4.18)6k is the estimate from the regression (4.16) including the variable
x~o . Further details are given by Atkinson (1985, §8.4).
• • •
• •
,
N
0 N
•• • ••
•
0
0
•
••
• •
•
• ...
•• 0 ~
.
• • • •
••
'"
9 •
•
• '";"
••
•
~
""
9 •
• "?
5 7 ·2 ·1 0 2
Predicted values Quantiles of standard normal
Figure 4.2. Transformed wool data: residual plots for log y: (left) least squares
residuals against fitted values; (right) normal QQ plot of studentized residuals
'ih 0
2
10 15 20 25
Subset size m
Figure 4.3. Wool data: fan plot-forward plot of Tp(.\) for five values of .\. The
curve for .\ = -1 is uppermost; log y is indicated
Initially, apart from the very beginning when results may be unstable,
there is no evidence against any transformation. When the subset size m =
15 (56% of the data) , A = 1 is rejected. The next rejections are A =
0.5 at 67% and -1 at 74%. The value of A = 0 is supported not only
by all the data, but also by our sequence of subsets. The observations
added during the search depend on the transformation. In general, if the
data require transformation and are not transformed, or are insufficiently
transformed, large observations will appear as outliers. Conversely, if the
data are overtransformed, small observations will appear as outliers. This
is exactly what happens here. For A = 1 and A = 0.5, working back from
m = 27, the last cases to enter the subset are 19, 20 and 21, which are
the three largest observations. Conversely, for A = -1 and A = -0.5 case
9 is the last to enter, preceded by 8 and 7, which are the three smallest
observations. Since the data are in standard order for a 33 factorial , the
patterns of these numbers indicate a systematic failure of the model. For
the log transformation, which produces normal errors, there is no particular
pattern to the order in which the observations enter the forward search.
Similar results are obtained if Tp(A) is replaced by the signed square
root of the likelihood ratio test (4.9) . In the absence of outliers and highly
influential observations, the fan plot of the score statistic evolves smoothly
with m: there are no jumps or dramatic reorderings of the curves. More
quantitatively, the method allows assessment of the proportion of the data
supporting a particular transformation, information not available from
other methods of analysis.
There are a number of details in the combination of the forward search
with statistics about transformation that need to be decided in any im-
4.3. Wool Data 91
10 15 20 25
Subset size m
Figure 4.4. Wool data: forward plot of Tp(.\) for five values of .\, one search on
untransformed data
"0
0
0
~
,5
~
0>
£
"0
0
i!l 0
')J
E
.~
::;
0
0
"i
10 15 20 25
Subset size m
Figure 4.5. Wool data: Lmax(A) for five values of A during the forward search.
The uppermost curve is for A = 0
LO LO
~ci
.0 :g'" 0
E E
.!!! 0 .!!! 0
'0 ci '0 c:i
w w
-' -'
::2 LO ::2 LO
9 9
10 15 20 25 10 15 20 25
Subset size m Subset size m
Figure 4.6. Wool data: forward plot of ), with 95% likelihood intervals from two
searches, (left) )., = 1; (right) A = 0
cal for any ordering. The second point is that the plot requires a numerical
search for calculation of the maximum likelihood estimate 5. at each stage
of the search, combined with two further numerical searches to find the
ends of the confidence region. This procedure is computationally intensive
when compared with calculation of the approximate score test Tp(.\) which
only requires noniterative calculations at the null value. The third point
is that the fan plot is appreciably easier to interpret than Figure 4.6. We
accordingly use plots of score statistics, rather than of parameter estimates.
Finally, we supplement our discussion of the forward search by consider-
ing information available from the graphical methods described in §4.2.2.
Figure 4.7 is the constructed variable plot when A = 1. With its positive
slope, the plot shows clear evidence of the need for a transformation, ev-
idence which seems to be supported by all the data. The most influential
points seem to be observations 20 and 19, which are the two largest obser-
vations and 9,8, 7 and 6, which are the four smallest. The sequential nature
of these sets of numbers reflects that the data are from a factorial exper-
iment and are presented in standard order. The constructed variable plot
for .\ = 0 is in Figure 4.8. There is no trend in t he plot and the transfor-
mation seems entirely acceptable. The residuals from the six observations
that were extreme in the previous plot now lie within the general cloud of
points.
Although the constructed variable plots give the same indications as all
other plots about the satisfactory nature of the log transformation, they
do not supply direct evidence on the influence of the individual observa-
tions on the selection of this transformation. This is provided in Figure 4.9
which gives index plots of the deletion values of Tp(.\) for three values of .\.
These are calculated by deleting each observation in turn and recalculating
the statistic, rather than using the approximate formulae coming from the
deletion results in §2.3. The plots for the three values of .\ are on the same
94 4. Transformations to Normality
,..
0
•
20
§
••
~ 0
c
8. 5l .S
~
.• ·.7
.6
~
:!l
..
~
...'·• :'•
a:
0
0
'? -
.24
Figure 4.7. Wool data: constructed variable plot for ,\ = 1. The clear slope in the
plot indicates that a t ransformation is needed. The largest observations are 19
and 20: the labelled points in the centre of the plot have the four smallest values
of y
• 6
•
20
0
~
• •
• .'•
.
•
.. ..
.•
~
~ ,
~
~
u
• .7
.8
.~ 0
•
0
a:
•
0
0
~
.2'
·200 200 400 600
Residual constructed variable
Figure 4.8. Wool data: constructed variable plot for ,\ = O. The absence of trend
indicates that the log transformation is satisfactory
4.4. Poison Data 95
lambda=-O.5
j o
~
:
1 . - - . - . - 1 I
10
"
lambda=O
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
20
20 25
lambda=O.5
10 15 20
Figure 4.9. Wool data: index plots of deletion values of Tp(>.) with 95% intervals
for >. = -0.5, 0 and 0.5. No evidence against log y
vertical scale. They show that deletion of individual observations has little
effect on the values of the statistics: >. = 0.5 and -0.5 are still firmly re-
jected. Likewise, individual deletions have virtually no effect on the value
of Tp(>'o): all values remain close to zero.
The last plot we consider in our analysis of the wool data is the inverse
response plot in Figure 4.10. The plot of fJ against y is given twice. In Fig-
ure 4.10(left) we have imposed the best fitting straight line, which clearly
fits very badly. But, in the right-hand panel the fitted curve is log y, which
fits very well. The visual impression of this plot is once more to support
the log transformation.
•
8 • 0
0
••
0
'" '"
••
• ••
8
~ '. 0
0
~
0
g 8
'"
0
'C 0
0
0
•
'(>
•
0 1000 2000 3000 0 1000 2000 3000
Figure 4.10. Wool data: inverse fitted value plots with fitted curves for (left) no
transformation and (right) the log transformation
Table 4.1. Poison data: last six observations to enter the five separate searches
and numbers of six largest observations
43 27 44 14 43 28 13
44 28 37 28 28 43 15
45 37 28 37 14 17 17
46 44 8 17 17 14 42
47 11 20 20 42 42 14
48 8 42 42 20 20 20
p = 6, as did Box and Cox (1964) when finding the reciprocal transforma-
tion. The implication is that the model should be additive in death rate,
not in time to death.
Our analysis is again based on five values of A : -1, -0.5,0,0.5 and
1. The fan plot of the values of the approximate score statistic Tp(A) for
each search as the subset size m increases is given in Fig 4.11 and shows
that the reciprocal transformation is acceptable as is the inverse square root
transformation (A = -0.5). Table 4.1 gives the last six observations to enter
each forward search. We first consider the ordering of the data achieved by
these forward searches and then discuss Fig 4.11 in more detail.
In addition to the ordering of the data by the search, Table 4.1 also gives
the numbers for the six largest observations. The table shows that, for A
= 0.5 and 1, observation 20, the largest observation, is the last to enter
the set used for fitting. It is the last but one (m = 47) to enter for A = 0
or -0.5 and is not in the last six for A = -1. Similarly, the four largest
observations are the last four to enter for A = 1 and 0.5, but the number
4.4. Poison Data 97
10 20 30 40
Subset size m
Figure 4.11. Poison data: fan plot- forward plot of Tp(A) for five values of A. The
curve for A = - 1 is uppermost: both A = -1 and A = -0.5 are acceptable
1"'
\,
10 20 30 40
Subset size m
Figure 4.12. Modified poison data: fan plot- forward plot of Tp(>.) for five values
of >.. The curve for >. = -1 is uppermost: the effect of the outlier is evident in
making >. = 0 appear acceptable at the end of the search
'9 - -- - - . ~J~'"
o \'
10 20 30 40
Subset size m
Figure 4.13. Modified poison data: forward plot of Tp(A) for five values of A, one
search on untransformed data. The outlier enters well before the end of the search
now, of course, enters at the same position in all five calculations of Tp(A).
Because a small observation has been made smaller, the outlier has its
greatest effect on the tests for A = -1. But the effect of its introduction
is clear for all five test statistics. Although this figure is helpful in the
identification of an influential outlier, it is nothing like as useful as the fan
plot of Figure 4.12 in understanding which is the correct transformation.
When, as in Figure 4.12, the data are approximately correctly transformed,
which they are for A = -1, -0.5 and 0, observation 8 enters at the end of
the search. As the value of A becomes more remote from the correct value,
so the outlier enters earlier in the search.
We now compare the clear information given by the fan plot with that
which can be obtained from other graphical methods. Figure 4.14 gives con-
structed variable plots for three values of A: -1 , 0 and 1. For A = 0 there is
a clear indication of the importance of observation 8. There is a cloud of 26
points with an upward trend, and the remote point of observation 8 which
is causing the estimate of slope to be near zero. Deletion of this obser-
vation can be expected to change the estimated transformation, although
by how much cannot be determined from this plot. The plot for A = -1
seems to show that there is evidence that the reciprocal transformation has
overtransformed the data, although what the effect is of observation 8 is
not clear. Likewise the panel for A = 1 indicates that the data should be
transformed. On this plot observation 8 seems rather less important. One
conclusion from these plots is that it is helpful to look at a set of values of
A when using constructed variable plots, just as it is in the fan plot.
As a third graphical aid to choosing a transformation we give in Fig-
ure 4.15 Cook and Weisberg's inverse fitted value plot for four values of A.
The values of y and of the fitted values fJ are the same in all four plots.
100 4. Transformations to Normality
•
'9"
• .•...... .,•
0
"....
0
~
'm
~
'is '"9
o
o if'·
.' ~
....
0: 0:
9'" 8
8 ... ~ • 8
9
·0.1 0.1 0.3 0.5 -0.05 0.05 0.15 -0.1 0.1 0.2 0.3
Residual constructed variable Residual constructed variable Residual constructed variable
Figure 4.14. Modified poison data: constructed variable plot for three values of A.
The effect of observation 8 on the estimated transformation is clearest for A = 0
~
~
"
. _.......
~ ~
~
~ ;l . _.......
'\ ~
~ ;l
'\
~ lambda=-1 ~ lambda=-D.5
0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2
~ ~
. . .......- . ...
~ ~
j ~ '\.-
'i
IE
;; I 0
~ lambda=O ~
0 lambda=D.5
0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2
Figure 4.15. Modified poison data: inverse fitted value plots with fitted curves for
four values of A. Is A = -0.5 best?
What differs is the fitted curve. Because the data consist of four obser-
vations at each factor combination, patterns of four identical values of r,
are evident in the plot. These are more widely dispersed (in the horizontal
direction, since this is an inverse plot) for larger values of r,. The difference
in dispersion makes it rather difficult to judge the plots by eye: the lower
values are best fitted, perhaps, by the reciprocal transformation and the
higher values by the log. The value of ..\ = 0.5 is clearly an inadequate
transformation: the fitted line is not sufficiently curved. The plots thus in-
dicate, in a rather general way, what is an appropriate transformation, but
they do not indicate the importance of observation 8. However, due to the
replication, the variance for the group of observations including 8 can be
seen to be rather out of line with the general relationship between mean
and variance.
4.6. Doubly Modified Poison Data: An Example of Masking 101
lambda=-1 lambda=-O.5
"-
~
'"
I .I, I
I
'" oil
I
~
'1 '"
I
'"
"-
0 10 20 30 40 10 20 30 40
lambda=O lambda=1
u;>
,I,. I
I""
,I, I ~ .'"
'"
I
I
o
~
o 10 20 30 40 10 20 30 40
Figure 4.16. Doubly modified poison data: index plots of deletion values of Tp(A)
with 99% intervals for four values of A; the log transformation is indicated
-1 -0.5 0 0.5 1
10.11 4.66 0.64 -3.06 -7.27
It seems clear that the data support the log transformation and that all
other transformations are firmly rejected. To show how diagnostics based
on the deletion of single observations fail to break the masking of the two
outliers, we give in Figure 4.16 index plots of the deletion values of the
Tp(.).) , calculated directly from the data with each case deleted in turn. Also
given on the panels, where possible, are lines at ± 2.58, corresponding to 1%
significance, assuming the statistics have a standard normal distribution.
The four panels have also been plotted with the same vertical scale. For >. =
-1 the statistics range from 7.22 to 10.7, so that the inverse transformation
is firmly rejected. For>. = -0.5 the range is 2.54 to 4.88, evidence for
rejection of this value. For the log transformation all values lie well within
102 4. Transformations to Normality
-1
10 20 30 40 50
Subset size m
Figure 4.17. Doubly modified poison data: fan plot- forward plot of Tp(>.) for
five values of >.. The curve for>. = -1 is uppermost; the effect of the two outliers
is clear
lambda=O lambda=0.5
• •
,...•• •
,.....
C\I 42
• 42
20
, t~· ...••
0 20
• • •
\;., .
14 14
.
Q)
~ ~
3lc: I •
o 0
0.
'·A • • o
0.0
. , 1
C/)
~ ci
l!? ~.
lU "! lU
0 j
I •
j
"0 ,
.0;
Q)
~
Q)
~,
a: 8 a: 8
~
9 • •
38 38
<0
9
• •
-0.05 0.05 0.15 -0.1 0.0 0.1 0.2
Residual constructed variable Residual constructed variable
Figure 4.18. Doubly modified poison data: constructed variable plots for two
values of A. There seem to be two outliers, or are there three?
46, giving upward jumps to the score statistic in favour of this value of A.
For the remaining value of 0.5 one of the outliers is the last value to be
included.
Although the single case deletion diagnostics of Figure 4.16 fail to reveal
the two outliers, they are revealed by the constructed variable plots in much
the same way as they were for the singly modified data in §4.5. Plots of the
constructed variables for A = 0 and 0.5 are given in Figure 4.18. For A = 0
it seems clear that there are two outliers, observations 8 and 38, which
will influence the choice of transformation. These two observations need to
be deleted and the transformation parameter reestimated. The results for
A = 0.5 are less clear: deletion of observations 8 and 38 would indicate that
the data need to be transformed to a lower value of lambda, although there
is no evidence whether the value of A should be -0.5, -1, or some other
value. But deletion of 14, 20 and 42, the three largest observations, would
suggest a higher transformation, or perhaps no transformation at all.
The conclusions from the constructed variable plots are, as before, less
sharp than those from the fan plot which clearly reveals not only the masked
outliers, but also their effect on the estimated transformation. Although the
outliers were not revealed by the single case deletion methods exhibited in
Figure 4.16, they could be found by looking at all 48 x 47/2 =1 ,128 pairs of
deletions. But if there were three outliers, 17,296 triples would have to be
investigated, and even more if there were four outliers. The problem with
this procedure is, perhaps, not the amount of computation but, rather, the
difficulty in interpreting the mass of computer output. Use of the fan plot
from the forward search reveals the outliers and their effect on inference in
one analysis.
104 4. Transformations to Normality
Table 4.2. Multiply modified poison data: the four modified observations
6 0.29 0.14
9 0.22 0.08
10 0.21 0.07
11 0.18 0.06
4.7. Multiply Modified Poison Data- More Masking 105
lambda=O
,I,
i: i,
, , , , ,
:1
I i II I i' • • I
II
,. 2.
lambda=1 /3
30 4.
;1 i I :II' , ,I,
IIII
I II
!!, ,
" :I i '
,
1. 2.
lambda=O.5
3. 4.
Figure 4.19. Multiply modified poison data: plots of deletion Tp(A) for three
values of >.; the value of 1/3 is indicated
imation to the effect of deletion. Also given on the panels, where possible,
are lines at ± 1.96, corresponding to 5% significance, assuming the statistics
have a standard normal distribution. The three panels have been plotted
with the same vertical scale. For >. = 0 deletion of observation 11 reduces
the value of the statistic, but it is still larger than 2, suggesting that the
log transformation remains unlikely. For>. = 1/3 the largest effect is from
the deletion of observation 20. Whether or not it is included, the third root
transformation is supported by the data. It is however an unusual transfor-
mation, except for volumes. The plot for>. = 0.5 shows that if observations
20 or 42 are deleted the square root transformation is acceptable. If, as a
result of this information it were decided to move to >. = 1/ 2, this would
take the analysis even further from the value appropriate to the majority
of the data.
The next step in a standard analysis is to look at the distribution of
residuals to see whether there is any evidence of outliers. Three QQ plots
are exhibited for>. = 1/3. As well as that for the least squares residuals we
show residuals from two robust fits when different sized subsets are used in
fitting the model during the forward search. In all plots the residuals have
been scaled by the overall estimate of (>2 from the end of the forward search.
This scaling is not important when looking for evidence of outliers, since
the important feature is the shape of the plot. Figure 4.20(left panel) shows
scaled residuals from an LMS fit to an elemental set, found by searching
over 10,000 randomly selected subsets of size p = 6. The plot shows the
typically long-tailed distribution which comes from a very robust fit. Fig-
ure 4.20(middle panel) shows what happens as we move along the forward
search until a subset of size m = 27 is used for fitting. The plot still has a
106 4. Transformations to Normality
..
,... ...
..,
.., .,
... ., . ....,
<)'
'1
11 11 11
"! "!
·2 ·2 ·2 -,
Figure 4.20. Multiply modified poison data: normal QQ plots of scaled residuals
at three points in the forward search
long-tailed shape, from which it might be concluded that there were many
outliers, or none. However, when all the data are fitted by least squares,
as shown in Figure 4.20(right panel), there is no evidence of any particular
outliers. The effect of individual deletions on the estimated transformation
has already been investigated in Figure 4.19, so it is not necessary to con-
sider again the importance of observations 11 and 20, which are the most
extreme observations when m = n in Figure 4.20, as they also are when
m = 6 and m = 27.
The conclusion from this analysis is that the transformation A = 1/3 is
reasonable. This traditional approach gives no indication of the effect of the
four outliers. The example shows that, if data are analyzed on the wrong
transformation scale, even the application of very robust methods such as
LMS fails to highlight the outliers and influential observations.
o
N
--~
";~' ..
o 1
10 20 30 40 50
Subset size m
Figure 4.21. Multiply modified poison data: fan plot- forward plot of Tp(A) for
five values of A. The curve for A = -1 is uppermost: the differing effects of the
four modified observations are evident
of the four contaminated observations, once more in the last four steps
of the forward search, brings it above the upper threshold. The statistic
for A = 0.5 spends much more of the central part of the search outside
the lower boundary. As we have seen, the final value of Tp(0.5) is -2.29.
But for values of m between 22 and 37 the curve lies approximately on or
below the boundary. The inclusion of units 9, 10 and 11 at m = 38, 39 and
40 increases the value of the score statistic from -2.65 to 1.89. From this
step onwards the curve decreases monotonically, except at m = 43 when
inclusion is of unit 6, the first modified unit to be included. It is interesting
that, in this scale, the four contaminated observations are not extreme and
so do not enter in the last steps of the forward search. But the forward plot
enables us to detect their appreciable effect on the score statistic.
The indication of this plot is that one possible model for these data takes
A = -1 for the greater part of the data, with four outliers. To confirm this
suggestion we look at the plot that monitors the scaled residuals during
the forward search. This is shown, for A = -1 in Figure 4.22. This plot
beautifully indicates the structure of the data. On this scale there are the
four outliers, observations 6, 9, 10 and 11, which enter in the last four steps
of the forward search. Until this point the pattern of residuals remains
remarkably constant, as we argued in §2.6.1 that it should. The pattern
only changes appreciably in the last four or five steps, when the outliers
and observation 8 are introduced.
The results of the forward search in Figures 4.21 and 4.22 clearly show
the masked outliers and their effects, which were not revealed by the single
case deletion methods exhibited in Figure 4.19 and the residual plots for
A = 1/3 of Figure 4.20. The comparison of these sets of figures exhibits the
power of our method in the presence of influential masked multiple outliers.
108 4. Transformations to Normality
11 - - - - - - - - - - - - - _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...... '\
10--------------------______________________ . \ \
\ \
9· . . . . . . • . ... . ......... . . .. ...~\\
. "
o ••
10 20 30 40
Subset size m
Figure 4.22. Multiply modified poison data, A = -1: forward plot of scaled
residuals clearly revealing the four modified observations
lambda=1/3 lambda=-1
'"0 • •• <Xl • •• •
• • • • • • •
• ••• • •••
---
<0
0 • • • • "'0 • • • •
'" • • '"CI> •
-
CI>
">
m ">
m
• .,.
"0
CI>
(!
":
0
•
"0
CI>
u:'" 0 •
'"0 '"0
0.2 0.4 0.6 0.8 1.0 1 .2 0.2 0.4 0.6 0.8 1.0 1.2
Figure 4.23. Multiply modified poison data: inverse fitted value plots for two
values of )... The inverse transformation is quite unacceptable
lambda=1/3
'"0 •
0'" •
•
• • •
~
0
Co
0
••• • •• ••
'"~ • ••
~
~
'"
0
0 I
• •• • "
• ••
• ••
.:. •
II:
9 • •
• •
•
• •
'"9 • •
·0.1 0.0 0.1 0.2 0.3
Figure 4.24. Multiply modified poison data: constructed variable plot for)" = 1/3
110 4. Transformations to Normality
lambda=-O.5 lambda=-l
ov ~---------------------,
• •
,• . '"0
...
'"o
.1,\ ... "c:
<J) 0
~.
-"
0
0
5l-
~
•• ni
" '"9
."
• 'iii
'"
II: •
•
• ~
•
'7
• •
-0.4 0.0 0.2 0.4 0.6 0.8 ·0.5 0.0 0.5 1.0 1.5
Residual constructed variable Residual constructed variable
Figure 4.25. Multiply modified poison data: constructed variable plots for
>. = -0.5 and -1. Particularly for>. = -1, all observations seem to support
transformation to a higher value
·1
~ ~-----,--------,--------,--------~
20 40 60 80
Subset size m
Figure 4.26. Ozone data, final model: fan plot- forward plot of Tp(A) for five
values of A. The curve for A = -1 is uppermost; log y is indicated
tions would not lead to an improved analysis. On the contrary, they would
increase the standard errors of the estimated parameters. An additional
argument for the log transformation is that, when the data are correctly
transformed in the absence of outliers, the magnitude of the untransformed
observations is not reflected in the order in which they enter the search.
The fan plot for the full model for the ozone data is similar to that in
Figure 4.26 except that, in line with the results of Table 3.1, the values
of the statistics are smaller. We do not give the figure here, but instead
give two diagnostic plots for the transformed data in Figure 4.27. The left
panel is the constructed variable plot which shows no particular pattern,
although the two largest observations 53 and 71 are evident. However these
observations did not enter the plot at the end, so they agree with the
proposed transformation. The inverse fitted value plot is in the right-hand
panel. The vertical patterning visible is due to rounding of the values of the
response so that many cases have the same value of y. The fitted curve of
log y appears to pass well through most of the data. However, as with the
left panel, the two largest observations stand slightly apart. The largest,
observation 53, enters the subset in the forward search for the logged data
two observations from the end. We see from the fan plot, Figure 4.26, that
this observation causes a very small change in Tp(O) when it enters. This
is another example of the way in which the fan plot enables us to quantify
impressions from other plots.
lambda=O lambda=O
• :: ••
.. •••
53
.. • · •
"<t
• •
~
• •
71
• •• • ••
· ... :
~
Q)
'" Co •• • • ••
'"c:0 I.·.~ •
'"
..
0-
.. • .... t
..•• ..··f•-.....,• • •
'"
i!! 0
::>
u
.
'Uj
Q) <)I
a: "<t
••
-r • '"
•
•
'" -1 0 2 3 4 5 10 15 20
Residual constructed variable
Figure 4.27. Logged ozone data, final model: (left) constructed variable and
(right) inverse fitted value plots for ,\ = 0
---
10 15 20
Subset size m
Figure 4.28. Stack loss data, all three variables: fan plot-forward plot of Tp(A)
for five values of A. The curve for A = -1 is uppermost: vy is indicated
Table 4.3. Stack loss data: last five observations to enter the five separate searches
and the score statistics for transformation at each stage (linear model with
variables 1 and 2 only)
in the final stages. However they do not affect the conclusion that the
square root transformation is acceptable. The log transformation is not
acceptable when observations 4 and 21 are deleted, but is acceptable when
all observations are included, a form of masking which may have misled
Atkinson (1985). The transformation A = -0.5 is not acceptable and the
reciprocal transformation is clearly rejected.
An interesting feature of Table 4.3 is that for all except one value of A
many of the same observations appear as outliers. We have found a similar
feature in other analyses of transformations with influential observations
or outliers, where there appear to be just two different analyses, depending
on the value of A. A feature of these particular data is that observations
1, 2, 3 and 4 (although not 21) are the largest observations. The smallest
observation is 16, with 15, 17 and 18 next smallest, followed by 19. These
are the last observations to enter the forward search when A = -1. We
114 4. Transformations to Normality
m=12 m=19 m=21
4
o
40
,.
.0 00
.0
C? 21
• 21 ~o__~__~__~__~
·2 ·1 ·2 -1
Figure 4.29. Stack loss data, Xl, X2 and ,jY: normal QQ plots of scaled residuals
for (left) m = 12, (middle) m = 19 and (right) m = 21
have already discussed this phenomenon in the context of the poison data
and of the ozone data.
The results of Table 4.3 show that , whether or not observations 4 and 21
are treated as outliers, the square root transformation is accepted. We saw
in the previous chapter that the evidence from the forward plot of residuals
in Figure 3.27 showed that these observations are outliers and that , until
m = 19, the plot of residuals is very stable. If it were important to test
whether observations 4 and 21 were outliers on the square root scale, the
mean shift outlier model could be used to provide an F test based on the
fit for m = 19. Instead we look at diagnostic plots for m = 19 and m = 21
(Exercise 3.5) . To justify this choice of values of m, we show in Figure 4.29
QQ plots for the residuals at three stages of the forward search for)' = 0.5.
The left panel shows the scaled residuals for m = 12. As the forward plot
of residuals showed, observations 4 and 21 appear outlying, as they do for
m = 19 in the middle panel. The right panel shows the residuals when all
observations are fitted and so is the plot for the residuals at the end of the
forward search. Despite the clear pattern of residuals in the forward plot,
in Figure 3.27, this plot is not so easy to interpret - the inclusion of the
two potential outliers has caused some masking of their properties.
A feature of the QQ plots of Figure 4.29 is that the plots for m = 12 and
m = 19 are very similar. This could be inferred from the forward plot of
residuals in Figure 3.27, which also shows that we could expect the plots
for m = 19 and m = 21 to be very different. Accordingly we look at the two
diagnostic plots for transformations calculated for m = 19, that is without
using observations 4 and 21 in the fit, and for m = 21. For m = 19 we use
the estimated parameter values to calculate predictions and residuals for
all 21 observations.
We begin with the constructed variable plot in Figure 4.30(right), for
m = 21, which shows a horizontal scatter of points with two, those for
observations 4 and 21 rather separate from the rest, but balancing each
other in their effect on the regression. Figure 4.30(1eft) shows that , when
the two observations are deleted , their predicted values move away from the
4.9. Stack Loss Data 115
Figure 4.30. Stack loss data, Xl , X2 and yIY: constructed variable plots (left)
m = 19 and (right) m = 21
m=19 m=21
...
0
• • ...
0
•
• •
g 0
21 '"
ill::>
<0
• ill::>
<0 21
> >
• ••• •4
-g 0
co • "0
Q) 0
~ 4 '"
;r:
co
•
•• e.
•
~ ~ ••
•
10 20 30 40 10 20 30 40
Y Y
Figure 4.31. Stack loss data, Xl , X2 and yIY: inverse fitted value plots for (left)
m = 19 and (right) m = 21
116 4. 'Ifansformations to Normality
+
+ 2
21
••
+ 16+
+
24
34 29+
•
+ • e• •
a +
...
10 ••
'" +
. ..-.--_e....
11 .\.~ 39
-,··Ii :
_to·. •+
~.
• 8
Figure 4.32. Mussels' muscles: scatterplot ofthe response M, muscle mass, against
the explanatory variable S, shell mass. Deletion of the 10 observations marked +
produces an approximately linear homoscedastic relationship between M and S
for which there is no evidence of a transformation
20 40 60 so
Subset size m
Figure 4.33. Mussels' muscles: score statistics for power transformations of M and
S as the subset size m increases. Test for the same transformation for M and S,
TMs(l); test for transformation of M only, TM(l, 0.2); and test for transformation
of S only, Ts(1,0.2). Transformation of M is needed as well as that of S
Table 4.4. Mussel's muscles: units included in the last 10 steps of the forward
search for various null transformations
Subset A=
Size m (1 , 1) (1 , 0.2) (1/3, 0.2) (1/3, 1/3) (0.2,0.2)
73 29 10 11 11 2
74 8 8 10 2 44
75 11 1,29 2 10 10
76 10 8 25 25 34
77 39 23 34 34 21
78 2 16 21 21 16
79 34 34 16 16 25
80 16 21 48 24 24
81 21 24 24 48 48
82 24 2 8 8 8
markably similar to each other, until close to the end of the search. For
m = 75, 78 and 79 T M(1/3 ,0.2) is below the lower boundary, while
Ts(1/3 , 0.2) always lies inside. This divergence shows that the two con-
structed variables are responding to different aspects of the data. Although
the transformation (1/3, 0.2) is supported by all the data, the plot shows
that it would be rejected except for the last three observations added. These
are, again working backwards, 8, 24 and 48, two of the three observations
identified by Cook and Weisberg as influential. However our analysis is also
informative about observations included not just at the end of the search.
Figure 4.35 shows that from m = 67 onwards nearly every unit added is
causing a decrease in the value of the score statistics. The three exceptions
are shown by heavy lines in the figure corresponding to the inclusion, from
largest m , of observations 8, 48 and 25, all of the three observations noted
as influential by Cook and Weisberg. The effect of cases 25 and 48 is to
bring the value of the score statistic back above the lower boundary. The
inclusion of observation 8 forces the test statistics to be positive. Apart from
these observations the statistic for transforming the response is responding
to the correlation between M and S. If S is transformed with too Iowa
value of A2, a lower value of Al is indicated.
Finally we consider the third root transformation for both variables.
This has the physically appealing property that both volumes have been
converted to the dimension of length, which is that of the other three vari-
ables. Figure 4.36 shows the plot of TMs(1/3) which stays within the limits
for the whole of the forward search - the increase in the statistic in the
last two steps being caused by observations 8 and 48. As Figure 4.37 con-
firms, these are outlying observations, whatever reasonable transformation
we take. The effect of observation 25 is no longer evident. Also given in
120 4. Transformations to Normality
21 ++
'"<'i
16 •
24 3<t + 1.
-
° +_,ef- \ - - -
- ,;:-.-
+,
0
11
-
<'i
"'" --:j; -
-
.. -
--
~
;r-
::; -:
--
-- - -
0
" +
25 +
'" 8
48
0 +
2.0 2.5 3.0
S'(0.2)
co
~"
i1ii 0
~
~
0
"
C/l
')l T_M (1/3,0.2)
T _S(1/3,0.2)
20 40 60 80
Subset size m
20 40 60 80
Subset size m
Figure 4.36. Mussels' muscles: score statistics for t he same power transformation
of M and S as the subset size m increases. TMs(1/3); TMS(O.2). The score test
for the third root transformations lies throughout within the 99% limits
the plot is T M s(0 .2) which behaves very similarly, except that the effect
of observations 8 and 48 is to cause the transformation to be rejected in
favour of higher values of A. Although the statistics in Figures 4.35 and
4.36 are calculated from different searches, Table 4.4 shows that, the units
included in the last 10 steps are virtually identical. The main difference is
the order.
Our sequence of transformations has produced plots of increasing
smoothness as better transformations are found. But the analysis of jumps,
that is, of nonsmoothnesses, in the central part of the forward search,
can highlight important cases: these are not outliers with respect to
the transformation being used, but contain information about a suitable
transformation. For example, case 48 is not an outlier if the data are un-
transformed, and so is not present in the last steps of the forward search.
However its inclusion causes the increase in the value of TMs(l), visible in
Figure 4 .33, as m goes from 50 to 5l.
Our analysis of this example shows how our forward method provides a
link between the evidence for a transformation and the scatter plots of the
data. As a result of this we are led to a physically meaningful transformation
and the identification of two outliers, together with knowledge of their effect
on the estimated transformations.
21 + + 2
"'
16 ••
•
M
2/34+ 1.
+ +
,..r••
.....
•
.
0 11 10
M
• #:.~
••
.. •
··1;
--. -
f[ "'
N
~ -I
::;
•
0
N •• • •
+
25 +
'" 8
0
48
+
S"{1/3)
E(Y) = T) = x T (3,
with the normalized Box-Cox transformation (4.1), to obtain
(yA _ l)/( Ay A-l) (T)A - l) / ( AyA-l) A f= 0
Z(A) = { (4.22)
ylogy Ylog T) A = 0,
where, as before, the geometric mean of the observations is written as y.
The maximum likelihood estimator of A is again the value minimizing R(A),
the residual sum of squares of the Z(A): transformation on the right-hand
side of (4.22) has no effect on the scale of the observations and so does not
enter the Jacobian of the transformation.
For fixed A, estimation of the parameters of the linear predictor T) in
(4.22) does not depend on whether the response is Z(A) or the nonnor-
malized y(A) (4.3). Multiplication of both sides of (4.22) by AyA-l and
simplification, leads to the model
(4.23)
(4.24)
Now let q = yA and u = VA. Model (4.24) then reduces to the simple form
q t5u + E A f= 0
(4.25)
q = log y = log t5 + u + E A = O.
For general A this model is regression through the origin and the residual
sum of squares R(A) is found by dividing the residual sum of squares of q
by (AyA-l)2. For the log model there is no regression, only correction by a
constant.
Calculation of the score test for transformation requires the constructed
variable found by Taylor series expansion of (4.22) about Ao. To find this
variable let
k(A) = AyA-l.
124 4. Transformations to Normality
Then, in (4.22),
z( ,X.) = (y>' - l)/k('x')
and the derivative (4.11) from Taylor expansion of z(,X.) is written
8z('x')
~ = {y>'logy- (y>' -l)(l/'x'+logy)}/k('x').
Likewise
The constructed variable for the transform both sides model (4.22) is found
as the difference of these two, since they occur on different sides of the
equation, and is
(4 .26)
In (4.26) the multiplicative constant k('x') has been ignored since scaling
a regression variable does not affect the value of the t statistic for that
variable.
The general constructed variable (4.26) simplifies for the one-parameter
model (4.25) being written in terms of q = y>' and 8u = ry>', provided ,X. "I O.
Of course, 8 is not known but is estimated by b, so that ry>' is replaced by
bu = q to give the constructed variable
WBS (,X.) = (q log q - q log q) /,X. - (q - q) (1/ ,X. + log y). (4.27)
When ,X. = 0 similar reasoning leads to the variable
WBS(O) = (q2 -l)/2 - (q - q) logy.
Evidence of regression between the residuals z· (,X.) from a fitted model
in which both sides have been transformed and the residuals wss('x') is
evidence of the need for a different transformation. Atkinson (1994b) gives
examples of the use of this variable in a diagnostic analysis of data on tree
volumes for which we give a forward analysis in the next section.
-1
20 40 60
Subset size m
Figure 4.38 . Short leaf pine: fan plot of score statistics for transforming both sides
of the conical model. The logarithmic transformation is indicated. There are no
influential observations
The trees are arranged in the table from small to large, so that one
indication of a systematic failure of a model would be the presence of
anomalies relating to the smallest or largest observations. To investigate
transformations for these data we use the conical model (4.21) with six
transformations: the usual five values plus A = 1/3, which had a special
interpretation in (4.19). Figure 4.38 is a fan plot of the score statistics
which, unlike the other plots in this chapter, uses the constructed variable
WBs defined in (4.27) . The forward search orders the residuals yA - fJA.
The plot shows that the log transformation is supported by all the data. All
other values are rejected, including 1/3, which has no special dimensional
significance when both sides are transformed. The smooth curves in the
plot do not reveal any highly influential observations.
The forward plot of residuals from the log transformation is Figure 4.39.
The pattern of residuals is very stable, with four slightly large residuals
throughout, the largest belonging to observation 53, which is the last to be
included in the forward search. The resulting model is of the form
log y - log(xix2) = {j + f.
Our analysis shows no evidence of any departure from this model.
There is a long history of the use of such models in forest mensuration.
Spurr (1952) gives 1804 as the date of the first construction of a table
relating volume to diameter and height. The definitive description of the
logarithmic formula found here by statistical means, is by Schumacher and
Hall (1933) , who analyze data for nine species. Bruce and Schumacher
(1935) give, in part, an introduction to multiple regression for workers in
forestry based on equations for tree volume, especially the logarithmic one
found here. The book discusses in detail many of the difficulties that arise
in trying to establish such equations.
126 4. Transformations to Normality
-----------------------------------6
<:> 1L _ _ _ _- r - _
20 40 60
Subsel sIze m
Figure 4.39. Short leaf pine: forward plot of the scaled residuals from the log-
arithmic model when both sides are transformed. A very stable pattern of
residuals
One difficulty is that trees change shape over their lifetimes. The trunk
of a young tree may be nearly conical, but a mature pine under certain con-
ditions is virtually cylindrical. The parameter 8 in (4.24) will then change
with age and so with tree size. There is no evidence of any such drift
here: for the logarithmic transformation large and small observations en-
ter throughout the forward search. Only for untransformed data does the
largest tree enter last. Another difficulty arises in the measurement of the
volume of the trunk of each tree, which is often not a smooth geometric
shape but may be highly irregular, as are the trunks of many European
oaks. Even a conical trunk will have to be truncated as there will be a
minimum diameter for the provision of useful timber. Furthermore, how
should the trees for measurement be sampled?
These problems were also discussed by Spurr (1952) who was reduced to
the defeatist position that the problems can only be avoided by modelling
stands of single species trees all of the same age. Hakkila (1989) stresses
that there is more to trees than trunks, particularly if all woody material
is to be used for paper or fuel chips. Hakkila's plot (p. 16) of the dry mass
of Appalachian hardwood trees against the square of diameter at breast
height shows the need for the variance stabilizing effect of the logarithmic
transformation. The collection of papers edited by Ranneby (1982) contains
survey papers on forest biometry and on the errors in prediction arising
from estimated volume residuals.
Developments in statistical methodology for the models considered here
are presented by Fairley (1986) and Shih (1993), who discusses deletion
diagnostics for the transform both sides model.
4.13. Other Transformations and Further Reading 127
4.14 Exercises
Exercise 4.1 Given a sample of observations for which
var(Yi) ex {E(Yi)}2a = f.l 2a ,
use a Taylor series expansion to find a variance stabilizing transformation
g(y) such that var{g(Y;)} is approximately constant. What happens when
ex = 1 (§4.2)?
Exercise 4.2 Find the Jacobian (4.4) for the power transformation (4.3).
The physical dimension of the sample mean is the same as that of an ob-
servation. Use a dimensional argument to justify comparison of R()..) for
different).. (§4.2).
Exercise 4.3 Derive the expression for w()..) (4 .11). Explain why the nor-
mal equations of linear least squares lead to the simplification of w()..) in
(4.14). Verify that z (O) is as given in (4.1) and find w(O) (§4.2).
Exercise 4.4 The folded power transformation is defined as:
y()..) = yA _ (1 - y)A
O:Sy:S1. (4.28)
)..
See what happens when).. --> 0, obtain the normalized form and find the
constructed variable for the transformation when ).. = 1 and O.
For what kind of data would this transformation be suitable? What
happens for data near 0 or near 1 (§4. 2) ?
Exercise 4.5 Suggest a transformation for percentages and describe its
properties (§4.2).
Exercise 4.6 The fan plot of Figure 4.4 shows distinct related patterns
at m = 10 and m = 24. What kind of observation causes each of these
patterns (§4.3)?
Exercise 4.7 Analyze the wool data using a second-order model and the
"standard" five values of )... For each)" obtain the QQ plot of the residuals,
the plot of residuals against fitted values and the constructed variable plot
for the transformation. What transformation is indicated?
How does the F test for the second-order terms change with)" (§4.3)?
Exercise 4.8 The poison data have four observations at each combination
of factors, so that an estimate of a 2 can be calculated from the within cells
sum of squares. Use this estimate to calculate the lack of fit sum of squares
for the untransformed data. How does the F test for lack of fit vary with )..
(§4·3)?
Exercise 4.9 Table 3.3 gave some demographic data about 49 countries
taken from Gunst and Mason (1980, p. 358). In Exercise S.li '/jou were asked
to find the most important explanatory variables for the demographic data.
4.15. Solutions 129
Repeat your model building exer·cise with y-05 as the response. Compare
the answers to those you obtained earlier (§4.8).
Exercise 4.10 Figure 3.44 showed plots of leverages for the demographic
data and Figure 3.45 showed how the leverage points were generated by the
values of X3 and X4. Construct a leverage plot using only variables 1, 5 and
6. What are the units with the largest leverage? If the data are analyzed
with response y-O.5, what is the effect of these units on R2 and on the t
statistic for X6 (§4. 8)?
4.15 Solutions
Exercise 4.1
Using a first-order Taylor expansion about f.L
Consequently
var[g(Y;)] R::: {g' (f.L )}2var(Y;)
R::: {g'(f.L)}2f.L2c>.
Now for var{g(Y;)} to be approximately constant, g(Y;) must be chosen so
that
So that, on integration,
if a ~ 1
if a = 1,
since the constant does not matter. For example, if the standard deviation
of a variable is proportional to the mean (a = 1) a logarithmic transfor-
mation (the base is irrelevant) will give a constant variance. If the variance
is proportional to the mean (a = 1/2), the square root transformation will
give a constant variance and so on. Table 4.5 reports the transformation
required to stabilize the variance for different values of a.
Exercise 4.2
The Jacobian of the transformation is the determinant of the matrix
OY1(A) 0Yl(A) 0YdA)
~ 0Y2 0Yn
0Y2(A) 0Y2(A) 0Y2(A)
J = ~ a:y;- a:;;;:-
130 4. Transformations to Normality
0 k Y
1/2 kJ.L JY
1 kJ.L2 logY
3/2 kJ.L3 l/JY
2 kJ.L4 I/Y
A-I o o
YI
o A-I o
Y2
o o A-I
Yn
So J = rr=1 Iy;-Il = iJn(A-I).
For linear models including a constant we can ignore the -1 in the nu-
merator of z(>'). We also ignore the>. in the denominator and consider
only the dimension of yA/iJA-I. The geometric mean has the same dimen-
sion as the arithmetic mean and as y , so the dimension of z(>.) is that of y.
The same is true for z(O) since changing the scale of measurement of the y
merely adds a constant to this z. Therefore the response in the regression
model has the dimension of y whatever the value of >.. Sums of squares can
therefore be directly compared.
See also Bickel and Doksum (1981) and Box and Cox (1982).
Exercise 4.3
To verify the expression for z(O) requires the use of I'Hopital's rule, which
we exemplify for the limit of W(A) , a similar, but more complicated, opera-
tion. To find w(O) we rewrite equation (4.29) in a form that allows the use
of I'Hopital's rule. We obtain
dZ(A) AY>' logy - (y>' -1) - Alogy(y>' -1)
dA A2 y>.-1
Application of I'Hopital's rule yields
w(O) = lim log y(y>' + AY>' log y) - y>' log y - log y(y>' - 1) - AY>' log Ylog y.
>'-->0 2Ay>.-1 + A2 y>.-1Iogy
Dividing the numerator and denominator by A we obtain
w ()
. y>' log2 Y - Y'\-l log y - y>' log Ylog y
0 = hm ----::-:--;--0'---:-.,....,----:;-::-..,----
>'-->0 2y>'-1 + Ay>. - llogy
Now letting A -+ 0
w(O) Y(0.5Iog 2 y -logylogy)
Ylog y(0.5 logy -logy).
Exercise 4.4
When A -+ 0 applying I'Hopital's rule shows that the folded power
transformation reduces to the logit transformation:
y(O) = log -y-.
1-y
In order to obtain the normalized version we must divide (4.28) by jIln.
In this case the Jacobian is Il~=l {y;-l + (1 - Yi)>. - l}. The normalized
response variable is thus
y"-(l-y)"
Z(A) = { >'G(>') (A i= 0)
log ~C-l(O) (A = 0) ,
rn rn ,
where
Q = ~ ~ 8g i
6 g ·8A'
i=l '
132 4. Transformations to Normality
Then
with
Exercise 4.5
One possibility is a "folded" transformation, similar to that in Exercise 4.3,
but now
y(,\) = y\ - (100 - y)\
,\
0::::: y ::::: 100. (4.31)
Exercise 4.6
At m = 10 observation 24 enters, a relatively small observation that has
its largest effect on the curve for ,\ = -1. Conversely, when m = 24,
observation 22 enters, a large observation having the greatest effect on the
plot for ,\ = 1.
Exercise 4.7
The plots of residuals against fitted values in Figure 4.40 for ,\ = 1 and
Figure 4.41 for ,\ = 0, suggest the log transformation, although the QQ
plot of residuals for ,\ = 0 is less good than that in Figure 4.2 for the
first-order model. The constructed variable plots in Figure 4.42 not only
indicate rejection of ,\ = 1, but also suggest a value slightly below zero for
~.
4.15. Solutions 133
0
• •
~
• •
•
-
8
'" •• •
"'
<0 '" .. •. •
~"
"
0
II: •
•
8')J
•• •
• • ••
0
0
't
• ••
0 1000 2000 3000 ·2 -1 2
Predicted values Quantiles of standard normal
• • •
"!
0 • '" •
•
• • •
~'"
•• •
0
"0
'in
'"
<0
• • ~
•
0
"
"0 0
• •
"0 0
'in
"
• ••
N
"
II:
• ~
9 -g"
• iii
';-
••
•
'"9 • •
')J •
9'" •
6 7 -2 -1 0 2
Predicted values Quantiles of standard normal
iambda=l iambda=O
0
••
0
co
•
0
0
• ~ • ••
••
~
'"'c:" • ''"c:" •• •
•• •
.-
0
§.
• •• •
0
• •
a. 0
'"~ '" ~
0
• •
• ••
<0 <0
::1
"0 0 •• ::1
:l2
III ••
'in
'" "'" •
II:
• II: 0
0
• •• •
0
• • •
•
0
')J
• • •
8 ••• 0
0
't ')J
Table 4.6. Poison data: F test for lack of fit for different values of A
A F test
1 1.87
0.5 1.62
0 1.22
-0.5 0.92
-1 1.09
Exercise 4.8
Table 4.6 indicates that, for the poison data, the lack of fit is smallest
between A = -0.5 and A = -1. The maximum likelihood estimate is -0.75.
However the test has very low power, not rejecting A = 1. More information
would be found by plotting the individual estimates of (J2 from the 12 cells
and looking for patterns that change with A.
Exercise 4.9
Table 4.7 shows that variable selection for the demographic data depends on
the transformation used. While there can be no universally best procedure,
it is advisable to find a good transformation before removing variables from
the model. Any model believed to be final should be checked as to whether
the transformation still holds.
Exercise 4.10
The units that have a leverage much bigger than the others are 20 and
46. If they are removed R2 passes from 0.793 to 0.691 and t6 = -1.7333
becomes nonsignificant.
4.15. Solutions 135
Model number 1 2 3 4
In this chapter we extend our methods based on the forward search to re-
gression models that are nonlinear in the parameters. Estimation is still by
least squares although now iterative methods have to be used to find the
parameter values minimizing the residual sum of squares. Even with nor-
mally distributed errors, the parameter estimates are not exactly normally
distributed and contours of the sum of squares surfaces are not exactly
ellipsoidal. The consequent inferential problems are usually solved by lin-
earization of the model by Taylor series expansion, in effect ignoring the
nonlinear aspects of the problem. The next section gives an outline of this
material, booklength treatments of which are given by Bates and Watts
(1988) and by Seber and Wild (1989). Both books describe the use of
curvature measures to assess the effect of nonlinearity on approximate in-
ferences using the linearized model. Since we find it informative to monitor
measures of curvature during the forward search, we present a summary
of the theory in §5.1.2 . Ratkowsky (1983) uses measures of curvature to
find parameter transformations that reduce curvature and so improve the
performance of nonlinear least squares fitting routines.
There follows some material more specific to the forward search. We
briefly touch on parameter estimation, since parameters have continually to
be updated during the forward search. We then outline differences between
the forward search for linear and nonlinear models. These differences are
not so much in the search itself as in the calculation of the quantities, such
as deletion residuals which we monitor.
The examples begin with one in which inference is virtually indistinguish-
able from that for a linear model and move through a series of examples in
5.1. Background 137
5.1 Background
5.1.1 Nonlinear Models
In this section we describe nonlinear regression models and give some
examples. We compare and contrast linear and nonlinear models.
The model for the ith of the n observations was written in (2.2) as
Yi = 'f/(Xi, {3) + f-io (5.1)
For the linear models of Chapter 2 we could then write
p-1
'f/(Xi, {3) = xf {3 = {30 + L {3j X ij, (5.2)
j=1
models that are linear in the parameters {3. So, for example,
'f/(Xi, {3) = {30 + {31 X i + {32 x ;
is a linear model, as is
i=j
(5.4)
i i= j,
are normally distributed and are also additive, as in (5.1), the maximum
likelihood estimates of the two parameters in the model (5.3) are also the
least squares estimates minimizing
n
8({3) = L {Yi - {31 e.62Xi}2. (5.5)
i=1
138 5. Nonlinear Least Squares
For linear models, differentiation of this expression yields the linear normal
equations (2.6) which can be solved explicitly to give the estimates ~. But
for nonlinear models, differentiation leads to sets of nonlinear equations,
which require iterative numerical methods for their solution. As an example,
the nonlinear model (5.3) yields the pair of equations
n
:L.::>c32 Xi {Yi - ~l ec32xi } 0
i=l
n
L~lXiec32Xi{Yi - ~lec32Xi} o. (5.6)
i=l
The equations are thus linear in ~l' which occurs linearly in (5.3) , but
nonlinear in ~2' Numerical solution of such equations is, in general, not
appreciably easier than minimization of the sum of squares function (5.5).
However it is not the difficulty in numerical calculation of least squares es-
timates that is the most important difference between linear and nonlinear
least squares. It is the lack of exact inferences even when the errors are
normally distributed.
For linear least squares the estimates ~ (2.7) are linear functions of the
observations. If the errors are normally distributed , so are the estimates,
with the consequences of t tests for individual parameters, F tests for
groups of parameters and ellipsoidal confidence regions in the parameter
space of the form
(5.7)
where 8 2 is an estimate of 0'2 on v degrees of freedom . The ellipsoidal shape
of these regions is a consequence of the ellipsoidal contours of the sum of
squares surface.
For nonlinear least squares, explicit formulae cannot usually be found
for the parameter estimates. The estimates are not linear combinations of
the observations, so that they will not be exactly normally distributed,
even if the observations are. Any distributional results for test statistics
will therefore only be asymptotic, increasing in accuracy as the number of
observations increases. In addition, the sum of squares contours may be far
from ellipsoidal. Such contours are often described as being banana shaped,
but some of the figures in Chapter 6 of Bates and Watts (1988) are even
worse than this, showing regions that extend to infinity in one direction.
To find confidence regions based on these contours is computationally com-
plicated and is not usually attempted. Instead inference is made using a
5.1. Background 139
For the linear model T)i = x; (3, the partial derivatives Ii are equal to Xi
and the procedures of Chapter 2 are obtained (Exercise 5.1).
It is convenient to write the linearized model (5.8) in matrix form. If we
let
Yi - T)(Xi, (30)
(3 - (30 (5.10)
the linearized model is
(5.11)
where
and ZO = [ zr 1
z~
a vector of random variables. The superscripts in (5.11) emphasize the de-
pendence of the linearized extended design matrix F on the parameter value
used in the linearization. As we show, this dependence can produce unex-
pected results when the parameter estimate changes due to the introduction
of outliers in the forward search.
The linearized model (5.11) suggests the Gauss- Newton method for find-
ing the least squares parameter estimates /3 by iteratively solving for the
least squares estimate in the linearized model and updating, giving the
iteration
and
k = 0,1, .... (5.12)
Convergence occurs when ,k+l is below some tolerance. However, like other
numerical procedures based on Newton's method, (5.12) may diverge. A
brief description of some algorithms for nonlinear least squares is given in
the next section.
As for the linear model of previous chapters, the least squares parameter
estimate /3 gives a vector of residuals e and an estimate 8 2 of 17 2 . We
140 5. Nonlinear Least Squares
denote the extended design matrix for the linearized model by P so that
the asymptotic variance covariance matrix of /3 from the linearized model
is
(5.13)
(5.14)
(5.15)
There are two approximations involved in (5.15). The first is that the ellip-
soidal confidence regions that it generates may not follow closely the sum
of squares contours given by the likelihood region (5.7). The second is that
the content of the region may not be exactly 100(1 - a)%.
A geometrical interpretation of the difference between linear and non-
linear least squares is helpful in interpreting the results on curvature in
the next section. If the vector observation y is considered as a point in n-
dimensional space, a linear model with p explanatory variables defines the
expectation plane, a p-dimensional subspace of this n-dimensional space.
Least squares estimation finds the nearest point on this subspace to y. The
least squares estimate /3 is therefore the foot of the perpendicular from y
to the expectation plane. Confidence regions for (3 are the locus of points
where the distance from y to the plane is constant. This distance is of
course greater than the perpendicular distance to /3. Such loci are formed
by the intersection of a cone and the plane and are circular. As Figure 5.1
indicates, the lines of constant parameter values on the expectation plane
are in general neither perpendicular nor do they have the same scale in the
different parameters. The circular intersection of the cone and subspace
therefore becomes ellipsoidal in the space of the parameters.
For nonlinear models changing parameter values likewise generate an ex-
pectation surface, but this is no longer planar. The least squares estimate is
again the point on the expectation surface nearest to y. The linearized fitted
model from iteration of (5.11) is E(Z) = P:y, the linearization producing
a planar approximation to the expectation surface which is the tangent
plane to the surface at /3. The approximate confidence region for (3 from
the linearization (5.15) is the locus of points formed by the intersection of
the cone of constant distances from f) on this tangent plane. The contours
of constant increase of sum of squares that form the likelihood-based confi-
dence region are more difficult to calculate. They consist of the intersection
of lines of constant length from y to the true expectation surface. In general
the resulting intersection will not lie in a linear subspace. It may well have
5.1. Background 141
(5.17)
On taking logs we obtain
log Yi = log f31 + f32xi + ti, (5.18)
5.1.2 Curvature
Before giving a mathematical definition of curvature for nonlinear models
we show examples of the expectation surface which was introduced at the
142 5. Nonlinear Least Squares
Figure 5.1. Linear model: portion of the expectation plane in the response space,
with lines generated by some values of the parameters /31 and /32
end of the previous section. The function 7](x, (3) when (3 varies forms a
p-dimensional expectation surface in the n-dimensional response space. If
7](x, (3) = X(3 this surface is a linear subspace of the response space. For lin-
ear models the expectation surface is often called the "expectation plane."
For example, consider a linear model with just three cases and suppose
that the design matrix X is
0.26]
1.78 . (5.19)
2.00
(3=O
-r=-c:?-
T=-3
T=-2.5
curve and the unequal spacing of the values on the expectation surface. It
is interesting to analyze what happens when we reparameterize the model.
If we set T = IOglO /3, equation (5.20) can be rewritten as
7](Xi' T) = 60 + 70e- x , lOT.
Figure 5.3 shows the plot of the new expectation surface after the repa-
rameterization. The expectation curve is identical to that of Figure 5.2, but
now the spacing of the values of T is much more uniform in the centre of the
expectation surface. This simple example shows the different characteris-
tics of the two aspects of curvature: the curving of the expectation surface
which does not depend on the parameterization used (this aspect is called
"intrinsic curvature") , and the second aspect which reflects how equally
spaced values in the parameter space map to unequally spaced values in
the response space. This second aspect depends on the parameterization
used and therefore is called "parameter effects curvature." If intrinsic cur-
vature is high the model is highly nonlinear and the linear tangent plane
approximation is not appropriate. High parameter effects curvature, on the
other hand, can often be corrected by an appropriate reparameterization
of the model.
In the previous section we obtained linear approximation inference re-
gions from a first-order Taylor series approximation to the expectation
surface evaluated at [3. Geometrically equation (5.15) assumes that around
[3 we can replace the expectation surface by the tangent plane. This local
approximation is appropriate only if 7](x, [3) is fairly flat in that neighbour-
hood, which in turn is true only if in the region of interest straight , parallel,
5.1. Background 145
equispaced lines in the parameter space map into nearly straight, parallel,
equispaced lines in the expectation surface. To determine how planar is
the expectation surface and how uniform the parameter lines are on the
tangent plane we can use second derivatives of the expectation function.
If ry((3) and (3 are one-dimensional, then the first derivative of ry((3) gives
the slope of the curve, while the second derivative gives the rate of change
of the curve, which is related to the idea of curvature. Intuitively, since in
linear models second- and higher-order derivatives are zero, it seems logical
to measure nonlinearity by investigating the second-order derivatives of the
expectation function.
/3
More formally, if (3 is close to we have the quadratic approximation
'. 'T 1 'T" ,
7)(Xi, (3) - 7)(Xi , (3) = ((3 - (3) Ii + 2(/J - (3) Fi ((3 - (3), (5.21 )
f t· = 87)(Xi, (3) I
8(3 ..
/3=/3
Likewise Fi is a p x p symmetric matrix of second derivatives, with element
r, s for the ith observation defined as
.. 8 2 7)(Xi , (3)
Iirs= 8(3r8(3s ' r, s=1, ... ,p.
r
If we let b = (3r - /3r, equation (5 .21) can be rewritten as
7)(Xi, (3) - 7)(Xi, /3) ,;, ((3 - /3? Ii + ~
r=ls=l
brbs iirs .tt (5.22)
Bates and Watts (1988) call the vectors I r "velocity vectors" because they
give the rate of change of 7) with respect to each parameter. As a conse-
146 5. Nonlinear Least Squares
quence, the vectors f rs are called acceleration vectors because they give
the rate of change of the velocity vectors with respect to the parameters.
From the first-order Taylor series expansion used in the preceding section
we know that the velocity vectors form the tangent plane to the expecta-
tion surface at the point /3. The validity of the tangent plane approximation
depends on the relative magnitude of the elements of the vectors f, which
contain the quadratic terms, to the velocity vectors, which contain the
linear terms. To assess this magnitude, the acceleration vectors can use-
.. T
fully be divided into two parts. One, f rs' lies in the tangent plane and is
informative about parameter-effects curvature. The other part of the accel-
.N
eration vectors, f rs' is normal to the tangent plane. The division uses the
projectors
and
to split the n x 1 vectors f rs that contain the quadratic terms into the two
orthogonal vectors
.. T
fr s FF fr s (5.24)
··N
frs (In - FF) f rs' (5.25)
with, of course,
.. ..T .. N
frs=frs + frs .
.. T .. N
The projection on the tangent plane is given by frs ' whereas f rs is normal
to the tangent plane.
The extent to which the acceleration vectors lie outside the tangent plane
measures the degree of deviation of the expectation surface from a plane
and therefore the nonplanarity of the expectation surface. In other words,
.. N
the vectors f rs measure the intrinsic nonlinearity of the expectation surface
which is independent of the parameterization used . The projections of the
.. T
acceleration vectors in the tangent plane (f rs) measure the degree of non-
uniformity of the parameter lines on the tangent plane and so depend on
the parameterization used.
In order to evaluate parameter effects curvature and the intrinsic
curvature we can use the ratios
.. T
II L~-l L~-l frs orosll (5.26)
II L~=l ir orl12
..N
II L~-l L~-l fr s orosll (5.27)
II L~=l ir orl12
5.1. Background 147
where by the symbol Ilzll we mean the Euclidean norm of z; that is, the
square root of the sum of squares of the elements of z: Ilzll = Y!L:~l z;.
If we want to measure the curvatures in a direction specified by some
vector h = (h 1, ... ,hp )T we can replace br by hr (r = 1,2, ... ,p) in
equations (5.26) and (5.27). These curvatures can be standardized to be
comparable between different models
. ..and sets of data, using the dimen-
sions of the derivatives. Both I r and Irs have the same dimension as the
response, so the numerators of the curvature measures are of dimension
response and the denominators of dimension (response)2. The curvatures
are therefore measured in units of (response) -1 and may be made scale free
through multiplication by the factor s.
It is possible to show that the geometric interpretation of intrinsic cur-
vature is as the reciprocal of the radius of the hypersphere that best
approximates the expectation surface in the direction h. Given that the
sum of squares contour {y - 1](X,,8) V {y - 1](X, ,8)} bounding a nom-
inal 1 - a region in the tangent plane coordinates is a hypersphere of
radius y!pS2 Fp,v,l-o, multiplication of the curvatures in equations (5.26)
and (5.27) by the factor svp gives values that can be compared with the
percentage points of 1/ y!Fp,v,l - o'
Bates and Watts (1988, Ch. 7) suggest maximizing the two curvature
measures with respect to h, rescaling them by the factor svp and then
comparing the obtained values with the percentage points of 1/ y!Fp,v,l-o'
The maximizing values of h are found by numerical search as described by
Bates and Watts (1980, §2.5).
During the forward search in order to evaluate the degree of curvature
of the model we monitor the quantities
.. 7
7
'Ymax
r:::
Synmax II L:~=l L:~=l. Irs hrhs II (5.28)
h IIL:~=l Ir hr l1 2
.. N
(V> 1).
If however S(,8k+l) > S(,8k), return to ,8k and repeat the step (5.33) with
VAk, v 2 Ak, etc. until improvement occurs, which it must for A sufficiently
large, unless a minimum has been reached. A value of two is often used for
v.
A difficulty in the application of this algorithm is that a line search
involving G:k can be included for any Ak. A strategy that seems to work
well is to start with the full step, G:k = 1. If this gives an increase in the sum
of squares, G:k can be reduced and the sum of squares recalculated. If several
reductions in G:k fail to bracket the minimum, Ak should be increased and
the process repeated from the full step length. Successful searches should
lead to a decrease in the value of A.
A general method like this can be combined with methods taking ad-
vantage of any special structure in the model. Sometimes it is possible to
partition the parameters into a group that occur linearly in the model and
a group that occur nonlinearly. Numerical search is then only necessary
in the lower-dimensional space of the nonlinear parameters. For example,
the model (5.3) is such that, for known values of ,82, the model is linear
in ,81. The parameter estimates can then be found by a numerical search
over values of ,82, the corresponding value of ,81 being found by solution of
a linear least squares problem.
Our numerical requirements fall into two parts. It is not always easy to
achieve convergence of numerical methods for the randomly selected subsets
of p observations, one of which provides the starting point for the forward
search. We attack this problem by brute force , using in succession the
numerical optimization algorithms provided by GAUSS until one of them
yields convergence. As a first algorithm we use steepest descent, followed
by several quasi-Newton algorithms and finishing with a form of conjugate
gradient algorithm.
Once the forward search is under way we order the observations from the
subset of size m using the linear approximation at the parameter estimate
~:". We then use this estimate as a starting point in finding ~:"+1' the
estimate for the subset of size m + 1.
150 5. Nonlinear Least Squares
ST = LY~'
i=1
A value near one indicates that a large proportion of the total sum of
squares has been explained by the nonlinear regression .
For the detection of outliers and influential observations we again monitor
residuals and leverages, but there are now some differences in the method of
calculation. A consequence of the dependence of the design matrix Fs( m ) on
the parameter estimate {3;,. is that some of the deletion results used in §2.6.5
in the derivation of forward deletion formulae now hold only approximately.
si
As before we denote by m ) the subset of size m used for parameter
estimation in the forward search. The parameter estimate is {3;,. and the
design matrix Fs( m ) with ith row jT (m)' The leverage is then written
... 't,8.
T -1
= f.t , s(m) f.
AT ( A A ) A
(5.36)
5.3. Radioactivity and Molar Concentration of Nifedipene 151
(5.40)
12----------- -----~
N
4 6 8 10 12 14 16
Subset size m
Bates and Watts (1988, p. 306, 307) give data relating molar concentra-
tion of nifedipene (NIF) to radioactivity counts Yi in rat heart tissue tagged
with radioactive nitrendipene. They propose fitting the four-parameter
logistic model
(5.41)
where
Xi = loglO(NIF concentration).
We follow St Laurent and Cook (1993) and consider only the 16 observa-
tions for tissue sample 2. These data are in Table A.11. The data consist
of replicate observations at eight values of x. For observations 1 and 2 the
NIF concentration is zero, so that Xi = -00. Again following St Laurent
and Cook we take the value as -27 in our calculations, a value also used
in our plot of the data in Figure 5.7. Then, provided /34 > 0, (5.41) for
these observations becomes 'T](Xi' (3) = /31 + /32. These two observations give
values of Yi around 5,000. The minimum value, observation 15, is 1,433.
Thus, although the data are counts, there should be no problem in treating
them as continuous observations.
Figure 5.4 shows a forward plot of the scaled residuals from an initial
subset consisting of observations 5, 7, 9and 13. These residuals are remark-
ably stable, with little indication of any outliers. Observation 12 has the
largest residual and is the last to enter the forward search, but its residual
is hardly changed by its inclusion; this observation seems to agree with the
model fitted to the remaining data. The forward plot of leverages, which we
5.3. Radioactivity and Molar Concentration of Nifedipene 153
\\ \ "
--;;; =--------------.,.<.""'
,
~
-
-\::.:.:. -" - /
§'
b_4 ,
I
I
I
_. __. b_2
----- b_3
--_ . b_4
4 6 8 10 12 14 16 6 8 10 12 14 16
Subset size m Subset size m
Figure 5.5. Molar data: (left) scaled estimated beta coefficients and (right)
statistics
do not give here, also fails to reveal any observations appreciably different
from the majority of the data.
The indication of the lack of several influential observations is supported
by the plot in Figure 5.5 which shows the parameter estimates and their
associated t statistics during the forward search. Given that the coefficients
have a very different scale, we have divided each curve by its maximum
(minimum for /33 because its values are negative) . From Figure 5.5(left),
the estimates of /31, /32 and /33 can be seen to be virtually constant. The
inclusion of observation 12 seems to have a non negligible effect only on
the estimate of /34 ' The plots of the t statistics show the declining shape,
without appreciable jumps, that follows from the stability of the parameter
estimates coupled with the increase in the estimate 8 2 during the forward
search. The effect of the inclusion of observation 12 is to halve the values
of the first three t statistics and to reduce t4 from 3.25 to 2.2, a value no
longer significant at the 1% level.
Further information about the effect of observation 12 is given by Figure
5.6(left) which shows forward plots of the maximum studentized residual
amongst the observations in the subset and, in the right panel, the value
of 8 2 . Both curves show an increase when observation 12 is introduced.
The studentized residual achieves its maximum value and there is a large
increase in the value of 8 2 . These plots both support the conclusion from
the others that there is no complicated structure of influential observations
or outliers to be unravelled and that observation 12 is outlying.
Finally, in Figure 5.7 we show a plot of the data together with the fitted
model with and without observation 12. A slightly strange feature of these
data, shared with those from the other tissue samples given by Bates and
Watts, is that they show a slight initial rise. The large residual of observa-
tion 12 is not immediately apparent, because it falls in a part of the curve
that is decreasing rapidly. Including this observation has a noticeable effect
154 5. Nonlinear Least Squares
g
o
'"
C\J
o
C\JO
< 0
(/) g
6 8 10 12 14 16 6 8 10 12 14 16
Subset size m Subset size m
Figure 5.6. Molar data: forward plots of (left) the maximum studentized residual
in the subset and (right) 8 2
on the shape of the fitted model and also on the two measures of curva-
ture: the parameter effects curvature increases from 1.33 to 3.09 and the
intrinsic curvature from 0.68 to 0.95. Although, with just one explanatory
variable, the outlying nature of observation 12 is not hard to detect, our
forward procedure both provides evidence of the effect of this observation
on a variety of inferences and establishes that this remote observation is
the only one that has a significant effect on any inferences about the model.
0
0
0
<0 4
•
Q)
c:
::>
0
0
•
0
0 It)
"=
Q) 1
•
'"" 0
"c 8't
'c"
.!1
::>
0 0
0 0
~ 0
n '"
.s;
u'"
0 0
0
cc'"
0
C\I
15
0
•
0
~
-25 -20 -15 -10 -5
log(NIF concentration)
Figure 5.7. Molar data: observed and fitted values with (continuous line) and
without (dashed line) observation 12. The inclusion of this observation reduces the
curvature of the fitted model. The two points plotted at abscissa -27 correspond
to log concentrations of -00
C\J
'"
(ij
:l
"0 0
'(ij
~
"0
Q)
~
CIl ,5
_________ ;---; 14
~
'.......... .... ..... --- ............... , ,/ /
------...... " " I
---------------- ----~:,......:~:::.:.~--_/
5 10 15
Subset size m
'"ci ---------\
1 \
Q) (!)
ci . 19
e
Cl
Q)
>
Q)
...J
~ ............... _._---""
ci
"'"''-----
.
C\J
--"
"'"'
ci
0
"'------------ ---
-----------~~:-~~~~
ci
4 6 8 10 12 14 16 18
Subset size m
~
---- "-
'-- ........
I.t) \
0> \
0>
o \
\
\
\
NO \
< 0>
a: 0> \
o \
\
\
\
\
\
\
'.
o
~
o
8 10 12 14 16 18 6 8 10 12 14 16 18
Subset size m Subset size m
Figure 5.10. Kinetics data: forward plots of (left) maximum studentized residual
in the subset and (right) two values of R2: the continuous line is for nonlinear
models (5.34)
observation, case 14, enters the subset is noticeable: there is then a slight
decrease when the last observation, case 5, enters. The plot reinforces the
separate nature of 5 and 14 which was shown in Figure 5.8. An interpreta-
tion of the importance of these observations comes from the forward plots
of the two values of R2 shown in Figure 5.10( right). The upper value is that
appropriate for nonlinear least squares problems, calculated as in (5.34) us-
ing the total sum of squares, rather than using the corrected sum of squares
as the base. The figure shows how this value of R2 declines at the end of
the search to a value of 0.995. Also given is the curve for the values ap-
propriate for a linear model, calculated using the corrected sum of squares.
This lower curve decreases to 0.983. Whichever value of R2 is used, it is
clear that the last observations are not causing severe degradation of the
model.
Our final plot from the forward search is of t statistics in Figure 5.1l.
The four parameters 131 all have very similar values, between 6.5 and 8 at
the end of the search: the common parameter 130 is much more precisely
estimated. Overall the curve shows the gentle downward drift which we
associate with such plots. There is no evidence, at the end of the search or
elsewhere, of an appreciable effect of any outliers. The data seem again to
be well behaved and the parameters well estimated.
There are three further points about the analysis of these data. One is
that we have taken different values of the parameter 131 for each inhibitor,
while using a common value 130. An alternative, potentially requiring fewer
parameters, is to use a value of 131 that depends directly on the inhibitor
concentration. We leave the exploration of this idea to the exercises.
The second point is that the data could be analyzed by rearranging the
model to be linear. We conclude our discussion of this example by looking
at such a rearrangement, one instance of which was introduced at the end
158 5. Nonlinear Least Squares
a
~
a
co
a
co
"
~
~
- a
"<t
a
N
6 8 10 12 14 16 18
Subset size m
of §5.1.1 for a different model. To start we assume that there is not only a
different parameter (JI for each level of inhibitor, but that the parameter
(Jo also varies with I. Let the parameter for level I be (Jo(l) ' If the errors
can be ignored the model (5.42) becomes, for group I,
(Jo(l) Xi
Yi = . (5.43)
(JI + Xi
This can be rearranged to yield the model
1 1 (JI
-=--+--. (5.44)
Yi (Jo(l) (J0(I) Xi
Thus estimates of the two parameters at each inhibitor level can be found
by regression of I/Yi on l/xi. If a common value (Jo is assumed, linear
regression can again be used, but now involving all observations to give the
estimate of 1/ (Jo.
However the parameters are estimated, a plot of I/Yi against l/xi should
be an approximately straight line if the model holds. As the plot of Figure
5.12 shows, this does seem to be the case. However the plot shows the effect
of transformation in generating leverage points from low values of the sub-
strate concentration Xi. A particularly dangerous point is observation 16,
which is both outlying and a leverage point for the highest inhibitor level.
This observation would have an appreciable influence on the parameter es-
timates if least squares were used on the rearranged model. If the errors
are additive and of constant variance in the original form of the model
(5.42), better parameter estimates are obtained by the use of nonlinear
least squares. And, indeed, our analysis shows that observation 16 is not
5.5. Calcium Uptake 159
0
0
'" 30
0 6
'"'"
0
0
~
·0
0
'"
Qi 0
>
.,
OJ ~
·c 10
~ 0
~
____
0
3
'"
~~==~~~~ ----~O
0
Figure 5.12. Kinetics data: plots of observations and fitted least squares lines for
the model rearranged to be linear (eq. 5.44) for each inhibitor concentration I
(right-hand axis)
~ ~-------------------------------------------,
5 10 15 20 25
Subset size m
0
'"d
III
Q)
d
0
c
tl 0
"
~
0
0
d
(,)
III
0
d
0
d
5 10 15 20 25 5 10 15 20 25
Subset size m Subset size m
Figure 5.14. Calcium data: forward plots of (left) estimated parameters and
(right) Cook's distance
i!1 0
~ .,;
~
::>
tl"
0
~ '"
" C!
~E
.,
~
0
Q. 0
10 15 20 25
OJ')
0
i!1
~
...0
::>
""
.~ (")
0
~
'"0
10 15 20 25
Subset size m
Figure 5.15. Calcium data: forward plots of the two curvature measures: (top)
parameter effects curvature; (bottom) intrinsic curvature
162 5. Nonlinear Least Squares
22 / .......
0
19
'"E" ~
Ii) 'IT l()
r
E
c 17 c
~ 'l' ~ 'l'
~ ~
""
'C
(') ""'C
C
(')
C
res res
Q) Q)
::>
'0
I t
.'
(ij (ij
() ()
0 0
0 5 10 15 20 0 5 10 15 20
Time Time
Figure 5.16. Calcium data: observations and fitted curve: (left) m = 20; (right)
m = n; 99% confidence intervals calculated from equation (5.40)
(ij
::>
:2M
~
C
.2
1i>
Qi '"
'C
E
::>
E
'c
::E
o
o L,----.----.----.----.~
5 10 15 20 25 10 15 20 25
Subset size m Subset size m
Figure 5.17. Calcium data: forward plots of (left) maximum studentized residual
in the subset and (right) minimum deletion residual not in the subset used for
fitting
forward plots. The seven observations still to enter are numbered, as they
are in Figure 5.13. These clearly all lie above and away from the fitted
model, with the last three observations to enter, 22, 19 and 23 being most
extreme. The other half of the plot shows the fit at the end of the search
(m = 27). The fitted curve has moved up to accommodate the final group
of observations and is less near its horizontal asymptote at high times, as
well as being apparently concave, which it was not at small times for m =
20.
This gradual change in the shape of the fitted curve explains the patterns
seen in the earlier plots. The large residuals in the earlier part of Figure
5.13 are for the last observations to enter. The only one not to decrease
is for observation 4. As Figure 5.16 shows, this observation is the only
5.5 . Calcium Uptake 163
23· 23·
"".-.
nitrogen
concentration ~.:, 'If •• • • 2· 22·
• •
,.
• • 2·
.'. X2=water ,-
- \.•
I ..\,
23- retention 23·
time 22·
I
10- 10-
22· 22·
-
10· 10·
2- 2·
•
........
y=mean annual
23- 23·
.... ,
nitrogen
t, --. -• •
concentration
•
10 20 30 o
The scatterplot matrix in Figure 5.18 shows that there may be a linear
relationship between y and Xl with two outlying observations 10 and 23.
The plot of y against X2 reveals in addition that observations 2 and 22 are
likely to be important.
The data were analyzed by Stromberg (1993) using the model
Xli
Yi = {3 +Ei· (5.46)
1 + {3lx 21
The forward plot of residuals, Figure 5.19, clearly shows these two outliers.
In addition it shows that inclusion of observation 10 causes an apprecia-
ble reduction in the residual for observation 23. The two observations are
5.6. Nitrogen in Lakes 165
<II
OJ
10----------------------------- --,
....... - ... ,
::l
~
:!! ~
.... ----
.,
"0
OJ
()
en
0 23--------------------~
o 5 10 15 20 25 30
Subset size m
<Xl
0
Q) <0
Ol 0
~
Q)
.,,
>
Q)
...J
",
'f
0
,,
'"0
0
-=~----------- -<.; .: :
0
5 10 15 20 25 30
Subset size m
0
10
1=
0
'f
U
(J
.,.,0 t_2
~
~ 0
C\J
5 10 15 20 25 30 5 10 15 20 25 30
Subset size m Subset size m
Figure 5.21. Lakes data: forward plots of (left) parameter estimates and (right)
t statistics
5.6. Nitrogen in Lakes 167
~r====:::::::::::::-------'
o i-=;:::::;:::;:::::;:::;::::;:::;::::;:::;::::;:::;:::;::!.../..--J
2 4 6 8 12 16 20 24 28 2 4 6 8 12 16 20 24 28
2 4 6 8 12 16 20 24 28 2 4 6 8 12 16 20 24 28
Subset size m Subset size m
Figure 5.22. Lakes data: forward plots of Cook's distance, R2, maximum studen-
tized residual in the subset and minimum deletion residual among observations
not in the subset
The addition of observations has caused all elements of the matrix to de-
crease. This behaviour is explained by the effect of the parameter estimates
on F, the elements of which may change appreciably as the parameter
estimates change. So here the two outlying observations cause an appre-
ciable change in the parameter estimates, but less marked changes in the t
statistics.
168 5. Nonlinear Least Squares
(jJ
~<D
'" •
23
Figure 5.23. Lakes data: fitted response surface when m =n- 2. Observations
10 and 23 are clearly outlying
•
23
Figure 5.24. Lakes data: fitted response surface when m = n. The surface is more
curved than it is in Figure 5.23
5.6. Nitrogen in Lakes 169
Other effects of these two observations are shown in Figure 5.22. Both
have very large Cook's distances and their inclusion causes R2 to decrease to
0.696. The maximum studentized residual among observations in the subset
shows a pattern typical of a pair of outliers: there is one large value followed
by one that is slightly smaller, due to the masking effect of observation 10 on
observation 23. This effect is seen even more dramatically in the last panel
of Figure 5.22, where the minimum deletion residual among observations
not in the subset is more than four for observation 10 before it enters. The
value for observation 23 at the end of the search is nearer three.
It is clear that these outliers have a large effect on some aspects of the
model. This can also be seen by plotting the fitted surface of the model as
a function of Xl and X2. Figure 5.23 shows the surface at step n - 2 and
Figure 5.24 shows the same surface, but when all n observations are used
in fitting. The two outliers are clearly visible in the first figure as being
remote from the surface. However in Figure 5.24 observation 10 appears
close to the fitted surface, as would observation 23 if the fitted surface were
extended to the remote region of this observation.
Such three-dimensional plots are hard to interpret in two dimensions.
The structure is seen much more clearly if the plots are either rotated
or jittered, so that the eye fabricates the illusion of a three-dimensional
object. What is clear from these figures is that addition of the last two
observations has resulted in an appreciable increase in the curvature of the
fitted surface. This is reflected in the measures of curvature: the parameter
effects curvature increases from 0.24 to 0.42 and, more importantly for the
difference in the figures, the intrinsic curvature increases from 0.14 when
m = 27 to 0.35 when m = 29.
Figure 5.21 shows another effect of the two outliers. When m = 27 the t
statistic for f32 is 1.85, increasing to 4.32 when the two outliers are included.
In a linear regression model the implication would be that X2 should be
dropped from the model, but here the interpretation is less clear. Since
f32 is the power to which X2 is raised the implication is that we could
consider a small value of f32. In the Box-Cox transformation of Chapter 4
the value>. = 0 led to the log transformation. Accordingly we here try log
X2. This value leads to so large an increase in the residual sum of squares
for the 27 observations, that log X2 has to be rejected as a variable. In fact,
this transformation led to difficulties in obtaining convergence for the least
squares estimation of the parameters from the 27 observations, perhaps
because there are some values of X2 close to zero which become extreme
when logged.
There remain the two outliers themselves. They correspond to units for
which the two values of Xl are about 10 times the values of this variable
for the other units. It is therefore possible that they have been caused by
decimal points being written in the wrong place. We suggest in the Exercises
that the analysis be repeated with these two X values replaced by one-tenth
of their values. Although such manipulation of the data is one statistical
170 5. Nonlinear Least Squares
23--------
5 10 15 20
Subset size m
method of perhaps producing outlier free data, the resulting data may not
represent the physical situation under study. It may be, for example, that
the lakes really are highly polluted. One way to resolve such questions is by
inspection of the records from which the data were transcribed. Another is
comparison with other recorded measurements on the same lakes.
.\1
'1
\I
<Xl
\\
\.\
\\.
,
C\J
5 10 15 20 5 10 15 20
Subset size m Subset size m
Figure 5.26. Pentane data: forward plots of three parameter estimates and of t
statistics
1
-0.805 1
-0.840 0.998 1
-0.790 0.998 0.995 1.
The estimates of {32, {33 and (34 are clearly extremely highly correlated, as
Figure 5.26(left) shows, and so will have the very similar t values seen in
the figure.
For a linear model with these correlations it would be customary to drop
one or two of the variables as we did in the analysis of the ozone data in
Chapter 3. But, with nonlinear models such as this, it is not always as
obvious, just as it was not obvious in the previous section, how the model
should be simplified. The only variable that can be dropped on its own
is Xl' But, since the model comes from the mechanism of the reaction,
simplified nonlinear models should come from consideration of simplified
reaction mechanisms.
These data have been subject to several analyses in the statistical liter-
ature, both in the journal Technometrics and in the books of Bates and
172 5. Nonlinear Least Squares
Watts (1988) and of Seber and Wild (1989). Carr's original analysis rear-
ranged the model to be linear. After rearrangement the response is y;l, for
which the variance is not constant. Box and Hill (1974) allow for this inho-
mogeneity by using weighted linear least squares for parameter estimation,
with the weights chosen to allow for the inhomogeneity of variance.
Pritchard et al. (1977) comment that heteroscedasticity may be intro-
duced inadvertently by the data analyst who transforms a nonlinear model
to a linear form, which is what has happened with Carr's analysis. They
perform two analyses using nonlinear least squares on the original data,
one using weighting to allow for heteroscedasticity and the other being
unweighted. They find no evidence of variance inhomogeneity when the
data are analyzed without weights. They also discuss rearrangement of the
model. If the errors in (5.47) are ignored the model may be written
1 (32 1 (34
(X2i - X3i/1.632)/Yi = (31(33 + (31(33 Xli + (31 X2i + (31(33 X3i· (5.48)
estimating the parameters. This is the only example of those in this chapter
in which we had trouble with convergence of the routines both for nonlinear
least squares and for calculation of the curvatures. Pritchard and Bacon
(1977) show how a design giving more precise estimates of the parame-
ters for the same number of points, 24, as in Carr's data, can be found by
sequential construction of a D-optimum design for the linearized model.
Such designs minimize the volume of the asymptotic confidence region for
the parameters (5.15). Pritchard et al. (1977) comment that Carr's original
design was a c entral composite design in the space of the process variables
X. Such a design would be good , although not optimum, for predicting
the behaviour of the response using a low-order polynomial model in the
process variables. However good design for the parameters of a nonlinear
model requires designs that are good in the space of the partial derivatives
FO. D-optimum designs for nonlinear models arising in chemical kinetics
are described in Atkinson and Donev (1992 , Chapter 18).
5.9 Exercises
Exercise 5.1 Show for the linear model TJi = xf (3 that the partial deriva-
tives ff given by (5.9) are equal to Xi. What is the implication for least
squares estimation (§ 5.1) ?
Exercise 5.2 Figure 5.1 shows a plot of the expectation plane. Add a data
point to this plot and sketch the position of the estimate /J and of a con-
fidence region for (3 . Show that, in general, the confidence region will be
elliptical in the parameter space (§ 5.1).
Exercise 5.3 Compute the angle between the two vectors (columns) of the
matrix X defined in equation (5.19). What are the implications of the non-
orthogonality of the two vectors for the parameter lines on the expectation
plane (§5.1)?
Exercise 5.4 Compute the Jacobian of the transformation from the pa-
rameter plane to the expectation surface when the matrix X is defined as in
equation (5.19). Give the general expression of the Jacobian for the multiple
linear regression model E(Y) = X(3 (§5.1).
Exercise 5.5 A certain chemical reaction can be described by the nonlinear
model:
(5.49)
Xl X2 Y Xl X2 Y Xl X2 Y
120 600 0.900 60 620 0.795 45 631 0.688
60 600 0.949 60 620 0.800 40 631 0.717
60 612 0.886 60 620 0.790 30 631 0.802
120 612 0.785 30 620 0.883 45 631 0.695
120 612 0.791 90 620 0.712 15 639 0.808
60 612 0.890 150 620 0.576 30 639 0.655
60 620 0.787 60 620 0.802 90 639 0.309
30 620 0.877 60 620 0.802 25 639 0.689
15 620 0.938 60 620 0.804 60 639 0.437
60 620 0.782 60 620 0.794 60 639 0.425
45 620 0.827 60 620 0.804 30 639 0.638
90 620 0.696 60 620 0.799 30 639 0.659
150 620 0.582 30 631 0.764
Exercise 5.8 In (5.34) a definition was given of the squared multiple cor-
relation coefficient for a nonlinear model. For the lakes data of§5.6 this has
the value 0.696, with the residual sum of squares b eing 43.392. Calculate
the customary value of R2 for a linear model (2.20) for these data. Explain
your answer (§5.7).
Exercise 5.9 Find the maximum likelihood estimate of the Box- Cox trans-
formation of the response for the data on the isomerization of n-pentane
(§5.7).
Exercise 5.10 Repeat the sketch of Exercise 5.2 for the expectation surface
of a nonlinear model, for example, Figure 5.2. Include the tangent plane
approximation and sketch both the approximate confidence interval and the
interval given by a constant increase in the residual sum of squares.
Exercise 5.12 In §5.4 it was shown how the model for the kinetics data
could be rearranged to be linear. Find the distribution of errors in the
original model (5.42) for which the rearrangement is appropriate.
176 5. Nonlinear Least Squares
1 8.3
2 10.3
3 19.0
4 16.0
5 15.6
7 19.8
Use simulation to find the distribution of the estimates from linear least
squares in the rearranged model, if the errors in the original model are
additive and normal with constant variance.
Can you find other linearizations? What assumptions do they lead to
about the errors (Ruppert et al. 1989)?
Exercise 5.13 The data on calcium uptake analyzed in §5.5 were treated
as nonlinear regression data. However the data consist of three replicate
experiments at seven sets of experimental conditions. Seven estimates of
pure error are therefore available, unaffected by the lack of fit of the model,
although they will be affected by any outliers.
Is there any evidence that the error variance increases with increasing
time? If there were such evidence how would you analyze the data? How is
your answer affected by the existence of one negative observation?
Exercise 5.14 Repeat the analysis of the lakes data with the outlying
values of Xl adjusted as suggested at the end of §5. 6.
5.10 Solutions
Exercise 5.1
For a linear model fJ-Tld af3j = Xij' Since the derivative does not depend on
the parameter values, iterative methods of parameter estimation are not
required.
Exercise 5.2
See Bates and Watts (1988, page 19).
Exercise 5.3
The cosine of the angle (a) between the two vectors can be computed as
5.10. Solutions 177
10 15 20 25 10 15 20 25
Subset size m Subset size m
Figure 5.27. Average absolute deletion residual for observations not in the subset
for (left) calcium and (right) lakes data
follows,
4.04 = 0.867.
v3 x 7.236
The angle between the two vectors is about 180 x 0.867/7r = 50° . This
means that the parameter lines on the expectation plane are not at right
angles as they are on the tangent plane. As Figure 5.1 shows, unit squares
on the parameter plane map to parallelograms on the expectation plane.
Exercise 5.4
The Jacobian of the transformation geometrically is equal to the area of the
parallelogram that corresponds to a unit square on the parameter plane.
From computational geometry the area is:
Exercise 5.5
The data do not seem to contain any outliers.
Exercise 5.6
Figure 5.27 shows: (1) that the curve on the right is always higher since
outliers are present, (2) the upward jump in the right panel when the
outliers are included and (3) a partial masking effect in the last step of the
right panel.
178 5. Nonlinear Least Squares
Ii)
'"
-=
-
"C
C 0
'" '"
"C
C
'"
E
Q)
"C
__ ___
C ~
_._ _
Q)
Cl
~ .....•../ /.. . .. ........_........_...... .. .............. _-.....-.._......._-
-/
0
OJ
0 ;2 ..
'f:Q)
,/
.I:
0
0
iii Ii)
0
I. /'
.........
....
o 2 4 6 8 10
Time
Figure 5.28. BOD data: observations, fitted curve and 95% inference band
Exercise 5.7
The least squares estimates are S
= (19.143,0.5311)T with 8 2 = 6.498
on four degrees of freedom. The estimated response function and the 95%
confidence band are plotted in Figure 5.28. It is interesting to notice that
the band has zero width when t = 0, widens up to t ~ 3, narrows around
t = 4 and then widens again. Compare this plot with Figure 5.16 and with
a plot for quadratic regression through the origin.
Exercise 5.8
If a constant is included R2 = 0.020. If the constant is not included R2 =
-1.657. If the linear regression model does not contain a constant R2 is no
longer forced to lie within the interval [0,1].
Exercise 5.9
A = 0.72.
6
Generalized Linear Models
6.1 Background
We give two examples of discrete data in which the distribution depends
on the levels of one or more factors.
allowing for X2 as a factor with three levels, rather than a single explanatory
variable. The normal theory regression model (2.2) was written in terms of
a linear predictor as
In the generalized linear model for Poisson data the mean of the Poisson
distribution again depends on the linear predictor, but the errors are no
longer additive.
6.1. Background 181
At dose level Xi the model is that the observations are binomially dis-
e
tributed, with parameter i . Interest is in whether there is a relationship
e
between the probability of success i and the dose level - the data clearly
show some relationship - and if so what is the form of that relationship.
Here the definitions of "success" and "failure" are a matter of point of
view: success for the beetle in surviving is a failure for the experimenter
who administered the insecticide. A starting point is a model with linear
predictor
1](Xi) = /30 + /3lXi' (6.2)
By analogy with linear regression we could consider the model
ei = 1](Xi) = /30 + /3lxi, (6.3)
but the ei are probabilities so that it is necessary that 0 ::; ei ::; 1. Instead,
for binomial data, we use models of the form
(6.4)
with the inverse link function 'I/J such that 0 ::; e ::; 1. One family of func-
tions with this property are cumulative distribution functions of univariate
distributions. The plot of the proportions of successes Ri/ni in Figure 6.1
do indeed have such a form. If, for example, 'I/J is the cumulative normal
distribution, probit analysis of binomial data results. Instead of the inverse
link function 'I/J, the theory of generalized linear models is developed in
terms of the equivalent link function 9 = 'I/J-l.
• •
•
co
0 •
"0
J1 0co
:;<
c:
.Q
•
...
1:
0
e
Q.
0
Q.
•
'"0 •
•
0
0
1.70 1.75 1.80 1.85
Log(dose)
Figure 6.1. Bliss's beetle data: proportion of deaths increasing with log dose
the linear predictor 'f/ and the mean p, are related by the link function
(6.5)
For binomial data we require a link such that the mean lies between zero
and one. A widely used link for binomial data that satisfies this property
is the logistic link
Comparison with (6.4) shows that the inverse link 'lj; for the logistic link is
the cumulative distribution function of the logistic distribution. Whatever
e
the values of f3 and x, the probabilities lie between zero and one.
For Poisson observations we require a link such that the mean p, cannot
be less than zero. An often used link is the log link
log(p,) = 'f/ = f3T X, (6.8)
so that p, = ef3T x cannot be negative. This model yields not only Poisson
regression models but also log linear models for the analysis of contingency
tables.
These models for binomial and Poisson data are both special cases of
generalized linear models, a family that generalizes the normal theory lin-
ear model of the earlier chapters on linear regression. The generalization
extends both the random and systematic parts of linear models:
6.1. Background 183
Generalization
• Distribution. Members of the one-parameter exponential family: for
regression models the normal distribution;
Table 6.1. The most usual link functions: for the probit link <1> is the cumulative
distribution function of the standard normal distribution
{
0.5(1 + sin 1]) if -~ S1/S
{ sin-l (2/1- - 1)
Arcsine 1 if 1/?
(0 S /1- S 1)
0 if 1/S-
Table 6.2 . Names of the most widely used combinations of distribution and link
function
logj(y;(} ,¢) = yb((}) ;C((}) +d(y, ¢) -00 <y< +00 ¢ > O. (6.9)
Under the standard regularity conditions that allow the interchange of the
order of differentiation and integration, the expectation
(6.11)
These conditions are those under which the Cramer- Rao lower bound for
the variance of maximum likelihood estimators holds. The most frequent
violation occurs when the range of the observations depends upon (), which
is not the case for (6.9). Derivations of (6.11) and the result for second
derivatives (6.14) are to be found in many textbooks, for example, Casella
and Berger (1990, p. 309).
186 6. Generalized Linear Models
r
between second derivatives (Exercise 6.2)
From (6.13) e'(B) = -J.Lb'(B) so that the derivative of (6.10) can be written
az b'(B)(y - J.L)
(6.15)
8B ¢
Then in (6.14)
and
E (;~ r =E {b'(B)(~ - J.L) r r = {b'~) varY (6.16)
(8l)
E 88 2
2 = J.Lb" (B) + el/ (B)
¢'
so that
__{~}2
varY - b'(8)
J.Lbl/(8) + el/(B)
¢ . (6.17)
the equality of the mean and the variance is the basis for a test of the Pois-
son assumption. The mean variance relationship for all generalized linear
models is obtained by rewriting (6.18). Let
1 8p,
b'(8) 88 = V(p,) = V, (6.19)
where r(a) = Jooo uo.-1e-udu. In this form E(Y) = p, and varY = p,2 la,
in agreement with the results of Table 6.3. Derivation of the result for the
inverse Gaussian distribution is left to the exercises.
At first glance, Table 6.3 seems to provide an extremely restricted fam-
ily of models for mean-variance relationships. A richer family is found by
specifying not the complete distribution, but just the relationship between
the mean and the variance, an extension of the second-order assumptions
for regression models of (2.3). The resulting quasilikelihood models are de-
scribed by Firth (1991) in a chapter that provides a more wide-ranging
introduction to generalized linear models than that given here.
A second departure from the variances listed in Table 6.3 is overdisper-
sion for Poisson and binomial data, also described by Firth among others,
in which the form of V(p,) seems correct, but the estimated dispersion
parameter is appreciably greater than one. This phenomenon can be gen-
erated by an extra source of variation in the data beyond that included in
the generalized linear model, perhaps arising from a compound distribu-
tion. For example, in Bliss's beetle data, the number of insects Ri dying
188 6. Generalized Linear Models
(6.24)
XTXS=XTy.
The rth equation (r = 1, ... ,p) can be written as
p n n
LLXirXisSs LXirYi
s=li=l i=l
or
(6.25)
s
since
n
(XT X)rs =L Xirxis =L XrXs
i=l
that is,
LL Wi X ir X is;3s
s
or
(6.26)
s
8L({3) I = O.
8{3 f3=i3
If we let the derivatives
8L({3)
U({3) the score function
8{3
6.4. Maximum Likelihood Estimation 191
and
= -J({3) the observed information, (6.29)
U(/3) = o. (6.30)
8 2 L({3)
I({3) = E{I({3)} = -E 8{38{3T' (6.33)
(6.34)
or
(6.35)
(j 2 '
which we write
8L 8L dB dJ.L 8ry
(6.36)
8(3j 8B dJ.L dry 8(3j .
The definition of the variance function in (6.19) yields
~~ = V(J.L)b'(B) = Vb'(B).
Also
so
w- 1 = (~:r V. (6.37)
8L ~ Wi dryi
8(3 . = ~ -:;:(Yi - J.Li)d- Xij . (6.38)
J i=l ~ J.L.
If the subscript i is suppressed we write
8L "W dry
8(3j = ~ ¢(y - J.L) dJ.L Xj.
the estimate of the parameter /3. For notational simplicity we take ¢> = 1,
when the expected information matrix found from further differentiation
of (6.39) is
Also
af1 df1 aT)
a/3s dT) a/3s '
whence
(6.40)
(6.41 )
with
(6.42)
s
(6.43)
the last equality following from the definition of the linear predictor, with
T)f the estimated linear predictor at iteration k. To find the p equations
for the parameter estimates we substitute for Ur in (6.41) from (6.39) and
obtain
(6.44)
194 6. Generalized Linear Models
Comparison with (6.26) shows that this is weighted least squares with
weights Wi and "working" response
W = V-I (~~r
Both the weights and the working response depend on the parameter es-
timate (3k. The iteration is started, where possible, by putting p, = y. For
zero observations in Poisson or Gamma models we put p, = 0.1. Similar ad-
justments for binomial starting values are given, for example, on page 117
of McCullagh and NeIder (1989). Usually about five iterations are required
to obtain satisfactory parameter estimates.
6.5 Inference
6.5.1 The Deviance
In regression differences in residual sums of squares are used to test the
significance of one or more terms in the linear model. Suppose that the
hypothesis is that a specified 8 of the elements of (3 are zero. The residual
sum of squares under this hypothesis is accordingly 8(/380)' If the hypoth-
esis is true the difference in residual sums of squares 8(/380) - 8(/3) is
distributed as 0'2X; . Usually 0'2 is estimated by 8 2 and the scaled differ-
ences {8(/380) - 8(/3)} /8 2 are displayed as an analysis of variance table.
The generalization considered in this section is to the analysis of deviance in
which likelihood ratio tests are expressed as differences in scaled deviances.
Let the maximized loglikelihood of the observations for the linear predic-
tor be L(/3) and the loglikelihood under the hypothesis that, again, 8 of the
elements of (3 are zero be L(/380)' Then, asymptotically the loglikelihood
ratio
(6.46)
For normal theory regression the result reduces to the distribution of the
scaled difference in residual sums of squares {8(/380) - 8(/3)} /0'2 and so
is exact. For other distributions the approximation to the distribution of
(6.46) improves as the number of observations increases. The distributional
result (6.46) also holds for the more general hypothesis that 8 linear com-
binations of (3 are constrained to have specified values, the only difference
being that /380 is now the maximum likelihood estimate satisfying these
constraints.
6.5. Inference 195
For a linear regression model the deviance D((3) reduces to the residual
sum of squares 8((3). To test the goodness of fit of the regression model
when (12 is known, we can use the scaled sum of squares 8((3) / (12. Likewise
for testing hypotheses about generalized linear models we use the scaled
deviance Dsc((3) defined as
which, from (6.47) is a likelihood ratio test. If the linear model contains p
parameters, the goodness of fit test based on the scaled deviance compares
Dsc((3) with the X2 distribution on n - p degrees of freedom. In general
the distributional result is again asymptotic, a word requiring careful in-
terpretation: for gamma or Poisson data we mean that n -; 00, while
for binomial data we require that each ni - ; 00 . For binary data each
ni = 1 and the value of the deviance (Exercise 6.7) becomes completely
uninformative about the goodness of fit of the model.
The scaled deviance is most useful not as an absolute measure of goodness
of fit but for comparing nested models. In this case the reduction in scaled
deviance Dsc(/3so ) - D sc (/3) is asymptotically distributed as X;. The X2
approximation is usually quite accurate for differences in scaled deviances
even if it is inaccurate for the scaled deviances themselves. Likelihood ratio
tests of parameters in the analysis of deviance depend upon differences in
scaled deviances, rather than on differences in deviances. However, since
these tests are commonly used for Poisson and binomial data where the
scale parameter is one, the scaled and unscaled deviances are identical for
these distributions. Perhaps as a result, the literature is not always clear as
to whether the deviances being discussed are the scaled deviances Dsc((3)
or the deviances D((3) which do not depend on ¢>. When we want to stress
that we are using the value of a Poisson or binomial deviance to indicate the
fit of a model, we sometimes refer to the residual deviance. We compare
this deviance with that from the null model (that is, one in which the
linear predictor only contains a constant). The difference between the null
deviance and the residual deviance is called the explained deviance. These
relationships are summarized in Table 6.4.
To find an expression for the deviance for the exponential family model
with loglikelihood (6.24) let the vector parameter (3 correspond to indi-
vidual parameters Oi, with (3max corresponding to parameters 0i ax . Then
196 6. Generalized Linear Models
Symbol Meaning
Null model Model which only contains a
constant (one parameter)
Current model Model with p parameters
Saturated model Model for which Pi Yi,
i = 1, ... ,n (model with n
parameters)
L(~) Loglikelihood of the current
model
Loglikelihood of the null model
Loglikelihood of the
saturated model
Likelihood ratio 2 { L(~) - L(~so) }
Deviance (of the current model) 2¢ lL(f3max) - L(~) }
Deviance of the null model 2¢ L(f3max) _ L(~nUll)}
from (6.47)
n
D({3) = 22: {Yib(8f'ax) + c(8f'ax) - Yib(8i) - c(8i )} , (6.49)
i=l
which is a function neither of d(Yi, ¢) nor , more importantly, of ¢. We leave
it as an exercise to show (Exercise 6.5) that this reduces to the residual sum
of squares for the regression model. The deviances for other distributions
will be derived when we come to analyze data from each family.
t
the moment estimator
X2 = t
i=l
(Yi ~ {li)2 =
J-li
t
(Yi - jti)2 ,
i=l V(J-li)
(6.51)
in the notation of this chapter. Since, for the gamma distribution, V(J-l) =
J-l2, we can rewrite (6.50) as
(6.52)
In the for~ard search we monitor both the deviance D(/J) and the dispersion
estimate ¢.
For linear regression the variance of the parameter estimates (2.18) was
vartJ = a 2 (X T X)-1 with the estimated standard error of the rth element
of tJ being
estimated s. e. ((3r)
• A
= (¢>vrr) 1/2 ,
-
(6.53)
where now Vrr is the rth diagonal element of (XTW X)-I. This formula
applies for the gamma distribution. For the Poisson and binomial distribu-
tions we calculate t statistics and confidence intervals using the theoretical
value of one for the dispersion parameter.
(6.54)
Since W is a diagonal matrix of nonnegative weights, W 1 / 2 is found as the
elementwise square root of W.
6.6.2 Residuals
Three residuals can be defined by analogy with least squares, for which
they are all identical. We use two of them.
6.6. Checking Generalized Linear Models 199
Pearson Residuals
The simple definition of least squares residuals is that in (2.10) as ei =
Yi -iJi , the difference between what is observed and what is predicted. This
definition for generalized linear models leads, when allowance is made for
the dependence of the variance of Y on the mean, to the Pearson residual
(6.55)
where, as in (6.20) , varY = 1> V(fi,). The name for the residual arises since
n
Lr~i = 1>X 2 = X2
i=l
for the Poisson distribution for which 1> = 1. Here, as in (6.51), X 2 is the
observed value of the generalized form of Pearson's chi-squared goodness
of fit statistic with the appropriate variance function .
The Pearson residual can be studentized, as the least squares residual
was in (2.14) , and is
Yi - P,i
r~i =~--~--~----- (6.56)
{¢V(P,i)(l- hi )}1 /2'
where hi is the diagonal element of matrix H defined in equation (6.54).
Deviance Residuals
In regression the residual sum of squares is the sum of squares of the least
squares residuals, 8(/3) = I: er· The deviance, which generalizes the resid-
ual sum of squares, is likewise the sum of n quantities, so we can write
formally
D((3) = ~di'
A 2
'"' (6.57)
less useful for generalized linear models for discrete data than they are for
normal data.
Deletion Residuals
A third residual can be defined by the effect of deletion. For the regression
model the exact change in the residual sum of squares when the ith obser-
vation is deleted is given by (2.34) as er!(1 - hi) . For generalized linear
models Williams (1987) shows that a one-step approximation to the change
in deviance on deletion yields a deletion residual that is a linear combina-
tion of one-step approximations to the effect of deletion on the Pearson and
deviance residuals (McCullagh and NeIder 1989, p. 398).
(6.61)
a function of the Pearson residual and other quantities all known from a
single fit of the model.
Suppose that the link used in fitting the data is g(p,) when the true link
is g*(p,) = ry. Let h(ry) = g{g*-l(ry)}. Then
g(p,) = g{g*-l(ry)} = h(ry).
= ry. Otherwise h(ry) will be nonlinear in
If the fitted link is correct, h(ry)
ry. So we need to test whether g(J-L) is a linear function of ry. Taylor series
expansion around zero yields
g(p,) = h(ry) h(O) + h'(O) ry + hl/(O) (ry2 / 2) + ...
(6.62)
~ a + bxT f3 + "(ry2,
where a, band "( are scalars. Since f3 is to be estimated (6.62) becomes
(6.63)
provided that the fitted model contains a constant. The test of the goodness
of the link then reduces to testing whether in (6.63) "( = O.
The test statistic is calculated in two stages. In the first the model is
fitted with link g(p,), yielding an estimated linear predictor", with iterative
weights W and estimated dispersion parameter ¢. Following the prescrip-
tion in (6.63) the linear predictor is extended to include the variable ",2 . The
model is then refitted to give a t test for "(. However the refitting is without
iteration, so that the parameters of the linear predictor are reestimated
with the weights W found previously. Likewise the dispersion estimate is
not adjusted for inclusion of the extra explanatory variable. As we show
in Figure 6.2 and in several other figures, monitoring the resulting t test is
sometimes a powerful method of detecting inadequacies in the model.
sion m + 1 by selecting the m + 1 units with the smallest values of the d;,
units being chosen by ordering all deviance components d2 (=). The search
7"S",
starts by randomly selecting subsets of size p and chooses the one for which
the median deviance component is smallest.
As it was for linear and nonlinear regression models, so also for gen-
eralized linear models is it informative during the search to look at the
evolution of leverages (hi ,si=»)' parameter estimates, /3;,., Cook's distance
(Dm) , and t statistics. For the forward search the leverage (6.54) for unit
si
i, with i E m ), is the i th diagonal element of the matrix
W 1 /(m)
2
XS (m)
(XT
SCm)
W S(m)XS(m) )-1 X T( =) W 1 (m)'
/2 (6.64)
S
S'" '" '" '" '" '" S'"
202 6. Generalized Linear Models
Cmt. - -
1/2 h sC"' ) d 2 sC"' ) }1/2
(i,S~"'» 4>S~"'-l)
m-p t, * *
(6.66)
{ p}
{ 1,
1- h 2 -
Factor 1: Policy holder's age (PA) with eight levels: 17-20, 21-24, 25-29,
30-34, 35-39, 40-49, 50-59, 60+;
Factor 2: Car (vehicle) group (VG) with four levels: A, B, C, D;
Factor 3: Car (vehicle) age (VA) with four levels: 0-3, 4-7, 8-9, 10+.
The response is the average claim. The numbers of claims mijk are also
given in Table A.16.
The data are thus in the form of the results of an 8 x 4 x factorial,
4
but there are five cells for which there are no observations. The total num-
ber of observations is therefore 123. We parameterize the factors by using
indicator variables for each level except the first, the parameterization em-
ployed by McCullagh and NeIder. Like them we also fit a first-order model
in the factors - there is no evidence of any need for interaction terms.
Since the responses are average claims, we weight the data by the number
of observations mijk forming each average.
To start, we explore the Box- Cox family of links for the five .A values
-1, -0.5, 0, 0.5 and 1, calculating for each the goodness of link test in-
troduced in §6.6.4. Figure 6.2 is the resulting equivalent of the fan plot for
transformations, but is instead a series of forward plots for the goodness of
link test. The test statistics are well behaved throughout the search, as the
figure shows: both .A = -1 and .A = -0.5 seem completely acceptable, a
conclusion in agreement with that from Figure ILIon page 377 of McCul-
lagh and NeIder (1989). When.A = -1 the final value of the goodness of link
test is 0.37 and the maximum absolute value during the search is 1.63. In
the remaining part of the analysis we stay with the canonical (reciprocal)
link.
Figure 6.3 shows the forward plot of the deviance residuals from a search
using 50,000 subsets to find the least median of deviances fit. This well-
behaved plot shows that, for all of the search, the most distant observation
is 18, which is the one identified by McCullagh and NeIder from the fit
to all the data as having the largest residual. Several residuals decrease in
magnitude at the end of the search but there is no evidence of any masking.
We next consider the forward plot of the leverage in Figure 6.4. There is
no information here about observations of exceptional leverage, although
there are two points to be made about the consequences of the factorial
structure of the data. The first is that, for normal regression, the leverage
6.8 . Car Insurance Data 205
1ii
~
""
:§
'0 0
VI
VI
Q)
c
"0
8
<.!)
<)'
40 60 80 100 120
Subset size m
F igure 6.2. Car insurance data: goodness of link tests for five values of A
ID
18
.,.
VI
OJ
::0
"0 N
·iii
~
Q)
0
c
to 0
.:;
Q)
0
<)'
"'r
20 40 60 80 100 120
Subset size m
co
0
., CD
0
'"~
'">
'" ...
..J
0
(\J
0
0
0
20 40 60 80 100 120
Subset size m
structure depends solely on the values of the factors for those observa-
tions in the subset. But, for a generalized linear model, the leverage (6.54)
depends also on the observations through the parameter estimates and
weights. Effects similar to those for nonlinear least squares in Figure 5.20
are therefore a possibility. There are no such dramatic changes in Figure 6.4,
although the plot does show the effect of the factorial structure in the near
vertical decreases in leverage, caused by the introduction of some factorial
points. This structure is absent from leverage plots where the explanatory
variables have a more random structure.
The next plot shows the deviance and estimate of the dispersion param-
eter during the search. Since we are searching using deviance residuals, the
plot of the deviance in Figure 6. 5 (left ) is smoother than that in (right)
J
which shows the evolution of estimated from the value of Pearson's X 2
(6.52). The data appear correctly ordered, with none of the jumps and
decreases in deviance associated with masked outliers.
The next pair of plots, both in Figure 6.6, show the parameter estimates,
which are stable throughout the search, and the t statistics. These param-
eter estimates on single degrees of freedom have been plotted to show the
/3
factor to which each belongs. The values of the do not indicate any in-
fluential observations. The values of the t statistics decrease in magnitude
during the forward search as do those for normal data, due to the increasing
value of the estimate of the dispersion parameter as the search progresses.
The only new feature is the occasional upward jumps in the t statistics.
Since, as we have already seen, the parameter estimates and the value of
J behave smoothly, these jumps relate to the changes in leverage that are
shown in Figure 6.6, which result from the factorial structure of the data.
6.S. Car Insurance Data 207
0
0
.
0
'"C
U
<Xl
.s; 0
'"
<0
0
...
0
'"
0
Figure 6.5. Car insurance data: (left) deviance and (right) dispersion parameter
¢ estimated from Pearson's X2
...
0
cQ)
0
ci
·u 0
iE
Q)
0
u '"00 u '"
~
~
.0
ci
~ 0
"0
Cii
*
~
w
E ci
0
0
-------_ ...... - _.... --' '7
'"
0
0
9
20 40 60 80 100 120 20 40 60 80 100 120
Subset size m Subset size m
Figure 6.6. Car insurance data: forward plots of (left) parameter estimates and
(right) t statistics. The upward jumps in the t statistics result from the factorial
structure of the data
208 6. Generalized Linear Models
Figure 6.7. Car insurance data: (left) approximate modified Cook distance and
(right) maximum deviance residual during the forward search
Table 6.5. Car Insurance Data: the last five stages of the forward search, showing
the increase in deviance
m Observation y Dispersion Deviance Deviance
Number i ¢ Difference
The final plot, Figure 6.7 again stresses the well-behaved nature of the
data. Given in the left panel are approximate modified Cook distances cal-
culated using (6.66). This shows that the observations entering at the end of
the search cause slightly more change in the parameter estimates than those
entering earlier. The parameter estimates themselves in Figure 6.6(left)
show how slight is this effect. Figure 6.7(right) gives the maximum abso-
lute deviance residual in the subset. This shows once more how the more
extreme observations enter towards the end of the search. There are no
surprises.
Finally we consider the deviance in the last few stages of the search.
Table 6.5 lists the last five observations to enter the search, together with
the deviance and the estimated dispersion parameter. As the last column of
the table shows, there is a steady upward trend in the increase in deviance
as each observation is added. The final observation to be included is 18,
which was revealed as the most extreme in the forward plot of the residuals.
If this observation is not included, the estimated dispersion parameter is
1.066, close to one for the exponential distribution.
6.9. Dielectric Breakdown Strength 209
Our analysis thus shows that the data are well fitted by a model close
to the exponential and that there are no influential observations, although
one observation is somewhat remote. However it has no significant effect on
inferences. A final comment is that our parameter estimates at the end of
the search agree with those of McCullagh and NeIder (1989, p . 299) except
that the estimate for level 2 of the vehicle age factor should be 366 not 336,
an easily generated outlier.
Since the response is nonnegative with many small values around one
and a maximum of 18.5, some form of model other than linear regression is
likely to be needed. The original analysis used the logged response together
with a nonlinear model in temperature. Here we follow the suggestion of
G. Smyth of the University of Queensland (www.maths.uq.edu.au/rvgks
/data/general/dialectr.html) and explore generalized linear models,
using the gamma distribution. Some previous analyses have treated the
two continuous explanatory variables as factors. We instead treat them as
variables and try to find a model with few parameters. The data are in the
form of an 8 x 4 factorial with four observations per cell. As we did in the
analysis of the Box and Cox poison data in §4.4, we ignore the presence of
the replicate observations. Here these could provide a test of the goodness
of fit of the models using the residual deviance from a saturated two-way
model with interactions to estimate the dispersion parameter regardless of
the linear model.
We begin with plots of the data. Since there is a factorial structure we
replace the scatterplot matrix with scatter plots using symbols to show
the levels of the factor not plotted. Figure 6.8 is a plot of y (strength)
against time Xl. The responses for the highest temperature (represented
210 6. Generalized Linear Models
co 0
~i ~
0
<0
§ 0
0
0
0
B
"<t 0
§ B B
N ~ 8
.c $" 2 + 6
enc
!!?
0
2
0 ~
6
0
~
U5 co +
+
~
<0
fr !:.
!:.
* +
!:.
"<t
o 10 20 30 40 50 60
Time
by triangles 6.) are by far the lowest readings at high time and lie away
from the rest of the observations. Because the plot is very congested for
low values of time we show in Figure 6.9 the plot of y against log Xl.
The patterns for the four temperatures as a function of time are revealed
as rather different: for the lowest temperature, represented by squares (0),
the line of points is virtually horizontal, showing little effect, on breakdown,
of time at this level of temperature. For the next higher level, plotted as
circles (0) , there is a slight and roughly linear downward trend. For the
third level (+) the response is dropping rapidly towards the end, whereas
for the highest temperature (6) an asymptote seems to have been reached
around response values of one.
Figure 6.10 shows two plots against the other factor, temperature.
The increasing spread to the right in Figure 6.10(1eft) indicates that we
need to include some interaction in the model. We repeat this plot in
Figure 6.1O(right) with the readings for different times systematically sepa-
rated by increasing the temperature readings by 3° for each increase in Xl.
This plot shows how the groups of observations for high temperatures and
high times (located in the lower right-hand corner of the graph) lie away
from the rest of the data. It is not clear from the plots whether a simpler
structure and better model will be obtained by using Xl as a variable, or
its logarithm. We use log Xl.
The interaction structure visible in the plots suggests that it may be hard
to find a satisfactory linear predictor for the data. To try to see what kind
of model might be satisfactory we start by fitting a linear predictor with
linear terms in log Xl and X2 using a gamma model with the reciprocal link.
6.9. Dielectric Breakdown Strength 211
0
~ 0
0
~
0
co 0 0 0 El
§ 0
;
0
0
~ 6
§ B
~
~
~ :; 8+ § 8
El
.c;
'"" ~
~ '" fr
fr
+
0
6 ~
6
0 §
~
iii +
OJ
Temp. levels
0=180 ..
+
$
+
co
0 =225 ~ ~
.,. +=250 61-64
77-80
6 =275 93-96
& 125-1 8
'" 109-112 ~ ~
0 2 3 4
Log(time)
x x
I rio'i
I~
I:l 0
Of)
~+ I:l 0
~ ,.-
Time levels
0=1
6XO~§
0
ITlj:
:j:
'V 0=2 'V
i'i5
I + =4
/', =8 ~
~ Of) x =16 ~
0=32
$ " =48 $
EO =64
I ill
180 200 220 240 260 280 300 180 200 220 240 260 280 300
Temperature Temperature
'"
d
95. 96
~
1
III 110.112. 125
'7 109127.128 ::::::.--::::::-
~ ~-----~~1~----------------------------
20 40 60 80 100 120
Subset size m
Figure 6.11. Dielectric data, reciprocal link: forward plot of deviance residuals
Figure 6.11 shows the forward plot of the deviance residuals and Figure 6.12
the forward plot of the score statistics for the goodness of link test. Neither
is satisfactory: the plot of residuals shows several remote groups of residuals
throughout the forward search and the final value of the score statistic,
using the Box-Cox link, is -8.26.
We first try to find a satisfactory link and then consider the linear pre-
dictor. The strategy is similar to that in Chapter 4 where a satisfactory
transformation was found first, before checking the linear model. Table 6.6
gives the value of the goodness of link statistic for five values of A calcu-
lated using the Box-Cox link. This gives the same numerical values as the
Table 6.6 . Dielectric Breakdown Strength: goodness of link tests for some values
of A using the Box-Cox link
A Link Test
2 -0.55
1 -8.82
0.5 -9.09
0 -8.58
-0.5 -8.33
-1 -8.26
6.9. Dielectric Breakdown Strength 213
C\I
0
a;
~
~
;§ C)I
"0
(/)
(/)
<Il
c 'f
"
0
0
Cl
~
20 40 60 80 100 120
Subset size m
power link, except for the sign of the statistic which is informative about
the direction of departure.
The table suggests that we should consider a link with A = 2. However
the forward plot of this goodness of link statistic in Figure 6.13 shows that,
although the value of the statistic may be acceptable for all the data, it is
not so earlier on, having a maximum absolute value of 5.58 when m = 115.
The forward plot of deviance residuals, Figure 6.14, shows that around this
extreme value there is a rapid change in the values of some residuals (units
125 to 128). We do not print the forward plot of leverages, but it also
helps to explain Figure 6.13: with this link some observations of very high
leverage enter at the end of the search and affect the value of the link test.
We therefore reject A = 2 and follow our second course of action, which is
to try to build a better linear predictor. For this we return to our results for
A = -1 , the reciprocal link often being found to be satisfactory for gamma
data.
The discussion of the scatter plots such as Figure 6.9 suggested that
interactions and second-order terms would be needed. A full second-order
model in log Xl and X2 including an interaction term on one degree of
freedom has a deviance of 15.15 as opposed to 23.64 for the first-order
model. Although this is an appreciable reduction for three extra degrees of
freedom , the forward plot of residuals is similar to Figure 6.11, still showing
groups of negative outliers. We try to accommodate these outliers by fitting
dummy variables for the individual groups. The hope is that we can explain
a few groups individually and that then the rest of the observations will be
fitted by the second-order model.
214 6. Generalized Linear Models
0
(;j
~
""~ 'l'
'0
'"'"c:
'" 'f
8
<!l
'7'
'9
20 40 60 80 100 120
Subset size m
Figure 6.13. Dielectric data, Box- Cox link with ,\ = 2: goodness of link t est
If)
ci
0
ci
'"
(ij
:::>
"0
·iii
'"9
!!!
'<>"
c
.!l!
:>
C!
Q) or;
Cl
II"!
or;
111 .
0
~
20 40 60 80 100 120
Subset size m
Figure 6.14. Dielectric data, Box- Cox link with ,\ = 2: forward plot of deviance
residuals
6.9. Dielectric Breakdown Strength 215
..-
ci
'"ci
'"::>
OJ 0
ci
"C
'iii
!
~
c: '"9
,~
II)
0
..-
9
'"9
62, 79
<Xl
9
20 40 60 80 100 120
Figure 6.15. Dielectric data, reciprocal link with second-order model and two
dummy variables: forward plot of deviance residuals
1;; C\J
$
-"
§
a
!I)
a
!I)
Q)
C
"0
0
0
(!) ~
'i
20 40 60 80 100 120
Subset size m
Figure 6.16. Dielectric data, Box- Cox link with second-order model and three
dummy variables: goodness of link test for five values of A
sion parameter. The smallest of the nonsignificant t values at the end of the
search is for x~. We therefore exclude this variable from the linear predictor
and rerun the forward search for an eight-variable predictor. All variables
are now significant, so we present a full analysis of this final model.
The final model has a linear predictor including three dummy vari-
ables and a full second-order model with interaction in temperature and
log(time), except for the quadratic term in temperature. Figure 6.18 is the
forward plot of the residuals, which is in general well behaved throughout.
The highest residual is for unit 125, which is the largest observation at
the highest level of both explanatory variables and one of the highest in
its group of eight observations. The last observation to enter the search
is 111. In some other searches we performed this observation entered the
search earlier than here and left again before entering again at the end of
the search, giving slightly different plots. The most negative residuals in
Figure 6.18 reflect the replicated structure of the data and come from the
smallest of four observations in particular cells of the factorial.
The forward plot of the leverages, Figure 6.19, shows horizontal lines
of high leverage that arise from the dummy variables, for which the
coefficients are determined by only a few observations, either four or
eight. The parameter estimates are in Figure 6.20(left) and are stable
during most of the search, trending slightly towards the end. The t statis-
tics in Figure 6.20(right) confirm that all variables are now significant.
Figure 6.21(left) shows the forward estimate of the deviance and, on the
right, the estimated dispersion parameter. Both show the smooth upward
trend associated with data correctly ordered by the forward search. The
6.9. Dielectric Breakdown Strength 217
0
<'l Inweep' (xO)
Xl
x2
0 xl . . .2
N
><2"2
x1"x2
dummIes
0
<J ...... X1
:~ 0
~ 4-
0
(,:f"-
I
I
..,
0
I
I
0 I
'?
20 40 60 80 100 120
Subset size m
Figure 6. 17. Dielectric data, Box- Cox link with A = 0.5 and a second-order model
including t hree dummy variables: forward plot of t statistics
.,.
ci
N
c::i
C/)
c;;
-6
'(ji)
0
0
~
ffi~ '"
9
.~
o .,. 11 1 - - - _ , _____ _ ____ ____ ...,-.-
9
<0
9 105.106.123
9
00
~----~------~------~------_r------_r------_r--~
20 40 60 80 100 120
Subset size m
Figure 6. 18. Dielectric data, Box- Cox link with A = 0.5 and the reduced sec-
ond-order model including t hree dummy variables: forward plot of deviance
resid uals
218 6. Generalized Linear Models
C!
CD
0
'~"
<0
0
'"
§?
'" 0.,.
...J
(\J
0
0
0
20 40 60 80 100 120
Subset size m
Figure 6.19. Dielectric data, Box-Cox link with A = 0.5 and the reduced sec-
ond-order model including three dummy variables: forward plot of leverage,
showing effect of replicated factorial structure
-_. x2
x1"2
.,.
0
x1"XZ ,, ........ ~
dummies
0 ,, ......- ........... .
.........
___ ------ xl---- __ _____________ -"'-" (\J
.......
() \~\I..r·-" ...........
.~
x2 , x1'2 , x1'x2 .~ 0
1ii
0
'l'
0
't
Figure 6.20. Dielectric data, Box- Cox link with A = 0.5 and the reduced
second-order model including three dummy variables: forward plots of (left)
estimated coefficients and (right) their t statistics
6.9. Dielectric Breakdown Strength 219
~
C\J
;;
~ c:i
.2l
Q)
C> E IX>
Q)
(,) ~ 0
c:
<IS
.;; ~ci
Q) o
.~
Cl
U") Q) ....
0.0
c:i
i5~
0 o
c:i c:i
20 40 60 80 100 120 20 40 60 80 100 120
Subset size m Subset size m
Figure 6.21. Dielectric data, Box- Cox link with A = 0.5 and the reduced sec-
ond-order model including three dummy variables: (left) forward plots of the
deviance and (right) of the scale estimate J>
goodness of link test for this reduced model is in Figure 6.22. The plot lies
within the 5% limits throughout. The most noticeable feature towards the
end of the search is the effect of observation 111, causing an upward jump
when it is introduced at the end of the search. However, in general, compar-
ison with Figure 6.16 shows that dropping one term from the predictor has
not affected the link test. As a pendant to this analysis, Figure 6.23 shows
that observation 111, entering at the end of the search, has a large residual
in the forward plot of the maximum absolute residual in the subset.
In fact, observation 111 comes from the group of observations from the
next to longest time and highest temperature. It is therefore included in
the same group of eight observations as 125 to 128 as Figure 6.24 shows.
The dummy variable for this group of eight should therefore perhaps be
split from that for the group for the longest time and the forward search
repeated. We do not do this here, but instead consider what our analysis
has achieved.
The forward search revealed the groups of observations that do not agree
with the model fitted to the rest of the data. The search also showed the
effect of these observations on the selected link and linear predictor. The
result is a model in which the five groups of observations for the highest
temperature (61- 64, 77- 80, 93-96, 109-112 and 125-128) and log(time)
greater than 2 (Figure 6.24) are modelled separately from the rest of the
data. The implication is that simple linear models are not adequate to
describe these data.
There are many other possible models. A simple alternative to that ex-
plored here is to work with time rather than its logarithm. Another is to
consider normal theory regression: our final dispersion estimate is 0.012,
implying an index for the gamma distribution around 80; the distribution
is thus close to normal. Another possibility is to consider a nonlinear model
220 6. Generalized Linear Models
(')
iii
2
-'<
~
'0 0
'"'"
Q)
c
"
0
0 ';"
CJ
C)'
'?
20 40 60 80 100 120
Subset size m
Figure 6.22 . Dielectric data, Box- Cox link with A = 0.5 and the reduced sec-
ond-order model including three dummy variables: forward plot of goodness of
link test
0
(')
ci
It>
N
(ij
::l ci
:g
'"~ 0
N
Q) ci
0
.,
c
.s; It>
Q)
"E
::l
ci
E 0
.,
·x ci
::2:
It>
0
ci
0
ci
20 40 60 80 100 120
Subset size m
Figure 6.23. Dielectric data, Box- Cox link with A = 0.5 and the reduced sec-
ond-order model including three dummy variables: forward plot of the maximum
absolute deviance residual in the subset; observation 111 enters last
6.10. Poisson Models 221
0
~ 0
0
~
~ 0
§ 0
0
0
0
8
0
;! 0
~
8
8 ~ ~ ~ g § 8 8
8
~
.r::
~
c;, 0 0
c
(l!
~ 8 0
Ci5
co 0
Q ~
0
I•
I
CD
Dummies
• first dummy
•
61-64
v 77-80
.I. second dummy 93-96
*
•
C\l
• third dummy 125-1 8
109-112 •
0 2 3 4
Log(time)
Figure 6.24. Dielectric data: dummy variables. In this version of Figure 6.9 the
observations modelled using the three dummy variables are shown with filled
symbols
in time to fit the exponential decay indicated in F igure 6.8, with the rate a
function of temperature. Such a generalized nonlinear model is outside the
models described in this book, although similar diagnostic methods based
on the forward search would apply.
II>O y = 0,1 , . . . ,
which is used to measure counts with no specified upper bound. The
loglikelihood of a single observation (6.22) is
Then the deviance for the sample is found , as in §6.7, by differencing and
summing twice the difference over the sample to be
n
D(~) = 2 L {Yi 10g(Yi/ Pi) - (Yi - Pi)}' (6.71)
i=l
The second term in this deviance, 2:(Yi - Pi) , is identically zero for linear
predictors including a constant. The deviance will be small for models with
good agreement between Yi and Pi' Since the scale parameter 4> = 1, the
value of the deviance can be used to provide a measure of the adequacy of
the model, tested by comparison with a chi-squared distribution on n - p
degrees of freedom. If the estimated means P are "small," the distribution
of the deviance may need to be checked by simulation.
We examine two examples of Poisson data: the data on train accidents,
in which we have one factor and one continuous explanatory variable, and
data on cellular differentiation which are the results of a 4 x 4 factorial
experiment with quantitative factors. We do not specifically consider con-
tingency tables, that is, Poisson data with qualitative factors. References
to the large literature on this subject are given at the end of the chapter.
The most straightforward analysis of contingency tables is concerned with
discovering whether there is a relationship between the factors. If there is
no relationship an additive model is appropriate , with cell means Pi esti-
mated from the product of the marginal distributions of observations over
each factor. Pearson's chi-squared goodness of fit test compares the predic-
tions from this model with those from the saturated model, giving the test
statistic
(6.72)
In the last expression in (6.72) the Oi are the observed Yi and Ei are the
expected values. Pearson's statistic provides an alternative to the use of
the deviance, which is sometimes called G 2 in the literature on contingency
tables. Both statistics can be used for overall testing of models where the
Pi come from fitting more complicated models than the product of the
estimated marginal distributions. We leave it to Exercise 6.6 to establish
the relationship between the two statistics.
A distinction between the Poisson model and the other generalized linear
models examined in this chapter is that the link function is not often in
question. In our examples we use the (canonical) log link log(p,) = T).
difference from the preceding analyses of gamma data is that the dispersion
parameter does not have to be estimated, since it is one for Poisson data.
The deviance therefore provides a test of the goodness of fit of the model.
We use Poisson models with a log link, so that J1 = exp1](x). However
the distribution of the number of accidents will also depend on the amount
of traffic on the railway system, measured by billions of train kilometres
and also given in Table A.18. For year t let the value be mt. The mean
number of accidents will be proportional to mt, so the model for year t
becomes
J1 = mt exp 1]( x) = exp{log mt + 1]( x)}. (6.73)
The dependence of accidents on traffic is thus modelled by adding to the
linear predictor a term log mt with known coefficient one. Such a term with
a known coefficient is called an offset (McCullagh and NeIder 1989, p. 423).
A complication which we ignore in our analysis is that the data do not
include figures on accidents in which there were no deaths. From a practi-
cal point of view, there are difficulties in defining such accidents. From a
statistical point of view we need to explore the effect of ignoring such zero
truncation in the estimation of Poisson parameters.
We first fitted a model in which the linear predictor included the offset
and the year as well as a factor with three levels for the type of rolling
stock. To demonstrate the difference between residual plots for continuous
data and those for discrete data we give in Figure 6.25 a plot of deviance
residuals against fitted value when the three largest observations (units 13,
23 and 63) are excluded.
This plot shows appreciable structure, which is unrelated to whether the
model fits well. First, the fitted values fall into three groups corresponding
successively to nonpassenger (goods) trains, post-Mark 1 passenger trains
and Mark 1 trains. The fitted values for these groups are slightly smeared by
the effect of calendar year. Also visible in the plot are a series of decreasing
curves: the lowest is for the residuals for all observations which are one,
the one above for all observations equal to two, and so on. Such banding is
typical of residual plots for discrete data with few different observed values.
We can expect that this structure may also affect some forward plots of
residuals.
The forward analysis of all the data showed that time was not significant,
so we removed the variable and used only the single factor for train type.
The deviance residuals from the forward search are shown in Figure 6.26.
The three large residuals for units 63, 13 and 23 are for observation values
of 49, 35 and 13, the three largest, the others all being 10 or less. These
are the last three observations to enter the forward search. A less obvious
feature of the plot is the slight curvature starting around m = 29 which
corresponds to the successive inclusion of 10 units of Mark 1 stock for all of
which there was one death. After this the order of entry of the data reflects
the size of the response.
224 6. Generalized Linear Models
0
,.; 7
10
0
9
N
7
7
6 6
'"
~
33 5 55
Cii 5
22 1
:>
2 4 5
.,
"0
'Vi ~
cr 4
0
ci 2
3
1111 11
11111 1 1 22 2 2
~
';"
2:2
1 1
0
<)i pass.
I
Non
Post mark 1 Mark 1
111 111
Figure 6.25 . Train data: residuals against fitted values showing the pattern in-
duced by the discrete values of the response. Numbers correspond to the value of
the response. The vertical lines divide the observations into three nonoverlapping
groups. The higher risk from Mark 1 stock is evident
0
(I)
Cii
:>
:!2
(I)
.,~
0
c:
.,0'"
.:; III
o 10 20 30 40 50 60
Subset size m
Figure 6.26. Train data: forward plot of deviance residuals. The residuals from
the three largest observations are evident
6.11. British Train Accidents 225
~
~ N ~----------------------------------------+-~
:a
'0
UJ
UJ
CIl
c
"8 a
C}
~ +-------------------------------------------~
10 20 30 40 50 60 70
Subset size m
Figure 6.27. Train data: goodness of link test, again showing the effect of the
largest observations
The model seems inadequate for all the data and the inadequacy shows
in the plots from the forward search. Figure 6.27, for example, is the for-
ward plot of the goodness of link test. Here the effect of the inclusion of
the last three observations to enter the search (23 , 13 and 63) is highly
evident. Inclusion of unit 23 causes the value of the test to become slightly
negative while units 13 and 63 cause an upward jump and bring the value
of the statistic much beyond the 1% limits. As a final set of plots showing
the inadequacy of the model, Figure 6.28 presents the deviance and the
estimated dispersion parameter. Both are smooth curves, showing that the
data are well ordered by the search, although the values are very large
towards the end of the search. Since, if the Poisson model holds, <p equals
one, we have an absolute test of the adequacy of the model by comparing
the deviance with the X2 distribution on m - 3 degrees of freedom. The
deviance is below the 95% point of this distribution until m = n - 2 = 65,
when at 102.8 it exceeds the 99.9% point of X~2. Although this distribution
is asymptotic and may not hold exactly for small numbers of counts, there
is no doubt that the Poisson model is inadequate. This is not surprising
since, for example, we do not have data on accidents in which there are no
fatalities , so that a zero-truncated distribution is required. A more plausible
model would be a compound Poisson process, in which accidents happen
in a Poisson process, but the number killed, given that there has been an
accident , has a zero-truncated distribution. It would then be of interest
to determine the relationship between the factors of both the number of
accidents and their severity.
226 6. Generalized Linear Models
~ co
.l!!
Q)
E
Q) 0 ~ <0
<.> 0
c: N
'"a.c:
'"
~ o
.~
Q)
v
o a.
o w
is N
o o
10 20 30 40 50 60 10 20 30 40 50 60
Subset size m Subset size m
Figure 6.28. Train data: (left) deviance and (right) dispersion parameter Jy. The
largest observations cause significant lack of fit
Xl: Dose of tumor necrosis factor (U/ml) with four levels (0, 1, 10 and
100), TNF
X2: Dose of Interferon-/, (U/ml) with four levels (0, 4, 20 and 100) , IFN
y: Number of cells exhibiting differentiation.
'" ,
4 nn.n _ ______ h __________ n _ h _____ ~'
~'1~ •
0 -=0
-- iiM~ . <"
- - ....
A
'"
;;;
:0 9' . • . . . . . . .
., . . "". ::.;
"0 C)I
.0;
.,
~ I
.,c:
()
.,
.:; 'r
I
/
0 I
I
~ I
I
I
'l'
16---- -------------__
-------,")
8 10 12 14 16
Subset size m
Figure 6.29. Cellular data: forward plot of deviance residuals. Observation 16 has
a large negative residual
0
N Q) co
()
c
.,
.,c
()
u>
*"'"
'0 <!)
.,
0
.:; 0
., ..,.
()
0 ~ "0
-=
'0
0
u> :2
'"
0 0
8 10 12 14 16 10 12 14 16
Subset size m Subset size m
Figure 6.30. Cellular data: (left) deviance and (right) modified Cook distance.
Introduction of observation 16 at the end of the search has a dramatic effect on
both statistics
228 6. Generalized Linear Models
g
N IFN levels
o IFN=O
'V IFN=4
+ IFN=20 TFN levels
• IFN=100 o TFN=O
'V TFN=1
+ TFN=10
• TFN=100
Q; 0 Q; 0
~::l '" ~::l '"
Z Z
o o
o 2 3 4 o 2 3 4
Log(dose of TNF +1) Log(dose of IFN +1)
modified Cook distance. Both show the dramatic effect of the introduction
of observation 16: the deviance, for example, increases from 4.39 to 23.03.
Thus, without observation 16, testing the deviance against a chi-squared
distribution on eight degrees of freedom shows that the model fits well
and there is no evidence of any interaction. With observation 16 included
the model does not fit adequately so any evidence for interaction depends
on this observation alone. Another piece of evidence that observation 16
is influential for the model is that the goodness of link test changes from
-0.59 to -3.26 when the observation is introduced at the end of the search.
Otherwise the log link is in agreement with the data.
To try to understand this effect, we show scatterplots of a transformation
of the data in Figure 6.31. For both variables the lowest dose levels are 0
and the highest 100. It is usually found , as it is for Bliss's beetle data
introduced in §6.1.2, that simple models are obtained by working with log
dose. If, as here, one of the dose levels is zero, it is customary to add one to
the dose before taking logs, thus working with variables Wj = 10g(1 + Xj) ,
j = 1, 2. These are the variables used in the scatter plots of Figure 6.31 ,
which show the values of y increasing with both explanatory variables. In
the absence of interaction the changes in response with the levels of the
factor in the plot should be the same for all levels of the second factor
represented by symbols. The additive structure therefore results in moving
the dose response relationships up or down for the levels of the second
factor. Changes in shape indicate interactions. Compared with plots for
normal data, those here for Poisson data are made more difficult to interpret
by the dependence of the variance on the magnitude of the observations.
Such an additive structure appears to be present in Figure 6.31 , except
that the left panel indicates that the highest observation, 16, may be too
low for additivity. We have already seen that this is the unit with a large
6.12. Cellular Differentiation Data 229
U')
'" o xO
/./... _................_......
'" xl
x2
o
'" x1*x2
/
/
Q)
g -
U')
.-..--.. .~.--=-- =::-=--~ _........ --_......
'"
'5
Q)
---
Cl ~ /----
U')
//
--------------//
U')
o - - - - - - - - - - - - - - - -.. . . . . . .
4 6 8 10 12 14 6 8 10 12 14
Subset size m Subset size m
Figure 6.32. Cellular data, observation 16 deleted: (left) deviance, showing lack
of fit throughout the search, and (right) t statistics; the interaction term is not
needed
negative residual until it enters at the last step of the forward search,
another indication that the value is too low. We therefore consider the
effect of arbitrarily adding 100 to the observation. Since the observation
enters in the last step of the forward search, nothing changes until the end
of the search, when the residual deviance becomes 6.08 rather than 23.03.
This new value is in agreement with the model without interaction and
with the rest of the data.
We now try to find a simpler model by regression on the "log" doses
WI and W2 · For the reasons given above, we leave aside observation 16.
Inspection of Figure 6.31 does not inspire much hope that it will be possible
to find a simple regression model for the data. Although the responses
increase with the variables, the increase is generally of a "dog-legged" form
which requires a cubic model, rather than a linear or quadratic one. The
main exception is the upper set of responses in Figure 6.31(left) which,
with the suspect observation 16, form a nice straight line.
Our numerical results confirm our initial pessimism. The final deviance
for a model for 15 observations with first-order terms and their interac-
tion is 26.1, so that there is evidence of lack of fit of this simple model.
The forward plot of the deviance, Figure 6.32(1eft), shows that the lack of
fit extends through a large part of the data: there are no outstandingly
large contributions to the deviance, just a steady increase in value as each
observation is added. The plot of t statistics in Figure 6.32(right) is very
different in form from those for normal and gamma models in which es-
timation of the dispersion parameter caused the plot to contract as the
search progressed. Here, since cjJ is taken as one throughout the search, the
t statistics either remain approximately constant or increase with m . This
particular plot shows great stability during the search - again there are no
influential observations. There is also no evidence of any interaction.
230 6. Generalized Linear Models
fer; e) = '(
r. n
n~ r.),er (1 _ e)n-r, r = 0, 1, ... , n ,
As for the Poisson distribution, here the dispersion parameter is also equal
to one.
It is convenient to rewrite the distribution in terms of the variable
y = rjn with E(Y) =0 and var(Y) = O( 1 - 0) j n,
when the loglikelihood for a single observation becomes
l(O; y) = ny log 0 + n(l - y) 10g(I - 0) + d(n, y).
With estimated probabilities Bfrom parameter estimates ~, the loglikeli-
hood for a single observation is
l(~; y) = ny 10gB + n(1- y) 10g(I - B) + d(n, y)
and that for the saturated model is
l(f3max; y) = ny logy + n(I - y) 10g(I- y) + d(n, y).
Then the deviance for the sample is found, as in §6.7, by differencing and
summing over the sample to be
D(~) = 2 t
i=l
{niYi log(YdBi) + ni(I - Yi) log (1
1
=~2)}.
()2
(6.74)
The deviance will be small for models with good agreement between Yi
and Bi . Since the dispersion parameter ¢ = 1, the value of the deviance
can be used to provide a measure of the adequacy of the model , tested by
comparison with a chi-squared distribution on n - p degrees of freedom
provided the ni are "large". If the numbers ni in the groups are "small",
the distribution of the deviance may need to be checked by simulation. For
the limiting case of binary data, when all ni = 1, discussed in §6.I7.I, this
deviance is uninformative about the fit of the model (Exercise 6.7). Intu-
itively, for the saturated model for ungrouped binary responses (ni = 1),
as n increases, the number of estimated parameters tends to infinity, since
one parameter is fitted to each observation. On the contrary, for binomial
observations only one parameter is fitted to each proportion Rdni in the
saturated model. Thus, when ni - 7 00 the number of fitted parameters
remains constant.
We now compare several link functions for binomial data. Table 6.1
included four often used for modelling binomial data. Figure 6.33 shows
how they vary in the relationship they give between linear predictor and
probability. In all panels, the continuous line represents the logit link.
For the symmetrical links (probit and arcsine) the curves have been
rescaled to agree not only when the probability equals 0.5, but also for
probabilities 0.1 and 0.9. The first panel shows how very close are the
probit and logit links. Chambers and Cox (1967) calculate that several
thousand observations would be needed in order to tell these two apart.
The fourth panel emphasizes the short tails of the arcsine link: outside a
certain range the probabilities are identically zero and one.
232 6. Generalized Linear Models
.
'"
d
.
0
'" ·2
.'"
.~
]5
i d
·4 ·2 ·4 ·2
Linear predictor Linear predictor
Figure 6.33. Comparison of links for binomial data to the logit link, represented
by a continuous line. Symmetrical links have been scaled to agree with the logit
at probabilities of 0.1 and 0.9
The other two panels of Figure 6.33 show two asymmetric links. The
complementary log log link was defined in Table 6.1 as g(f1) = log{ -log(l-
f1)}. Applying this link to 1 - y gives the log log link for which g(f1) =
log{ -log(f1)}. We find both links useful in the analysis of data. Because
of the similarity of the probit and logit links it does not , usually, matter
which is fitted. The logit link has the advantage of easier interpretation
through interpretation of the logit as a log of odds.
2 3 4 5 6 7 8
~~:: 1 3==;;--~
4
__==_~=:~~:.:=: ::;;:=_ SC~;_-.:._- -
----------------::-----
===
o . ;; _;:;.--: - - ; : ________ '": a:: __=.::;
2 3 4 5 6 7 8
i~o r2=-~-----~----~-----~-----~----~-----~----~-----~-----~----~-3
U I,
Lg__ -_- --: -;:= '" ;s-
2 3 4 5 6 7 8
Figure 6.34. Bliss's beetle data: absolute values of deviance residuals as the subset
size increases: (top) logit, (middle) probit and (bottom) complementary log log
links
from the rest of the data: they are badly predicted by models in which
they are not included. On the other hand, the residuals from the forward
search with the complementary log log link show no such behaviour; all
residuals are smaller than two throughout, and relatively constant. Since
the scale parameter is not estimated, it is possible to make such absolute
comparisons of the residuals across different models, even if they come from
different link families .
Figure 6.35 shows a forward plot of the goodness of link test, the order
of introduction of the observations being different for the three links. For
the logit and probit links these plots show evidence of lack of fit at the
5% level, which is indicated by the statistic going outside the bounds in
the plot. Although it is inclusion of the last two observations that causes
the values of the statistic to become significant, it is clear from the steady
upward trend of the plots that lack of fit is due to all observations. The
plot for the complementary log log link shows no evidence of any departure
from this model. This plot also shows that unit 5, which is the one with
the biggest residual for the complementary log log link and the last to be
included in this forward search, has no effect on the t value for the goodness
of link test.
This analysis shows that, of the three links considered, only the comple-
mentary log log link is satisfactory. The plots of fitted values in Figure 6.36
relate this finding to individual observations. The upper pair of plots show
234 6. Generalized Linear Models
Logit
Probit
Cloglog
'l' r-
4 5 6 7 8
Subset size m
Figure 6.35. Bliss's beetle data: forward plot of the goodness of link test. Only
the complementary log log link is satisfactory
the fitted dose response curve for the logistic model, both at the beginning
and at the end of the forward search. When m = 2 observations 1 and 2
are badly fitted by this symmetrical link. At the end of the search these
two lower observations are better fitted, but observations 7 and 8 are now
less well fitted . The complementary log log link, the fitted dose response
curves for which are shown in the lower two panels, are not symmetrical
and can fit both the higher and lower dose levels for these data. The sym-
metrical pro bit link gives fitted curves very similar to those for the logistic
model. We now consider another relatively simple example in which there
are advantages in using an asymmetrical link.
There are two types of preparation, with nine levels of dose for the stan-
dard preparation and five levels for the test preparation. Since there are
6.15. Mice with Convulsions 235
""'" '"
Q)
<ii
>
<0
0 "
<ii
>
<0
0
i
:E
"c: ..
0
"~"
"c: .
0
'"
<ii '"
<ii
«
fl 1e "
0
«
0 0
0 0
1.4 1.6 1.8 2.0 1.4 1.6 1.8 2.0
Log(dose) Log(dose)
'"
Q)
'""
"
<ii
>
co
0 ">
(ij
co
0
" ..
" g""
..
Q)
:S
c: 0 "c: 0
'"
<ii
• '"
<ii
~" 0 ~" 0
0 0
1.4 1.6 1.8 2.0 1.4 1.6 1.8 2.0
Log(dose) Log(dose)
Figure 6.36. Bliss's beetle data: actual and fitted values showing that the fit of
the complementary log log link does not change appreciably with m
between 30 and 40 mice in each group, the value of the deviance should be
a useful overall measure of the goodness of fit of the models.
We start considering the three models analyzed by Lindsey using
log( dose). The first is a logistic model in the two variables, for which the
residual deviance is 8.79. The second is a model with the complementary
log log link and, again, both variables, for which the deviance is a slightly
larger 12.87. Because the link is not symmetrical, it matters whether we
take the proportion of mice with convulsions as the response or, as we do
for the third model, the proportion with no convulsions. For this log log
model the residual deviance is 4.688. There are 14 - 3 = 11 residual degrees
of freedom, so it therefore seems that all models fit adequately, with the
log log link fitting slightly better than the others. We now use the forward
search to elucidate the reasons for this ordering of the models.
Figure 6.37 shows forward plots of the deviance residuals for the logistic
model. For this, as for the other two models, the plot is very stable. For the
logistic model observation one consistently has the most extreme residual.
This observation is that with the lowest dose level, so low that there are
no convulsions. It is the last to enter the forward search. In the analysis of
Bliss's data on beetles we saw that it was observations at extreme values
of the explanatory variables that carried information about the correctness
of the link. For the complementary log log link, the forward residual plot
236 6. Generalized Linear Models
C'l
11------------___________ / / ' . . . . . . . . . .,
12 ..................... _ - - -
._._u___.~ . ~
lO-~----------~~~~:~:---<::-
-------------_... ----
... _... ... -
1-------------__________ -------___---
4 6 8 10 12 14
Subset size m
(Figure 6.38) shows that both observations 1 and 10 have relatively large
absolute deviance residuals. Observation 10 is that with the lowest dose
level for the group receiving the test preparation and is the second to last
to enter the forward search, observation 1 again being last. It seems that
this new link has accentuated the failings of the model at low dose levels.
Figure 6.39 shows the plot for the log log link. There are no observations
that consistently have large residuals, although the last two to enter the
search are again observations 10 and 1.
Figure 6.40 shows the plot of the goodness of link tests for the models
during the three forward searches for the three links. At the end of the
searches the values are -1.353, -2.324 and 0.247. There is thus some ev-
idence that the complementary log log model has an unsatisfactory link.
The power of this test on one degree of freedom for a specific departure is
to be contrasted with the value of 12.87 for the residual deviance, for which
the significance level is 30.2%.
The plot shows that some of the evidence for lack of fit comes from
observations 10 and 1, the last two to enter the search. A similar pattern,
but less pronounced, can be seen in the last step of the plot for the logistic
link, when observation 1 enters at the end of the search. The plot for the
log log model if anything shows an increase in support for the link when
the last two observations enter.
A final plot is Figure 6.41 which shows the growth of the deviance during
the forward search. All are smooth curves, showing that the data have
been correctly ordered. But those for the logistic link and, especially, for
6.15. Mice with Convulsions 237
<J)
Cij 12
:;:J
"0
11
·iii
__=_"="""_5=:·
~
0
Ql
(.)
c:
<U
.;;
Ql
Cl ';"
2
----
10----- ________________ _
... _----".,.,. ...
1--------------------------------
4 6 8 10 12 14
Subset size m
Figure 6.38. Mice: deviance residuals from the complementary log log model
10 _ _ _ _==-=:::__ ::=-_-:::-:.:::-~..,.........".~:_.::::._:::_..::------- 1
<J)
Cij ...... '
---
:;:J "':...,,-
"0
·iii
;2_~~~---=---::=-~~;=;=
= ..
~
0
Ql
---------~==------- -----
(.)
c:
<U
.;;
·':'...,....:.....:.~~~11
Q)
-------
Cl ';" 7 --
8- ....... . . .. .
9· .
')I
"?
4 6 8 10 12 14
Subset size m
Figure 6.39. Mice: deviance residuals from the log log model
238 6. Generalized Linear Models
N ~ __________________________________________ ~
Logit
Cloglog
Loglog
~ ~--------------------------------------~~~
6 8 10 12 14
Subset size m
Figure 6.40. Mice data: forward plots of the goodness of link tests for the three
models
the complementary log log link, show marked increases at the end of the
search, due to observation 1 or to observations 10 and l.
Other plots, such as that for the t statistics for the parameters are not
shown here. They are smooth and well behaved, showing no evidence of
observations influential for the parameters of the linear predictor. However
the stable form of the forward plots of residuals, with persistent extreme
values may be evidence of systematic departures from the model. As is to
be expected on general statistical grounds, the specific goodness of link test
on one degree of freedom provides a more powerful test for link departures
than does the general test using the residual deviance on several degrees
of freedom. Finally, for this example, by considering the anomalous obser-
vations in the context of the data, extra information has been gained: here
that low dose levels are responsible for the departures from some of the
links.
Logit
Cloglog
Loglog
4 6 8 10 12 14
Subset size m
Figure 6.41. Mice data: forward plot of the deviances for the three models
binomial model with logit link to a cubic in standardized rainfall: that is,
where Z denotes the standardized value of x. This fit has some strange
features: the cubic term in the model (/33) is significant at the 1% level,
with a t value of 3.35. The linear term (/31) is also significant at the 1%
level, with the quadratic term (/32) significant at the 5%: the constant
term (/30) is not significant. However the relationship does not explain all
the variation in the data: the residual deviance is 62.63 on 30 degrees of
freedom, significant evidence of lack of fit at a level of 0.043%, if asymptotic
theory is an adequate guide.
Fig 6.42 is the plot of residuals from the forward search. This figure
clearly shows that four units (34, 14, 19 and 23) have very large negative
residuals until m = 30. But some of these signs change when m = n = 34:
units 23 and 34 have positive deviance residuals of 1.39 and 0.13, whereas
unit 19 has a small negative deviance residual of -0.37. In the upper part
of Figure 6.42 we can detect three units (30, 27 and 29) that show deviance
residuals always above 3 in the central part of the forward search. However,
in the last step of the forward search unit 29 has a deviance residual that
is equal to only 0.22. We may thus expect some problems for backwards
methods due to masking. These are shown in the next part of this section.
Finally, in the plot of Figure 6.42 we can see two units (7 and 21) whose
negative residuals (less than -2) remain virtually constant in all steps of
the forward search.
240 6. Generalized Linear Models
27
o
7~~~Jji
21-- - ---- --- ------· --- ---·---- ·------·---t·-·,~
II)
5 10 15 20 25 30 35
Subset size m
Figure 6.42. Toxoplasmosis data with logistic link: deviance residuals as the subset
size increases in the forward search
To interpret the remaining plots we give in Table 6.7 the order in which
the forward search causes the observations to enter the fit. Also given is
the estimate of the dispersion parameter ¢. Of course, for binomial data,
we hope for a value around one.
Table 6.7. Last steps of the forward search - subset size, observation introduced
and estimate of dispersion parameter ¢
m 34 33 32 31 30 29 28 27 26
Obs. 14 34 19 23 30 21 27 29 7
¢sim ) l.94 l.73 l.76 l.72 l.64 l.33 1.05 0.92 0.76
We notice that the observations entering at the end of the search are
precisely those identified as different by the forward plot of residuals.
The plot of the goodness of link test , in Figure 6.43(left), shows that
when observations 14, 34 and 19 are excluded, the statistic is almost sig-
nificant at the 5% level, having a value of -l.92. Adding the last three
observations causes the plot to move in quite a different direction, the final
value being l.58. The plot of the deviance in Figure 6.43(right), shows that
the forward search has ordered the data up to m = 30, but that the last
four observations seem to be outliers - the smooth shape of the curve is lost.
The influential importance of the last four observations is shown by the
plot of the Cook statistic in Figure 6.44(left). The implication of the peak at
m = 31 is that addition of observation 23 causes a significant change in the
6.16. Toxoplasmosis and Rainfa ll 241
'" 0
<D
C\I
0
"'
$
-'"
~
~
Q)
'"
..,.
0
0
15 c
en 0 .s:'" 0
en
Q)
C
"0
Q)
Cl '"
0 ';" 0
0 C\I
(!)
~ ~
'? 0
5 10 15 20 25 30 35 5 10 15 20 25 30 35
Subset size m Subset size m
Figure 6.43. Toxoplasmosis data: (left) goodness of link test and (right) residual
deviance, showing effect of the last four observations to enter
<D
C\I
0
~
g 0
C\I
-en
~
.-..."\. ~ v j.J
'i -
.............-.--..~.......-.-... ........_/
o
5 10 15 20 25 30 35 5 10 15 20 25 30 35
Subset size m Subset size m
Figure 6.44 . Toxoplasmosis data: (left) Cook's distance showing the effect of in-
cluding observation 23 when m = 31 and (right) the t statistics for the individual
parameters
parameter values. The values of the statistic for larger m are small because
the introduction of the remaining observations reinforces the change in
the parameter values signalled by the Cook statistic. These changes are
most easily seen by looking at the plot of the individual t statistics in
Figure 6.44(right). The statistics for the linear and cubic terms remain
sensibly unchanged for most of the search. But those for the other two
terms change sign and become less significant in the last five steps of the
search.
These results have a straightforward interpretation if we go back to the
data as plotted in Figure 6.45. The solid line shows the cubic fit using all
data; the line with short dashes shows a cubic fit without observations 23,
19, 14 and 34 (the last four in the forward search). These four observations
form a group with the highest rainfall and are clearly all influencing the
242 6. Generalized Linear Models
05
Q)
<Xl
c:i 010
> 023
~ 030
~ CD 34
g> 0
~
.sl 019
c
o 0
to
o
a. 000 0
e
a.
0
0 0
0
014
0 0 0
Figure 6.45. Toxoplasmosis data: proportion testing positive versus rainfall for 34
cities in El Salvador (logit link). o=observed proportion. Solid line: fitted cubic
using all the observations (m = 34); short dashes: fitted cubic when m = 30; long
dashes: fitted cubic when m = 29
shape of the cubic curve in the same way, lessening the curvature at the
second point of inflection. The first of the four to be included is 23. Once
it has been included the other points do not greatly change the shape of
the curve, which explains the values of the Cook statistic in Figure 6.44.
When all are included observation 34 is virtually on the fitted curve. But
when m = n - 4 this observation has a deviance residual of -12.9. This
dramatic change can be seen in Figure 6.42. The last observation to be
considered is 30, which enters immediately before this group of four. The
effect resulting from its additional deletion, shown by the curve with long
dashes in Figure 6.45, is to reduce the curvature of the fitted cubic model.
It may seem surprising that observations 5 and 10 do not have a similar
effect, but they are for 2 and 10 subjects, whereas observation 30 is from
75.
Deletion of these five observations has other beneficial effects. The resid-
ual deviance is 36.42 on 25 degrees of freedom, still perhaps some evidence
of lack of fit, if asymptotic theory is a good guide, but a decided improve-
ment on the previous value. Deletion of one further observation gives a
value of 1.05 for 1>s~28), removing any evidence of that overdispersion which
caused Firth (1991) to wonder whether the model was appropriate. Of
course, to remove observations solely to achieve a small deviance is not
likely to lead to a correct model for the data. But our results show how
many aspects of model building and criticism come together once the obser-
vations have been ordered by the forward search. As one further example,
the t statistics for the parameters in Fig 6.44(right) are reasonably stable
up to m = 29.
6.16. Toxoplasmosis and Rainfall 243
Table 6.8. Toxoplasmosis data: the last stages of the forward search, with residual
deviances, for the logistic and complementary log log links.
Logistic Complementary
log log
Subset Observation Residual Observation Residual
Size m Entering Deviance Entering Deviance
26 7 18.97 21 22.67
27 29 23.55 24 26.57
28 27 27.75 27 34.25
29 21 36.42 20 43.80
30 30 45.96 13 46.23
31 23 50.18 23 50.63
32 19 53.03 19 53.06
33 34 54.20 34 54.26
34 14 62.63 14 62.43
C\I 0
C\I
Cii
J!l
"'"c:
-'<
:§ ~
'0 0
~
'C
''"c:" -'<
"0
0
'" '7 U
8 ~
0
<!)
III
~
'?
"""35
0
10 15 20 25 30 35 5 10 15 20 25 30
Subset size m Subset size m
F igure 6.46. Toxoplasmosis data with complementary log log link: (left ) good-
ness of link test and (right) Cook's distance showing t he effect of including
observations 20 when m = 29 and then 23 when m =31
q 0 (~-,'
I :'
I :
IX>
o I :
ci
~'"
·iii o p/
I :
023
0
Q.
en
CD
-" I: 34
ci
c:
~ 0 /
"~
", ,of,:'
J
J!l
c:
0
...
ci I
I ob'"\', ___ -- /7""--_ _ 019
'e : I ", / 0
,.620 o
0
I 0 0- / 0
e
Q.
10
a. "l I o
0
I 0 o 014
/
/
0 ~:"::'~----o/ o o o
ci
Figure 6.47. Toxoplasmosis data wit h complementary log log link: proport ion
testing positive versus rainfall for 34cit ies i n El Salvador. o= observed proportion .
Solid line; fitted cubic using all the observations (m = 34) ; s hort da shes: fitted
cubic when m = 30; long dashes: fitted cubic when m = 2 8
6.16. Toxoplasmosis and Rainfall 245
27·
30·
C\I 28·
o
...
.........-
-3 -2 -1 o 2 3
Quantiles of Standard Normal
Figure 6.48. Toxoplasmosis data, logistic link: normal plot of deletion residuals
with simulation envelope
of our analysis is that the observations form a clear subset when the data
are appropriately plotted.
~
--------- - --------~
,,
,
,,
co
0 ,,
,,
., CD
0
.,>~
C1l
.,
...J
...0
co
0 ------- - ... -
0 -------~'"=--=-~
0
5 10 15 20 25 30 35
Subset size m
Figure 6.49. ToxoplasmQsis data, logistic link: curves of the leverage in each step
of the forward search. The leverage for the unit that joins the subset in the last
step is denoted with a filled square
However, the presence of one unit of the group causes the second unit to
enter to have reduced leverage. Thus, as this plot clearly shows, units 23
and 19 are included with a leverage equal to 0.86 and 0.39, respectively.
In the final step, however, their leverage is simply equal to 0.12 and 0.08.
Unit 34 comes in with a leverage equal to 0.77, much bigger than that for
observation 14 (the last to enter). This explains why the curves for these
two units in Figure 6.42 cross in step n-3 = 31. These comments show that
an analysis of leverage at the last step can be highly misleading if there is a
cluster of outliers. The results also agree with those for the Cook distances
in Figure 6.44, where the inclusion of observation 23 has the largest effect.
0 0 ··1
0
0
0 0 •§ •• • • • • • .17
4·1~ 0 0 0 24 •
0
• •
•
0
0
0
0 •
~
Oi
";-
0 0
0
-'
'?
0 32
Figure 6.50. Vasoconstriction data: scatter plot showing regions of zero response
o and unit response • . Observations 4 and 18 are surrounded by zero responses
and observation 24 is on the boundary of the two response classes. Observations
17 and 32 are well within the correct regions for their response
and then discuss features that make the analysis different from our earlier
analyses of binomial data.
The vasoconstriction data are in Table A.23. There are n = 39 readings
on the occurrence (20 times) or nonoccurrence of vasoconstriction and two
explanatory variables, the volume (xd and the rate (X2) of air inspired,
that is, breathed in. The data, from Finney (1947), were used by Pregibon
(1981) to illustrate his diagnostics for logistic regression. Other analyses are
given, for example, by Aitkin et al. (1989, pp. 168- 179), and by Fahrmeir
and Tutz (1994, p. 128).
Figure 6.50 is a scatterplot of the data against the explanatory variables
which also shows the two classes of response. Observations 4 and 18 are in
an area of otherwise zero responses . Pregibon found that these two observa-
tions were not well fitted by a linear predictor in the logs of the explanatory
variables. They do indeed stand out in the plot of the deviance residuals
against fitted values in Figure 6.51. However what is most striking about
the figure is its structure: it consists of two decreasing bands of points,
the upper being the residuals for observations with value one, the lower
for those equal to zero. This plot is more extreme than Figure 6.25 for the
Poisson distributed train data in which there were also bands for each value
of the discrete response. Both are quite distinct from plots of residuals for
normal data.
It is interesting to identify the observations with extreme residuals in
Figure 6.51 with their positions in the scatter plot of Figure 6.50. Obser-
vations 4 and 18 stand out clearly in both figures. In addition, observation
24 is the zero response with highest expected value and is on the edge
248 6. Generalized Linear Models
C\I
.
.!il
•
• • .... ...,
-.
::l
.,
:9
!/l
32
a:
\
0
17
0
Cb
';" o
o
o
0 24
Fitted values
Figure 6.51. Vasoconstriction data: deviance residuals against fitted values, that
is, estimated probabilities -rr. The plot is quite unlike those from normal theory
regression. Inspection of Figure 6.50 shows that observations 4 and 18, with
response 1, are in a region of zero response
17
q .. - ___I!!.--'O--- -
18
co
0
<0
~ 0
:0
e'" "'"
.0
c- o
(\J
0
0
------~~------~m__wco m mm 24
0 32
This requires modification both of the initial subset and of the progress of
the search.
We modify the initial subset so that it is constrained to include at least
one observation of each type. A perfect fit to one kind of observation is
thus avoided at the beginning of the search. To maintain a balance of
both kinds of response during the search we balance the ratio of zeroes
and ones in the subset so that it is as close as possible to the ratio in the
complete set of n observations. Given a subset of size m we fit the model and
then separately order the observations with zero response and those with
response equal to one. From these two lists of observations we then take the
mo smallest squared residuals from those with zero response and the ml
smallest squared residuals with unit response such that mO+ml = m+l and
the ratio mo/ml is as close as possible to the ratio nO/nl in the whole set of
observations, where no + nl = n. In the vasoconstriction data the numbers
of zero and one responses are as equal as they can be for n = 39. After the
initial stages the forward search, therefore, alternately adds observations
with zero and unit responses. The quantities we monitor are those described
in §6.6.5 with one exception. Given that in binary data y = 0 or y = 1, we
found it useful to monitor separately the maximum residual for the units
whose response is equal to one and those whose response is equal to zero.
More precisely, in every step we monitored
4
)()(
18
__ ....... ...........- .
17
co
c:i
C\I
c:i
:; 0 - - - - - -_ _9-----Be-€ll---....,..-6-<""'"')8( 24
32
It is clear from the figure that the slope of the fitted model between the
two groups of observations can be arbitrarily large without significantly
affecting the fit of the model. The phenomenon is not restricted to binary
data, but becomes increasingly unlikely for binomial data as the numbers
of observations per group ni increase. For such perfect fits McCullagh and
NeIder (1989, p. 117) comment that although the parameter estimates go to
infinity, the estimated probabilities are well behaved and calculation of the
deviance converges. They do not mention the t statistics for the parameter
estimates in the linear predictor, which are shown by Hauck and Donner
(1977) to approach zero as the fit improves, even though the parameter
estimates themselves go to infinity. Venables and Ripley (1997, p. 237)
describe this property of the t statistics as "little-known." It is however
related to the asymptotic properties of the Wald test for the difference
13k - 13k.
In multiple regression we stated in §2.1.3 that the t test for 13k = 0 was
the signed square root of the F test from the difference in the residual sum
of squares when 13k is and is not included in the model. The standard large
sample result is the asymptotic equivalence of the likelihood ratio test for
a parameter and the square of the Wald test, which calculates the ratio
of 13k to its standard error. This is the t test that we have used in this
chapter for, for example, our goodness of link test. In some cases this t test
for individual parameters can be very sensitive to the parameterization of
the model, when the likelihood ratio and Wald tests can give very different
252 6. Generalized Linear Models
a a
j
'" '" I
'"'" '"'"
a a
Q)
u
'" Q)
u
'"
c
olijs:
Q)
'"
T"""
'" '"
"S:
Q)
T"""
a a
~ ~
'" '"
a a
10 20 30 40 10 20 30 40
Subset size m Subset size m
Figure 6.54. Vasoconstriction data: plot of deviances showing perfect fit . The
residual deviances are zero for most of the forward search: (left) balanced search
and (right) unbalanced search
from zero as the perfect fit is destroyed. An unexpected result is the high
correlation of the statistics for different variables.
In the next section we consider the rate of convergence to zero of the
statistics as the parameter estimates go to infinity. The key is an analysis
of the limiting behaviour of the weights in the iterative fitting of generalized
linear models. In particular we analyse the arcsine link and show that the
t statistics for this link converge more slowly to zero than they do for the
logistic and complementary log log links. We argue that the new link should
also produce larger t statistics near the perfect fit, which statistics will
therefore be more in agreement with the deviance. They should also have
a reduced correlation. Analysis of our examples confirms and quantifies
the extent of improvement that can be obtained using the arcsine link as
opposed to the logistic or complementary log log. The arcsine link also
seems to give better agreement between t statistics and deviances in an
example in which the fit is far from perfect.
The behaviour of these weights, especially for large values of I13k I, is central
to our analysis. In the vasoconstriction data the perfect fit was obtained as
a linear combination of the parameters went to infinity. In such cases we
assume that a linear transformation of the carriers Xij has been made such
254 6. Generalized Linear Models
L~l akiZi
(6.79)
V(XTWX)k~'
where aki is the ith element of the kth row of the matrix (XTW X) -1 XTW
and (XTW X)k~ is element (k , k) of the matrix (XTW X)-l. In a situation
of perfect fit the elements aki remain bounded and the rate of convergence
to infinity of the denominator simply depends on the weights. Substitution
in equation (6.79) of the expression for Zi found in (6.45), followed by
consideration of only those terms which determine the rate of convergence
of the t statistic, shows that we can write the statistic as
(6.80)
VL~=l1ri(l - 1ri) (d1)i/ d1r i)2
Using the appropriate weights for each link (given in Table 6.9) we can
analyze the rate of convergence to zero of the t statistics for the different
links.
The results are reported in Table 6.10. Several points arise:
1. The rates of convergence for the t statistics for the logit and the
arcsine links do not depend on whether 1) -+ +00 or 1) -+ -00 (since
the links are symmetrical). However, for the complementary log log
link we have two different rates of convergence;
6.18. Theory: The Effect of Perfect Fit and the Arcsine Link 255
3. The t statistics from the arcsine link tend to zero at a slower rate than
those based on the other two links. This implies that, in a situation
close to a perfect fit , the t statistics associated with the arcsine link
are to be trusted more than those based on the logit or complementary
log log links. We present some numerical evidence that this is so.
Table 6.10 . Rate of convergence of t statistics for different links (TJ --> 00)
Other aspects of our analysis of the weights also have practical impli-
cations. The rate of convergence 1/ yfii of the t statistics associated with
the arcsine link simply depends on the existence of the threshold and not
on the particular characteristic of this link. This follows since if 1i" = 1 or
1i" = 0 when ITJI > I (for some threshold I < (0), d1i"/dTJ = 0 for 11]1 > I and
the denominator of equation (6.80) goes to 00 with speed yfii.
256 6. Generalized Linear Models
Table 6.11. Vasoconstriction data. Logistic link: t tests and deviances for models
fitted to the original data and with observations 4 and 18 changed from 1 to 0
Log(Volume) Log(Rate)
tl t2 Deviance Explained
Original Data 2.78 2.48 24.81
Modified Data 1.70 1.80 46.47
In this chapter among all possible links with a threshold we have cho-
sen the arcsine link due to the structure of the weights. As emphasized by
Pregibon (1981 , page 712), the weights in fitting generalized linear models
(6.37) are not fixed, as in weighted least squares regression, but are deter-
mined by the fit. Table 6.9 shows that in a situation in which no unit is
fitted perfectly the weight given by the arcsine link to each unit is constant
and equal to one. As equation (6.27) clearly shows, the matrix of weights
affects all variables in the columns of X in the same way. The presence of
a few dominant weights will therefore tend to cause correlation among the
t statistics for the 13k. By using the arcsine link, with its constant value of
the weights away from a perfect fit, we expect to reduce this effect. There
is thus an advantage to be expected from this link even far from a perfect
fit. A final point about the weights for the arcsine link in Table 6.9 is that
they are similar to those which occur in robust estimation of location us-
ing redescending M estimators described, for example, by Hoaglin et al.
(1983).
N
,
(~
)
\4
-/' ~
~
-
.. ----
--- -
Intercept
Log(volume)
Log(rate)
I cct:tt:!!
/24
, --- ---- - -- - _..-..--.,.,.
10 20 30 40 36 37 38 39
Subset size m Subset size m
Figure 6.55. Vasoconstriction data: the effect of perfect fit on the model with a
logistic link; (left) t statistics; (right) deviance residuals in the last four steps of
the forward search
deviance. This is more nearly so for the original data. A subsidiary point
is that , for each fit , the two t statistics have very similar values.
In these calculations, as in all others on binary data, we have standard-
ized the explanatory variables to have zero mean and unit variance. In
addition the constant term in the model is taken as 1/..;n. These scalings
are not important for t values, other than that of the intercept, but are
important when we come to compare parameter estimates.
Figure 6.55(left) shows a plot of the values of the t statistics during the
forward search through the vasoconstriction data. As with other plots of t
statistics for binary data, we have added 95% confidence intervals. Because
we have balanced the search the last three observations to enter are 18,
24 and 4. The plot shows that without these observations the t values are
effectively zero: the data give a perfect fit, the parameter estimates can
be arbitrarily large and the probabilities are estimated as being close to
zero or one. The actual values of the parameter estimates depend upon the
convergence criterion used. We have used a value of 10- 8 for the change in
the deviance. The actual value of the deviance for the fitted model, which
theoretically is zero, increases from this value of 10- 8 to 10- 5 during this
forward search and the values of the t statistics increase slightly, as shown.
For a numerically perfect fit they would be zero.
Figure 6.55(right) shows the deviance residuals for the last four steps of
the forward search. When m = 36 there is a perfect fit and all residuals of
observations in this subset are zero; only those outside (4, 18 and 24) are
nonzero, with large values. As soon as the perfect fit is destroyed, that is,
when m = 37, the deviance increases and there are many nonzero residuals.
In fact the major change in the pattern of residuals occurs at this point, the
increasing number of appreciably negative residuals explaining the increase
in deviance in Figure 6.54(1eft).
Figure 6.56 shows how the numerical values of the parameter estimates,
in the presence of perfect fit, depend on the value of the convergence cri-
258 6. Generalized Linear Models
0 0 ~
.r' \
c
Q)
0
LO
I.' c 0
LO }1
~1
Q)
'0 '0
==0
.--( \ ==~ -0/ ~
-~
Q)
0
r 0 •• -e=<::::..,\ .
'\.,\ ' I
(.) (.)
~
.c
"0
Q)
iii
0
0
u;>
\ I
, I
\1
~
.c
"0
Q)
iii
0
0
u;> "I \ i
\ i
E
.~ 0
Intercept
Log(volume)
~ E 0
Intercept
Log(v<>ume) IiiI
0 ~ 0
W 0
'7
Log(rate)
I W 0
'7
Log(rate)
I 'Jj
10 20 30 40 10 20 30 40
Subset size m Subset size m
Table 6.12. Vasoconstriction data. Arcsine link: t tests and explained deviances
for models fitted to the original data and with observations 4 and 18 changed
from 1 to 0
There are only 26 nonzero responses, for those who had a "coronary
incident" in the last 10 years, and six explanatory variables. The data are
thus far from balanced. Christensen concludes that a model with the three
variables Xl, X4 and X6 is adequate to describe the data. The values of
the three t statistics are 2.54, 1.82 and 2.12, with the model explaining a
deviance of 19.1. There is thus not the large discrepancy between t statistics
and deviance that was present in the vasoconstriction data.
Figure 6.57 is a scatter plot of the data showing the two classes of re-
sponse. The predominance of zero responses is evident. Even if we can see
a slight positive connection between the incidence of heart attack and the
three variables, the plot shows that the two responses are intermingled. It
260 6. Generalized Linear Models
~ '/,
~
Iii X4=cholesterol .
B0
0
co · ~
~
X6=weight
0 ~
Figure 6.57. Chapman data: scatterplot matrix of data showing zero 0 and unit
• responses
ir = mdm,
where ml is the number of unit responses in the subset of size m. The resid-
ual deviance for this model is easily calculated from the general expression
6.20. Chapman Data 261
"'iiic 8
Q)
~
C.
x
Q)
0
~
:>Q) 0
Cl
0 50 100 150 200
Subset size m
~
- -. .
(\J
b]
.2 -------------------------~{i.
'ii) C}I
~
-
-XO
------ X1
~
---- X4
--- X6
0
~
,
0 50 100 150 200
Subset size m
Figure 6.58. Chapman data: forward search with the logit link, showing the effect
of perfect fit up to m = 139; (top) explained deviance; (bottom) t statistics
for the binomial deviance (6.74). Since all ni = 1 and all Yi are zero or one,
we find that
D(iJ) = (m logm - m1logml - mo logmo). (6.81)
The deviance therefore just counts the number of zero and unit responses
in the subset. But, with a balanced search, these are determined by the
requirement of balance, independently of any other function of the data.
The resulting pattern in Figure 6.58(top) slopes up gently for the addition
of units with zero response, with a series of larger steps upward when
unit responses are included. Once the perfect fit has been broken, the plot
begins to decline as the nonzero residual deviance for the fitted model is
subtracted. The steady downward slope means that there are no highly
influential observations or outliers causing large increases in this residual
deviance. The small jumps in the explained deviance are again caused by
the inclusion of observations with a unit response. The difference between
t tests and deviance is greatest when the data are near a perfect fit. The
values of tl and of t6 are highly correlated, being virtually indistinguishable
over the range of m for which the fit is not perfect.
Straightforward use of the arcsine link with Chapman's data leads to a
model with one less variable than the three used above and by Christensen
with the logistic link. For all six potential variables, three of the six t
statistics when fitting the arcsine link have absolute values less than one.
Backwards elimination using these values one variable at a time, in a similar
way to that summarized in Table 3.1 , leads to the three-variable linear
predictor used above but with t values h = 2.86, t4 = 1.35 and t6 = 1.84,
which are more dispersed than the values for the logistic link, which are
262 6. Generalized Linear Models
<0
v
If)
iii
:> C\I
"0
.(i;
2? 0
Ql
u
c
ro
.> ~
Ql
0
"i
'-?
Figure 6.59. Chapman data: forward plot of deviance residuals in the last stages
of the search using an arcsine link and a two-variable model
<0
<¢
ell
(ij
:> C\J
"0
'(jj
~
Q) 0
0
<::
tV
·sQ) ~
D ,,;;;>'
"r ,
I '
_________ ! '
'--~---.JII
~
Figure 6.60. Chapman data without observation 86: forward plot of deviance
residuals in the last stages of the search using an arcsine link and a two-variable
model
ti~""N-:-:'{~
~~ '\':'
/'-,........,""-_ ........ -..... -\-,
0
~4
-1-'
u \'l
.~
\
§
~ u;>
'v,
\,
\
-- ---- X1
~
---- X4
--- X6
0
~
,
Figure 6.61. Chapman data: t statistics using the arcsine link and the
three-variable model. Compare with Figure 6.58
Table 6.13. Chapman data. Values of t statistics and deviances explained from
simulations that move towards a perfect fit as 'ljJ increases
to the comparison of two links. As our theory predicts, the t values from
the arcsine link decrease less rapidly with increasing 't/J than those from
the logistic link. They are also larger, and so more nearly correspond to
the values of the deviances. In making this comparison it is the sum of
squares of the t statistics that need to be compared against the explained
deviances.
6.22 Exercises
Exercise 6.1 Show that the normal, gamma, Poisson, inverse Gaussian
and binomial distributions can be written as in equation (6.9) and find for
each distribution b((J), c((J), ¢ and d(y, ¢) {§6.2}.
The density function of the inverse Gaussian distribution is:
E (:~) o
Exercise 6.3 Starting from equation (6.17), show that varY can be written
as in equation (6.18) (§6.3).
Exercise 6.4 Given data on n proportions, Yi = Rdni' i = 1, .. . ,n, sup-
pose that the response probability for the ith observation (Hi) is a random
variable with mean (Ji and variance ¢(Ji(1- ()i) where ¢ is an unknown scale
parameter.
(a) What is the effect on E(Ri) and var(Ri) of the assumption of random
variability in the response probabilities?
(b) Is it possible with binary data to estimate the overdispersion
parameter ¢ (§6.3)?
Exercise 6.5 Show that the deviance for normal data is equal to the
residual sum of squares (§6.5).
Exercise 6.6 For Poisson observations show the asymptotic equivalence
between the deviance and Pearson's Chi-squared statistic (eq. (6.51); §6.10) .
Exercise 6.7 The deviance cannot be used as a summary measure of the
goodness of fit of a model for binary data. Prove this statement for the
logistic model (§. 6.13).
Exercise 6.8 Suppose you have fitted a linear logistic model with two co-
variates Xl and X2. Holding Xl fixed, what is the effect of a unit change
in X2 on the following scales: (a) log odds, (b) odds and (c) the probability
scale (§6.13)?
Exercise 6.9 In dose response models it may be important to know the
estimated dose that produces a specified value of the response probability.
The dose which is expected to produce a response in 50% of the exposed
subjects is usually called the median effective dose or ED50. When the
268 6. Generalized Linear Models
/30 /31
logit -60.77 34.30
probit -34.97 19.75
cloglog -39.60 22.05.
Given that the explanatory variable was loglO (dose), find the LD50 and
LD90 for the above three models (§6.13).
Exercise 6.10 Suppose that R "-' B(m, if) and that m is large. Show that
the random variable
Z = arcsin ( JR/m)
6.23 Solutions
Exercise 6.1
(a) Normal distribution
2 1
f(y;p"a ) = - - e x p
V27ra 2
{ ( )2}
1
--
2
y-p,
--
a
y2 p,2) 1 1
( --+yp,-- ---log27ra 2
2 2 a2 2
yp, - if
2
1 2 y2
a2 - 2log27ra - 2a 2 '
6.23. Solutions 269
so that
1
b(8) = -- c(8) = -logfL
fL
so that
1
c(8) = - d(c/J,y) = - -1-2 - 1 3
-10g27fUY .
fL 2yu 2
(e) Binomial distribution
(;)fLY(l - fLt- Y
y log _fL_ +
1-fL
n
10g(1 - fL) + log (n)
y
270 6. Generalized Linear Models
so that
b( 0) = log _11_
1-11
c(O) = nlog(1 - 11) ¢=1 d(¢,y) = log (~).
Exercise 6.2
The standard conditions assume the exchangeability of integration and
differentiation. Then, for the first identity,
Therefore:
d2l) (dl)2
E ( d()2 = -E d()
Exercise 6.3
From equation (6.13)
var Y =
Exercise 6.4
(a) Given a particular value of 7r, the observed number of successes for the
ith unit (Ri) has a binomial distribution with mean and variance:
E(Ril7ri) = ni7ri and var(Ril7ri) = ni7ri(l - 7ri).
Now using standard results from conditional probability theory:
E(Ri) = EE(Ril7ri) = E(ni7rd = niE(7ri) = niOi
var(Ri) = Evar(Ril7ri) + var{E(Rd7ri)} '
We obtain:
E{ni7ri(l- 7ri)}
ni { E7ri - Var7ri - (E7ri)2}
ni {Oi - ¢Oi (1 - Oi) - On
ni {Oi(l - Oi)(l - ¢)}
and
var{E(Ri l7ri)} var{ni7ri}
nt¢Oi(l - Oi) .
We therefore find that:
var(Ri) = niOi(l - Oi) {I + (ni - l)¢}. (6.84)
We conclude that if there is variation in the response probabilities (that is,
if ¢ > 0) the variance of Ri exceeds the variance under binomial sampling
by a factor of {I + (ni - 1)¢}. In other words, variation among the response
probabilities leads to overdispersion; that is, to a variance of the observed
number of successes which is greater than it would have been if the response
probabilities did not vary at random.
(b) With (ungrouped) binary data ni = 1 for all values of i. Thus the
expression for the variance in equation (6.84) reduces to Oi (1- Oi), which is
exactly the variance of a binary response variable. This implies that binary
data provide no information about the overdispersion parameter ¢.
272 6. Generalized Linear Models
Exercise 6.5
For normal data the log likelihood of a single observation can be written
We conclude that with normal data the deviance is identical to the residual
sum of squares.
Exercise 6.6
Let (y - M) / M = E so that
y
y - M = ME, y = M(1 + E) and - = 1 + E.
M
Now consider the squared deviance residual for unit i as a function of E (for
simplicity we drop the subscript i)
2 {y log(y/ M) - (y - M)}
2 {M(1 + E) log(l + E) - ME} .
Differentiation yields
TJ ~ o +2Mlog(l+E)I,=OE+2-2l ~I
1 + E ,=0
E2
~ ME2 = (y - M)2
M
Thus we can state that asymptotically
n
D(/J) = 22: {Yi log(yi/ Pi) - (Yi - Pi)} ~ X2 = 2: Yi
i=l
n
i=l
(
-;;i
" )2
6.23. Solutions 273
Exercise 6.7
The expression for the deviance of binary data can be obtained by putting
ni = 1 in equation (6.74); that is,
D(/3) = 2 t
i=l
{Yi log(Yi/8i ) + (1 - Yi) log (1
1
=~,) }.
B,
Remembering that Yi = 0 or Yi = 1 we have Yi log Yi = (1 - Yi) log( 1 - Yi) =
O. D(/3) can be rewritten as
n
D(/3) = -2 L {Yi log ei + (1 - Yi) log(l - ei )}
i=l
-2 t
i =l
{Yi log ~
1 - Bi
+ 10g(1- ei)}.
t
In matrix notation we can write:
where Y = (YI, ... , Ynf and iJ = x/3 = (iJI, ... , iJnf· Now, in the case of
linear logistic models, Wi = Bi(l - Bi ), ¢ = 1 and 8ry/8Bi = l/{Bi (l - Bi )}
so that equation (6.38) becomes
8L n 1
8(3j ~ Bi(l - Bi)(Yi - Bi ) Bi(l _ Bi ) Xij
n
L(Yi - Bi)Xij j = 1, .. . ,po
i=l
D(/3) = -2 { eT iJ + t log(l - ei ) } .
274 6. Generalized Linear Models
This expression shows that the deviance depends on the binary observations
Yi only through the fitted probabilities Bi and so it cannot tell us anything
about the agreement between the observations and their fitted probabilities.
In other words: given /3, D(/3) has a conditionally degenerate distribution
and cannot be used to evaluate the goodness of fit of a model. The result
is also true for the other links.
Exercise 6.8
Given that:
holding Xl fixed the effect of a unit change in X2 is to increase the log odds
by an amount (32.
In terms of the odds, equation (6.86) can be rewritten as
(6.87)
Thus, given that the maximum of ()(l - ()) is obtained when () = 0.5 we
can state that a small change in X2 measured on the probability scale has
a larger effect on () if () is near 0.5 than if () is near 0 or 1.
Exercise 6.9
We start with the logit model. Given that logO.5/(1 - 0.5) = 0, the dose
for which 'iT = 0.5 (ED50) satisfies the equation
(30 + {31ED50 = 0
so that the ED 50 = - {30 1(31' If loge (dose) rather than dose is used as
explanatory variable ED50 is estimated by exp( -/301 /3d.
Similarly, the ED90 must be estimated from the equation
0.9 ' ,-
log 1 _ 0.9 = (30 + {31 ED90 .
6.23. Solutions 275
EOOO = 2.197? - ~o .
i31
Estimates of the ED50 and ED 90 can be obtained similarly under a probit
or cloglog model When loge(dose) is used as an explanatory variable for
the probit model we obtain:
exp ( -~ol ~1 )
-
ED 90 exp ( 1.2816
, -
i31
~o) .
-
ED50 = exp (
-0.3665 -
,
i31
~o )
=exp (
0.8340 -
,
~o) .
i31
If, as in the exercise, logarithms of dose are taken to base 10, exp(.) in the
above expressions needs to be replaced by 10('). Making this adjustment
and using the estimated values of the parameters reported in the text of
the exercise, we obtain the following table.
ED50 ED90
logit 59.12 68.51
probit 58.97 68.47
cloglog 60.16 68.19
The models agree more closely in the upper tail than they do in the centre
of the distribution.
Exercise 6.10
n
Expanding arcsin ( J Rim) in a Taylor series around 7r up to second order:
z ~
a""in (JR/m) l, + 2~ • (~ - ii)
- ~! ~ {~ (1 - ~) ( /2
) (1 -2~) I. (~ -ii)'
Taking the expectation and the variance of both sides the result
immediately follows.
276 6. Generalized Linear Models
00
•••
0 0
o 4-11f
0
0 • •
0 •
0 0
0
~1
Q)
~ ...., 0
0
0;
0
-'
~
C?
0
Figure 6.62. Vasoconstriction data: scatter plot showing zero 0 and unit • re-
sponses. Without units 4, 18, 29, 31 and 39 the two regions are completely
separated
Exercise 6.11
As Figure 6.62 shows, without units 4, 18, 29, 31 and 39 there is a line in
the space that completely separates the two groups of observations. This
ceases when unit 39 (the first of the five to enter) is included.
Appendix A
Data
278 Appendix A. Data
Table A.I. Forbes' data on air pressure in the Alps and the boiling point of water
y Xl X2 X3 X4
y Xl X2 X3 X4
Table A.3. Wool data: number of cycles to failure of samples of worsted yarn in
a 33 experiment
Observation Xl X2 X3 X4 Xs X6 X7 Xs y
Number
1 -15 -10 -14 -8 2 -4 -10 59 8.88
2 9 0 8 -8 18 8 -18 74 12.18
3 -3 4 10 0 16 -14 6 49 5.75
4 -19 6 12 -16 8 -6 4 95 11.75
5 -3 0 6 4 -8 22 -16 57 10.52
6 11 -32 -38 10 -16 -2 10 97 10.57
7 11 2 0 18 -18 12 4 27 1.70
8 -11 32 38 -10 16 2 -10 62 5.31
9 -3 -2 -16 -12 -6 -8 -10 56 8.51
10 9 14 30 12 6 12 0 60 1.21
11 -3 -6 -2 4 -8 -6 -6 43 3.36
12 -9 12 12 -12 26 -8 -8 53 8.26
13 5 -24 -36 -2 -6 4 -4 72 10.14
14 -11 16 8 -14 8 -10 10 67 -0.58
15 -3 8 -4 -16 18 -16 2 24 7.10
16 7 0 18 6 -2 8 4 61 -0.63
17 9 18 16 -4 8 10 -4 68 5.87
18 11 -6 4 10 16 2 2 7 -0.25
19 -1 12 -4 -6 2 -4 -14 10 -9.45
20 -7 16 12 -2 10 4 -24 58 8.93
21 1 -12 4 6 -2 4 14 76 18.88
22 -3 -20 -10 16 -18 12 8 69 4.01
23 -11 -14 -20 2 -26 -12 22 78 8.48
24 13 -2 4 20 0 14 -14 6 -0.16
25 -21 12 10 0 0 6 -6 43 7.26
26 -1 6 8 -8 -10 -16 18 49 1.69
27 1 8 20 -6 8 14 -10 2 -4.46
28 -1 8 10 10 0 -2 -10 49 3.36
29 5 -10 -14 18 -18 8 14 67 7.53
30 7 4 4 -10 0 6 0 68 3.91
31 3 16 24 0 16 -10 -4 77 6.73
32 15 10 14 8 -2 4 10 1 -2.91
33 5 -28 -22 14 -8 6 0 97 8.80
34 -5 -10 -2 -6 8 -18 10 1 1.80
35 -13 -2 10 -4 -2 -12 18 7 -2.40
36 7 -16 -12 2 -10 -4 24 94 6.25
37 -7 0 -18 -6 2 -8 -4 89 15.60
38 -1 -20 -20 2 2 12 -14 28 1.06
39 -3 12 6 8 -18 -4 8 92 9.99
40 -9 0 -8 8 -18 -8 18 94 2.10
41 -3 -16 -24 0 -16 10 4 7 1.63
42 -9 -14 -30 -12 -6 -12 0 11 5.84
43 7 -14 -10 20 0 10 -4 1 -2.30
Appendix A. Data 283
Observation Xl X2 X3 X4 X5 X6 X7 Xs y
Number
44 7 6 6 8 10 20 -28 1 1.42
45 -5 -6 -16 -22 10 -20 6 93 2.67
46 7 -10 -24 4 2 8 -8 38 -6.93
47 -3 10 18 0 16 14 -4 16 0.75
48 -15 8 -6 -4 -8 -2 4 96 14.31
49 -5 8 6 -2 -2 -16 24 23 2.93
50 3 -8 4 16 -18 16 -2 68 2.06
51 3 2 16 12 6 8 10 89 5.97
52 -11 -2 0 -18 18 -12 -4 88 9.78
53 11 -2 -10 -6 18 0 -2 73 10.20
54 -15 4 8 12 -10 0 8 80 8.90
55 -5 10 14 -18 18 -8 -14 84 7.55
56 3 -4 -10 0 -16 14 -6 80 7.11
57 5 4 -6 6 -8 -10 0 98 12.60
58 -9 -18 -16 4 -8 -10 4 19 2.80
59 5 10 2 6 -8 18 -10 79 5.88
60 -11 12 22 2 6 -8 14 21 3.38
61 -9 2 0 -8 2 0 -20 94 7.10
62 -3 24 26 -12 26 -4 -18 69 4.43
63 11 -16 -8 14 -8 10 -10 31 9.47
64 17 -20 -24 10 -16 2 0 59 4.92
65 -1 14 18 10 0 26 -20 31 2.44
66 -15 24 24 0 0 10 -16 29 2.03
67 13 2 -10 4 2 12 -18 73 10.35
68 3 -10 -18 0 -16 -14 4 48 5.65
69 -17 -6 -18 -10 -16 -6 8 81 2.02
70 9 -16 -22 -12 10 -4 2 25 3.45
71 1 -6 -8 8 10 16 -18 58 8.94
72 3 22 32 0 16 18 -14 25 9.69
73 13 -4 2 2 -10 0 14 24 13.81
74 -7 2 4 10 0 22 -10 44 2.66
75 3 0 -6 -4 8 -22 16 83 2.55
76 9 -2 0 8 -2 0 20 49 5.61
77 17 10 4 -6 18 4 -12 33 3.21
78 13 -18 -26 16 -8 2 6 6 3.41
79 15 -24 -24 0 0 -10 16 22 3.95
80 1 6 -2 -22 10 -16 -4 14 2.28
81 -7 -4 -4 10 0 -6 0 78 10.65
82 -9 20 8 -4 -8 2 -6 28 5.70
83 -17 10 12 -6 -8 6 -12 82 7.35
84 -9 -12 -8 4 -8 18 -6 75 6.69
85 21 -12 -10 0 0 -6 6 90 6.01
86 9 -12 -12 12 -26 8 8 40 1.01
284 Appendix A. Data
Observation Xl X2 X3 X4 Xs X6 X7 Xs y
Number
87 -13 2 -4 -20 0 -14 14 94 10.14
88 1 2 12 -6 8 -14 0 6 -2.33
89 3 20 10 -16 18 -12 -8 12 4.05
90 23 2 2 6 8 -2 2 1 -0.90
91 -1 -8 -20 6 -8 -14 10 61 10.72
92 -3 -22 -32 0 -16 -18 14 30 -2.72
93 11 14 20 -2 26 12 -22 2 -0.52
94 -7 10 24 -4 -2 -8 8 53 16.00
95 -13 18 26 -16 8 -2 -6 23 -0.55
96 -1 -6 2 22 -10 16 4 57 4.77
97 -5 28 22 -14 8 -6 0 14 2.27
98 -9 16 22 12 -10 4 -2 91 8.13
99 5 2 6 -2 26 8 -12 95 7.36
100 19 -6 -12 16 -8 6 -4 67 4.71
101 7 -2 -4 -10 0 -22 10 9 2.93
102 -1 -2 -12 6 -8 14 0 5 3.42
103 -23 -2 -2 -6 -8 2 -2 58 6.78
104 15 -8 6 4 8 2 -4 97 4.97
105 -7 -6 -6 -8 -10 -20 28 18 0.47
106 17 6 18 10 16 6 -8 8 7.64
107 1 -8 -10 -10 0 2 10 23 4.90
108 11 -12 -22 -2 -6 8 -14 87 6.91
109 1 -14 -18 -10 0 -26 20 58 6.46
110 5 -8 -6 2 2 16 -24 76 6.94
111 -13 4 -2 -2 10 0 -14 9 -8.69
112 -17 -10 -4 6 -18 -4 12 89 11.03
113 -5 -4 6 -6 8 10 0 70 4.18
114 -11 2 10 6 -18 0 2 81 5.16
115 -7 14 10 -20 0 -10 4 82 8.70
116 -5 24 36 2 6 -4 4 98 6.83
117 9 12 8 -4 8 -18 6 25 3.27
118 17 -10 -12 6 8 -6 12 9 1.71
119 3 -12 -6 -8 18 4 -8 86 7.78
120 15 -4 -8 -12 10 0 -8 11 0.20
121 -17 20 24 -10 16 -2 0 59 6.86
122 1 20 20 -2 -2 -12 14 91 12.06
123 3 6 2 -4 8 6 6 62 7.10
124 -5 -2 -6 2 -26 -8 12 91 11.21
125 9 -20 -8 4 8 -2 6 87 5.79
126 5 6 16 22 -10 20 -6 92 15.30
127 -11 6 -4 -10 -16 -2 -2 64 7.33
128 3 -24 -26 12 -26 4 18 53 7.76
Appendix A. Data 285
Table A.5. Brownlee's stack loss data on the oxidation of ammonia. The response
is ten times the percentage of ammonia escaping up a stack, or chimney
Cooling Water
Inlet Acid Stack
Observation Air Flow Temperature Concentration Loss
Number Xl X2 X3 Y
1 80 27 89 42
2 80 27 88 37
3 75 25 90 37
4 62 24 87 28
5 62 22 87 18
6 62 23 87 18
7 62 24 93 19
8 62 24 93 20
9 58 23 87 15
10 58 18 80 14
11 58 18 89 14
12 58 17 88 13
13 58 18 82 11
14 58 19 93 12
15 50 18 89 8
16 50 18 86 7
17 50 19 72 8
18 50 19 79 8
19 50 20 80 9
20 56 20 82 15
21 70 20 91 15
286 Appendix A. Data
Lagged Water
Observation Salinity Trend Flow Salinity
Number Xl X2 X3 Y
1 8.2 4 23.005 7.6
2 7.6 5 23.873 7.7
3 4.6 0 26.417 4.3
4 4.3 1 24.868 5.9
5 5.9 2 29.895 5.0
6 5.0 3 24 .200 6.5
7 6.5 4 23.215 8.3
8 8.3 5 21.862 8.2
9 10.1 0 22.274 13.2
10 13.2 1 23.830 12.6
11 12.6 2 25.144 10.4
12 10.4 3 22.430 10.8
13 10.8 4 21.785 13.1
14 13.1 5 22.380 12.3
15 13.3 0 23 .927 10.4
16 10.4 1 33.443 10.5
17 10.5 2 24.859 7.7
18 7.7 3 22.686 9.5
19 10.0 0 21.789 12.0
20 12.0 1 22.041 12.6
21 12.1 4 21.033 13.6
22 13.6 5 21.005 14.1
23 15.0 0 25.865 13.5
24 13.5 1 26.290 11.5
25 11.5 2 22.932 12.0
26 12.0 3 21.313 13.0
27 13.0 4 20.769 14.1
28 14.1 5 21.393 15.1
Appendix A. Data 287
Observation Ozone
Number Xl X2 X3 X4 Xs X6 X7 Xs Concentration
(ppm)
1 40 2693 -25 250 5710 28 47.66 4 3
2 45 590 -24 100 5700 37 55.04 3 5
3 54 1450 25 60 5760 51 57.02 3 5
4 35 1568 15 60 5720 69 53.78 4 6
5 45 2631 -33 100 5790 19 54.14 6 4
6 55 554 -28 250 5790 25 64.76 3 4
7 41 2083 23 120 5700 73 52.52 3 6
8 44 2654 -2 120 5700 59 48.38 3 7
9 54 5000 -19 120 5770 27 48.56 8 4
10 51 111 9 150 5720 44 63.14 3 6
11 51 492 -44 40 5760 33 64.58 6 5
12 54 5000 -44 200 5780 19 56.30 6 4
13 58 1249 -53 250 5830 19 75.74 3 4
14 61 5000 -67 200 5870 19 65.48 2 7
15 64 5000 -40 200 5840 19 63.32 5 5
16 67 639 1 150 5780 59 66.02 4 9
17 52 393 -68 10 5680 73 69.80 5 4
18 54 5000 -66 140 5720 19 54.68 4 3
19 54 5000 -58 250 5760 19 51.98 3 4
20 58 5000 -26 200 5730 26 51.98 4 4
21 69 3044 18 150 5700 59 52.88 5 5
22 51 3641 23 140 5650 70 47.66 5 6
23 53 111 -10 50 5680 64 59.54 3 9
24 59 597 -52 70 5820 19 70.52 5 6
25 64 1791 -15 150 5810 19 64.76 5 6
26 63 793 -15 120 5790 28 65.84 3 11
27 63 531 -38 40 5800 32 75.92 2 10
28 62 419 -29 120 5820 19 75 .74 5 7
29 63 816 -7 6 5770 76 66.20 8 12
30 54 3651 62 30 5670 69 49.10 3 9
31 36 5000 70 100 5590 76 37.94 3 2
32 31 5000 28 200 5410 64 32.36 6 3
33 30 1341 18 60 5350 62 45.86 7 3
34 36 5000 0 350 5480 72 38.66 9 2
35 42 3799 -18 250 5600 76 45.86 7 3
36 37 5000 32 350 5490 72 38.12 11 3
37 41 5000 -1 300 5560 72 37.58 10 4
38 46 5000 -30 300 5700 32 45.86 3 6
39 51 5000 -8 300 5680 50 45.50 5 8
40 55 2398 21 200 5700 86 53.78 4 6
288 Appendix A. Data
Observatio Ozone
Number Xl X2 X3 X4 Xs X6 X7 Xs Concentration
(ppm)
41 41 5000 51 100 5650 61 36.32 5 4
42 41 4281 42 250 5610 62 41.36 5 3
43 49 1161 27 200 5730 66 52.88 5 7
44 45 2778 2 200 5770 68 55.76 5 11
45 55 442 26 40 5770 82 58.28 3 13
46 41 5000 -30 300 5690 21 42.26 8 6
47 45 5000 -53 300 5700 19 43.88 3 5
48 51 5000 -43 300 5730 19 49.10 11 4
49 53 5000 7 300 5690 19 49.10 7 4
50 50 5000 24 300 5640 68 42.08 5 6
51 60 1341 19 150 5720 63 59.18 6 10
52 54 1318 2 150 5740 54 64.58 3 15
53 53 885 -4 80 5740 47 67.10 3 23
54 53 360 3 40 5740 56 67.10 3 17
55 44 3497 73 40 5670 61 49.46 7 7
56 40 5000 73 80 5550 74 40.10 10 2
57 30 5000 44 300 5470 46 29.30 7 3
58 25 5000 39 200 5320 45 27.50 11 3
59 40 5000 -12 140 5530 43 33.62 3 4
60 45 5000 -2 140 5600 21 39.02 3 6
61 51 5000 30 140 5660 57 42.08 7 7
62 48 3608 24 100 5580 42 39.38 5 7
63 45 5000 38 140 5510 50 32.90 5 6
64 47 5000 56 200 5530 61 35.60 5 3
65 43 5000 66 120 5620 61 34.34 9 2
66 49 613 -27 300 5690 60 59.72 0 8
67 56 334 -9 300 5760 31 64.40 4 12
68 53 567 13 150 5740 66 61.88 3 12
69 61 488 -20 2 5780 53 64.94 5 16
70 63 531 -15 50 5790 42 71.06 2 9
71 70 508 7 70 5760 60 66.56 3 24
72 57 1571 68 17 5700 82 56.30 4 13
73 35 721 28 140 5680 57 55.40 4 8
74 52 505 -49 140 5720 21 67.28 5 10
75 59 377 -27 300 5720 19 73.22 5 8
76 67 442 -9 200 5730 32 75.74 4 9
77 57 902 54 250 5710 77 60.44 5 10
78 42 1381 4 60 5720 71 56.30 4 14
79 55 5000 -16 100 5710 19 50.00 3 9
80 40 5000 38 150 5600 45 46.94 6 11
Appendix A. Data 289
Table A.8. Box and Cox poison data. Survival times in lO-hour units of animals in
a 3 x 4factorial experiment. Each cell in the table includes both the observation
number and the response
Treatment Poison
1 2 3 4 A I
0.31 0.45 0.46 0.43
5 6 7 8 A II
0.36 0.29 0.40 0.23
9 10 11 12 A III
0.22 0.21 0.18 0.23
13 14 15 16 B I
0.82 1.10 0.88 0.72
17 18 19 20 B II
0.92 0.61 0.49 1.24
21 22 23 24 B III
0.30 0.37 0.38 0 .29
25 26 27 28 C I
0.43 0.45 0.63 0.76
29 30 31 32 C II
0.44 0.35 0.31 0.40
33 34 35 36 C III
0.23 0.25 0.24 0.22
37 38 39 40 D I
0.45 0.71 0.66 0.62
41 42 43 44 D II
0.56 1.02 0.71 0.38
45 46 47 48 D III
0.30 0.36 0.31 0.33
290 Appendix A. Data
Table A.9. Mussels data from Cook and Weisberg. The response is the mass of
the edible portion of the mussel
Table A.lO. Short leaf pine. The response is the volume of the tree, Xl the girth
and X2 the height
Number Xl X2 Y
1 4.6 33 2.2
2 4.4 38 2.0
3 5.0 40 3.0
4 5.1 49 4.3
5 5.1 37 3.0
6 5.2 41 2.9
7 5 .2 41 3.5
8 5.5 39 3.4
9 5.5 50 5.0
10 5.6 69 7.2
11 5.9 58 6.4
12 5.9 50 5.6
13 7.5 45 7.7
14 7.6 51 10.3
15 7.6 49 8.0
16 7.8 59 12.1
17 8.0 56 11.1
18 8.1 86 16.8
19 8.4 59 13.6
20 8.6 78 16.6
21 8.9 93 20.2
22 9.1 65 17.0
23 9.2 67 17.7
24 9.3 76 19.4
25 9.3 64 17.1
26 9.8 71 23.9
27 9.9 72 22.0
28 9.9 79 23.1
29 9.9 69 22.6
30 10.1 71 22.0
31 10.2 80 27.0
32 10.2 82 27.0
33 10.3 81 27.4
34 10.4 75 25.2
35 10.6 75 25.5
Appendix A. Data 293
Number Xl X2 Y
36 11.0 71 25.8
37 11.1 81 32.8
38 11.2 91 35.4
39 11.5 66 26.0
40 11.7 65 29.0
41 12.0 72 30.2
42 12.2 66 28.2
43 12.2 72 32.4
44 12.5 90 41.3
45 12.9 88 45.2
46 13.0 63 31.5
47 13.1 69 37.8
48 13.1 65 31.6
49 13.4 73 43.1
50 13.8 69 36.5
51 13.8 77 43.3
52 14.3 64 41.3
53 14.3 77 58.9
54 14.6 91 65.6
55 14.8 90 59.3
56 14.9 68 41.4
57 15.1 96 61.5
58 15.2 91 66.7
59 15.2 97 68.2
60 15.3 95 73.2
61 15.4 89 65.9
62 15.7 73 55.5
63 15.9 99 73.6
64 16.0 90 65.9
65 16.8 90 71.4
66 17.8 91 80.2
67 18.3 96 93.8
68 18.3 100 97.9
69 19.4 94 107.0
70 23.4 104 163.5
294 Appendix A. Data
1 (0) 4403
2 (0) 5042
3 -11 5259
4 -11 5598
5 -10 4868
6 -10 4796
7 -9 3931
8 -9 4503
9 -8 2588
10 -8 3089
11 -7 2084
12 -7 3665
13 -6 2149
14 -6 2216
15 -5 1433
16 -5 1926
(0) indicates NIF concentration = O.
Appendix A. Data 295
Table A.12. Enzyme kinetics data. The response is the initial velocity of the
reaction
Number NIN TW TN
Xl X2 Y
1 5.548 0.137 2.590
2 4 8. 96 2.499 3.770
3 1.964 0.419 1.270
4 3.586 1.699 1.445
5 3.824 0.605 3 .290
6 3.111 0.677 0.930
7 3.607 0.159 1.600
8 3.557 1.699 1.250
9 2.989 0.340 3.450
10 18.053 2.899 1.096
11 3.773 0.082 1.745
12 1.253 0.425 1.060
13 2.094 0.444 0.890
14 2.726 0.225 2.755
15 1.758 0.241 1.515
16 5.011 0.099 4.770
17 2.455 0.644 2.220
18 0.913 0.266 0.590
19 0.890 0.351 0.530
20 2.468 0.027 1.910
21 4.168 0.030 4.010
22 4.810 3.400 1.745
23 34.319 1.499 1.965
24 1.531 0.351 2.555
25 1.481 0.082 0.770
26 2.239 0.518 0.720
27 4.204 0.471 1.730
28 3.463 0.036 2.860
29 1.727 0.721 0.760
Table A.16. Car insurance data from McCullagh and Neider. The response is the
average claim, in £. Also given are observation number and m, the number of
claims in each category
Number Xl X2 Y Number Xl X2 Y
1 1 180 15.0 44 4 250 13.5
2 1 180 17.0 45 4 275 10.0
3 1 180 15.5 46 4 275 11.5
4 1 180 16.5 47 4 275 11.0
5 1 225 15.5 48 4 275 9.5
6 1 225 15.0 49 8 180 15.0
7 1 225 16.0 50 8 180 15.0
8 1 225 14.5 51 8 180 15.5
9 1 250 15.0 52 8 180 16.0
10 1 250 14.5 53 8 225 13.0
11 1 250 12.5 54 8 225 10.5
12 1 250 11.0 55 8 225 13.5
13 1 275 14.0 56 8 225 14.0
14 1 275 13.0 57 8 250 12.5
15 1 275 14.0 58 8 250 12.0
16 1 275 11.5 59 8 250 11.5
17 2 180 14.0 60 8 250 11.5
18 2 180 16.0 61 8 275 6.5
19 2 180 13.0 62 8 275 5.5
20 2 180 13.5 63 8 275 6.0
21 2 225 13.0 64 8 275 6.0
22 2 225 13.5 65 16 180 18.5
23 2 225 12.5 66 16 180 17.0
24 2 225 12.5 67 16 180 15.3
25 2 250 12.5 68 16 180 16.0
26 2 250 12.0 69 16 225 13.0
27 2 250 11.5 70 16 225 14.0
28 2 250 12.0 71 16 225 12.5
29 2 275 13.0 72 16 225 11.0
30 2 275 11.5 73 16 250 12.0
31 2 275 13.0 74 16 250 12.0
32 2 275 12.5 75 16 250 11.5
33 4 180 13.5 76 16 250 12.0
34 4 180 17.5 77 16 275 6.0
35 4 180 17.5 78 16 275 6.0
36 4 180 13.5 79 16 275 5.0
37 4 225 12.5 80 16 275 5.5
38 4 225 12.5 81 32 180 12.5
39 4 225 15.0 82 32 180 13.0
40 4 225 13.0 83 32 180 16.0
41 4 250 12.0 84 32 180 12.0
42 4 250 13.0 85 32 225 11.0
43 4 250 12.0 86 32 225 9.5
Appendix A. Data 301
Number Xl X2 Y Number Xl X2 Y
87 32 225 11.0 108 48 250 7.9
88 32 225 11.0 109 48 275 1.2
89 32 250 11.0 110 48 275 1.5
90 32 250 10.0 111 48 275 1.0
91 32 250 10.5 112 48 275 1.5
92 32 250 10.5 113 64 180 13.0
93 32 275 2.7 114 64 180 12.5
94 32 275 2.7 115 64 180 16.5
95 32 275 2.5 116 64 180 16.0
96 32 275 2.4 117 64 225 11.0
97 48 180 13.0 118 64 225 11.5
98 48 180 13.5 119 64 225 10.5
99 48 180 16.5 120 64 225 10.0
100 48 180 13.6 121 64 250 7.2
101 48 225 11.5 122 64 250 7.5
102 48 225 10.5 123 64 250 6.7
103 48 225 13.5 124 64 250 7.6
104 48 225 12.0 125 64 275 1.5
105 48 250 7.0 126 64 275 1.0
106 48 250 6.9 127 64 275 1.2
107 48 250 8.8 128 64 275 1.2
Number Xl X2 Y
1 0 0 11
2 0 4 18
3 0 20 20
4 0 100 39
5 1 0 22
6 1 4 38
7 1 20 52
8 1 100 69
9 10 0 31
10 10 4 68
11 10 20 69
12 10 100 128
13 100 0 102
14 100 4 171
15 100 20 180
16 100 100 193
Dose of TNF (V/ml) .
Xl :
X2:Dose of IFN (V/ml).
y: Number of cells differentiating.
1 49.09 6 59
2 52.99 13 60
3 56.91 18 62
4 60.84 28 56
5 64.76 52 63
6 68.69 53 59
7 72.61 61 62
8 76.54 60 60
Appendix A. Data 305
Table A.21. Number of mice with convulsions after treatment with insulin
1 3.4 0 0 33
2 5.2 0 5 32
3 7.0 0 11 38
4 8.5 0 14 37
5 10.5 0 18 40
6 13.0 0 21 37
7 18.0 0 23 31
8 21.0 0 30 37
9 28.0 0 27 30
10 6.5 1 2 40
11 10.0 1 10 30
12 14.0 1 18 40
13 21.5 1 21 35
14 29.0 1 27 37
Preparation: 0 = Standard, 1 = Test.
306 Appendix A. Data
y: 0 = nonoccurrence; 1 = occurrence.
308 Appendix A. Data
Number Xl X4 X6 Y Number Xl X4 X6 Y
1 44 254 190 0 44 56 428 171 1
2 35 240 216 0 45 53 334 166 0
3 41 279 178 0 46 47 278 121 0
4 31 284 149 0 47 30 264 178 0
5 61 315 182 1 48 64 243 171 1
6 61 250 185 0 49 31 348 181 0
7 44 298 161 0 50 35 290 162 0
8 58 384 175 0 51 65 370 153 1
9 52 310 144 0 52 43 363 164 0
10 52 337 130 0 53 53 343 159 0
11 52 367 162 0 54 58 305 152 1
12 40 273 175 0 55 67 365 190 1
13 49 273 155 0 56 53 307 200 0
14 34 314 156 0 57 42 243 147 0
15 37 243 151 0 58 43 266 125 0
16 63 341 168 0 59 52 341 163 0
17 28 245 185 0 60 68 268 138 0
18 40 302 225 0 61 64 261 108 0
19 51 302 247 1 62 46 378 142 0
20 33 386 146 0 63 41 279 212 0
21 37 312 170 1 64 58 416 188 0
22 33 302 161 0 65 50 261 145 0
23 41 394 167 0 66 45 332 144 0
24 38 358 198 0 67 59 337 158 0
25 52 336 162 0 68 56 365 154 0
26 31 251 150 0 69 59 292 148 0
27 44 322 196 1 70 47 304 155 0
28 31 281 130 0 71 43 341 154 0
29 40 336 166 1 72 37 317 184 0
30 36 314 178 0 73 27 296 140 0
31 42 383 187 0 74 44 390 167 0
32 28 360 148 0 75 41 274 138 0
33 40 369 180 0 76 33 355 169 0
34 40 333 172 0 77 29 225 186 0
35 35 253 141 0 78 24 218 131 0
36 32 268 176 0 79 36 298 160 0
37 31 257 154 0 80 23 178 142 0
38 52 474 145 0 81 47 341 218 1
39 45 391 159 1 82 26 274 147 0
40 39 248 181 0 83 45 285 161 0
41 40 520 169 1 84 41 259 245 0
42 48 285 160 1 85 55 266 167 0
43 29 352 149 0 86 34 214 139 1
y : 0 = nonoccurrence; 1 = occurrence.
Appendix A. Data 309
Number Xl X4 X6 Y Number Xl X4 X6 Y
87 51 267 150 0 130 51 286 134 0
88 58 256 175 0 131 37 260 188 0
89 51 273 123 0 132 28 252 149 0
90 35 348 174 0 133 44 336 175 0
91 34 322 192 0 134 35 216 126 0
92 26 267 140 0 135 41 208 165 0
93 25 270 195 0 136 29 352 160 0
94 44 280 144 0 137 46 346 155 0
95 57 320 193 0 138 55 259 140 0
96 67 320 134 0 139 32 290 181 0
97 59 330 144 0 140 40 239 178 0
98 62 274 179 0 141 61 333 141 0
99 40 269 III 0 142 29 173 143 0
100 52 269 164 0 143 52 253 139 0
101 28 135 168 0 144 25 156 136 0
102 34 403 175 0 145 27 156 150 0
103 43 294 173 0 146 27 208 185 0
104 38 312 158 0 147 53 218 185 0
105 45 3ll 154 0 148 42 172 161 0
106 26 222 214 0 149 64 357 180 0
107 35 302 176 0 150 27 178 198 0
108 51 269 262 0 151 55 283 128 1
109 55 3ll 181 0 152 33 275 177 0
llO 45 286 143 0 153 58 187 224 0
III 69 370 185 1 154 51 282 160 0
ll2 58 403 140 0 155 37 282 181 0
ll3 64 244 187 0 156 47 254 136 0
ll4 70 353 163 0 157 49 273 245 0
115 27 252 164 0 158 46 328 187 0
116 53 453 170 0 159 40 244 161 1
117 28 260 150 0 160 26 277 190 0
ll8 29 269 141 0 161 28 195 180 0
ll9 23 235 135 0 162 23 206 165 0
120 40 264 135 0 163 52 327 147 0
121 53 420 141 0 164 42 246 146 0
122 25 235 148 0 165 27 203 182 0
123 63 420 160 1 166 29 185 187 0
124 48 277 180 1 167 43 224 128 0
125 36 319 157 0 168 34 246 140 0
126 28 386 189 1 169 40 227 163 0
127 57 353 166 0 170 28 229 144 0
128 39 344 175 0 171 30 214 150 0
129 52 210 172 1 172 34 206 137 0
310 Appendix A. Data
Number Xl X4 X6 Y
173 26 173 141 0
174 34 248 141 0
175 35 222 190 0
176 34 230 167 0
177 45 219 159 0
178 47 239 157 0
179 54 258 170 0
180 30 190 132 0
181 29 252 155 0
182 48 253 178 0
183 37 172 168 0
184 43 320 159 1
185 31 166 160 0
186 48 266 165 0
187 34 176 194 0
188 42 271 191 1
189 49 295 198 0
190 50 271 212 1
191 42 259 147 0
192 50 178 173 1
193 60 317 206 0
194 27 192 190 0
195 29 187 181 0
196 29 238 143 0
197 49 283 149 0
198 49 264 166 0
199 50 264 176 0
200 31 193 141 0
Bibliography
Seber, G. A. F ., 136, 148, 172, 317 Tutz, G., 226, 230, 247, 265, 314
Selwyn, M. R, 27, 314
Sherman, J. , 35, 317 van Zomeren, B. C., 29, 317
Shih, J -. Q. , 126, 317 Venables, W. N., 251 , 265, 318
Simonoff, J. S., 30, 33, 315 Vreth, M., 252, 318
Smith, H., 174, 314
Smith, R L., 127, 312 Walbran, A., 76, 314
Snell, E. J. , 187, 265, 315 Wang, P.C., 21, 313
Spurr, S. H., 125, 126, 317 Watts, D. G., 136, 138, 143, 145,
Srinivasan, R, 174, 317 147,151,171,172, 176,312
Stefanski, L. A. , 265, 318 Weinberg, C. R, 226, 230, 316
Stromberg, A., 154, 164, 165, 173, Weisberg, S., 2, 18, 28, 35, 67, 80,
318 87, 116, 313, 314, 318
Stromberg, A. J., 173,318 Welsch, R E., 35 , 312
Swallow, W. H., 35, 315 Wild, C. J. , 136, 148, 172, 317
St Laurent, R T. , 152, 173, 318 Williams, D. A., 200, 318
Witmer, J. A., 147, 314
Tidwell, P. W., 87, 312 Woodbury, M., 35, 318
Tukey, J. W ., 256, 315 Woodruff, D.L., 30, 266, 318
Subject Index