Davison Hinkley Bootstrap Methods and Their Application
Davison Hinkley Bootstrap Methods and Their Application
Editorial Board:
R. Gill (Utrecht)
B.D. Ripley (Oxford)
S. Ross (Berkeley)
M. Stein (Chicago)
D. Williams (Bath)
A . C. D a v iso n
Professor o f Statistics, Department o f Mathematics,
Swiss Federal Institute o f Technology, Lausanne
D . V. H in k le y
Professor o f Statistics, Department o f Statistics and Applied Probability,
University o f California, Santa Barbara
H I C a m b r id g e
U N IV E R S IT Y P R E S S
P U B L IS H E D BY THE PRESS S Y N D IC A T E OF THE U N IV E R S IT Y OF C A M B R ID G E
The Pitt Building, Trumpington Street, Cambridge CB2 1RP, United Kingdom
C A M B R ID G E U N IV E R S IT Y PRESS
The Edinburgh Building, Cambridge CB2 2R U , United Kingdom
40 West 20th Street, N ew York, N Y 10011-4211, U SA
10 Stamford Road, Oakleigh, M elbourne 3166, Australia
Preface ix
1 Introduction 1
3 Further Ideas 70
3.1 In tro d u ctio n 70
3.2 Several Sam ples 71
3.3 Sem iparam etric M odels 77
3.4 Sm ooth E stim ates o f F 79
3.5 C ensoring 82
3.6 M issing D a ta 88
3.7 F inite Population Sam pling 92
3.8 H ierarchical D a ta 100
3.9 B ootstrapping the B ootstrap 103
v
vi Contents
Tests 136
4.1 Intro d u ctio n 136
4.2 R esam pling for Param etric Tests 140
4.3 N o n p aram etric P erm utation Tests 156
4.4 N o n p aram etric B ootstrap Tests 161
4.5 A djusted P-values 175
4.6 Estim ating Properties o f Tests 180
4.7 B ibliographic N otes 183
4.8 Problem s 184
4.9 Practicals 187
The publication in 1979 of Bradley Efron’s first article on bootstrap methods was a
major event in Statistics, at once synthesizing some of the earlier resampling ideas
and establishing a new framework for simulation-based statistical analysis. The idea
of replacing complicated and often inaccurate approximations to biases, variances,
and other measures of uncertainty by com puter simulations caught the imagination
of both theoretical researchers and users of statistical methods. Theoreticians
sharpened their pencils and set about establishing mathematical conditions under
which the idea could work. Once they had overcome their initial skepticism, applied
workers sat down at their terminals and began to amass empirical evidence that
the bootstrap often did work better than traditional methods. The early trickle of
papers quickly became a torrent, with new additions to the literature appearing
every month, and it was hard to see when would be a good moment to try to chart
the waters. Then the organizers o f COMPSTAT ’92 invited us to present a course
on the topic, and shortly afterwards we began to write this book.
We decided to try to write a balanced account o f resampling methods, to include
basic aspects of the theory which underpinned the methods, and to show as many
applications as we could in order to illustrate the full potential of the methods —
warts and all. We quickly realized that in order for us and others to understand
and use the bootstrap, we would need suitable software, and producing it led us
further towards a practically oriented treatment. Our view was cemented by two
further developments: the appearance o f two excellent books, one by Peter Hall
on the asymptotic theory and the other on basic methods by Bradley Efron and
Robert Tibshirani; and the chance to give further courses that included practicals.
O ur experience has been that hands-on computing is essential in coming to grips
with resampling ideas, so we have included practicals in this book, as well as more
theoretical problems.
As the book expanded, we realized that a fully comprehensive treatm ent was
beyond us, and that certain topics could be given only a cursory treatm ent because
too little is known about them. So it is that the reader will find only brief accounts
o f bootstrap methods for hierarchical data, missing data problems, model selection,
robust estimation, nonparam etric regression, and complex data. But we do try to
point the more ambitious reader in the right direction.
No project of this size is produced in a vacuum. The majority of work on
the book was completed while we were at the University of Oxford, and we are
very grateful to colleagues and students there, who have helped shape our work
in various ways. The experience of trying to teach these methods in Oxford and
elsewhere — at the Universite de Toulouse I, Universite de Neuchatel, Universita
degli Studi di Padova, Queensland University of Technology, Universidade de
Sao Paulo, and University of Umea — has been vital, and we are grateful to
participants in these courses for prompting us to think more deeply about the
ix
X Preface
material. Readers will be grateful to these people also, for unwittingly debugging
some of the problems and practicals. We are also grateful to the organizers of
COMPSTAT ’92 and CLAPEM V for inviting us to give short courses on our
work.
While writing this book we have asked many people for access to data, copies
of their programs, papers or reprints; some have then been rewarded by our
bombarding them with questions, to which the answers have invariably been
courteous and informative. We cannot name all those who have helped in this
way, but D. R. Brillinger, P. Hall, M. P. Jones, B. D. Ripley, H. O’R. Sternberg and
G. A. Young have been especially generous. S. Hutchinson and B. D. Ripley have
helped considerably with computing matters.
We are grateful to the mostly anonymous reviewers who commented on an early
draft of the book, and to R. G atto and G. A. Young, who later read various parts
in detail. A t Cambridge University Press, A. W oollatt and D. Tranah have helped
greatly in producing the final version, and their patience has been commendable.
We are particularly indebted to two people. V. Ventura read large portions o f the
book, and helped with various aspects of the com putation. A. J. Canty has turned
our version o f the bootstrap library functions into reliable working code, checked
the book for mistakes, and has made numerous suggestions that have improved it
enormously. Both of them have contributed greatly — though o f course we take
responsibility for any errors that remain in the book. We hope that readers will
tell us about them, and we will do our best to correct any future versions of the
book; see its WWW page, at U R L
http://dmawww.epf1.ch/davison.mosaic/BMA/
The book could not have been completed without grants from the U K Engineer
ing and Physical Sciences Research Council, which in addition to providing funding
for equipment and research assistantships, supported the work o f A. C. Davison
through the award o f an Advanced Research Fellowship. We also acknowledge
support from the US N ational Science Foundation.
We must also mention the Friday evening sustenance provided at the Eagle and
Child, the Lam b and Flag, and the Royal Oak. The projects of many authors have
flourished in these amiable establishments.
Finally, we thank our families, friends and colleagues for their patience while
this project absorbed our time and energy. Particular thanks are due to Claire
Cullen Davison for keeping the Davison family going during the writing of this
book.
A. C. Davison and D. V. Hinkley
Lausanne and Santa Barbara
May 1997
1
Introduction
1
2 1 ■ Introduction
Figure 1.1, together w ith the observed to tal reports to the end o f 1992. How
good are these predictions?
It would be tedious b u t possible to p u t pen to p ap er and estim ate the
prediction uncertainty th ro u g h calculations based on the Poisson model. But
in fact the d a ta are m uch m ore variable th an th a t m odel would suggest, and
by failing to take this into account we w ould believe th at the predictions are
m ore accurate th a n they really are. Furtherm ore, a b etter approach would be
to use a sem iparam etric m odel to sm ooth out the evident variability o f the
increase in diagnoses from q u arter to q u arter; the corresponding prediction is
the dotted line in Figure 1.1. A nalytical calculations for this m odel would be
very unpleasant, and a m ore flexible line o f attack is needed. W hile m ore th an
one approach is possible, the one th a t we shall develop based on com puter
sim ulation is b o th flexible and straightforw ard.
Time
which the variability o f the quantities o f interest can be assessed w ithout long-
winded and error-prone analytical calculation. Because this approach involves
repeating the original d a ta analysis procedure w ith m any replicate sets o f data,
these are som etim es called computer-intensive methods. A n o th er nam e for them
is bootstrap methods, because to use the d a ta to generate m ore d a ta seems
analogous to a trick used by the fictional B aron M unchausen, who when he
found him self a t the b o tto m o f a lake got out by pulling him self up by his
b ootstraps. In the sim plest nonparam etric problem s we do literally sample
from the data, and a com m on initial reaction is th a t this is a fraud. In fact
it is not. It turns out th a t a wide range o f statistical problem s can be tackled
this way, liberating the investigator from the need to oversimplify complex
problem s. T he ap proach can also be applied in simple problem s, to check the
adequacy o f stan d ard m easures o f uncertainty, to relax assum ptions, and to
give quick approxim ate solutions. A n exam ple o f this is random sam pling to
estim ate the p erm u tatio n distribution o f a nonparam etric test statistic.
It is o f course true th a t in m any applications we can be fairly confident in
a p articu lar p aram etric m odel and the stan d ard analysis based on th a t model.
Even so, it can still be helpful to see w hat can be inferred w ithout particular
p aram etric m odel assum ptions. This is in the spirit o f robustness o f validity o f
the statistical analysis perform ed. N onparam etric b o o tstrap analysis allows us
to do this.
4 1 • Introduction
Examples
B ootstrap m ethods can be applied b o th when there is a well-defined probability
m odel for d a ta an d when there is not. In o u r initial developm ent o f the
m ethods we shall m ake frequent use o f tw o simple examples, one o f each type,
to illustrate the m ain points.
F t ) = / °’ y ~ °’
\ l - e x p (-y/n), y > 0,
for the fitted exponential distrib u tio n w ith m ean fi set equal to the sample
average, y = 108.083. The solid line on the sam e plot is the nonparam etric
equivalent, the em pirical distribution function (E D F ) for the data, which places
equal probabilities n-1 = 0.083 at each sam ple value. C om parison o f the two
curves suggests th a t the exponential m odel fits reasonably well. A n alternative
view o f this is shown in the right panel o f the figure, which is an exponential
1 ■ Introduction 5
o
Figure 1.2 Summary O
co
displays for the
air-conditioning data. o
o
The left panel shows the in
EDF for the data, F o
o
(solid), and the CDF of
a fitted exponential o
distribution (dots). The o
co
right panel shows a plot
o
of the ordered failure o
CM
times against
exponential quantiles, O
o
with the fitted
exponential model
shown as the dotted line.
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Failure time y Quantiles of standard exponential
= - log (1
n+ 1 n+ 1
K=1
A lthough these plots suggest reasonable agreem ent with the exponential
m odel, the sam ple is ra th e r too small to have m uch confidence in this. In the
d a ta source the m ore general gam m a m odel with m ean /i and index k is used;
its density is
1 / \ K
1 I K ' „K-1.
fw (y) = y K exP ( - Ky / v l y > o, h, k > o. ( i.i)
F or o u r sam ple the estim ated index is k = 0.71, which does not differ signif
icantly (P = 0.29) from the value k = 1 th a t corresponds to the exponential
m odel. O u r reason for m entioning this will becom e apparent in C h apter 2.
Basic properties o f the estim ator T = Y for fj. are easy to obtain theoretically
under the exponential model. For example, it is easy to show th at T is unbiased
and has variance fi2/n. A pproxim ate confidence intervals for n can be calculated
using these properties in conjunction with a norm al approxim ation for the
distrib u tio n o f T, alth o u g h this does n o t w ork very well: we can tell this
because Y / n has an exact gam m a distribution, which leads to exact confidence
limits. Things are m ore com plicated under the m ore general gam m a model,
because the index k is only estim ated, and so in a traditional approach we would
use approxim ations — such as a norm al approxim ation for the distribution
o f T, or a chi-squared approxim ation for the log likelihood ratio statistic.
6 1 ■ Introduction
The param etric sim ulation m ethods o f Section 2.2 can be used alongside these
approxim ations, to diagnose problem s w ith them , or to replace them entirely.
■
Example 1.2 (City population data) Table 1.3 reports n = 49 d a ta pairs, each
corresponding to a city in the U nited States o f A m erica, the p air being the 1920
and 1930 p o pulations o f the city, w hich we denote by u and x. The d a ta are
plotted in Figure 1.3. Interest here is in the ratio o f m eans, because this would
enable us to estim ate the to tal pop u latio n o f the U SA in 1930 from the 1920
figure. I f the cities form a ran d o m sam ple w ith ( U , X ) denoting the p air o f
populatio n values for a random ly selected city, then the total 1930 population
is the prod u ct o f the to tal 1920 popu latio n and the ratio o f expectations
6 = E (X )/E ([7). This ratio is the p aram eter o f interest.
In this case there is no obvious p aram etric m odel for the jo in t distribution
o f ( U , X ) , so it is n atu ral to estim ate 9 by its em pirical analog, T = X / U , the
ratio o f sam ple averages. We are then concerned w ith the uncertainty in T. If
we had a plausible param etric m odel — for exam ple, th a t the pair ( U, X ) has
a bivariate lognorm al distrib u tio n — then theoretical calculations like those
in Exam ple 1.1 would lead to bias an d variance estim ates for use in a norm al
approxim ation, which in tu rn would provide approxim ate confidence intervals
for 6. W ithout such a m odel we m ust use nonparam etric analysis. It is still
possible to estim ate the bias an d variance o f T, as we shall see, and this m akes
norm al approxim ation still feasible, as well as m ore com plex approaches to
setting confidence intervals. ■
Table 13 Populations
in thousands of n — 49 u X u X u X
large US cities in 1920
(u) and in 1930 (x)
(Cochran, 1977, p. 152). 138 143 76 80 67 67
93 104 381 464 120 115
61 69 387 459 172 183
179 260 78 106 66 86
48 75 60 57 46 65
37 63 507 634 121 113
29 50 50 64 44 58
23 48 77 89 64 63
30 111 64 77 56 142
2 50 40 60 40 64
38 52 136 139 116 130
46 53 243 291 87 105
71 79 256 288 43 61
25 57 94 85 43 50
298 317 36 46 161 232
74 93 45 53 36 54
50 58
Figure 1J Populations
of 49 large United
States cities (in 1000s)
in 1920 and 1930.
c
o
«
3
Q.
O
Q.
O
CO
O)
1920 population
8 1 ■ Introduction
the S language to sets o f data. The practicals are intended to reinforce the
ideas in each chapter, to supplem ent the m ore theoretical problem s, and to
give exam ples on which readers can base analyses o f their own data.
It would be possible to give different sorts o f course based on this book.
O ne w ould be a “theoretical” course based on the problem s and an o th er an
“applied” course based on the practicals; we prefer to blend the two.
A lthough a library o f routines for use with the statistical package S P lu s
is bundled w ith it, m ost o f the book can be read w ithout reference to p a r
ticular softw are packages. A p art from the practicals, the exception to this is
C h ap ter 11, which is a short introduction to the m ain resam pling routines,
arran g ed roughly in the order with which the corresponding ideas ap p ear in
earlier chapters. R eaders intending to use the bundled routines will find it
useful to w ork through the relevant sections o f C h apter 11 before attem pting
the practicals.
Notation
A lthough we believe th a t o u r n o tation is largely standard, there are not enough
letters in the English and G reek alphabets for us to be entirely consistent. G reek
letters such as 6, P and v generally denote param eters or o ther unknow ns, while
a is used for error rates in connection with significance tests and confidence
sets. English letters X , Y, Z , and so forth are used for random variables, which
take values x, y, z. T hus the estim ator T has observed value t, which m ay be
an estim ate o f the unknow n p aram eter 0. The letter V is used for a variance
estim ate, an d the letter p for a probability, except for regression models, where
p is the num b er o f covariates. Script letters such as J/~ are used to denote sets.
Probability, expectation, variance and covariance are denoted Pr( ), E( ),
var(-) and cov(-, •), while the jo in t cum ulant o f Yi, Y1Y2 and Y3 is denoted
cum(Yi, Yj Y2, Y3). We use I {A} to denote the indicator random variable, which
takes values one if the event A is true and zero otherwise. A related function
is the H eaviside function
We use #{/!} to denote the nu m ber o f elem ents in the set A, and #{^4r} for the
num ber o f events A r th a t occur in a sequence A i , A 2 , __ We use = to m ean
“is approxim ately equal to ”, usually corresponding to asym ptotic equivalence
as sam ple sizes tend to infinity, ~ to m ean “is distributed as” o r “is distributed
according to ”, ~ to m ean “is distributed approxim ately a s”, ~ to m ean “is a
sam ple o f independent identically distributed random variables from ”, while
s has its usual m eaning o f “is equivalent to ”.
10 1 ■ Introduction
2.1 Introduction
In this chap ter we discuss techniques which are applicable to a single, h om o
geneous sam ple o f data, denoted by y i,...,} V T he sam ple values are thought
o f as the outcom es o f independent and identically distributed ran d o m variables
Y U . . . ,Y „ w hose probability density function (P D F ) and cumulative distribution
function (C D F ) we shall denote by / and F, respectively. T he sam ple is to be
used to m ake inferences ab o u t a p o p ulation characteristic, generically denoted
by 6, using a statistic T whose value in the sam ple is t. We assum e for the
m om ent th a t the choice o f T has been m ade and th a t it is an estim ate for 6,
which we take to be a scalar.
O u r atten tio n is focused on questions concerning the probability distribution
o f T. F or exam ple, w hat are its bias, its stan d ard error, or its quantiles? W hat
are likely values und er a certain null hypothesis o f interest? H ow do we
calculate confidence limits for 6 using T ?
T here are tw o situations to distinguish, the param etric and the n o n p a ra m et
ric. W hen there is a p articu lar m athem atical m odel, with adjustable constants
o r p aram eters ip th a t fully determ ine / , such a m odel is called parametric and
statistical m ethods based on this m odel are param etric m ethods. In this case
the p aram eter o f interest 6 is a com ponent o f or function o f ip. W hen no such
m athem atical m odel is used, the statistical analysis is nonparametric, and uses
only the fact th a t the ran d o m variables Yj are independent and identically
distributed. Even if there is a plausible param etric m odel, a nonparam etric
analysis can still be useful to assess the robustness o f conclusions draw n from
a p aram etric analysis.
A n im p o rta n t role is played in nonparam etric analysis by the empirical
distribution which puts equal probabilities n-1 a t each sam ple value yj. The
corresponding estim ate o f F is the empirical distribution function (E D F ) F,
11
12 2 • The Basic Bootstraps
F(y) = l i Z H ^ y - y ^ w
j=i
where H(u) is the unit step function which ju m p s from 0 to 1 at u = 0. N otice
th at the values o f the E D F are fixed (0, j[), so the E D F is equivalent
to its points o f increase, the ordered values >’(i) < • • • < y ln} o f the data. An
exam ple o f the E D F was shown in the left panel o f Figure 1.2.
W hen there are rep eat values in the sample, as would often occur with
discrete data, the E D F assigns probabilities p ro p o rtional to the sam ple fre
quencies at each distinct observed value y. The form al definition (2.1) still
applies.
The E D F plays the role o f fitted m odel when no m athem atical form is
assum ed for F, analogous to a param etric C D F w ith param eters replaced by
their estim ates.
b o th the p aram eter an d its estim ate, b u t we shall use t( ) to represent the
function, and t to represent the estim ate o f 9 based on the observed d ata
Example 2.1 (Average) T he sample average, y, estim ates the population m ean
H = J ydF(y).
j= i
Example 2.2 (City population data) F or the problem outlined in Exam ple 1.2,
the p aram eter o f interest is the ratio o f m eans 9 = E (X )/E (l/). In this case F
is the bivariate C D F o f Y = (V , X ), and the bivariate E D F F puts probability
n~l at each o f the d a ta pairs (uj ,Xj). T he statistical function version o f 9 simply
uses the definition o f m ean for b o th nu m erato r and denom inator, so th at
fxdF(u,x)
f ud F( u, x)
* [ xdF(u,x) x
t = t(F) =
J udF(u,x) u
require special treatm en t is n o n p aram etric density estim ation, which we discuss
in Exam ple 5.13.)
The representation 6 = t(F) defines the p aram eter and its estim ator T in a
robust way, w ithout any assum ption ab o u t F, oth er th an th a t 6 exists. This
guarantees th a t T estim ates the right thing, no m atter w hat F is. Thus the
sam ple average y is the only statistic th a t is generally valid as an estim ate o f the
population m ean f i : only if Y is sym m etrically distributed ab o u t /i will statistics
such as trim m ed averages also estim ate fi. This property, which guarantees th at
the correct characteristic o f the underlying distribution is estim ated, w hatever
th a t distribution is, is som etim es called robustness o f specification.
2.1.2 Objectives
M uch o f statistical theory is devoted to calculating approxim ate distributions
for p articu lar statistics T , on which to base inferences ab o u t their estim ands 8.
Suppose, for exam ple, th a t we w ant to calculate a (1 — 2a) confidence interval
for 6. It m ay be possible to show th a t T is approxim ately norm al w ith m ean
6 + P and variance v; here P is the bias o f T. If p an d v are b o th know n, then
we can write
= means “is
P r(T < 1 1 F) = O . (2-3) approximately equal to”.
where <t>() is the stan d ard norm al integral. I f the a quantile o f the standard
norm al distrib u tio n is z« = <D- 1(a), then an approxim ate (1 — 2a) confidence
interval for 6 has limits
t - p - v ^ \ , (2.4)
as follows from
(2.5), th a t is
Example 2.3 (Air-conditioning data) U nder the exponential m odel for the
d a ta in Exam ple 1.1, the m ean failure tim e n is estim ated by the average T = Y ,
which has a gam m a distrib u tio n with m ean fi and shape param eter k = n.
Therefore the bias an d variance o f T are b(F) = 0 and i>(F) = /i2/ n , and these
are estim ated by 0 and y 2/n. Since n = 12, y = 108.083, and 20.025 = —1.96,
a 95% confidence interval for /i based on the norm al approxim ation (2.3) is
+ 1.96n_1/2y = (46.93,169.24). ■
E stim ates such as those in (2.6) are b o o tstrap estim ates. H ere they have
been used in conjunction w ith a norm al approxim ation, which som etim es will
be adequate. However, the b o o tstrap approach o f substituting estim ates can
be applied m ore am bitiously to im prove upon the norm al approxim ation and
o th e r first-order theoretical approxim ations. The elaboration o f the b o o tstrap
ap proach is the purpose o f this book.
N ote th a t the estim ated bias o f Y is zero, being the difference between
E '(Y *) an d the value ji = y for the m ean o f the fitted distribution. These
m om ents were used to calculate an approxim ate norm al confidence interval in
Exam ple 2.3.
If, however, we wished to calculate the bias and variance o f T = log Y under
the fitted m odel, i.e. E* (log Y*) — lo g y and v ar’ (lo g Y '), exact calculation is
m ore difficult. The delta m ethod o f Section 2.7.1 would give approxim ate
values —(2n)~* and n-1 . But m ore accurate approxim ations can be obtained
using sim ulated sam ples o f 7* s.
Sim ilar results and com m ents would apply if instead we chose to use the
m ore general gam m a m odel (1.1) for this example. T hen Y* would be a gam m a
random variable with m ean y and index k. m
B = b(F) = E (T | F) — t = E*(T*) - t,
N ote th a t in the sim ulation t is the p aram eter value for the model, so th at
T ' — t is the sim ulation analogue o f T — 6. The corresponding estim ator o f
the variance o f T is
1 R
Vr = D 7’-* - f *)2’ (2-8)
t2 t4 / 2 6 \
nR’ n2 \ R - 1 + n R . ) ’
and we can use these to say how large R should be in order th a t the sim ulated
values have a specified accuracy. For exam ple, the coefficients o f variation
o f VR a t R = 100 and 1000 are respectively 0.16 and 0.05. However, for a
com plicated problem w here sim ulation was really necessary, such calculations
could n o t be done, an d general rules are needed to suggest how large R should
be. These are discussed in Section 2.5.2. ■
n i \ — t < u} 1 ,
G* (U) = ~ ^ R ------- = R Z 2 1{tr ~ 1 -
r=l
T he illustration used here is very simple, but essentially the same m ethods
can be used in arb itrarily com plicated param etric problems. F or example,
distributions o f likelihood ratio statistics can be approxim ated when large-
sam ple approxim ations are inaccurate or fail entirely. In C hapters 4 and
5 respectively we show how param etric boo tstrap m ethods can be used to
calculate significance tests an d confidence sets.
It is som etim es useful to be able to look at the density o f T, for exam ple to
see if it is m ultim odal, skewed, or otherw ise differs appreciably from norm ality.
A rough idea o f the density g(u) o f U = T —6, say, can be had from a histogram
o f the values o f t ' — t. A som ew hat b etter picture is offered by a kernel density
20 2 • The Basic Bootstraps
ooo /■ • / ■
o
o
C\J
o
•> /
o /
/*S ’ to
o
o
Jr o /
o
/
o o /
CD j in
O
''fr
60 80 120 160 200 50 100 150 200
Exact gamma quantile Exact gamma quantile
<«>
r= l v y
where w is a sym m etric P D F with zero m ean and h i s a. positive bandw idth th a t
determ ines the sm oothness o f gh. The estim ate gh is non-negative and has unit
integral. It is insensitive to the choice o f w(-), for which we use the standard
norm al density. The choice o f h is m ore im portant. T he key is to produce a
sm ooth result, while n o t flattening out significant modes. If the choice o f h
is quite large, as it m ay be if R < 100, then one should rescale the density
2.2 - Parametric Simulation 21
Figure 2 3 Histograms
of t* values based on o
o
R = 99 (left) and
R = 999 (right) o
simulations from the o o
fitted exponential model r~
O
for the air-conditioning o co
o
data. o
in
o Tt
o o
liB
o
lb
o
o
o
l o
o
50 100 150 200 50 100 150 200
t* t*
estim ate to m ake its m ean and variance agree with the estim ated m ean bR and
variance vR o f T — 9; see Problem 3.8.
As a general rule, good estim ates o f density require at least R = 1000:
density estim ation is usually h ard er th an probability o r quantile estim ation.
N ote th a t the same m ethods o f estim ating density, distribution function and
quantiles can be applied to any transform ation o f T. We shall discuss this
fu rth er in Section 2.5.
22 2 • The Basic Bootstraps
Example 2.7 (Average) In the case o f the average, exact m om ents under
sam pling from the E D F are easily found. F or exam ple,
E*(Y*) = E '(Y * ) = ^ ^ ; =y
j=i
and similarly
- 1 1 1 " 1
v a r* (Y * )= -v a r * ( Y ') = -E *{Y * — E*(Y*)}2 = - x V - { y , — y f
n n 1 1 n ^ n 1
}=i
(n — 1) 1 2
= —
A p art from the factor (n — 1)/n, this is the usual result for the estim ated
variance o f Y . ■
O ther simple statistics such as the sam ple variance and sam ple m edian are
also easy to handle (Problem s 2.3, 2.4).
To apply sim ulation w ith the E D F is very straightforw ard. Because the
E D F puts equal probabilities on the original d a ta values y i , . . . , y „ , each Y*
is independently sam pled a t ran d o m from those d a ta values. T herefore the
sim ulated sam ple Y(’, . . . , Y„* is a ran d o m sam ple taken with replacem ent from
the data. This simplicity is special to the case o f a hom ogeneous sample, but
m any extensions are straightforw ard. This resam pling procedure is called the
nonparametric bootstrap.
Example 2.8 (City population data) H ere we look at the ratio estim ate for
the problem described in Exam ple 1.2. F or convenience we consider a subset
o f the d a ta in Table 1.3, com prising the first ten pairs. This is an application
with no obvious param etric m odel, so nonparam etric sim ulation m akes good
sense. Table 2.1 shows the d a ta and the first sim ulated sample, which has been
draw n by random ly selecting subscript j ' from the set { l,...,n } w ith equal
probability and taking (w*,x*) = (uj-,xj-). In this sam ple j ' = 1 never occurs
2.3 ■Nonparametric Simulation 23
R eplicate r
1 3 2 1 2 1 1 t\ = 1.466
2 1 1 2 2 1 2 1 t* = 1.761
3 1 1 1 1 4 2 r; = 1.951
4 1 2 1 1 2 2 1 t\ = 1.542
5 3 1 3 1 1 1 t'5 = 1.371
6 1 1 2 1 1 1 3 t'6 = 1.686
7 1 1 2 2 2 1 1 t; = 1.378
8 2 1 3 1 1 1 1 tj = 1.420
9 1 1 1 2 1 2 1 1 (j = 1.660
<
oN
of size n — 10, R = 999
simulations. Note the
skewness of both t* and
I ■
in
o
1 ll
o
o
J llll.-_ q _ n .llll
under the best-fitting gam m a m odel w ith index k = 0.71. The agreem ent in the
second panel is strikingly good. O n reflection this is natural, because the E D F
is closer to the larger gam m a m odel th a n to the exponential model. ■
m etric resam pling, T* and related quantities will have discrete distributions,
even though they m ay be approxim ating continuous distributions. This m akes
results som ew hat “fuzzy” com pared to their param etric counterparts.
Example 2.10 (Air-conditioning data) For the nonparam etric sim ulation dis
cussed in the previous exam ple, the right panels o f Figure 2.9 show the scatter
plots o f sam ple stan d ard deviation versus sam ple average for R = 99 and
R = 999 sim ulated datasets. C orresponding plots for the exponential sim u
lation are shown in the left panels. T he qualitative feature to be read from
any one o f these plots is th a t d a ta stan d ard deviation is proportional to d ata
average. The discreteness o f the nonparam etric m odel (the E D F ) adds noise
whose peculiar b anded structure is evident a t R = 999, although the qualitative
structure is still apparent. ■
O
O
CO
O
in
C\J
Q o Q
C/) o co
CsJ
o. Q.
CO o (0
LO
to
8 o 8
CD o m
We shall refer to the limits (2.10) as the basic bootstrap confidence limits. Their
accuracy depends upon R, o f course, and one would typically take R > 1000 to
be safe. But accuracy also depends upon the extent to which the distribution o f
T" — t agrees w ith th a t o f T — 9. Com plete agreem ent will occur if T — 9 has a
distribution n o t depending on any unknow ns. This special property is enjoyed
by quantities called pivots, which we discuss in m ore detail in Section 2.5.1.
If, as is usually the case, the distribution o f T — 9 does depend on unknow ns,
then we can try alternative expressions contrasting T and 6, such as differences
o f transform ed quantities, o r studentized com parisons. For the latter, we define
the studentized version o f T — 9 as
where V is an estim ate o f v a r(T | F): we give a fairly general form for V in
Section 2.7.2. The idea is to mimic the Student-t statistic, which has this form,
and which elim inates the unknow n standard deviation when m aking inference
ab o u t a norm al mean. T hro u g hout this book we shall use Z to denote a
studentized statistic.
Recall th a t the S tudent-t (1 — 2a) confidence interval for a norm al m ean n
has limits
y - v l/2tn- i ( l - a ) , y - v l/2t„-i(a),
where v is the estim ated variance o f the m ean and f„_i(a), t„_ i(l — a) are
quantiles o f the Student-f distribution w ith n — 1 degrees o f freedom , the
distribution o f the pivot Z . M ore generally, when Z is defined by (2.11), the
(1 — 2a) confidence interval limits for 9 have the analogous form
where zp denotes the p quantile o f Z . One simple approxim ation, which can
often be justified for large sam ple size n, is to take Z as being N ( 0,1). The result
would be no different in practical term s from using a norm al approxim ation
for T — 9, and we know th a t this is often inadequate. It is m ore accurate
to estim ate the quantiles o f Z from replicates o f the studentized bootstrap
statistic, Z* = (T* — t ) / V * 1/2, where T ' and V * are based on a sim ulated
ran d o m sample, Y ’, . . . , Yn'.
If the m odel is param etric, the Y ' are generated from the fitted param etric
distribution, and if the m odel is nonparam etric, they are generated from the
E D F F, as outlined in Section 2.3. In either case we use the (R + l)a th order
statistic o f the sim ulated values z \ , . . . , z ' R, nam ely z(*(K+1)(x), to estim ate z„. Then
the studentized bootstrap confidence interval for 9 has limits
(2 .12)
30 2 • The Basic Bootstraps
Example 2.11 (Air-conditioning data) U nder the exponential m odel for the
d a ta o f Exam ple 1.1, we have T = Y , and since v a r(T | FM) = n 2/n, we would
take V = Y 2/n. This gives
Z = (T - n ) / V l/2 = n 1/2(l - n / Y ) ,
Example 2.12 (City population data) F or the sam ple o f n = 10 pairs analysed
in Exam ple 2.8, o u r estim ate o f the ratio 8 is t = x / u = 1.52. The 0.025 and
0.975 quantiles o f the 999 values o f t ‘ are 1.236 and 2.059, so the 95% basic
boo tstrap confidence interval (2.10) for 8 is (0.981,1.804).
To apply the studentized interval, we use the delta m ethod approxim ation
to the variance o f T, which is (Problem 2.9)
n
VL = n ~ 2 J ^ ( x y - tU j)2/Q 2,
j =i
and base confidence intervals for 8 on ( T — 0 ) / v lL[ 2, using sim ulated values
o f z ' = (t* — t ) / v L . T he sim ulated values in the right panel o f Figure 2.5
show th at the density o f the studentized b o o tstrap statistic Z ' is n o t close to
norm al. The 0.025 and 0.975 quantiles o f the 499 sim ulated z ' values are -3.063
and 1.447, and since v i = 0.0325, an approxim ate 95% equitailed confidence
interval based on (2.12) is (1.260,2.072). T his is quite different from the interval
above.
The usefulness o f these confidence intervals will depend on how well F
2.5 ■Reducing Error 31
where /i_1( ) is the inverse transform ation. So h~l { h(T) — aa} is an upper
(1 — a) confidence lim it for 8.
Parametric problems
In param etric problem s F = F# and F = Fv have the sam e form, differing
only in p aram eter values. T he n otion o f a pivot is quite simple here, m eaning
constant behaviour und er all values o f the m odel param eters. M ore formally,
we define a pivot as a function Q = q ( T , 8 ) w hose distribution does o r n o t a In general Q may also
depend on other
p articular q uantity Q is exactly or nearly pivotal, by exam ining its behaviour
statistics, as when Q is
under the m odel form w ith varying p aram eter values. F or example, in the the studentized form of
T.
context o f Exam ple 1.1 n o t depend on the value o f \p: for all q,
is independent o f \p.
O ne can check, som etim es theoretically and always em pirically, whether,
we could sim ultaneously exam ine properties o f T — 8, log T — log 8 and the
studentized version o f the form er, by sim ulation under several exponential
m odels close to the fitted m odel. This m ight result in plots o f variance or
selected quantiles versus param eter values, from which we could diagnose the
nonpivotal behaviour o f T — 6 and the pivotal b ehaviour o f log T — log 8.
A special role for tran sfo rm atio n h ( T) arises because som etim es it is rela
tively easy to choose h{-) so th a t the variance o f T is approxim ately o r exactly
independent o f 8, and this stability is the prim ary feature o f stability o f distri
bution. Suppose th a t T has variance v(6). T hen provided the function h(-) is
well behaved at 8, T aylor series expansion as described in Section 2.7.1 leads
to
W L i { h ( T ) } ± { h ( 8 ) } 2 v(8), h(8) is the first
derivative dh(6)/d6.
which in tu rn implies th a t the variance is m ade approxim ately constant (equal
to 1) if
50 60 70 90 200
theta
in conjunction w ith (2.13) will typically give m ore accurate confidence limits
th an would be obtained using direct approxim ations o f quantiles for T — 6.
If such use o f the transfo rm ation is appropriate, it will som etim es be clear
from theoretical considerations, as in the exponential case. O therw ise the
tran sfo rm atio n w ould have to be identified from a scatter plot o f sim ulation-
estim ated variance o f T versus 6 for a range o f values o f 8.
Example 2.13 (Air-conditioning data) Figure 2.10 shows a log-log plot o f the
em pirical variances o f r* = y ' based on R = 50 sim ulations for each o f a
range o f values o f 6. T h a t is, for each value o f 0 we generate R values t ’
corresponding to sam ples y y " „ from the exponential distribution with
m ean 6, and then plot log { ( R — l) -1 X)(t* — r*)2} against log0. T he linearity
an d slope o f the plot confirm th at v a r(T | F ) oc 62, where 6 = E (T | F). a
Nonparametric problems
In n o n p aram etric problem s the situation is m ore com plicated. It is now unlikely
(but n o t strictly im possible) th a t any quantity can be exactly pivotal. A lso we
cann o t sim ulate d a ta from a distribution with the same form as F, because
th a t form is unknow n. However, we can sim ulate d a ta from distributions near
to and sim ilar to F, an d this m ay be enough since F is near F. A rough idea
o f w hat is possible can be h ad from Exam ple 2.10. In the right-hand panels o f
Figure 2.9 we plotted sam ple stan d ard deviation versus sam ple average for a
series o f n o nparam etrically resam pled datasets. If the E D F s o f those datasets
are th o u g h t o f as m odels n ear both F and F, then although the pattern is
obscured by the banding, the plots suggest th a t the true m odel has standard
deviation p ro p o rtio n al to its m ean — which is indeed the case for the m ost
34 2 • The Basic Bootstraps
likely true m odel. T here are conceptual difficulties with this argum ent, b u t
there is little question th a t the im plication draw n is correct, nam ely th at log Y
will have approxim ately the sam e variance und er sam pling from b o th F and
F.
A m ore tho ro u g h discussion o f these ideas for nonparam etric problem s will
be given in Section 3.9.2.
A m ajor focus o f research on resam pling m ethods has been the reduction
o f statistical error. This is reflected particularly in the developm ent o f accurate
confidence lim it m ethods, which are described in C h apter 5. In general it is
best to rem ove as m uch o f the statistical erro r as possible in the choice o f
procedure. However, it is possible to reduce statistical erro r by a b o o tstrap
technique described in Section 3.9.1.
Br = R ~ l £ t ; - Y.
(2.15)
v ar ( R ?; - y ) = vary |e * ( R ~ 1 ^ Yr* - j
+ E y | var' ( r ' 5 ] y ; - y ) } ,
where E y ( - ) and vary(-) denote the m ean and variance taken with respect to
the jo in t distrib u tio n o f Y \ , . . . , Y n. F rom (2.15) this gives
a2 n —1
v ar (Br ) = vary(O) + Ey — x ——. (2.16)
n nR
This result does not depend on norm ality o f the data. A sim ilar expression
holds for any sm ooth statistic T w ith a linear approxim ation (Section 2.7.2),
except for an 0 ( n ~ 2) term.
N ext consider the variance estim ator VR = (R — I)-1 XXYr’ — Y*)2, where
Y* = R ^ 1 Yr*. The m ean and variance o f VR across all possible sim ulations,
conditional on the data, are
var(F«) = vary + Ey
which reduces to
(2.17)
The first term on the right o f (2.17) is due to d a ta variation, the second to
36 2 • The Basic Bootstraps
and
p ( l - p ) = 2np(l — p) a2 exp(z2)
var (aPjR) = p ^ = ------------ — ----------(2.18)
R g 2(ap) nR
^ ( zp , 2 n p ( l - p ) e \ p ( z 2) \
v a r(„„s ) = - | ^ + ----------------------- ’- j . (2.19)
So to m ake the variance inflation factor 10% for the 0.025 quantile, for
example, we would need R = 40n. E qu atio n (2.19) m ay n o t be useful in the
centre o f the distribution, where d(p) is very large because zp is small.
Example 2.14 (Air-conditioning data) To see how well this discussion applies
in practice, we look briefly a t results for the d a ta in Exam ple 1.1. T he statistic
o f interest is T = log Y, which estim ates 8 = log fi. The true m odel for Y is
taken to be the gam m a distrib u tio n w ith index k = 0.71 and m ean p. = 108.083;
these are the d a ta estim ates. Effects due to sim ulation e rro r are approxim ated
2.6 ■Statistical Issues 37
Table 2 3 Components
of variance (xlO -3 ) in Source T ype P
bootstrap estimation of
p quantile for 0.01 0.99 0.05 0.95 0.10 0.90
log Y —log /i, due to
data variation and
D a ta actual 31.0 6.9 14.0 3.6 8.3 2.2
simulation variation,
based on nonparametric theoretical 26.6 26.6 13.3 13.3 8.1 8.1
simulation applied to
the data of Example 1.1. S im ulation, R = 100 actual 53.6 9.4 8.5 3.2 3.8 2.6
theoretical 32.9 32.9 10.5 10.5 6.9 6.9
S im ulation, R = 500 actual 4.3 2.4 2.0 0.6 1.2 0.4
theoretical 6.6 6.6 2.1 2.1 . 1.4 1.4
S im ulation, R = 1000 actual 2.2 0.8 1.5 0.1 0.8 0.2
theoretical 3.3 3.3 1.0 1.0 0.7 0.7
by taking sets o f R sim ulations from one long nonparam etric sim ulation o f
9999 datasets. Table 2.3 shows the actual com ponents o f variation due to
sim ulation an d d a ta variation, together with the theoretical com ponents in
(2.19), for estim ates o f quantiles o f l o g ? — log/i. O n the whole the theory
gives a fairly accurate prediction o f perform ance. ■
when an d how a b o o tstrap calculation m ight fail, and ideally how it should
be am ended to yield useful answers. This topic o f boo tstrap diagnostics is
discussed m ore fully in Section 3.10.
A second question is: u nder w hat idealized conditions will a resam pling
procedure produce results th a t are in some sense m athem atically correct?
Answ ers to questions o f this sort involve an asym ptotic fram ew ork in which
the sam ple size n—>oo. A lthough such asym ptotics are ultim ately intended
to guide practical work, they often act only as a backstop, by rem oving from
consideration procedures th a t do n o t have ap p ro p riate large-sam ple properties,
and are usually n o t subtle enough to discrim inate am ong com peting procedures
according to their finite-sam ple characteristics. N evertheless it is essential to
appreciate when a naive application o f the b o o tstrap will fail.
To put the theoretical basis for the b o o tstrap in simple term s, suppose th at
we have a ran d o m sam ple or equivalently its E D F F, from which
we wish to estim ate properties o f a standardized quantity Q = q ( YU ---, Y„;F).
For exam ple, we m ight take
G*»(«) = Pr { Q ( l Y , . . . , y B* ;F ) ^ q I F } (2.21)
where in this case Q{Y{, . . . , Y * ; F ) = n{/1{ Y ' —y). In order for G p n to approach
G f n as n—*■oo, three conditions m ust hold. Suppose th a t the true distribution
F is surrounded by a neighbourhood in a suitable space o f distributions,
and th at as n—*oo, F eventually falls into J f w ith probability one. T hen the
conditions are:
for all integrable functions h(-). U nder these conditions the b o o tstrap is con
sistent, m eaning th a t for any q and e > 0, Pr{\Gpn(q) — GF^ }(q)\ > e}—>0 as
n—yoo.
2.6 ■Statistical Issues 39
T he first condition ensures th at there is a limit for Gf,„ to converge to, and
w ould be needed even in the happy situation where F equalled F for every
n > n', for som e ri. N ow as n increases, F changes, so the second and third
conditions are needed to ensure th at G p n approaches G fi00 along every possible
sequence o f F s. If any one o f these conditions fails, the b o o tstrap can fail.
Asymptotic accuracy
Here and below we say
X n = Op{nd) when Consistency is a w eak property, for exam ple guaranteeing only th at the true
Prfn^l-Xnl > e)-*p for probability coverage o f a nom inal (1 — 2a) confidence interval is 1 —2ot + op(l).
some constant p as
n—►oo, and X„ = op(nd) S tan d ard norm al approxim ation m ethods are consistent in this sense. Once
when consistency is established, m eaning th at the resam pling m ethod is “valid”, we
Pr(n rf|ATn| > e)-*0 as
n—>cc, for any e > 0. need to know w hether the m ethod is “good” relative to o ther possible m ethods.
This involves looking at the rate o f convergence to nom inal properties. For
example, does the coverage o f the confidence interval deviate from (1 —2a) by
0 p(n~l/2) or by 0 p(n-1 )? Some insight into this can be obtained by expansion
m ethods, as we now outline. M ore detailed calculations are m ade in Section 5.4.
Suppose th a t the problem is one where the lim iting distribution o f Q is stan
d ard norm al, and where an Edgeworth expansion applies. T hen the distribution
o f Q can be w ritten in the form
where <!>(•) an d </>{■) are the C D F and P D F o f the stan d ard norm al distribution,
and a(-) is an even quad ratic polynom ial. For a wide range o f problem s it can
be shown th a t the corresponding approxim ation for the b o o tstrap version o f
Q is
Pr(2* < q \ F ) = <b(q) + n~l/2a(q)(l>(q) + 0 ^ ) , (2.23)
40 2 • The Basic Bootstraps
where a(-) is obtained by replacing unknow ns in a(-) by estim ates. Now typically
a(q) = a(q) + 0 p(n~1/2), so
T hus the estim ated distrib u tio n for Q differs from the true distribution by a
term th a t is Op(n_1), provided th a t Q is constructed in such a way th a t it is
asym ptotically pivotal. A sim ilar argum ent will typically hold when Q has a
different lim iting distribution, provided it does n o t depend on unknow ns.
Suppose th a t we choose n o t to standardize Q, so th a t its lim iting distribution
is norm al w ith variance v. A n E dgew orth expansion still applies, now with
form
because the leading term s on the right-hand sides o f (2.25) and (2.26) are
different.
The difference betw een (2.24) and (2.27) explains o u r insistence on w orking
w ith approxim ate pivots w henever possible: use o f a pivot will m ean th at a
boo tstrap distribution function is an o rd er o f m agnitude closer to its target.
It also gives a cogent theoretical m otivation for using the b o o tstrap to set
confidence intervals, as we now outline.
We can obtain the a quantile o f the distribution o f Q by inverting (2.22),
giving the Cornish-Fisher expansion
where za is the a quantile o f the stan d ard norm al distribution, and a"(-) is
a further polynom ial. T he corresponding b o o tstrap quantile has the property
th a t q ’^ —qn = Op(n~l ). F or simplicity take Q = ( T — 0 ) / V l/1, where V estim ates
the variance o f T. T hen an exact one-sided confidence interval for 9 based on
Q would be I a = [T — V 1/2qx, oo), an d this contains the true 6 w ith probability
a. T he corresponding b o o tstrap interval is / ’ = [T — I/1/2g ”,oo), where q ’ is
the a quantile o f the distrib u tio n o f Q* — which w ould often be estim ated by
sim ulation, as we have seen. Since q'x — qx = Op(n~[), we have
Pr(0 e I a) = a, P r(0 e /* ) = a + 0 ( n ~ l ),
2.6 ■Statistical Issues 41
so th a t the actual probability th at / ' contains 6 differs from the nom inal
probability by only 0 ( n -1 ). In contrast, intervals based on inverting (2.25) will
contain 8 w ith probability a + 0 ( n ~ l/2). This interval is in principle no m ore
accurate th a n using the interval [T — F 1/2za, oo) obtained by assum ing th at
the distribution o f Q is stan d ard norm al. Thus one-sided confidence intervals
based on quantiles o f Q’ have an asym ptotic advantage over the use o f a
norm al approxim ation. Sim ilar com m ents apply to tw o-sided intervals.
The practical usefulness o f such results will depend on the num erical value
o f the difference (2.24) at the values o f q o f interest, and it will always be wise
to try to decrease this statistical error, as outlined in Section 2.5.1.
T he results above based on E dgew orth expansions apply to m any com m on
statistics: sm ooth functions o f sam ple m om ents, such as m eans, variances, and
higher m om ents, eigenvalues and eigenvectors o f covariance m atrices; sm ooth
functions o f solutions to sm ooth estim ating equations, such as m ost m axim um
likelihood estim ators, estim ators in linear and generalized linear models, and
som e robust estim ators; and to m any statistics calculated from tim e series.
distribution o f Y * is
m , \ m , s
p r(y * = ^ " (2.28)
;=0 j=0 '■*'
for k = l , . . . , n where = k / n ; sim ulation is n o t needed in this case. The
m om ents o f this b o o tstrap distribution, including its m ean and variance,
converge to the correct values as n increases. However, the convergence can be
very slow. To illustrate this, Table 2.4 com pares the average b o o tstrap variance
w ith the em pirical variance o f the m edian for d a ta sam ples o f sizes n = 11 and
21 from the stan d ard norm al distribution, the Student-t distribution with three
degrees o f freedom , and the C auchy d istrib u tio n ; also shown are the theoretical
variance approxim ations, which are incalculable when the true distribution F
is unknow n. We see th a t the b o o tstrap variance can be very po o r for n = 11
when distributions are long-tailed. The value 1.4 x 104 for average boo tstrap
variance w ith C auchy d a ta is not a m istake: the b o o tstrap variance exceeds
100 for ab o u t 1% o f d atasets: for som e sam ples the b o o tstrap variance is
huge. The situation stabilizes when n reaches 40 o r more.
The gross discreteness o f y * could also affect the simple confidence limit
m ethod described in Section 2.4. But provided the inequalities used to justify
(2.10) are taken to be < an d > rath er th a n < and > , the m ethod w orks well.
For example, for C auchy sam ples o f size n = 11 the coverage o f the 90% basic
boo tstrap confidence interval (2.10) is 90.8% in 1000 sam ples; see Problem 2.4.
We suggest ado p tin g the sam e practice for all problem s where t* is supported
on a small nu m b er o f values. ■
The statistic T will certainly behave wildly under resam pling w hen t(F) does
not exist, as happens for the m ean when F is a C auchy distribution. Q uite
naturally over repeated sam ples the b o o tstrap will produce silly and useless
results in such cases. T here are two points to m ake here. First, if d a ta are
taken from a real population, then such m athem atical difficulties can n o t arise.
Secondly, the stan d ard approaches to d a ta analysis include careful screening
o f d a ta for outliers, nonnorm ality, an d so forth, which leads either to deletion
o f disruptive d a ta elem ents or to sensible and reliable choices o f estim ators
2.6 ■Statistical Issues 43
U nder certain circum stances the resam pling m ethods we have described will
work, b u t in general it w ould be unwise to assum e this w ithout careful thought.
A lternative m ethods will be described in Section 3.6.
Dependent data
In general the n o n p aram etric resam pling m ethod th a t we have described will
n o t work for dependent data. This can be illustrated quite easily in the case
where the d a ta form one realization o f a correlated tim e series. For
example, consider the sam ple average y an d suppose th a t the d a ta com e from
a stationary series {Yj} whose m arginal variance is a 2 = var(Y; ) and whose
autocorrelations are ph = c o n ( Y j , Y j +h) for h = 1 ,2 ,... In Exam ple 2.7 we
showed th a t the nonparam etric b o o tstrap estim ate o f the variance o f Y is
approxim ately s2/n, an d for large n this will ap proach <r2/n . But the actual
variance o f Y is
The sum here w ould often differ considerably from one, and then the b o otstrap
estim ate o f variance would be badly wrong.
Sim ilar problem s arise w ith oth er form s o f dependent data. The essence o f
the problem is th a t simple b o o tstrap sam pling im poses m utual independence
on the Y j , effectively assum ing th a t their jo in t C D F is F(yi) x • ■• x F (yn)
and thus sam pling from its estim ate x • • • x F (y '). This is incorrect for
dependent data. The difficulty is th a t there is no obvious way to estim ate a
general jo in t density for Y i,...,Y „ given one realization. We shall explore this
im p o rtan t subject furth er in C h ap ter 8.
W eakly dependent d a ta occur in the altogether different context o f finite
population sam pling. H ere the basic nonparam etric resam pling m ethods work
reasonably well. M ore will be said ab o u t this in Section 3.7.
Dirty data
W hat if sim ulated resam pling is used when there are outliers in the d a ta?
T here is no substitute for careful d a ta scrutiny in this o r any o th er statistical
context, an d if obvious outliers are found, they should be removed or corrected.
W hen there is a fitted p aram etric m odel, it provides a benchm ark for plots
o f residuals an d the panoply o f statistical diagnostics, and this helps to detect
poor m odel fit. W hen there is no p aram etric m odel, F is estim ated by the ED F,
and the bench m ark is sw ept aw ay because the d a ta and the m odel are one and
the same. It is then vital to look closely a t the sim ulation output, in order to
see w hether the conclusions depend crucially on p articular observations. We
retu rn to this question o f sensitivity analysis in Section 3.10.
2.7 • Nonparam etric Bias and Variance 45
In some cases the Two form al expressions are U = £ + op(l) and U = ( + n~1/2(T(C)Z + Op(n_1),
Op(n~l ) “remainder
term” in the second
where Z is a N ( 0,1) variable. The first o f these corresponds to a statem ent o f
expression would be the consistency p roperty o f U, and the second amplifies this to state both the
op(n~i/2), but this would
not affect the principal
rate o f convergence an d the norm al approxim ation in an alternative form.
result of the delta N ow consider T = g(U), where g(-) is a sm ooth function. We shall see below
method below.
th a t provided th a t g(£) ^ 0,
T -JV (0 ,n _1{g(C)}2o’2(C )) ,
g (t/) = g (C + o P( l) ) = g ( 0 + o P( l) .
F rom the latter, we can see th a t the norm al approxim ation for U implies th a t
N othing has yet been said a b o u t the bias o f T, which would usually be
hidden in the Op(n_1) term . I f we take the larger expansion (2.30), ignore the
rem ainder term , an d take expectations, we obtain
E (T ) = 0 + ^ g ( C)a2( 0 .
These results extend quite easily to the case o f vector U and vector T , as
outlined in Problem 2.9. T he extension includes the case where U is the set o f
observed frequencies / i , . . . , / m when Y is discrete w ith probabilities
on m possible values. T hen the analogue o f (2.31) is
t{G) = t ( F ) + j uL rt( y,
, F) dG ( y) , (2.33)
with H y(u) = H( u — y) the H eaviside or unit step function jum ping from 0 to
1 at u = y. In this form the derivative satisfies / L t( y ; F ) d F ( y ) = 0, as seen
on setting G = F in (2.33). O ften the function L t(y) = L t( y; F ) is called the
influence function o f T an d its em pirical approxim ation l(y) = L t( y; F) is called
the empirical influence function. T he p articu lar values lj = l(yj) are called the
empirical influence values.
2.7 ■Nonparametric Bias and Variance 47
because f L , ( y ; F ) d F ( y ) = 0, where
J L , ( y ; F) d F{ y ) = n-1 ^ lj = 0.
,2.37)
s
w ith a sm all value o f e such as (100n)-1 . The same m ethod can be used
for em pirical influence values lj = L,(yj;F). A lternative approxim ations to
the em pirical influence values lj, which are all th a t are needed in (2.36), are
described in the following sections.
is the unbiased sam ple variance o f the yj. This differs by the factor (n — 1)/n
from the m ore usual n o n p aram etric variance estim ate for y. m
=E i= l o ti
<2-38)
This can also be used to find the influence function for a transform ed statistic,
given the influence function for the statistic itself.
Example 2.18 (Correlation) The sam ple correlation is the sam ple version o f
the prod u ct m om ent correlation, w hich for the p air Y = ( U , X ) can be defined
in term s o f p rs = E ( U rX s) by
H{qp - y ) ~ p
L q, {y ;F ) =
f (qP)
Evidently this has m ean zero.
T he approxim ate variance o f qp( F) is
_ P(1 ~ P )
vl {F) = tT 1 / LUy;F)dF(y) = n f{ q v
ip)
the em pirical version o f which requires an estim ate o f f ( q p). But since n o n
p aram etric density estim ates converge m uch m ore slowly th an estim ates o f
m eans, variances, and so forth, estim ation o f variance for quantile estim ates is
h ard er and requires m uch larger samples. ■
Example 2.20 (City population data) For the ratio estim ate t = x / u , calcu
lations in Problem 2.16 lead to em pirical influence values lj = (xj — tuj)/u.
N um erical values for the city population d a ta o f size 10 are given in Table 2.5;
the regression estim ates are discussed in Exam ple 2.23. T he variance estim ate
is Vl = = 0.182.
50 2 - The Basic Bootstraps
The lj are plotted in the left panel o f Figure 2.11. Values o f yj = (uj, x j )
close to the line x = tu have little effect on the ratio t. C hanging the d ata
by giving m ore weight to those yj w ith negative influence values, for which
(uj, Xj) lies below the line, would result in sm aller values o f t th a n th a t actually
observed, and conversely. We discuss the right panels in Exam ple 2.23. ■
L ( v ) _
,(y) E { - c ( y ,0 ) } ’
where c = dc/dd. The corresponding em pirical influence values are therefore
—nc(yj, t)
j Z nyjJY
and the nonparam etric delta m ethod variance estim ate is
„ _ E (£ to 0 1 1
1 {£«>■;. Of2'
A simple illustration is Exam ple 2.20, where t is determ ined by the estim ating
function c(y, 6) = x — 6u.
For som e purposes it is useful to go beyond the first derivative term in the
expansion o f t(F) and o btain the quad ratic approxim ation
ljackj = { n - W - t - j ) , (2.42)
where t - j is the estim ate calculated w ith y; om itted from the data. In effect
this corresponds to num erical approxim ation (2.37) using e = —(n — I)- 1 ; see
Problem 2.18.
2.7 • Nonparametric Bias and Variance 51
IjackJ — lj 5
Example 2.21 (Average) F or the sam ple average t = y and the case deletion
values are = (ny — y j ) / ( n — 1) and so ljack,j = }’j ~ V- This is the same as the
em pirical influence function because t is linear. The variance approxim ation in
(2.43) reduces to {n{n — l )}-1 ^2(yj — y)2 because bjack = 0; the denom inator
n — 1 in the form ula for vjack was chosen to ensure th at this happens. ■
= (2.44)
j- 1
say, where /* is the nu m b er o f times th a t y* equals yj, for j = 1, . . . , n . The
linear ap proxim ation (2.44) will be used several times in future chapters.
U nder the n o n p aram etric b o o tstrap the jo in t distribution o f the /* is m ulti
nom ial (Problem 2.19). It is easy to see th a t var(T *) = n~2 = vl , showing
52 2 • The Basic Bootstraps
th a t the b o o tstrap estim ate o f variance should be sim ilar to the nonparam etric
delta m ethod approxim ation.
Example 2.22 (City population data) The right panels o f Figure 2.11 show
how 999 resam pled values o f f* depend on «-1 / j for four values o f j, for the
d ata w ith n = 10. T he lines w ith slope lj sum m arize fairly well how t’ depends
on /* , b u t the correspondence is n o t ideal.
A different way to see this is to p lo t t* against the corresponding t'L.
Figure 2.12 shows this for 499 replicates. The line shows where the values
for an exactly linear statistic would fall. The linear approxim ation is poor
for n = 10, b u t it is m ore accurate for the full dataset, where n = 49. In
Section 3.10 we outline how such plots m ay be used to find a suitable scale on
which to set confidence limits. ■
Expression (2.44) suggests a way to approxim ate the /,-s using the results o f
a b o o tstrap sim ulation. Suppose th a t we have sim ulated R sam ples from F as
described in Section 2.3. Define /*• to be the frequency with which the d a ta
value yj occurs in the rth b o o tstrap sample. T hen (2.44) implies th a t
t; = t + ^ ] T r = l,...,R.
j=i
A A A _
So the vector I = ( /j,___ i ) o f approxim ate values o f the lj is obtained with
the least-squares regression form ula
where F* is the R x ( n — 1) m atrix w ith (r,j) elem ent n-1 /*;, and the rth row
o f the R x 1 vector d* is t* — f*. In fact (2.45) is related to an alternative,
o rthogonal expansion o f T in which the “rem ainder” term is uncorrelated
with the “linear” piece.
The several different versions o f influence produce different estim ates o f
v ar(T ). In general vl is an underestim ate, w hereas use o f the jackknife values
or the regression estim ates o f the Is will typically produce an overestim ate. We
illustrate this in Section 2.7.5.
Example 2.23 (City population data) For the previous exam ple o f the ratio
estim ator, Table 2.5 gives regression estim ates o f em pirical influence values,
obtained from R = 1000 samples. The exact estim ate v l for v a r(T ) is 0.036,
com pared to the value 0.043 obtained from the regression estimates. The
b o o tstrap variance is 0.042. For n = 49 the corresponding values are 0.00119,
0.00125 an d 0.00125.
O u r experience is th a t R m ust be in the hundreds to give a good regression
approxim ation to the em pirical influence values. ■
r= (2-48)
(2.49)
which is exact for a linear statistic. In effect this uses the usual form ula, with
lj replaced by L t(y*j\F) — n-1 J 2 L t(y*k ;F) in the rth resam ple. However, the
right-hand side o f (2.49) can badly underestim ate v'Lr if the statistic is not close
to linear. A n im proved approxim ation is outlined in Problem 2.20.
Example 2.24 (City population data) Figure 2.13 com pares the variance a p
proxim ations for n = 10. T he top left panel shows v" with M = 50 plotted
against the values
n
for R = 200 b o o tstrap samples. T he top right panel shows the values o f the
approxim ate variance on the right o f (2.49), also plotted against v'L. T he lower
panels show Q -Q plots o f the corresponding z* values, with (t* — t ) / v ^ /2 on
the horizontal axis. Plainly v’L underestim ates v', though not so severely as to
have a big effect on the studentized b o o tstrap statistic. But the right o f (2.49)
underestim ates v'L to an extent th a t greatly changes the distribution o f the
corresponding studentized b o o tstrap statistics.
2.8 ■Subsampling Methods 55
vL*
T he rig h t-h an d panels o f the corresponding plots for the full d a ta show m ore
nearly linear relationships, so it appears th a t (2.49) is a b etter approxim ation
at sample size n = 49. In practice the sam ple size cannot be increased, and
it is necessary to seek a tran sfo rm ation o f t to attain approxim ate linearity.
T he tran sfo rm atio n outlined in Exam ple 3.25 greatly increases the accuracy o f
(2.49), even w ith n = 10. ■
We briefly review three such m ethods here. The first two are in principle
superior to resam pling for certain applications, although their com petitive
m erits in practice are largely untested. T he third m ethod provides an alternative
to the nonparam etric delta m ethod for variance approxim ation.
z f = (n - d)a{S - t) (2.50)
Hence confidence intervals for fi can be determ ined. In practice one w ould take
a ran d o m selection o f R such subsets, and attach equal probability ( R + I)-1
to the R + 1 intervals defined by the R ff values. It is unclear how efficient
this m ethod is, and to w hat extent it can be generalized to o th er estim ation
problems.
. R [ 1 m 1 m m
i s i E wf ( yn - y a )1 + j(yn - y a ) { y n - yj i)
r= l I i= 1 i= l j = 1
equals
1 m
4 E - >‘2)2-
i=l
For this to hold for all d a ta values we m ust have = 0 for all i ± j.
This is a stan d ard problem arising in factorial design, and is solved by w hat
are know n as P lackett-B urm an designs. If the rth half-sam ple coefficients cfrj
form the rth row o f the R x m m atrix C +, and if every observation occurs in
exactly | R half-sam ples, then C +TC f = rnlmxm. In general the ith colum n o f C +
can be expressed as ( c y, . —1) w ith the first R — 1 elem ents obtained
by i — 1 cyclic shifts o f c i j , . . . , For exam ple, one solution for m = 7 with
R = 8 is
-1 -1 +1 - 1 +1
( +l + 1 ni
+1 +1 - 1 - 1 +1 - 1 +1
+1 +1 +1 - 1 - 1 +1 - 1
-1 +1 +1 +1 - 1 - 1 - 1
+1 - 1 +1 +1 +1 - 1 - 1
-1 +1 - 1 +1 +1 +1 - 1
-1 -1 +1 - 1 +1 +1 +1
U i -1 -1 -1 -1 -1 1)
This solution requires th a t R be the first m ultiple o f 4 greater th a n or equal
to m. The half-sam ple designs for m = 4 ,5 ,6 ,7 are the first in colum ns o f this
C + m atrix.
In practice it would be com m on to double the half-sam pling design by
adding its com plem ent —C \ which adds furth er balance.
It is fairly clear th a t the half-sam pling m ethod extends to stratum sample
sizes k larger th a n 2. The basic idea can be seen clearly for linear statistics o f
the form
m k m k
t= n + X k~l E = ^ + E a,>
i= 1 7=1 i= l j= l
say. Suppose th a t in the rth subsam ple we take one observation from each
stratum , as specified by the zero -o n e indicator c jy . T hen
can be calculated, then the usual estim ate o f v ar(T ) can be calculated. The
choice o f - values corresponds to selection o f a fractional factorial design,
w ith only m ain effects to be calculated, and this is solved by a Plackett-B urm an
design. O nce the subsam pling design is obtained, the estim ate o f v a r(T ) is a
form ula in the subsam ple values tj. The same form ula w orks for any statistic
th a t is approxim ately linear.
The same principles apply for unequal stratum sizes, although then the
solution is m ore com plicated and m akes use o f orthogonal arrays.
statistic for confidence intervals an d significance tests, and m akes the connec
tion to E dgew orth expansions for sm ooth statistics. The em pirical choice o f
scale for resam pling calculations is discussed by C h apm an and H inkley (1986)
and T ibshirani (1988).
H all (1986) analyses the effect o f discreteness on confidence intervals. Efron
(1987) discusses the num bers o f sim ulations needed for bias and quantile
estim ation, while D iaconis an d H olm es (1994) describe how sim ulation can be
avoided com pletely by com plete en um eration o f b o o tstrap sam ples; see also
the bibliographic notes for C h ap ter 9.
Bickel and F reedm an (1981) were am ong the first to discuss the conditions
under which the b o o tstrap is consistent. T heir w ork was followed by Bre-
tagnolle (1983) and others, and there is a grow ing theoretical literature on
m odifications to ensure th a t the b o o tstra p is consistent for different classes o f
aw kw ard statistics. T he m ain m odifications are sm oothing o f the d ata (Sec
tion 3.4), which can im prove m atters for nonsm ooth statistics such as quantiles
(D e Angelis and Young, 1992), subsam pling (Politis and R om ano, 1994b), and
rew eighting (B arbe and Bertail, 1995). H all (1992a) is a key reference to Edge-
w orth expansion theory for the b o o tstrap , while M am m en (1992) describes
sim ulations intended to help show when the b o o tstrap works, and gives the
oretical results for various situations. Shao and Tu (1995) give an extensive
theoretical overview o f the b o o tstrap an d jackknife.
A threya (1987) has show n th a t the b o o tstra p can fail for long-tailed distri
butions. Some o th er exam ples o f failure are discussed by Bickel, G otze and
van Zwet (1996).
T he use o f linear approxim ations an d influence functions in the context
o f robust statistical inference is discussed by H am pel et al. (1986). Fernholtz
(1983) describes the expansion theory th a t underlies the use o f these approx
im ation m ethods. A n alternative and o rthogonal expansion, sim ilar to th at
used in Section 2.7.4, is discussed by E fron and Stein (1981) and E fron (1982).
Tail-specific approxim ations are described by H esterberg (1995a).
The use o f m ultiple-deletion jackknife m ethods is discussed by H inkley
(1977), Shao and W u (1989), W u (1990), and Politis and R om ano (1994b), the
last w ith num erous theoretical exam ples. T he m ethod based on all non-em pty
subsam ples is due to H artig an (1969), an d is nicely p u t into context in C h apter 9
o f Efron (1982). H alf-sam ple m ethods for survey sam pling were developed by
M cC arthy (1969) an d extended by W u (1991). The relevant factorial designs
for half-sam pling were developed by Plackett and B urm an (1946).
2.10 Problems
1 Let F denote the E D F (2.1). Show that E {f(y )} = F(y) and that var{F(y)} =
f (3'){l — F(y)}/ n. Hence deduce that provided 0 < F(y) < 1, F(y) has a limiting
2.10 ■Problems 61
normal distribution for large n, and that Pr(|F(y) — F(y)| < e)—>1 as n—too for any
positive e. (In fact the much stronger property s u p ^ ^ ^ ^ |F(y) — F (y )|—>0 holds
with probability one.)
(Section 2.1)
2 Suppose that Y ],..., Y„ are independent exponential with mean their average is
Y=n~' E Yj .
(a) Show that Y has the gamma density (1.1) with k = n, so its mean and variance
are n and fi2/n.
(b) Show that log Y is approximately normal with mean log^i and variance n~'.
(c) Compare the normal approximations for Y and for log Y in calculating 95%
confidence intervals for /z. Use the exact confidence interval based on (a) as the
baseline for the comparison, which can be illustrated with the data o f Example 1.1.
(Sections 2.1, 2.5.1)
where w 4 = n- 1 E / X ; - f ) 4-
(Section 2.3; Appendix A)
This specifies the exact resampling density (2.28) o f the sample median. (The result
can be used to prove that the bootstrap estimate o f var(T ) is consistent as n—>oo.)
(c) Use the resampling distribution to show that for n = 11
and apply (2.10) to deduce that the basic bootstrap 90% confidence interval for
the population median 6 is (2 y(6) — y(9 ), 2 y(6) —
(d) Examine the coverage o f the confidence interval in (c) for samples from normal
and Cauchy distributions.
(Sections 2.3, 2.4; Efron, 1979, 1982)
i* \ ■ v°° I . , *3 , l 2 / t , k4
K ) - R { ' + Z ' , ^ + i2' ( 2 + <
where k 3 and k4 are the third and fourth cumulants o f T" under bootstrap
resampling. If T is asymptotically normal, k ^ / v U2 = 0 ( n ~ l/2) and k 4/ v1^ = 0 (n “ ’).
Compare this variance to that o f the bootstrap quantile estimate — t in
the special case T = Y .
(Sections 2.2.1, 2.5.2; Appendix A)
8 Suppose that estimator T has expectation equal to 0(1 + y ) , so that the bias is 9y.
The bias factor y can be estimated by C = E’( T ' ) / T — 1. Show that in the case
o f the variance estimate T = ri [ ^ 2(Yj — Y ) 2, C is exactly equal to y. I f C were
approximated from R resamples, what would be the simulation variance o f the
approximation?
(Section 2.5)
9 Suppose that the random variables U = (Ui, .. . , Um) have means C i,...,( m and
covariances cov(Uk,Ui) = n-1 cow( 0 , and that Ti = g t ( U ) , . . . , T q = gq(U). Show
that
2 \ " (x i — tuj)2
" - 2£
i=i
is a variance estimate for t = x / u , based on independent pairs (u i, Xi) ,...,( « „ ,x n).
(Section 2.7.1)
2.10 ■Problems 63
10 (a) Show that the influence function for a linear statistic t(F) = / a(x) dF(x)
is a ( y ) — t(F). Hence obtain the influence functions for a sample mom ent fir —
f x r dF(x), for the variance /1 2 (F) — {/ti(F)}2, and for the correlation coefficient
(Example 2.18).
(b) Show that the influence function for {t(F) — 6 } / v ( F ) i/2 evaluated at 9 = t{F) is
v(F)~l/2L, (y; F) . Hence obtain the empirical influence values lj for the studentized
quantity {t{F) — t ( F) } / v L( F ) l/2, and show that they have the properties E O = 0
and n~2 E I2 = 1 .
(Section 2.7.2; Hinkley and Wei, 1984)
J u { y, 9 ) d F ( y ) = 0 .
J u { y J ( F ) } dF(y) = 0,
replace F by (1 — e)F + eH y, and differentiate with respect to e to show that the
u(x;0) = du(x-,6)/d8 influence function for f(-) is
Hence show that with 9 = t{F) the y'th empirical influence value is
t = u ( y j ; 6)
1 - n ~ l E L i “(w ;
(b) Let {p be the maximum likelihood estimator o f the (possibly vector) parameter
o f a regular parametric m odel / v (y) based on a random sample y u . ..,y„. Show
that the j \ h empirical influence value for \p at yj may be written as n I ~ lSj, where
t { F) =r h a [
computed at the E D F F. Express t(F) in terms o f order statistics, assuming that
na is an integer. How would you extend this to deal with non-integer values o f not?
Suppose that F is a distribution symmetric about its mean, p.. By rewriting t(F) as
) rii-«(f)
-— — / udF(u),
1 - 2 a
where qa(F) is the a quantile o f F, use the result o f Example 2.19 to show that the
influence function o f t(F) is
l - 2 « r I, y<q«(F),
L t(y,F )= l 1 — 2a) ', q„(F) < y < <ji_a(F ),
{{q '(F )-p }(l-2 * )-\ q t - x( F ) < y .
Evaluate this at F = F.
(Section 2.7.2)
h = j Rt{(l - e)p + s l j }
e=0
where P = ( $ , ■-■>%) and 1 j is the vector with 1 in the y'th position and 0 elsewhere.
Hence or otherwise show that
n
0 = Mp) - X Mp)>
k=\
d2
qtj = g~ ^ t{(l - El - E 2)p + Ell, + E 2 ly}
£| =£2=0
Hence deduce that
<2.7 = 'iij(P) ~ n ‘ 5Z
k=l
tik ( P ) - n '
k=l
+ n 2 Y1ikl
k,l=1
(Section 2.7.2)
vjack = " “ —
1 (,y<m
( +i) - y(m))'i2 ■
J ___ / I ?\2
nVJadl~ ( ‘ X2 )
4 Pin)
as n—*oo. This confirms that the jackknife variance estimate is not consistent.
(Section 2.7.3)
Show that in (b) and (c) the squared distance (dF — dFe)T(dF — dFc) from F to
Fe = (1 — s)F + eH Vj is o f order 0 ( n ~ 2), but that if F* is generated by bootstrap
sampling, E* j( d F ‘ — d F ) T {dF’ — dF) j = 0 ( n ~ l ). Hence discuss the results you
would expect from the butcher knife, which uses e = n~l/2. How would you
calculate it?
(Section 2.7.3; Efron, 1982; Hesterberg, 1995a)
K ( £ ) = n log 7 ty e x p ( ^ -)|,
where £ =
(a) Show that with Kj = n~l, the first four cumulants o f the /* are
E‘(/D = 1,
co v '( / ' , / * ) = dij-n~\
cum ' ( f i J ’j J k ) = n~2{n2Sijk-«<5ft[3] + 2 } ,
cum ‘ (/,',/* , f l J J ) = n }{n}dijki - n2 (c5ft<5y,[3] + SJkl[4]) + 2nSit [6 ] - 6 },
where S:J = 1 when i = j and zero otherwise, and so on, and d:k [3] = d,k + S,j + Sjk,
and so forth.
(b) N ow consider t ’Q = f + n ~ ' J 2 f j h + \ n~2 H Show that E*(tg) =
t + \ n~2 ^2 qjj and that t'g has variance
i £ 1j + ^ E {E 4 - 1 ( E « « ) * + E E • (2-51)
(Section 2.7.2; Appendix A ; D avison, Hinkley and Schechtman, 1986; McCullagh,
1987)
20 Show that the difference between the second derivative Q , ( x , y ) and the first
derivative o f L,(x) is equal to L,(y). Hence show that the empirical influence value
can be written as
n
lj = L t( y j ) + n~l ^ { 2 ,( y y ,y ( c ) ~ L ,(yk)}-
k= 1
Use the resampling version o f this result to discuss the accuracy o f approximation
(2.49) for v‘L .
(Sections 2.7.2, 2.7.5)
2.11 Practicals
1 Consider parametric simulation to estimate the distribution o f the ratio when a
bivariate lognormal distribution is fitted to the data in Table 2.1:
ml < - m e a n ( lo g ( c it y $ u ) ) ; m2 < - m e a n ( lo g ( c it y $ x ) )
s i <- s q r t ( v a r ( lo g ( c it y $ u ) ) ) ; s 2 <- s q r t(v a r (lo g (c it y $ x )))
rho < - c o r ( l o g ( c i t y ) ) [ 1 , 2 ]
c it y .m le < - c (m l, m2 , s i , s 2 , rho)
2.11 ■Practicals 67
Are histograms o f t ’ and z ' similar to those for nonparametric simulation, shown
in Figure 2.5?
Use (2.10) and (2.12) to give 95% confidence intervals for the true ratio under this
model:
attach(co.transfer)
plot(0.5*(entry+week),week-entry)
t .test(week-entry)
Compare the variance o f the bootstrap estimate t" with the estimated variance
o f t, in c o .b o o t $ t 0 [ 2 ]. Compare normal-based and studentized bootstrap 95%
confidence intervals.
To display the bootstrap output:
2 • The Basic Bootstraps
split.screen(c(l,2))
screen(l); split.screen(c(2,1))
screen(3); qqnonn(co,boot$t[,1],ylab="t*",pch=".")
abline(co.boot$tO[l],sqrt(co.boot$t0[2]) ,lty=2)
screen(2)
plot(co.boot$t[,1],sqrt(co.boot$t[,2]),xlab="t*",ylab="SE*",pch=".")
screen(4); z <- (co,boot$t[,1]- co.boot$tO[1])/sqrt(co.boot$t[,2])
qqnorm(z); abline(0,1 ,lty=2)
What is going on here? Is the normal interval useful? What difference does
dropping the simulation outliers make to the studentized bootstrap confidence
interval?
(Sections 2.3, 2.4; Hand et ai , 1994, p. 228)
t0 <- cd4.boot$t0[l]
tstar <- cd4.boot$t[,1]
vL <- cd4.boot$t[,2]
zstar <- (tstar-tO)/sqrt(vL)
fisher <- function( r ) 0.5*log( (l+r)/(l-r) )
split.screen(c(1,2))
screen(l); plot(tstar,vL)
screen(2); plot(fisher(tstar),vL/(l-tstar"2)~2)
W hat are these on the correlation scale? How do they compare to intervals obtained
without the transformation?
If there are simulation outliers, delete them and recalculate the intervals.
(Sections 2.3, 2.4, 2.5; D iC iccio and Efron, 1996)
2.11 ■Practicals 69
4 How many simulations are required for quantile estimation? To get som e idea, we
make four replicate plots with 39, 99, 399 and 999 simulations.
split.screen(c(4,4))
quantiles <- matrix(NA,16,4)
n <- c (39,99,399,999)
p <- c(0.025,0.05,0.95,0.975)
for (i in 1:4)
{ y <- rnorm(999)
for (j in 1:4) {
quantiles[(j-1)*4+i,] <- quantile(y [1 :n[j]] , probs=p)
screen((i-1)*4+j)
qqnorm(y [1 :n[j] ] ,ylab="y" ,main=paste("R = ",n[j]))
abline(h=quantile(y[l :n[j]] ,p) ,lty=2) } }
Repeat the loop a few times. How large a simulation is required to get reasonable
estimates o f the 0.05 and 0.95 quantiles? O f the 0.025 and 0.975 quantiles?
(Section 2.5.2)
close.screen(all=T);plot(tstar.linear.approx(cd4.boot,L.reg))
Find the correlation between t ’ and its linear approximation. M ake the corre
sponding plots for the other empirical influence values. Are the plots better on the
transformed scale?
(Section 2.7)
3
Further Ideas
3.1 Introduction
In the previous chap ter we laid out the basic elem ents o f resam pling or
b o o tstrap m ethods, in the context o f the analysis o f a single hom ogeneous
sam ple o f data. This ch ap ter deals w ith how those ideas are extended to some
m ore com plex situations, an d then tu rn s to uses for variations and elaborations
o f simple b o o tstrap schemes.
In Section 3.2 we describe how to construct resam pling algorithm s for
several independent sam ples, and then in Section 3.3 we discuss briefly the use
o f partial m odelling, either qualitative or sem iparam etric, a topic explored m ore
fully in the later chapters on regression m odels (C hapters 6 and 7). Section 3.4
exam ines w hen it is w orthw hile to m odify the statistic by using a sm oothed
em pirical distribution function. In Sections 3.5 and 3.6 we tu rn to situations
where d a ta are censored or missing and therefore are incom plete. One relatively
simple situation where the stan d ard b o o tstrap m ust be modified to succeed is
finite population sampling, which we consider in Section 3.7. In Section 3.8 we
deal with simple situations o f hierarchical variation. Section 3.9 is an account
o f nested b ootstrapping, where we outline how to overcome som e o f the
shortcom ings o f a single b o o tstrap calculation by a fu rther level o f sim ulation.
Section 3.10 describes b o o tstrap diagnostics, which are concerned w ith the
assessm ent o f sensitivity o f resam pling analysis to individual observations, as
well as the use o f bo o tstrap o u tp u t to suggest m odifications to the calculations.
Finally, Section 3.11 describes the use o f nested b o o tstrapping in selecting an
estim ator from the data.
70
3.2 ■Several Samples 71
Since each o f the k p opulations is separate, nonparam etric sim ulation from
their respective E D F s F i , . . . , F k leads to datasets
This differs slightly from the delta m ethod variance approxim ation, which we
describe in Section 3.2.1.
A sim ulated value o f T w ould be f* = yj — y ’2 > where yj is the average
o f n\ observations generated w ith equal probability from the first sample,
y ii ,- - -, yi m, and is the average o f n2 observations generated with equal
72 3 ■Further Ideas
1 ”> 1 «2
Example 3.2 (Gravity data) Between M ay 1934 and July 1935, a series o f
experim ents to determ ine the acceleration due to gravity, g, was perform ed at
the N atio n al B ureau o f S tan d ard s in W ashington D C. T he experim ents, m ade
with a reversible pendulum , led to eight successive series o f m easurem ents. The
d ata are given in Table 3.1. Figure 3.1 suggests th a t the variance decreases
from one series to the next, th a t there is a possible change in location, and
th a t mild outliers m ay be present.
T he m easurem ents for the later series seem m ore reliable, and although
we would wish to estim ate g from all the data, it seems in ap p ropriate to
pool the series. We suppose th a t each o f the series is taken from a separate
population, F i,...,F g , b u t th a t each pop u latio n has m ean g; for a check on
this see Exam ple 4.14. T hen the ap p ro p riate form o f estim ator is a weighted
com bination
r = Ef=i V(Fi)/<r2(Fi)
E l i IM A ) ’
where F, is the E D F o f the ith series, fi(Fi) is an estim ate o f g from F„ and
3.2 ■Several Samples 73
O
CD
<j2(Fj) is an estim ated variance for n(Fi). The estim ated variance o f T is
v = j E 1/ ^
1 1=1
If the d a ta were tho u g h t to be norm ally distributed with m ean g but different
variances, we w ould take
to be the average o f the ith series and its estim ated variance. The resulting
estim ator T is then an em pirical version o f the optim al weighted average. For
o u r d a ta t = 78.54 w ith stan d ard error uI/2 = 0.59.
Figure 3.2 shows sum m ary plots for R = 999 nonparam etric sim ulations
from this model. The to p panels show norm al plots for the replicates t ' and
for the corresponding studentized b o o tstrap statistics z* = (f* — t ) / v ' l/2. Both
are m ore dispersed th a n norm al. There is one large negative value o f z*, and
the lower panels show w h y : on the left we see th a t the u* for the smallest value
o f t* is very small, w hich inflates the corresponding z*. We would certainly
om it this value on the grounds th at it is a sim ulation outlier.
The average an d variance o f the £* are 78.51 and 0.371, so the bias estim ate
for t is 78.51 — 78.54 = —0.03, and a 95% confidence interval for g based on a
norm al approxim ation is (77.37,79.76). The 0.025 x (R + 1) and 0.975 x (R + 1)
order statistics o f the z* are -3.03 and 2.50, so the 95% studentized boo tstrap
confidence interval for g is (77.07,80.32), slightly wider th an th at based on
the norm al approxim ation, as the top right panel o f Figure 3.2 w ould suggest.
74 3 • Further Ideas
77 78 79 80 81
t*
A p art from the resam pling algorithm , this mimics exactly the studentized
bo o tstrap procedure described in Section 2.4. ■
St (Fu . . . , ( l - e ) F i + eHy, . . . , F k)
Lv(y;F) = (3.2)
ds 6=0
and for brevity we w rite F = (Fi, . . . , Fk). A s in the single sam ple case, the influ
ence functions have m ean zero, E { Ltii( y;F)} = 0 for each i. T hen the im m ediate
consequence o f (3.1) is the nonparam etric delta m ethod approxim ation
T —6 ~ N ( 0, vL),
for large where the variance approxim ation vL is given by the variance o f
the second term on the right-hand side o f (3.1), th a t is
k 1
v l = V - v a r { L M(Y ;F) \ F}. (3.3)
£ f n‘
By analogy with the single sam ple case, em pirical influence values are
obtained by substituting the E D F s F = ( F i , . . . , F k) for the C D F s F in (3.2) to
give
h j = Lt Ay j i f )-
ju st as in Exam ple 2.17. Sim ilarly L^2(y,F) = —(y — fi2). In this case the linear
approxim ation (3.1) is exact. The variance approxim ation form ula (3.3) gives
vL = — v a r(F i) + — var(Y2),
ni n2
76 3 ■Further Ideas
vl = X
"1 ;=1
~ h)2 + i ;=1
_ *2)2-
As usual this differs slightly from the unbiased variance approxim ation.
N ote th a t if we could assum e th a t the two p o pulation variances were equal,
then it w ould be ap p ro p riate to replace vl by
The various com m ents m ade ab o u t calculation in Section 2.7 apply here
w ith obvious m odifications. T hus the em pirical influence values can be ap
proxim ated accurately by num erical differentiation, which here m eans
f ^ t(F\ , ...,( 1 - e ) F j + e H yj , .. ., Fk) - t
lj ~ e
for small e. We can also use the generalization o f (2.44), namely
'■ - ‘ + E r E ^ - <3-5>
.- = 1 J.1
where is the estim ate obtained by om itting the yth case in the ith sample.
T hen
k ^ n,
vjack = E _ J) ~^jack,if-
One can also generalize the discussion o f bias approxim ation in Section 2.7.3.
However, the extension o f the quad ratic approxim ation (2.41) is n o t straight
forw ard, because there are “cross-population” terms.
The same approxim ation (3.1) could be used even when the samples, and
hence the F,s, are correlated. But this w ould have to be taken into account in
(3.3), which as stated assum es m utual independence o f the samples. In general
it would be safer to incorporate dependence th ro u g h the use o f appropriate
m ultivariate E D Fs.
3.3 ■Semiparametric Models 77
Yy — fli 4“ 6 i'c-ij>
where the ey are sam pled from a com m on distribution with C D F Fo, say.
The norm al distrib u tio n is a p aram etric m odel o f this form. The form can be
checked to some extent by plotting standardized residuals such as
for ap p ro p riate estim ates jl, an d au to verify hom ogeneity across samples. The
com m on Fo will be estim ated by the E D F o f all n, o f the ey-s, or better
by the E D F o f the standardized residuals e y /( l — n f 1)1/2. The resam pling
algorithm will then be
where the £y-s are random ly sam pled from the ED F, i.e. random ly sam pled
w ith replacem ent from the standardized eys; see Problem 3.1.
In an o th er context, w ith positive d a ta such as lifetimes, it m ight be ap p ro
priate to think o f d istributions as differing only by m ultiplicative effects, i.e.
Yy = HiSij, where the ey are random ly sam pled from some baseline distribution
w ith unit m ean. The exponential distribution is a param etric m odel o f this
form. The principle here w ould be essentially the sam e: estim ate the ey by
residuals such as ey = y y //i„ then define Yy = &£*• with the e*- random ly
sam pled w ith replacem ent from the eys.
Sim ilar ideas apply in regression situations. The param etric p art o f the model
concerns the system atic relationship betw een the response y and explanatory
variables x, e.g. th ro u g h the m ean, and the nonparam etric p a rt concerns the
ran d o m variation. We consider this in detail in C hapters 6 and 7.
R esam pling plans such as those ju st outlined will give m ore accurate answers
when their assum ptions ab o u t the relationships betw een F, are correct, but they
are not robust to failure o f these assum ptions. Some pooling o f inform ation
78 3 • Further Ideas
across sam ples m ay be essential in o rd er to avoid difficulties w hen the sam ples
are small, b u t otherw ise it is usually unnecessary.
If we widen the m eaning o f sem iparam etric to include any partial modelling,
then features less tangible th a n param eters com e into play. T he following two
exam ples illustrate this.
In b o th o f these exam ples the resulting estim ate will be m ore efficient th an
the EDF. This m ay be less im p o rtan t th a n producing a m odel which satisfies
the practical assum ptions an d m akes intuitive sense.
t M - h t H n r 1)- ™
j=i
where w(-) is a continuous an d sym m etric P D F with m ean zero and unit
variance, an d do calculations o r sim ulations based on the corresponding C D F
Fh, rath er th a n on the E D F F. This corresponds to sim ulation by setting
where the l j are independent and uniform ly distributed on the integers 1 ,..., n
and the ej are a ran d o m sam ple from w(-), independent o f the l j . This is the
smoothed bootstrap. N ote th a t h = 0 recovers the EDF.
The variance o f an observation generated from (3.6) is n~l J2(yj ~ S’)2 + ^2>
and it m ay be preferable for the sam ples to have the same variance as for the
unsm oothed b ootstrap. This is im plem ented via the shrunk smoothed bootstrap,
under which h sm ooths betw een F and a m odel in which d a ta are generated
from density w(-) centred at the m ean and rescaled to have the variance o f F ;
see Problem 3.8.
H aving decided which sm oothed b o o tstrap is to be used, we estim ate the
required p roperty o f F , a(F), by a(F/,) ra th er th an a(F). So if T is an estim ator
o f 9 = t(F), an d we inten d to estim ate a(F) = v a r(T | F) by sim ulation, we
w ould obtain values t \ , . . . , t ’R calculated from sam ples generated from F/,, and
then estim ate a(F) by (R — I)-1 — F ) 2. N otice th a t it is a(F), n o t t(F),
th a t is estim ated using sm oothing.
To see w hen a(F/,) is b etter th an a(F), suppose th a t a(F) has linear approxi
m ation (2.35). Then
n
a(Fh) - a(F) = n~l ^ J L a( Y j + h £ j - , F ) w ( E j ) d E j - i -------
7= 1
n
= n - 1 Y , L a( Yj ; F ) + \ h 2n~ l £ L "(Y ,;F ) + • ■•
7=1 7=1
Example 3.8 (Sample median) Suppose th a t t(F) is the sam ple m edian, and
th at we wish to estim ate its variance a(F). In Exam ple 2.16 we saw th at
the discreteness o f the m edian posed problem s for the ordinary, unsm oothed,
b ootstrap. D oes sm oothing im prove m atters?
U nder regularity conditions on F an d h, detailed calculations show th a t the
m ean squared error o f na(Fh) is pro p o rtio n al to
w hereas it is 0 ( n ~ ,/2) in the unsm oothed case. T hus there are advantages
to sm oothing here, a t least in large samples. Sim ilar results hold for other
quantiles.
Table 3.3 shows results o f sim ulation experim ents where 1000 sam ples were
taken from the exponential an d tj distributions. F or each sam ple sm oothed
an d shrunk sm oothed b o o tstrap s were perform ed w ith R = 200 an d several
values o f h. U nlike in Table 3.2, the advantage due to sm oothing increases with
n, and the shrunk sm oothed b o o tstrap im proves on the sm oothed bootstrap,
particularly at larger values o f h.
As predicted by the theory, as n increases the root m ean squared error
decreases m ore rapidly for sm oothed th an for unsm oothed bootstrap s; it
decreases fastest for shru n k sm oothing. F o r the tj d a ta the ro o t m ean squared
erro r is n o t m uch reduced. F or the exponential d a ta sm oothing was per
form ed on the log scale, leading to reduction in root m ean squared erro r by
a factor two o r so. Too large a value o f h can lead to large increases in
ro o t m ean squared error, b u t choice o f h is less critical for shrunk sm ooth
ing. Overall, a small am o u n t o f shrunk sm oothing seems w orthw hile here,
provided the d a ta are well-behaved. But sim ilar experim ents w ith Cauchy
d a ta gave very p o o r results m ade worse by sm oothing, so one m ust be
sure th a t the d a ta are n o t pathological. F urtherm ore, the gains in preci
sion are n o t large enough to be critical, at least for these sam ple sizes.
■
The discussion above begs the im p o rtan t question o f how to choose the
sm oothing p aram eter for use w ith a p articular dataset. O ne possibility is
to treat the problem as one o f choosing am ong possible estim ators a(Fh)
an d use the nested b o o tstrap , as in Exam ple 3.26. However, the use o f an
estim ated h is n o t sure to give im provem ent. W hen the rate o f decrease o f the
optim al value o f h is know n, an o th er possibility is to use subsam pling, as in
E xam ple 8.6.
82 3 ■Further Ideas
3.5 Censoring
3.5.1 Censored data
Censoring is present w hen d a ta con tain a lower or upper b o und for an
observation ra th e r th a n the value itself. Such d a ta often arise in m edical and
industrial reliability studies. In the m edical context, the variable o f interest
m ight represent the tim e to death o f a patien t from a specific disease, with an
indicator o f w hether the tim e recorded is exact or a lower b o und due to the
p atient being lost to follow -up or to d eath from oth er causes.
The com m onest form o f censoring is right-censoring, in which case the value
observed is Y = m in (7 ° , C), where C is a censoring value, and Y° is a no n
negative failure time, which is know n only if Y° < C. The d a ta themselves
are pairs ( Y , D ), w here D is a censoring indicator, w hich equals one if Y° is
observed an d equals zero if C is observed. Interest is usually focused on the
distributio n F° o f Y°, w hich is obscured if there is censoring.
The survivor function and the cumulative hazard function are central to
the study o f survival data. The survivor function corresponding to F°(y) is
Pr(Y ° > y) = 1 — F°(y), an d the cum ulative h azard function is A°(y) =
—lo g { l—-F°(y)}. The cum ulative h azard function m ay be w ritten as / 0y dA°(u),
where for continuous y the hazard function d A° ( y) /d y m easures the in stan
taneous rate o f failure at tim e y, conditional on survival to th a t point. A
constant h azard X leads to an exponential distrib u tion o f failure tim es with
survivor an d cum ulative h azard functions exp(—Ay) and Ay; departures from
these simple form s are often o f interest.
T he sim plest m odel for censoring is random censorship, u n der which C is a
random variable w ith distrib u tio n function G, independent o f Y°. In this case
the observed variable Y has survivor function
Pr(Y > y ) = { I - F ° ( y ) } { l - G ( y ) } .
O ther form s o f censoring also arise, an d these are often m ore realistic for
applications.
Suppose th a t the d a ta available are a hom ogeneous random sam ple (yi,di),
. . . , (y n, dn), and th a t censoring occurs at random . Let y\ < ■■■< y„, so there are
n o tied observations. A stan d ard estim ate o f the failure-tim e survivor function,
the product-limit o r Kaplan-Meier estim ate, m ay then be w ritten as
(3.9)
I f there is no censoring, all the dj equal one, and F°(y) reduces to the E D F o f
y i , . . . , y n (Problem 3.9). T he product-lim it estim ate changes only a t successive
failures, by an am o u n t th a t depends on the num b er o f censored observations
3.5 ■Censoring 83
between them . Ties betw een censored and uncensored d ata are resolved by
assum ing th a t censoring happens instantaneously after a failure m ight have
occurred; the estim ate is unaffected by o th er ties. A stan d ard error for 1—F°(y)
is given by Greenwood’s formula,
1/2
(3.10)
■-*M- n Gr^H-j:yj< y v J/
<^>
T he cum ulative h azard function m ay be estim ated by the Nelson-Aalen
estim ate
H{u) is the Heaviside
function, which equals ----- —y (3.12)
zero if u < 0 and equals
one otherwise.
Since y\ < • ■• < y„, the increase in A (> at yj is dA°(yj) = dj /( n — j + 1). The
in terp retatio n o f (3.12) is th a t at each failure the hazard function is estim ated
by the num b er observed to fail, divided by the num ber o f individuals at risk (i.e.
available to fail) im m ediately before th a t time. In large sam ples the increm ents
o f A 0, the d A0(yj), are approxim ately independent binom ial variables with
denom inators (n + 1 — j ) and probabilities dj /( n — j + 1). The product-lim it
estim ate m ay be expressed as
Example 3.9 (AM L data) Table 3.4 contains d a ta from a clinical trial
conducted a t Stanford U niversity to assess the efficacy o f m aintenance chem o
therapy for the rem ission o f acute m yelogeneous leukaem ia (A M L). A fter
reaching a state o f rem ission through treatm ent by chem otherapy, patients were
divided random ly into two groups, one receiving m aintenance chem otherapy
an d the oth er not. T he objective o f the study was to see if m aintenance
chem otherapy lengthened the tim e o f remission, w hen the sym ptom s recur.
T he d a ta in the table were gathered for prelim inary analysis before the study
ended.
84 3 ■Further Ideas
The left panel o f Figure 3.3 shows the estim ated survivor functions for the
tim es o f rem ission. A plus on one o f the lines indicates a censored observation.
T here is some suggestion th a t m aintenance prolongs the time to remission,
b u t the sam ples are sm all and the evidence is n o t overwhelming. T he right
panel shows the estim ated survivor functions for the censoring times. Only
one observation in the n o n-m aintained group is censored, b u t the censoring
distributions seem sim ilar for b o th groups.
The estim ated probabilities th a t rem ission will last beyond 20 weeks are
respectively 0.71 and 0.59 for the groups, w ith stan d ard errors from (3.10)
b o th equal to 0.14. ■
Conditional bootstrap
A second sam pling scheme starts from the prem ise th a t since the censoring
variable C is unrelated to Y°, know ledge o f the quantities C i,...,C „ alone
would tell us noth in g a b o u t F°. They w ould in effect be ancillary statistics. This
suggests th a t sim ulations should be conditional on the p atte rn o f censorship, so
far as practicable. To allow for the censoring pattern, we argue th a t although
the only values o f cj know n exactly are those w ith dj = 0, the observed values
o f the rem aining observations are lower b o unds for the censoring variables,
because Cj > yj when d} = 1. This suggests the following algorithm .
3.5 ■Censoring 85
Figure 3.3
Product-limit survivor
function estimates for
two groups o f patients
with A M L, one
receiving maintenance n
chem otherapy (solid) na
o
and the other not (dots).
The left panel shows CO
estimates for the time to >
remission, and the right 3
panel shows the C/D
estimates for the time to
censoring. In the left
panel, + indicates times
o f censored
observations; in the
right panel + indicates
times o f uncensored Time (weeks) Time (weeks)
observations.
Weird bootstrap
The sam pling plans outlined above m im ic how the d a ta are th o u g h t to arise, by
generating individual failure and censoring times. W hen interest is focused on
the survival o r h azard functions, a third and quite different approach uses direct
sim ulation from the N elso n -A alen estim ate (3.12) o f the cum ulative hazard.
The idea is to treat the num bers o f failures a t each observed failure tim e as
independent binom ial variables w ith denom inators equal to the num bers of
individuals at risk, and m eans equal to the num bers th at actually failed. Thus
w hen yi < ■• ■< y n, we take the sim ulated num b er to fail at tim e yj, N*, to be
binom ial w ith den o m in ato r n — j + 1 an d probability o f failure dj / ( n — j + 1).
A sim ulated N elso n -A alen estim ate is then
A°*00 = E V n - L vV (3-14)
;=1 l ^ k =i ™\yj yk)
which can be used to estim ate the uncertainty o f the original estim ate A Q(y).
In this weird bootstrap the failures at different tim es are unrelated, the num ber
at risk does n o t depend on previous failures, there are no individuals whose
sim ulated failure tim es underlie -4°’ (y), and no explicit assum ption is m ade
ab o u t the censoring m echanism . Indeed, under this scheme the censored indi
viduals are held fixed, b u t the num b er o f failures is a sum o f binom ial variables
(Problem 3.10).
The sim ulated survivor function corresponding to (3.14) is obtained by
substituting
Example 3.10 (AM L data) Figure 3.3 suggests th a t the censoring distribu
tions for b o th groups o f d a ta in Table 3.4 are sim ilar, b u t th at the survival
distributions them selves are not. To com pare the resam pling schemes described
above, we consider estim ates o f two param eters, the probability o f remission
beyond 20 weeks and the m edian survival time, b o th for G ro u p 1. These
estim ates are 1 — F°(20) = 0.71 an d inf{t : F°(t) > 5} = 31.
Table 3.5 com pares results from 499 sim ulations using the ordinary, condi
tional, and weird bootstraps. F or the survival probabilities, the ordinary and
conditional b o o tstrap s give sim ilar results, and b o th stan d ard errors are sim
ilar to th a t from G reenw ood’s form ula; the weird b o o tstrap probabilities are
significantly higher an d are less variable. The schemes give infinite estim ates
3.5 ■Censoring 87
-20 20 40 -20 0 20 40
Cases Cases
o f the m edian 21, 19, and 2 tim es respectively. The w eird b o o tstrap results for
the m edian are less variable th a n the others.
The last colum ns o f the table show the num bers o f sam ples in which the
sm allest censored observation appears 0, 1, 2, and 3 or m ore times. U nder the
conditional scheme the observation appears m ore often th an under the ordinary
b o o tstrap , and und er the weird b o o tstrap it occurs once in each resample.
Figure 3.4 com pares the distributions o f the difference o f m edian survival
times betw een the two groups, und er the three schemes. R esults for the condi
tional and o rdinary b o o tstrap s are similar, b u t the weird bo o tstrap again gives
results th a t are less variable th a n the others.
This set o f d a ta gives an extrem e test o f m ethods for censored data, because
quantiles o f the product-lim it estim ate are very discrete.
T he weird b o o tstra p also gave results less variable th a n the o ther schemes
for a larger set o f data. In general it seems th a t case resam pling and conditional
resam pling give quite sim ilar an d reliable results, b o th differing from the weird
bootstrap. ■
88 3 ■Further Ideas
Parametric problems
F o r param etric problem s the situation is relatively straightforw ard, at least
in principle. First, in defining estim ators there is a general fram ew ork w ithin
which com plete-data M L E m ethods can be applied using the iterative EM The EM or expectation
maximization algorithm
algorithm , which essentially w orks by estim ating missing values. Form ulae
is widely used in
exist for com puting approxim ate stan d ard errors o f estim ators, b u t sim ulation incomplete data
problems.
will often be required to obtain accurate answers. O ne extra com ponent th at
m ust be specified is the m echanism which takes com plete d a ta y° into observed
d a ta y, i.e. f ( y \ y°). T he m ethodology is sim plest w hen d a ta are missing at
random .
The corresponding Bayesian m ethodology is also relatively straightforw ard
in principle, and num erous general algorithm s exist for using com plete-data
form s o f posterior distribution. Such algorithm s, although they involve sim u
lation, are som ew hat rem oved from the general context o f b o o tstra p m ethods
and will n o t be discussed here.
Nonparametric problems
N onparam etric analysis is som ew hat m ore com plicated, in p a rt because o f the
difficulty o f defining ap p ro p riate estim ators. T he following artificial exam ple
illustrates som e o f the key ideas.
Example 3.11 (Mean with missing data) Suppose th a t responses y° had been
obtained from n random ly chosen individuals, b u t th a t m random ly selected
values were then lost. So the observed d a ta are
y u - - - , y n = y \ , - . - , y l - m, N A , . . . , N A .
3.6 • Missing Data 89
To estim ate the popu latio n m ean /i we should o f course use the average
response y = (n — m)-1 X/’ whose variance we would estim ate by
n—m
v = (n — m) 2 Y ( y j - y f ■
A ssum ing th a t we discard all resam ples with rn = n (all d a ta missing), the
b o o tstrap variance will overestim ate v ar(T ) by a factor which ranges from
15% for n = 10, m = 5 to 4% for n = 30, m = 15.
In the second approach, the first step was to fix the d ata so th at the
com plete-data estim ation form ula /t = n-1 YTj=i y*j f ° r t could be used. Then
we attem pted to sim ulate d a ta according to the two steps in the original
d ata-generation process. U nfortunately the E D F o f y®,...,y®_m,y®_m+l,...,y®
is an underdispersed estim ate o f the true C D F F. Even though the estim ate t
is n o t affected in this particularly simple problem , the boo tstrap distribution
certainly is. This is illustrated by the b o o tstrap variance
Both approaches can be repaired. In the first, we can stratify the sam pling
w ith com plete an d incom plete d a ta as strata. In the second approach, we can
ad d variability to the estim ates o f missing values. This device, called multiple
90 3 ■Further Ideas
This exam ple suggests two lessons. First, if the com plete-data estim ator can
be m odified to w ork for incom plete data, then resam pling cases will w ork
reasonably well provided the p ro p o rtio n o f m issing d a ta is sm all: stratified
resam pling would reduce variation in the am o u n t o f missingness. Secondly,
the com plete-data estim ator and full sim ulation o f d a ta observation (including
the data-loss step) can n o t be based on single im p u tatio n estim ation o f missing
values, b u t m ay w ork if we use m ultiple im p u tatio n appropriately.
O ne fu rth er poin t concerns the data-loss m echanism , which in the exam ple
we assum ed to be com pletely random . If d a ta loss is dependent upon the
response value y, then resam pling cases should still be v a lid : this is som ew hat
sim ilar to the censored-data problem . But the o th er approach via m ultiple
im putatio n will becom e com plicated because o f the difficulty o f defining a p
propriate m ultiple im putations.
Example 3.12 (Bivariate missing data) A m ore realistic exam ple concerns the
estim ation o f bivariate correlation when some cases are incom plete. Suppose
th a t Y is bivariate w ith com ponents U an d X . T he param eter o f interest is
6 = c o t t ( U , X ) . A ran d o m sam ple o f n cases is taken, such th a t m cases have
x missing, b u t no cases have b o th u an d x missing o r ju st u missing. I f it is
safe to assum e th a t X has a linear regression on U, then we can use fitted
regression to m ake single im pu tatio n s o f missing values. T h a t is, we estim ate
each missing x; by
Xj = x + b(uj — u),
where x, u and b are the averages and the slope o f linear regression o f x on u
from the n — m com plete pairs.
It is easy to see th a t it would be w rong to substitute these single im putations
in the usual form ula for sam ple correlation. The result would be biased aw ay
from zero if b ± 0. O nly if we can m odify the sam ple correlation form ula to
remove this effect will it be sensible to use simple resam pling o f cases.
The o th er strategy is to begin w ith m ultiple im p u tation to obtain a suitable
bivariate F, next estim ate 6 w ith the usual sam ple correlation t(F), and then
resam ple appropriately. M ultiple im p u tatio n uses the regression residuals from
3.6 • Missing Data 91
- 3 - 2 - 1 0 1 2 3 - 3 - 2 - 1 0 1 2 3
ej = Xj — Xj = Xj — {x + b(uj — u )},
quantile o f the stan d ard norm al distribution. Such confidence intervals are a
factor (1 —/ ) 1/2 shorter th a n for sam pling with replacem ent.
The lack o f independence affects possible resam pling plans, as is seen by
applying the o rdinary b o o tstrap to 7 . Suppose th a t 7 1*,...,Y„* is a random
sam ple tak en w ith replacem ent from y i , . . . , y n- T heir average 7* has variance
var*(7*) = n~2 ^ 2 ( y j —y ) 2, and this has expected value n~2(n— l)y over possible
sam ples y i , . . . , y „ . This only m atches the second line o f (3.15) if / = n~l . T hus
for the larger values o f / generally m et in practice, ordinary b o o tstrap standard
errors for y are too large an d the confidence intervals for 6 are system atically
too wide. ■
the (^) possible w ithout-replacem ent sam ples from 9 ’, and the corresponding
b o o tstrap value is X* = f(Y,*,. • •, Y„’).
If N / n is n o t an integer, we w rite N = kn + 1, where 0 < I < n, and form
t y ’ by taking k copies o f y i , . . . , y n an d adding to them a sam ple o f size I
taken w ithout replacem ent from y i , . . . , y n- B ootstrap sam ples are form ed as
w hen N = kn, b u t a different <&' is used for each. We call this the population
bootstrap. U nder a superp o p u latio n m odel, the m em bers o f the population
aJJ are them selves a ran d o m sam ple from an underlying distribution, 2P. The
nonparam etric m axim um likelihood estim ate o f & is the E D F o f the sample,
which suggests the following resam pling plan.
As one w ould expect, this gives results sim ilar to the population bootstrap.
vv*\ N ( n - 1) i
v a r ( y , = < A r = T j ; ’'< 1 - ' , “
and this is the correct form ula a p a rt from the first factor on the right, which is
typically close to one. U n d er the su p erp o p u latio n b o o tstra p a straightforw ard
calculation establishes th a t the m ean variance o f Y ’ is (n — l) /n x (1 —/ ) n -1 c
(Problem 3.12).
These sam pling schemes m ake alm ost the right allowance for the sam pling
fraction, at least for the average.
F or the m irror-m atch scheme we suppose th a t n = km for integer m, and write
Y* = n~l ]Tf= i Y,j, where (Y(j , . . . , Y ^) is the ith w ithout-replacem ent
resam ple, independent o f the o th er w ithout-replacem ent resamples. T hen we
can use (3.15) to establish th a t var*(Y ’ ) = (km)~l ( 1 — m / n )m ~lc. Because our
assum ptions im ply th a t / = m/n, this is an unbiased estim ate o f var(Y ), b u t it
would be biased if m ^ n f . m
3.7 • Finite Population Sampling 95
Example 3.15 (City population data) F or a num erical assessm ent o f the
schemes outlined above, we consider again the d a ta in Exam ple 1.2, on 1920
a n d 1930 p opulations (in thousands) o f N = 49 U S cities. Table 2.1 contains
populations yj = (Uj,Xj) for a sam ple o f n = 10 cities taken w ithout replace
m ent from the 49, and we use them to estim ate the m ean 1930 population
6 = N~l x j f ° r the 49 cities.
Two sta n d a rd estim ators o f 6 are the ratio and regression estim ators. The
ratio estim ate an d its estim ated variance are given by
o ------------
y
Figure 3.6 Population o
bootstrap results for CO /
regression estim ator o 4 / //
in
based on city d a ta with C\J //
n = 10. The left panel o
shows values o f z'eg and o
CVJ
ivJ/2 for resamples in o
which case 4 appears at X lO
least once (dots), and in o o
CM
which case 4 does not o 9 / > 2'
appear and case 9
appears zero times (0),
O
o
in •m
,IUy Aft
/Q
once (1), or m ore times co
(+ ); the dotted line
o
shows The right
panel shows the sample
2 4 6 8 10 0 50 100 150 200 250 300
and the regression lines
fitted to the d a ta with sqrt(v*) u
case 4 (dashes) and
w ithout it (dots); the
vertical line shows the
value fi at which 0 is
estimated.
To com pare the perform ances o f the various m ethods in setting confidence
intervals, we conducted a num erical experim ent in which 1000 sam ples o f
size n = 10 were taken w ithout replacem ent from the p o p ulation o f size
N = 49. F or each sam ple we calculated 90% confidence intervals [L, U] for
6 using R = 999 b o o tstrap samples. Table 3.8 contains the em pirical values
o f Pr(0 < L), Pr(0 < U), an d Pr(L < 9 < U). T he norm al intervals are short
an d their coverages are m uch too small, while the m odified intervals with
ri = 2 have the opposite problem . Coverages for the m odified sam ple size with
ri = 11 and for the pop u latio n and superpopulation b o o tstrap are close to
their nom inal levels, though their endpoints seem to be slightly too far left. The
80% and 95% intervals an d those for the regression estim ator have sim ilar
properties. In line w ith o th er studies in the literature, we conclude th a t the
population and superp o p u latio n b o o tstraps are the best o f those considered
here. ■
Stratified sampling
In m ost applications the pop u lation is divided into k strata, the ith o f which
contains N t individuals from which a sam ple o f size n, is taken w ithout
replacem ent, independent o f o th er strata. The ith sam pling fraction is f i =
tii/Ni and the p ro p o rtio n o f the p o pulation in the ith stratu m is vv, = N t/ N ,
where N = N i H-------- 1- N k- The estim ate o f 9 and its stan d ard erro r are found
by com bining quantities from each stratum .
Two different setups can be envisaged for m athem atical discussion. In the
first — the “small-fc” case — there is a small num ber o f large stra ta: the
asym ptotic regim e takes k fixed and n „ N j—>oo with where 0 < 7tj < 1.
98 3 ■Further Ideas
A p art from there being k strata, the same ideas and results will apply as above,
w ith the chosen resam pling scheme applied separately in each stratum . The
second setup — the “large-/c” case — is where there are m any sm all stra ta;
in m athem atical term s we suppose th a t k —>00 b u t th a t N, and n, are bounded.
This situation is m ore com plicated, because biases from each stratum can
com bine in such a way th a t a b o o tstrap fails completely.
Example 3.16 (Average) Suppose th a t the p o p u lation ,]M com prises k strata,
and th at the yth item in the ith stratu m is labelled the average for th at
stratum is ^ . T hen the pop u latio n average is 6 = which is estim ated
by T = wiYi, where % is the average o f the sam ple Y,i,. . . , Yint from the ith
stratum . T he variance o f T is
k . Ni
V= £ W,2(l - / ,) X — — - W f , (3.18)
i=l 1 j= 1
the m ean o f which is obtained by replacing the last term on the right by
(Ni — I )-1 Z j i & i j — &i)2- If k is fixed and TV,—>-oo while f ~ * n t , (3.20) will
converge to v, b u t this will n o t be the case if n!; N, are bounded and k —>00.
T he boo tstrap bias estim ate also m ay fail for the same reason (Problem 3.12).
■
F or setting confidence intervals using the studentized b o o tstrap the key issue
is n o t the perform ance o f bias and variance estim ates, b u t the extent to which
the distrib u tio n o f the resam pled q uantity Z* = (T* — t ) / V ’ll2m atches th at
o f Z = ( T —6 ) / V 1/2. D etailed calculations show th a t when the population
and superpopulation b o o tstrap s are used, Z an d Z* have the same limiting
distribution u n d er b o th asym ptotic regimes, an d th a t under the fixed-/c setup
the approxim ation is b etter th a n th a t using the other resam pling plans.
Example 3.17 (Stratified ratio) F or em pirical com parison o f the m ore prom is
ing o f these finite populatio n resam pling schemes w ith stratified data, we gen
erated a pop u latio n w ith N pairs (u,x) divided into strata o f sizes N i , . . . , N k
3.7 ■Finite Population Sampling 99
e = r l E E x'>
.=i j=\
where x,j is the value o f x for the jth elem ent o f stratu m i.
We took independent sam ples (uy,Xy) o f sizes n, w ithout replacem ent from
the ith stratum , an d used these to form the ratio estim ate o f 9 and its estim ated
variance, given by
k k i n,
t = V WjU, X ti, V = Y Wi ( 1 ~ f i ) X — (---- 7T ~~ t t o j ) 2’
i= 1 i= 1 ^ l } j —1
where
E / ' = 1 X ij _ 1 Ni
E jW .....
these extend (3.16) to stratified sampling. We used b o o tstrap resam ples with
R = 199 to com pute studentized b oo tstrap confidence intervals for 9 based on
1000 different sam ples from sim ulated datasets. Table 3.9 shows the em pirical
coverages o f these confidence intervals in three situations, a “large-/c” case with
k = 20, Nj = 18 and n, = 6, a “small-fc” case with k = 5, Ni = 72 and n, = 24,
and a “small-fc” case w ith k = 3, Ni = 18 and n, = 6. The m odified sam pling
m ethod used sam pling w ith replacem ent, giving sam ples o f size n' = 7 when
n = 6 an d size ri = 34 w hen n = 24, while the corresponding values o f m for
the m irror-m atch m ethod were 3 and 8. T h roughout / i = j-
In all three cases the coverages for norm al, population and m odified sample
size intervals are close to nom inal, while the m irror-m atch m ethod does poorly.
T he superp o p u latio n m ethod also does poorly, perhaps because it was applied
to separate stra ta ra th e r th an used to construct a new p o pulation to be
stratified a t each replicate. Sim ilar results were obtained for nom inal 80% and
95% confidence limits. O verall the population b o o tstrap and m odified sample
100 3 ■Further Ideas
size m ethods d o best in this lim ited com parison, an d coverage is n o t im proved
by using the m ore com plicated m irror-m atch m ethod. ■
where the x,s are random ly sam pled from Fx an d independently the z^s
are random ly sam pled from Fz, w ith E (Z ) = 0 to force uniqueness o f the
model. T hus there is hom ogeneity o f variation in Z betw een groups, and the
structure is additive. T he feature o f this m odel th a t com plicates resam pling is
the correlation betw een observations w ithin a group,
For d a ta having this nested structure, one m ight be interested in param eters o f
Fx o r Fz o r some co m bination o f both. F o r exam ple, w hen testing for presence
o f variation in X the usual statistic o f interest is the ratio o f betw een-group
and w ithin-group sum s o f squares.
How should one resam ple nonparam etrically for such a d ata structure? There
are two simple strategies, for b o th o f which the first stage is to random ly sample
groups w ith replacem ent. A t the second stage we random ly sam ple w ithin the
groups selected at the first stage, either w ithout replacem ent (Strategy 1) or
w ith replacem ent (Strategy 2). N ote th a t Strategy 1 keeps selected groups intact.
To see which strategy is likely to w ork better, we look at the second m om ents
o f resam pled d a ta y'j to see how well they m atch (3.22). C onsider selecting
y'i V. . . , y ’ib. A t the first stage we select a ran d o m integer /* from {1 ,2 ,__a}.
A t the second stage, we select ran d o m integers from {1,2
either w ithout replacem ent (Strategy 1) o r w ith replacem ent (Strategy 2): the
3.8 ■Hierarchical Data 101
sam pling w ithout replacem ent is equivalent to keeping the J* th group intact.
U nder b o th strategies
E*(5y I /* = O = )V,
and
However,
This gives
E {v a r'(i'jy )} =
and
Strategy 1,
Strategy 2.
O n balance, therefore, Strategy 1 m ore closely mimics the variation properties
o f the data, an d so is the preferable strategy. R esam pling should w ork well so
long as a is m oderately large, say at least 10, ju st as resam pling hom ogeneous
d a ta w orks well if n is m oderately large. O f course b o th strategies would work
well if b o th a an d b were very large, b u t this is rarely the case.
A n application o f these results is given in Exam ple 6.9.
The preceding discussion w ould apply to balanced d a ta structures, b u t not
to m ore com plex situations, for which a m ore general approach is required. A
direct, m odel-based ap proach would involve resam pling from suitable estim ates
o f the tw o (or m ore) d a ta distributions, generalizing the resam pling from F in
C h ap ter 2. H ere we outline how this m ight work for the d a ta structure (3.21).
102 3 ■Further Ideas
Estim ates o f the two C D F s Fx an d Fz can be form ed by first estim ating the
xs and zs, and then using their E D F s. A naive version o f this, which parallels
stan d ard linear m odel theory, is to define
S traightforw ard calculations (Problem 3.17) show th a t this approach has the
sam e second-m om ent properties o f as Strategy 2 earlier, show n in (3.23)
and (3.24), w hich are n o t satisfactory. Som ew hat predictably, Strategy 1 is
mim icked by choosing z\ r a n d o m l y w ith replacem ent from one group
o f residuals Zki,...,Zkb — either a random ly selected group or the group
corresponding to x* (Problem 3.17).
W hat has gone w rong here is th a t the estim ates x* in (3.25) have excess
variation, nam ely a ^ S S g = <xl + b~loj, relative to T he estim ates Zy defined
in (3.25) will be satisfactory provided b is reasonably large, although in principle
they should be standardized to
- - (3.26)
11 ( 1 - f c - 1)1/2 '
The excess variation in X; can be corrected by using the shrinkage estim ate
= cy■■+ (1 - c ) y i . ,
where c is given by
(i - c Y =
1 b ( b - l ) S S B’
where F* denotes either the E D F o f the boo tstrap sam ple Y J , . . . , Y * draw n
from F or the p aram etric m odel fitted to th at sample. Thus the calculation
applies to b o th param etric an d nonparam etric situations. There is b o th random
variation an d system atic bias in B in g e n eral: it is the bias w ith which we are
concerned here.
As with T itself, so w ith B : the bias can be estim ated using the bootstrap.
If we w rite y = c(F) = E (B \ F ) — b(F), then the simple b o o tstra p estim ate
according to the general principle laid out in C h ap ter 2 is C = c(F). From the
definition o f c(F) this implies
C = E*(B* | F ) - B ,
the b o o tstrap estim ate o f the bias o f B. To see ju st w hat C involves, we use
the definition o f B in (3.27) to obtain
H ere F** denotes the E D F o f a sample draw n from F*, o r from the param etric
m odel fitted to th a t sam ple; T** is the estim ate com puted w ith th a t sam ple; and
E** denotes expectation over the the distribution o f th a t sam ple conditional on
F*. T here are tw o levels o f b o o tstrapping in this procedure, which is therefore
104 3 ■Further Ideas
Badj = B — C .
Since typically bias is o f o rder n-1 , the adjustm ent C is typically o f order n~2.
T he following exam ple gives a simple illustration o f the adjustm ent.
B — C = —n-1 T — n~2 T.
E { h ( F , F - p ) \ F } = 0, (3.31)
E * { h ( F \ F - , P ) \ P } = 0.
E { h ( F , F ; p ) \ F } = e ( F ) n - a. (3.32)
To correct for this bias we introduce the ideal pertu rb atio n y = c„(F) which
modifies b(F) to b(F,y) in o rd er to achieve
E[h{F,F-,b(F,y)}\F]=0. (3.33)
W hat we w ant to see is the effect o f substituting p ajj for ft in (3.32). First
we approxim ate the solution to (3.33). T aylor expansion ab o u t 7 = 0 , together
with (3.32), gives
where
dn( F ) = ^ E [ h { F , F ; b ( F , y ) } \ F ]
y=0
7 = c„(F) = —r(F)n~a.
This, together w ith the corresponding approxim ation for y = cn(F), gives
? - y = —n~a{r(F) - r(F)} = - r T ^ X , ,
We can now assess the effect o f the adjustm ent from [3 to (iadj- Define the Note that if the next
term in expansion (3.34)
conditional quantity were 0(n~a~c), then the
right-hand side of (3.35)
kH(X„) = ^ - E [ h { F , F ; b ( F , y ) } \ X „ , F ] , would strictly be
0 (n- a-(.-l/2) +
8y y=0 1/2)+
0(n-a-c-1/2). In almost
which is Op(l). T hen taking expectations in (3.35) we deduce that, because o f all cases this will lead to
(3.34), the same conclusion.
This implies th at
v(Vk) = (3'37)
r= l
where t*k = J T 1 £ ? = i C Plots o f v{\pk) against com ponents o f yik can then
be used to see how v a r(T ) depends on 1p. Exam ple 2.13 shows an application
o f this. The sam e sim ulation results can also be used to approxim ate other
properties, such as the bias or quantiles o f T , or the variance o f transform ed T.
As described here the num ber o f sim ulated datasets will be R K , b u t in fact
this num b er can be reduced considerably, as we shall show in Section 9.4.4. The
sim ulation can be bypassed com pletely if we estim ate v(ipk) by a delta-m ethod
variance approxim ation VL(y)k), based on the variance o f the influence function
under the p aram etric m odel. However, this will often be impossible.
In the nonparam etric case there appears to be a m ajor obstacle to per
form ing calculations analogous to (3.37), nam ely the unavailability o f models
corresponding to a series o f p aram eter values rpi,...,\pK. But this obstacle can
108 3 ■Further Ideas
Example 3.20 (City population data) Figure 3.7 shows the results o f the
double b o o tstrap procedure outlined above, for the ratio estim ator applied to
the d a ta in Table 2.1, w ith n = 10. The left panel shows the bias b’ estim ated
using M = 50 second-level b o o tstrap sam ples from each o f R = 999 first-level
b o o tstrap samples. The right panel shows the corresponding stan d ard errors
* 112
vr . The lines from applying a locally w eighted robust sm oother confirm the
clear increase w ith the ratio in each panel.
The lim plication o f Figure 3.7 is th a t the bias and variance o f the ratio are
no t stable w ith n = 10. Confidence intervals for the true ratio 9 based on
norm al approxim ations to the distrib u tio n o f T — 9 will therefore be poor, as
will basic b o o tstra p confidence intervals, and those based on related quantities
such as the studentized b o o tstrap are suspect. A reasonable in terpretation o f
the right panel is th a t v a r(T ) oc 92, so th a t log T should be m ore stable. ■
Transformed t*
o f Figure 3.8 contains a scatter plot o f v’L versus t* from R = 999 n o n p aram et
ric sim ulations: the d o tted line is the approxim ate norm al-theory relationship
v a r(T ) = n ~ '( l — 02)2. T he p lo t correctly shows strong instability o f variance.
The right panel shows the corresponding plot for b o otstrapping the tra n s
form ed estim ate ^ l o g ^ l + f ) /( l - t)}, whose variance is approxim ately n~l :
here v i is com puted as in Exam ple 2.18. The plot correctly suggests quite
stable variance. ■
110 3 ■Further Ideas
As presented here the selection o f p aram eter values ip* is com pletely random ,
and R would need to be m oderately large (at least 50) to get a reasonable
spread o f values o f \p*. T he to tal nu m b er o f samples, R M + R, will then be very
large. It is, however, possible to im prove upon the algorithm ; see Section 9.4.4.
A n other im p o rtan t problem is the roughness o f variance estim ates, apparent
in b o th o f the preceding exam ples. This is due n o t ju st to the size o f M , but
also to the noise in the E D F s F* being used as models.
Frequency smoothing
O ne m ajor difference betw een the p aram etric an d nonparam etric cases is th at
the param etric m odels vary sm oothly w ith p aram eter values. A simple way
to inject such sm oothness into the nonp aram etric “m odels” F ’ is to sm ooth
them. F or simplicity we consider the one-sam ple case.
Let w( ) be a sym m etric density w ith m ean zero and unit variance, and
consider the sm oothed frequencies
f j ( o , e ) c c ( n r O ^ , j = (3-39)
r= l ' '
H ere e > 0 is a sm oothing p aram eter th a t determ ines the effective range o f
values o f t* over which the frequencies are sm oothed. As is com m on with kernel
sm oothing, the value o f e is m ore im p o rtan t th an the choice o f w(-), which we
take to be the stan d ard norm al density. N um erical experim entation suggests
th a t close to 6 = t, values o f e in the range 0 .2 v l/ 2 - 1 .0 v l/2 are suitable, where v is
an estim ated variance for t. We choose the co n stan t o f proportionality in (3.39)
to ensure th a t Z j f j { 8 ,E) = n- F ° r a given e, the relative frequencies n~ 1 f j ( 8 , e)
determ ine a distribution F e‘ , for which the p aram eter value is 8 " = t{Fg); in
general 0* is n o t equal to 8 , although it is usually very close.
Example 3.22 (City population data) In co n tin u ation o f Exam ple 3.20, the
top panels o f Figure 3.9 show the frequencies f j for four sam ples with values
o f t' very close to 1.6. T he variation in the f j leads to the variability in both
b* and v" th a t shows so clearly in Figure 3.7.
The lower panels show the sm oothed frequencies (3.39) for distributions Fg
with 8 = 1.2, 1.52, 1.6, 1.9 and e = 0.2u1/2. The corresponding values o f the
ratio are 8 ’ = 1.23, 1.51, 1.59, an d 1.89. T he observations w ith the smallest
em pirical influence values are m ore heavily weighted when 8 is less th a n the
original value o f the statistic, t = 1.52, and conversely. The third panel, for
6 = 1.6, results from averaging frequencies including those shown in the upper
panels, an d the distribution is m uch sm oother th an those. The results are not
very sensitive to the value o f e, although the tilting o f the frequencies is less
m arked for larger s.
The sm oothed frequencies can be used to assess how the bias and variance
3.9 ■Bootstrapping the Bootstrap 111
tt ^ f ‘ dd
(3.40)
{t) J { m v /r
3.10 ■Bootstrap Diagnostics 113
In general, b u t especially for small Ri, it will be b etter to fit a sm ooth curve
to values o f logt>*, in p art to avoid negative estim ates v(0). Provided th at
a suitable sm oothing m ethod is used, inclusion o f t\ and t'R in the set for
which the v" are estim ated implies th at all the transform ed values h(t*) can be
calculated. T he transform ed estim ator h ( T ) should have approxim ately unit
variance.
A ny o f the com m on sm oothers can be used to obtain v(0), and simple inte
gration algorithm s can be used for the integral (3.40). I f the nested boo tstrap
is used only to obtain the variances o f Ri o f the f*, the total num ber o f
b o o tstrap sam ples required is R + M R i . Values o f R\ and M in the ranges
50-100 and 25-50 will usually be adequate, so if R = 1000 the overall num ber
o f b o o tstrap sam ples required will be 2250-6000. If variance estim ates for all
the t ’ are available, for exam ple nonparam etric delta m ethod estim ates, then
the delta m ethod shows th a t approxim ate standard errors for the h(t'r) will be
i>*1/2/ v ( t ') 1/2; a plot o f these against t* will provide a check on the adequacy
o f the transform ation.
T he sam e procedure can be applied with second-level resam pling done from
sm oothed frequencies, as in Exam ple 3.22.
Example 3.23 (City population data) For the city population d ata o f E xam
ple 2.8 the p aram eter o f interest is the ratio 6 , which is estim ated by t = x / u.
Figure 3.7 shows th a t the variance o f T depends strongly on 6 . We used the
procedure outlined above to estim ate a transform ation based on R = 999
b o o tstrap samples, w ith R\ = 50 and M = 25. The transform ation is shown
in the left panel o f Figure 3.11: the right panel shows the stan d ard errors
v ^ 2 / v ( O l/2 o f the h(t'). T he transform ation has been largely successful in
stabilizing the variance.
In this case the variances VLr based on the linear approxim ation are readily
calculated, an d the tran sfo rm atio n could have been estim ated from them rather
than from the nested bootstrap. ■
Figure 3.11
Variance-stabilization
for the city population
ratio. The left panel
shows the empirical
transformation «(•), and
the right panel shows
f the standard errors
ID u jy2/{v(r*)}1,/2 of the
CO
h{t*), with a smooth
curve.
to com pare outliers, for example. In this situation we m ust focus on the effect
of individual observations on b o o tstrap calculations, to answ er questions such
as “would the confidence interval differ greatly if this point were rem oved?”,
or “w hat happens to the significance level when this observation is deleted?”
Nonparametric case
Once a nonparam etric resam pling calculation has been perform ed, a basic
question is how it w ould have been different if an observation, yj, say, had
been absent from the original data. F or exam ple, it m ight be wise to check
w hether or n o t a suspicious case has affected the quantiles used in a confidence
interval calculation. T he obvious way to assess this is to do a fu rth er sim ulation
from the rem aining observations, b u t this can be avoided. This is because a
resam ple in which y; does n o t ap p ear can be th o u g ht o f as a random sample
from the d a ta w ith yj excluded. Expressed formally, if J* is sam pled uniform ly
from { l ,...,n } , then the conditional distribution o f J ' given th at J* =/= j
is the sam e as the distribution o f /*, where /* is sam pled uniform ly from
{ 1 ,... , j — \ , j + 1 ,...,« } . T he probability th a t is n o t included in a boo tstrap
sample is (1 — n-1 )" = e ~ \ so the num b er o f sim ulations R - j th a t do not
include yj is roughly equal to R e ~l = 0.368R.
So we can m easure the effect o f on the calculations by com paring the full
sim ulation w ith the subset o f t \ , . . . , t R
’ obtained from bo o tstrap sam ples where
yj does n o t occur. In term s o f the frequencies f ’j which count the num ber o f
tim es yj app ears in the rth sim ulation, we sim ply restrict attention to replicates
with f ' j = 0. F or exam ple, the effect o f yj on the bias estim ate B can be
3.10 ■Bootstrap Diagnostics 115
Table 3.10
M easurements on the F irst son Second son F irst son Second son
head breadth and length L en Brea Len Brea Len B rea L en Brea
o f the first two adult
sons in 25 families
(Frets, 1921). 1 191 155 179 145 14 190 159 195 157
2 195 149 201 152 15 188 151 187 158
3 181 148 185 149 16 163 137 161 130
4 183 153 188 149 17 195 155 183 158
5 176 144 171 142 18 186 153 173 148
6 208 157 192 152 19 181 145 182 146
7 189 150 190 149 20 175 140 165 137
8 197 159 189 152 21 192 154 185 152
9 188 152 197 159 22 174 143 178 147
10 192 150 187 151 23 176 139 176 143
11 179 158 186 148 24 197 167 200 158
12 183 147 174 147 25 190 163 187 150
13 174 150 185 152
Example 3.24 (Frets’ heads) Table 3.10 contains d ata on the head breadth
and length o f the first two ad u lt sons in 25 families.
T he correlations am ong the log m easurem ents are given below the diagonal
in Table 3.11. T he values above the diagonal are the partial correlations. For
exam ple, the value 0.13 in the second row is the correlation betw een the log
head b read th o f the first son, b i, and the log head length o f the second
son, h, after allowing for the other variables. In effect, this is the correlation
betw een the residuals from separate regressions o f b\ and lj on the other two
variables. T he correlations are all large, b u t four o f the partial correlations
are small, which suggests the simple in terpretation th at each o f the four pairs
o f m easurem ents for first and second sons is independent conditionally on the
values o f the o th er two m easurem ents.
116 3 ■Further Ideas
We focus on the p artial correlation t = 0.13 betw een log foj and log I2 . The
top panel o f Figure 3.12 shows a jack k n ife-after-b ootstrap plot for t, based
on 999 b o o tstrap samples. T he points at the left-hand end show the em pirical
0.05, 0.1, 0.16, 0.5, 0.84, 0.9, an d 0.95 quantiles o f the values o f t’ — t *_2 for the
368 b o o tstrap sam ples in which case 2 was n o t selected; ~t_’ 2 is the average o f
t* for those samples. T he d o tted lines are the corresponding quantiles for all
999 values o f t* — t. T he distribution is clearly m uch m ore peaked when case
2 is left out. T he panel also contains the corresponding quantiles when other
cases are excluded. T he horizontal axis shows the em pirical influence values
for t: clearly puttin g m ore weight on case 2 sharply decreases the value o f t.
The low er left panel o f the figure shows th a t case 2 lies som ew hat away
from the rest, and the plot o f residuals for the regressions o f logfti and lo g /2
on (lo g b2,lo g h) in the low er right panel accounts for the jackknife-after-
b oo tstrap results. Case 2 seems outlying relative to the others: deleting it will
clearly increase t substantially. T he overall average and stan d ard deviation o f
the t* are 0.14 an d 0.23, changing to 0.34 and 0.17 when case 2 is excluded.
The evidence against zero p artial correlation depends heavily on case 2. ■
Parametric case
In the p aram etric case different calculations are needed, because random
sam ples from a case-deletion m odel are n o t simply an unw eighted subset o f
the original b o o tstrap samples. N evertheless, those original b o o tstrap samples
can still be used if we m ake use o f the following identity relating expectations
under two different p aram eter v alu es:
E { h ( Y ) \ r p ' } = E { h ( Y ) f^ Y li 'P
w) | y j- (3.42)
Suppose th a t the full-data estim ate (e.g. m axim um likelihood estim ate) o f the
m odel p aram eter is xp, an d th a t when case j is deleted the corresponding
estim ate is xp^j. The idea is to use (3.42) w ith xp an d xp-j in place o f xp and xpr
3.10 • Bootstrap Diagnostics 117
respectively. F or example,
/d di _ f l W .* . \ f ( y * I V-y) 1 V~V**
; } ~ "\ R r j) f ( y ; Iv) R § (r }J ’
w here the sam ples y* are draw n from the full-data fitted model, th at is with
p aram eter value ip. Sim ilar w eighted calculations apply to o ther features o f the
118 3 ■Further Ideas
3.10.2 Linearity
Statistical analysis is simplified w hen the statistic o f interest T is close to
linear. In this case the variance approxim ation v i will be an accurate estim ate
o f the b o o tstrap variance v a r(T | F), and saddlepoint m ethods (Section 9.5)
can be applied to o btain accurate estim ates o f the distribution o f t \ w ithout
recourse to sim ulation. A linear statistic is n o t necessarily close to norm ally
distributed, as Exam ple 2.3 illustrates. N o r does linearity guarantee th at T is
directly related to a pivot and therefore useful in finding confidence intervals.
O n the o th er hand, experience from o th er areas in statistics suggests th at these
three properties will often occur together.
This suggests th a t we aim to find a transfo rm atio n h(-) such th a t h ( T ) is well
described by the linear approxim ation th a t corresponds to (2.35) or (3.1). For
simplicity we focus on the single-sam ple case here. T he shape o f h(-) would be
revealed by a p lo t o f h(t) against t, b u t o f course this is n o t available because
h(-) is unknow n. However, using T aylor approxim ation and (2.44) we do have
h(t') = h(tl) = h{t) + h(t)± Y ' f j l j - h(t) + h(t)(t'L - t), h(t) is dh(t)/dt.
" i =i
Example 3.25 (City population data) T he top left panel o f Figure 3.13 shows
t ’L plotted against t" for 499 b o o tstrap replicates o f the ratio t = x / u for the
d ata in Table 2.1. The p lo t is highly nonlinear, an d the logarithm ic tran sfo r
m ation, o r one even m ore extreme, seems appropriate. N ote th a t the plot has
shape sim ilar to th a t for the em pirical variance-stabilizing transform ation in
Figure 3.11.
For a p aram etric transform ation, we try a B ox-C ox transform ation, h{t) =
(tx — 1) / 1, w ith the value o f k estim ated by m axim izing the log likelihood for
the regression o f the h(t') on the t'Lr. This strongly suggests th at we use I = —2,
for which the fitted curve is shown as the solid line on the plot. This is close to
the result for a sm oothing spline, shown as the d o tted line. The to p right panel
shows the linear approxim ation for h(t‘), i.e. h(t) + h(t)n~l Y T j = i f j b ’ plotted
against h(tm). This plot is close to the line w ith unit gradient, and confirm s the
results o f the analysis o f transform ations.
3.10 • Bootstrap Diagnostics 119
r* CM
_c=
O * O
N
CNJ
V •‘ jf c a
C \1
CO CO
-6 -4-2 0 2 - 3 - 2 - 1 0 1 2 3
z* Quantiles of Standard Normal
The lower panels show related plots for the studentized b o o tstrap statistics
on the original scale and on the new scale,
. t'-t . h(t')-h(t)
Z ~ *1/2 ’ Z>
>~ *1/2 ’
vL h(t)vL
where v’L = n~ 2 ^ 2 f j l j . T he left panel shows that, like t*, z ’ is far from
linear. The lower right panel shows th a t the distribution o f z ’h is fairly close
to stan d ard norm al, though there are som e outlying values. The distribution
o f z* is far from norm al, as shown by the right panel o f Figure 2.5. It
seems that, here, the tran sfo rm ation th a t gives approxim ate linearity o f t* also
120 3 ■Further Ideas
m akes the corresponding studentized b o o tstrap statistic roughly norm al. The
transform atio n based on the sm oothing spline w ould give sim ilar results. ■
F or m ost simple estim ators we can use the nonp aram etric delta m ethod vari
ance estim ates. But in general, an d for m ore com plicated problem s, we use the
b o o tstrap to im plem ent this procedure. T hus we generate R boo tstrap samples,
com pute the estim ates f* (l),. . . , t ' ( K ) for each sample, and then choose t to be
th a t t(i) for which the b o o tstra p estim ate o f variance
R
«;(0 = ( « - l r 1 5 3 {t;(o - r ( o }2
r= 1
is serious depends on the context. However, the bias can be adjusted for by
b o o tstrap p in g the w hole procedure, as follows.
L et y\,...,y*n be one o f the R sim ulated samples. Suppose th at we apply
the procedure for choosing am ong T ( 1 ) ,..., T { K ) to this b o o tstrap sample.
T h a t is, we generate M sam ples with equal probability from y \ , . . . , y ’n, and
calculate the estim ates f” (l), . . . , f ” (K ) for the mth such sample. T hen choose
the estim ator w ith the smallest estim ated variance
M
m=1
D oing this for each o f the R sam ples y [ , . . . , y * gives t \ , . . . , t ’R, and the em pirical
d istribution o f the t‘ — t values approxim ates the distribution o f T — F or
exam ple, v = ( R —I )-1 ^ ( t * —t )2 estim ates the variance o f T, and by accounting
for the selection bias should be m ore accurate th an t>(i).
T here are two byproducts o f this double b o o tstrap procedure. One is infor
m ation on how w ell-determ ined is the choice o f estim ator, if this is o f interest,
simply by exam ining the relative frequency with which each estim ator is cho
sen. Secondly, the bias o f v(i) can be approxim ated: on the log scale bias is
estim ated by R ~ l ^ l o g y ’ — log v, where v'r is the sm allest value o f the v’(i)s
in the rth b o o tstrap sample.
n-k
which are averages after d ropping the k smallest and k largest order statistics
yy y The usual average and sam ple m edian correspond respectively to k =
0 an d \{n — 1). The left panel o f Figure 3.14 plots the trim m ed averages
against k. The m ild dow nw ard trend in the plot suggests slight asym m etry o f
the d a ta distribution. O u r aim is to use the b o o tstrap to choose am ong the
trim m ed averages.
T he trim m ed averages will all be unbiased if the underlying d a ta distribution
is sym metric, an d estim ator variance will then be a sensible criterion on which
to base choice. The b o o tstrap procedure m ust build in the assum ed symmetry,
122 3 • Further Ideas
2.0
2.0
averages and their
9 9 estimated variances and
& m ean squared errors for
1 '
the pooled gravity data,
§ I O
a
based on R = 1000
> % "
bootstrap samples, using
9 9 . * * ’ the ordinary bootstrap
6 e * • , 6 0
0 O o (•) and the symmetric
0
bootstrap (o).
0 0
0 0
10 20 30 40 10 20 30 40 10 20 30 40
and this can be done (cf. Exam ple 3.4) by sim ulating sam ples from a
sym m etrized version o f F such as
F sym(y ) = l2 { F ( y ) + F( 2 U - y - 0)} ,
mse(i) = K_ 1 £ { r ; ( 0 - y } 2 -
r= 1
sam ples y j,...,y g [ from Fsym. To each o f these sam ples we then apply the
original sym m etric b o o tstrap procedure, generating M = 100 sam ples o f size
n = 81 from the sym m etrized E D F o f y \ , . .. , 3^ , choosing t* to be th a t one o f
the 11 trim m ed averages w ith sm allest value o f v’(i). The variance v o f t\ , . . . , t'R
equals 0.356, which is 10% larger th an the original m inim um variance. If we
use this variance w ith a norm al aproxim ation to calculate a 95% confidence
interval centred on t, the interval is [77.16,79.50]. This is very sim ilar to the
intervals obtained in Exam ple 3.2.
The frequencies w ith which the different trim m ing proportions are chosen
are:
k 12 16 20 24 28 32 36 40
Frequency 1 25 54 96 109131 49886
T hus when sym m etry o f the underlying distribution is assum ed, a fairly heavy
degree o f trim m ing seems desirable for these data, and the value k = 36
actually chosen seems reasonably well-determ ined. ■
E fron (1979, 1982) suggested and studied em pirically the use o f sm ooth ver
sions o f the ED F, b u t the first system atic investigation o f sm oothed bootstraps
was by Silverm an and Y oung (1987). They studied the circum stances in which
sm oothing is beneficial for statistics for which there is a linear approxim ation.
Hall, D iCiccio an d R om an o (1989) show th a t when the quantity o f interest
depends on a local property o f the underlying C D F, as do quantiles, sm ooth
ing can give w orthw hile theoretical reductions in the size o f the m ean squared
error. Sim ilar ideas apply to m ore com plex situations such as L\ regression
(D e Angelis, H all and Y oung 1993); see how ever the discussion in Section 6.5.
D e Angelis an d Y oung (1992) give a useful review o f b o o tstrap sm oothing, and
discuss the em pirical choice o f how m uch sm oothing to apply. See also W ang
(1995). R o m an o (1988) describes a problem — estim ation o f the m ode o f a
density — where the estim ator is undefined unless the E D F is sm oothed; see
also Silverm an (1981). In a spatial d a ta problem , K endall and K endall (1980)
used a form o f b o o tstrap th a t jitte rs the observed data, in order to keep the
rough configuration o f p oints co n stan t over the sim ulations; this am ounts to
sam pling w ithout replacem ent when applying the sm oothed bootstrap. Young
(1990) concludes th a t although this ap proach can o u tperform the unsm oothed
bootstrap , it does n o t perform so well as the sm oothed b o o tstrap described in
Section 3.4.
G eneral discussions o f survival d a ta can be found in the books by Cox
and O akes (1984) and Kalbfleisch an d Prentice (1980), while Flem ing and
H arringto n (1991) and A ndersen et al. (1993) give m ore m athem atical accounts.
T he product-lim it estim ator was derived by K ap lan and M eier (1958): it and
variants are widely used in practice.
Efron (1981a) proposed the first b o o tstra p m ethods for survival data, and
discussed the relation betw een trad itio n al an d b o o tstrap stan d ard errors for
the product-lim it estim ator. A kritas (1986) com pared variance estim ates for
the m edian survival tim e from E fron’s sam pling scheme and a different a p
proach o f R eid (1981), and concluded th a t E fron’s scheme is superior. The
conditional m ethod outlined in Section 3.5 was suggested by H jo rt (1985),
and subsequently studied by K im (1990), who concluded th a t it estim ates
the conditional variance o f the product-lim it estim ator som ew hat b etter th an
does resam pling cases. D oss and G ill (1992) an d B urr and D oss (1993) give
weak convergence results leading to confidence bands for quantiles o f the
survival time distribution. T he asym ptotic behaviour o f param etric and no n
param etric b o o tstrap schemes for censored d a ta is described by H jo rt (1992),
while A ndersen et al. (1993) discuss theoretical aspects o f the weird b o o t
strap.
The general ap p ro ach to m issing-data problem s via the EM algorithm is dis
cussed by D em pster, L aird and R ubin (1977). Bayesian m ethods using m ultiple
im putatio n an d d a ta au gm entation are decribed by T anner and W ong (1987)
3.12 ■Bibliographic Notes 125
3.13 Problems
1 In a two-sample problem, with data y tj, j = 1 ,..., n„ i = 1,2, giving sample averages
y,- and variances t>„ describe models for which it would be appropriate to resample
the following quantities:
(a) e y = ytj - %
(b) ei} = (ytj - 3>.)/(l + n~l )l/2,
(c) etj = (ytj - y,•)/{«.■( 1 + n - l )}l/2,
(d) = + ( y , j — yi)/{vt( 1 + n~l )}l/1, where the signs are allocated with equal
probabilities,
(e) etj = yij/%
In each case say how a simulated dataset would be constructed.
What difficulties, if any, would arise from replacing y and v, by more robust
estimates o f location and scale?
(Sections 3.2, 3.3)
4 Spherical data y i , . . . , y „ are points on the sphere o f unit radius. Suppose that it
is assumed that these data come from a distribution that is symmetric about the
unknown mean direction /i. In light o f the symmetry assumption, what would be
an appropriate resampling algorithm for simulating data y j ,...,y * ?
(Section 3.3; Ducharme et a l., 1985)
6 The empirical influence values can be calculated more directly as follows. Consider
only distributions supported on the data values, with probabilities p t = ( p n , ... , p inf)
on the values in the ith sample for i = 1, . . . , k . Then write T = so
that t = t(pi,...,pk) with pi = (},■■■, ^ ) . Show that the empirical influence value
lij corresponding to the 7 'th case in sample i is given by
lv = ^ - t { p u . . . , ( l - s ) p i + e l j , . . . , p k} I ,
u£ I£=0
where l j is the vector with 1 in the j th position and zeroes elsewhere.
(Section 3.2.1)
7 Following on from the previous problem, re-express t(p 1 , . . . , pk) as a function u(n)
o f a single probability vector n = ( 7t 1 1 , . . . , 7t 1 „1 , . . . , nk„t ). For example, for the ratio
o f means o f two independent samples, t = _p2 />’i,
where 1 y is the vector with 1 in the («,■_1 + j )th position, with n0 = 0 , and zeroes
elsewhere. One consequence o f this is that vL = n~ 2 J2j'=i %•
A pply these calculations to the ratio t = yi/yi.
(Section 3.2.1)
- Vh i b w i ^ r ) = l j w { ^ r )
1 ^ ( x — a —bxj\
j= 1 v
hb J7
will have the same first two mom ents as the E D F if a = (1 — b)x and b =
{1 + nh2z 2/ J2(x j “ x)2} ~ l/2. W hat algorithm simulates from this sm oothed E D F?
128 3 ■Further Ideas
(d) D iscuss the special problems that arise from using gh(x) when the range o f x
is [0, oo) rather than (—oo, oo).
(e) Extend the algorithms in (b) and (c) to multivariate x.
(Section 3.4; Silverman and Young, 1987; Wand and Jones, 1995)
9 Consider resampling cases from censored data (y i, d \ ) , . . . , (y„, dn), where yi < ■■■<
y n. Let f j denote the number o f times that (y j , d j ) occurs in an ordinary bootstrap
sample, and let Sj = / ' H-------1- / ' .
(a) Show that when there is no censoring, the product-limit estimate puts mass n-1
on each observed failure yi <■■■< y„, so that F° = F.
(b) Show that if B(m, p) denotes binom ial distribution with index m and probability
p, then
v a r { l-F ° -(y )} = { l - F ° ( y ) } 2 £
( n - j + l)2'
)-yj<y
This equals the variance from Greenwood’s formula, (3.10), apart from replacement
o f (n — j + l) 2 by (n - j)(n - j + 1).
(Section 3.5; Efron, 1981a; Cox and Oakes, 1984, Section 4.3)
10 Consider the weird bootstrap applied to a hom ogeneous sample o f censored data,
(yi ,di),...,(y„,d„), in which >’i < ■- < >>„. Let d A 0,(yj) = N j / ( n — j + l), where the
N'j are independent binomial variables with denominators n —j + 1 and probabilities
dj / ( n — j + 1).
(a) Show that the total number o f failures under this resampling scheme is dis
tributed as a sum o f independent binom ial observations.
(b) Show that
12 (a) Establish (3.15), and show that the sample variance c is an unbiased estimate
o f y.
3.13 ■Problems 129
(b) N ow suppose that N = kn for some integer k. Show that under the population
bootstrap,
(c) In the context o f Example 3.16, suppose that the parameter o f interest is a
nonlinear function o f 9, say t] — g (6), which is estimated by g(T ). Use the delta
m ethod to show that the bias o f g (T ) is roughly ^g"(0)var(T), and that the
bootstrap bias estimate is roughly ig " (t)var'(T ‘ ). Under what conditions on n and
N does the bootstrap bias estimate converge to the true bias?
(Section 3.7; Bickel and Freedman, 1984; Booth, Butler and Hall, 1994)
13 To model the superpopulation bootstrap, suppose that the original data are
y i , . . . , y n and that <9* contains copies o f y u . . - , y n; the joint distri
bution o f the M j is multinomial with probabilities n~{ and denominator N. If
Y]’, . . . , y„* are sampled without replacement from <&' and if Y ’ = n~l J2 Y/> show
that
where k' = [k] is the integer part o f k. Show that if the mirror-match algorithm is
applied for an average Y ’ with this distribution fo r X ', var”(Y ”) = (1—m/n)c/ (mk).
Show also that under mirror-match resampling with the simplifying assumption
that randomization is not required because k is an integer,
f, « (* -!)
E (C ) = c l 1- ^ j ^ T j
where C ‘ is the sample variance o f the Y-.
What implications are there for variance estimation for more complex statistics?
(Section 3.7; Sitter, 1992)
15 Suppose that n is a large even integer and that N = 5n/2, and that instead o f
applying the population bootstrap we choose a population from which to resample
according to
Having selected <&’ we take a sample Y ,\...,Y „ ' from it without replacement and
calculate Z ‘ = (Y* — y ){(l — f ' ) n ~ l c}~i/2. Show that if f = n / N the approximate
distribution o f Z ’ is the normal mixture |N (0, | ) + |N (0, y ) , but that if f =
#{/!} is the number of n/#{<&'} the approximate distribution o f Z ‘ is N ( 0,1). Check that in the first case,
elements in the set A. E*(Z*) = 0 and var’(Z ’ ) = 1.
Comment on the implications for the use o f randomization in finite population
resampling.
(Section 3.7; Bickel and Freedman, 1984; Presnell and Booth, 1994)
130 5 • Further Ideas
16 Suppose that we have data y i,...,y „ , and that the bootstrap sample is taken to be
Yj = y + d(yij — y), j
18 For the m odel o f Problem 3.17, define estimates o f the x,s and z,; s by
Show that the E D F s o f % and have first two mom ents which are unbiased for
the corresponding m om ents o f the A"s and Z s if
(Section 3.8)
19 Consider the double bootstrap procedure for adjusting the estimated bias o f T,
as described in Section 3.9, when T is the average Y . Show that the variance o f
simulation error for the adjusted bias estimate B — C is
■F-'a-p)
3.14 ■Practicals 131
(a) If Fk denotes the gamma distribution with index k and unit mean, show that
tp(FK) = k(1 - 2p )-'{ F K+1(yK,i_p) - / \ +](>v,P)}, where y K<p is the p quantile o f FK.
Hence evaluate tp(FK) for k: = 1, 2, 5, 10 and p = 0, 0.1, 0.2, 0.3, 0.4, 0.5.
(b) Suppose that the parameter o f interest, 6 = £ * =1 Cjt()(FjKi), depends on several
gamma distributions FiJtr Let F, denote the E D F o f a sample o f size n, from f , K|.
Under what circumstances is T = £ , = 1 c,rp(F,) (i) unbiased, (ii) nearly unbiased,
as an estimate o f 0? Test your conclusions by a small simulation experiment.
(Section 3.11)
3.14 Practicals
1 To perform the analysis for the gravity data outlined in Example 3.2:
g r a v .fu n < - f u n c t io n ( d a t a , i )
{ d <- d a t a f i,]
m < - ta p p ly (d $ g ,d $ s e r ie s ,m e a n )
v < - t a p p l y ( d $ g ,d $ s e r i e s ,v a r )
n <- ta b le (d S s e r ie s )
c (su m (m * n /v )/su m (n /v ), l/s u m ( n /v ) ) >
g r a v .b o o t < - b o o t ( g r a v it y , g r a v .f u n , R=200, s t r a t a = g r a v it y $ s e r i e s )
Plot the estimate and its variance. Is the simulation well-behaved? How normal
are the bootstrapped estimates and studentized bootstrap statistics?
N ow for a semiparametric analysis, as suggested in Section 3.3:
a t t a c h ( g r a v it y )
n <- t a b le (s e r ie s )
m < - r e p ( t a p p ly ( g , s e r i e s , m ean), n)
s <- r e p (s q r t(ta p p ly (g ,s e r ie s ,v a r )) ,n )
r e s < - (g - m )/s
q q n o r m (r e s); a b l i n e ( 0 , 1 ,lt y = 2 )
g ra v < - d a ta .fr a m e (m , s , s e r i e s , r e s )
g r a v .fu n < - f u n c t io n ( d a t a , i )
{ e < - d a t a $ r e s [ i]
y < - data$m + d a ta $ s * e
m < - t a p p l y ( y , d a t a $ s e r ie s , mean)
v < - t a p p l y ( y , d a t a $ s e r ie s , v a r)
n <- ta b le (d a ta $ s e r ie s )
c (s u m (m * n /v )/su m (n /v ), l/s u m ( n /v ) )
>
g r a v l.b o o t < - b o o t( g r a v , g r a v .f u n , R=200)
D o residuals r e s for the different series look similar? Compare the values o f t’ and
d' for the two sampling schemes. Compare also 80% confidence intervals for g.
(Section 3.2)
2 Dataframe charm ing contains data on the survival o f 97 men and 365 women
in a retirement home in California. The variables are sex, ages in months at
which individuals entered and left the home, the time in months they spent there,
and a censoring indicator (0/1 denoting censored due to leaving the hom e/died
there). For details see Hyde (1980). We compare the variability o f the survival
probabilities at 75 and 85 years (900 and 1020 months), and o f the estimated 0.75
and 0.5 quantiles o f the survival distribution.
132 3 ■Further Ideas
To study the performance o f censored data resampling schemes when the censoring
pattern is fixed, we perform a small simulation study. We apply a fixed censoring
pattern to samples o f size 50 from the unit exponential distribution, and for each
sample we calculate t = ( t \ , t 2), where t\ is the maximum likelihood estimate o f the
distribution mean and t2 is the number o f censored observations. We apply each
bootstrap scheme to the sample, and record the mean and standard deviation o f
t from the bootstrap simulation. (This is quite time-consuming: take n re p s and R
as big as you dare.)
apply(con.boot$t, 2, mean),
apply(wei.boot$t, 2, mean),
sqrt(apply(ord.boot$t, 2, var)),
sqrt(apply(con.boot$t, 2, var)),
sqrt(apply(wei.boot$t, 2, var)))
results <- rbind(results, res) }
The estimated bias and standard deviation o f t ly and the bootstrap bias estimates
are
mean(results[,1])-l
sqrt(var(results[,1] ))
bias.o <- results[,3]-results[,1]
bias.c <- results[,5]-results[,1]
bias.w <- results[,7]-results[,1]
How do they compare? W hat about the estimated standard deviations? How do
the numbers o f censored observations vary under the schemes?
(Section 3.5; Efron, 1981a; Burr, 1994)
4 The tau particle is a heavy electron-like particle which decays into various col
lections o f other charged particles shortly after its production. The decay usually
involves one charged particle, in which case it can happen in a number o f modes,
the main four o f which are labelled p, n, e, and p. It takes a major research project
to measure the rate o f occurrence o f single-particle decay, decayi, or any o f its
com ponent rates decay,,, decay^, decaye, and decay,,, and just one o f these can
be measured in any one experiment. Thus dataframe ta u on decay rates for 60
experiments represent several years o f work. Here we use them to estimate and
form a confidence interval for the parameter
8 = decay! — decay p — decay n — decay e — decay
Suppose that we had thought o f using the 0, 12.5, 25, 37.5 and 50% trimmed
averages to estimate the difference. To calculate these and to obtain bootstrap
confidence intervals for the estimates o f 8:
N ow suppose that we want to choose the estimator from the data, by taking the
trimmed average with smallest variance. For the original data this is the 25%
trimmed average, so the estimate is 16.87. Its variance can be estimated by a
double bootstrap, which we can implement as follow s:
To see what degrees o f trimming give the smallest variances, and to calculate the
corresponding estimates and obtain their variance:
i <- matrix(l:5,5,tau.boot2$R)
i <- i[t(tau.boot2$t[,6:10]==apply(tau.boot2$t[,6:10] ,l,min))]
table(i)
t.best <- tau.boot2$t[cbind(l:tau.boot2$R,i)]
var(t.best)
(a) The value o f the correlation is t = 0.83. Will it increase or decrease if observation
7 is deleted from the sample? (Be careful.) W hat is the effect on t o f deleting
observation 6?
(b) What happens to the bootstrap distribution o f t" — t when observation 8 is
deleted from the sample? W hat about observation 6?
(c) Show that the probability that neither observation 5 nor observation 6 is in a
bootstrap sample is (1 — ^ ) u = 0.11. N ow suppose that observation 5 is deleted,
and calculate the probability that observation 6 is not in a bootstrap sample. D oes
this explain what happens in (b)?
6 Suppose that we are interested in the largest eigenvalue o f the covariance matrix
between the baseline and one-year C D 4 counts in cd4; see Practical 2.3. To
3.14 ■Practicals 135
calculate this and its approximate variance using the nonparametric delta method
(Problem 2.14), and to bootstrap it:
split.screen(c(l,2))
screen(l); split.screen(c(2,1))
screen(3)
plot(cd4.boot$t[,1],cd4.boot$t[,2],xlab="t*",ylab="vL*",pch=".")
screen(4)
plot(cd4[,l],cd4[,2],type="n",xlab="baseline",
ylab="one year" ,xlim=c(l,7) ,ylim=c(1,7))
text(cd4[, 1] ,cd4[,2] ,c(l :20) ,cex=0.7)
screen(2); jack.after.boot(cd4.boot,useJ=F,stinf=F)
W hat is going on here?
(Section 3.10.1; Canty, D avison and Hinkley, 1996)
4
Tests
4.1 Introduction
M any statistical applications involve significance tests to assess the plausibil
ity o f scientific hypotheses. R esam pling m ethods are n o t new to significance
testing, since rando m izatio n tests and p erm u tatio n tests have long been used
to provide nonp aram etric tests. A lso M onte C arlo tests, which use sim ulated
datasets, are quite com m only used in certain areas o f application. In this chap
ter we describe how resam pling m ethods can be used to produce significance
tests, in b o th p aram etric and nonparam etric settings. The range o f ideas is
som ew hat w ider th a n the direct b o o tstrap approach introduced in the pre
ceding tw o chapters. To begin with, we sum m arize some o f the key ideas o f
significance testing.
T he sim plest situation involves a simple null hypothesis Ho which com pletely
specifies the probability distribution o f the data. Thus, if we are dealing with
a single sam ple y \ , . . . , y n from a p o p u latio n w ith C D F F, then Ho specifies
th a t F = Fo, where F0 contains no unknow n param eters. A n exam ple would
be “exponential w ith m ean 1”. T he m ore usual situation in practice is th at
Ho is a composite null hypothesis, which m eans th a t some aspects o f F are
n o t determ ined and rem ain unknow n w hen Ho is true. A n exam ple would
be “norm al w ith m ean 1”, the variance o f the norm al distribution being
unspecified.
P-values
A statistical test is based on a test statistic T which m easures the discrepancy
between the d a ta an d the null hypothesis. In general discussion we shall follow
the convention th a t large values o f T are evidence against H 0. Suppose for the
m om ent th a t this null hypothesis is simple. If the observed value o f the test
statistic is denoted by t then the level o f evidence against Ho is m easured by
136
4.1 ■Introduction 137
This yields the error rate in terp retation o f the P-value, nam ely th at if the
observed test statistic were regarded as ju st decisive against Ho, then this is
equivalent to following a procedure which rejects H 0 with error rate p. The
sam e is not exactly true if T is discrete, and for this reason m odifications to
(4.1) are som etim es suggested for discrete d a ta problem s: we shall n o t worry
a b o u t the distinction here.
It is im p o rtan t in applications to give a clear idea o f the degree o f discrepancy
betw een d a ta an d null hypothesis, if not giving the P-value itself then at least
indicating how it com pares to several levels, say p = 0.10,0.05,0.01, rather
th a n ju st testing a t the 0.05 level.
L(e) = f Yu...,Yn( y u . - . , y n \ 0 ) ,
to departu re from the original m odel. We would then test those additional
param eters. O therw ise general purpose goodness o f fit tests will be used, for
exam ple chi-squared tests.
In the nonp aram etric setting, no p articu lar form s are specified for the
distributions. T hen the ap p ro p riate choice o f T is less clear, b u t it should be
based on at least a qualitative n otion o f w hat is o f concern should Ho n o t be
true. Usually T would be based on a statistical function s(F) th a t reflects the
characteristic o f physical interest and for which the null hypothesis specifies a
value. F or example, suppose th a t we wish to test the null hypothesis Hq th at
X and Y are independent, given the ran d o m sam ple (X i, Vi) , . . . , (X„, Y„). The
correlation s(F) = corr(AT, Y ) = p is a convenient m easure o f dependence, and
p = 0 und er Hq. If the alternative hypothesis is positive dependence, then a
natu ral test statistic is T = s(F), the raw sam ple correlation; if the alternative
hypothesis is ju st “dependence”, then the tw o-sided test statistic T = s 2 (F)
could be used.
Conditional tests
In m ost p aram etric problem s and all nonparam etric problem s, the null h y p o th
esis Ho is com posite, th a t is it leaves som e param eters unknow n and therefore
does not com pletely specify F. Therefore P-value (4.1) is not generally well-
defined, because P r( T > t \ F) m ay depend upon which F satisfying Ho is
taken. T here are two clean solutions to this difficulty. One is to choose T
carefully so th a t its distrib u tio n is the same for all F satisfying H o : examples
include the Student-t test for a norm al m ean w ith unknow n variance, and rank
tests for nonparam etric problem s. The second and m ore widely applicable so
lution is to elim inate the p aram eters which rem ain unknow n when Ho is true
by conditioning on the sufficient statistic und er Ho- If this sufficient statistic is
denoted by S, then we define the conditional P-value by
Fam iliar exam ples include the Fisher exact test for a 2 x 2 table and the
S tudent-t test m entioned earlier. O th er exam ples will be given in the next two
sections.
A less satisfactory approach, which can nevertheless give good approxim a
tions, is to estim ate F by a C D F f '0 which satisfies Ho and then calculate
Typically this value will n o t satisfy (4.2) exactly, b u t will deviate by an am ount
which m ay be practically negligible.
Pivot tests
W hen the null hypothesis concerns a p articu lar p aram eter value, the equiva
lence betw een significance tests an d confidence sets can be used. This equiv
4.1 ■Introduction 139
an d therefore
p = Pr (Z > z0 | F ) . (4.6)
statistic there is a simple approxim ation to the null distribution, and m odifi
cations to im prove approxim ation in m oderate-sized samples. The likelihood
ratio m ethod appears lim ited to p aram etric problem s, but as we shall see in
C h apter 10 it is possible to define analogues in the nonparam etric case.
W ith all o f the P-value calculations introduced thus far, simple approxim a
tions for p exist in m any cases by appealing to lim iting results as n increases.
Part o f the purpose o f this chapter is to provide resam pling alternatives to such
approxim ations when they either fail to give ap p ro p riate accuracy o r do not
exist a t all. Section 4.2 discusses ways in which resam pling and sim ulation can
help with param etric tests, starting w ith exact M onte C arlo tests. Section 4.3
briefly reviews p erm u tatio n and random ization tests. This leads on to the wider
topic o f nonp aram etric b o o tstrap tests in Section 4.4. Section 4.5 describes a
simple m ethod for im proving P-values when these are biased. M ost o f the
exam ples in this chap ter involve relatively simple applications. C hapters 6 and
beyond con tain m ore substantial applications.
Table 4.1 n = 50
0 1 2 3 4 3 4 2 2 1 counts o f balsam-fir
0 2 0 2 4 2 3 3 4 2 seedlings in five feet
square quadrats.
1 1 1 1 4 1 5 2 2 3
4 1 2 5 2 0 3 2 1 1
3 1 4 3 1 0 0 2 7 0
It seems intuitively clear th a t the sensitivity o f the M onte C arlo test increases
w ith R. We shall discuss this issue later, b u t for now we note th a t it is advisable
to take R to be a t least 99.
T here are tw o im p o rtan t aspects o f the M onte C arlo test which m ake it
widely useful. T he first is th a t we only need to be able to sim ulate d a ta under
the null hypothesis, this being relatively simple even in some very com plicated
problem s, such as those involving spatial processes (C hapter 8). Secondly,
do n o t need to be independent outcom es: the m ethod rem ains
valid so long as they are exchangeable outcom es, which is to say th a t the
jo in t density o f T, T R u nder Ho is invariant under p erm u tatio n o f its
argum ents. This allows us to apply M onte C arlo tests to quite com plicated
problem s, as we see next.
generate each y ’ by an independent sim ulation o f N steps w ith the same initial
state x. If the M arkov chain has equilibrium distribution equal to the null
hypothesis distribution o f Y = ( Y [ ,..., Y„), then y and the R replicates o f y *
are exchangeable outcom es under Ho an d (4.11) applies.
Suppose th a t und er H q the d a ta have jo in t density fo(y) for where
both /o and & are conditioned on sufficient statistic s if we are dealing with
a conditional test. F or simplicity suppose th a t has \3S\ elements, which
we now regard as possible states labelled (1 ,2 ,...,\&S\) o f a M arkov chain
{Zr, t = . . . , —1 ,0 ,1 ,...} in discrete time. C onsider the d a ta y to be one
realization o f Zjy. We then have to fix an ap p ro p riate value o r state for Zo,
and w ith this initial state sim ulate the R independent values o f Z N which are
the R values o f Y \ The M arkov chain is defined so th a t /o is the equilibrium
distribution, which can be enforced by ap p ro p riate choice o f the one-step
forw ard transition probability m atrix Q, say, with elements
Let the final state, the realized value o f Zo, be x. N ote th at if Ho is true, so
th at y was indeed sam pled from / 0, then Pr(Zo = x) = /o(x). In the second
p art o f the sim ulation, which we repeat independently R times, we sim ulate N
forw ard steps o f the M arkov chain, starting in state x and ending up in state
y ' = (> > i,...,y '). Since und er Ho the chain starts in equilibrium ,
Pr(Y* = / | H 0) = P r( Z N = / ) = / 0( / ) .
R
f ( y , y l . . . , f R | Ho) = fo(y) £ Pr(Z 0 = x | Z N = y ) ] ] P r(Z N = | Z 0 = x),
x r= l
using the independence o f the replicate sim ulations from x. But by the definition
o f the first p a rt o f the sim ulation, where (4.12) applies,
and so
and
This correlation depends upon the p articu lar construction o f Q, and reduces
to zero at a rate which depends upon Q as m increases. W hile the correlation
does not affect the validity o f the P-value calculation, it does affect the power
o f the te s t: the higher the correlation, the lower the power.
Example 4.3 (Logistic regression) We retu rn to the problem o f Exam ple 4.1,
which provides a very sim ple if artificial illustration. The d a ta y are a binary
sequence o f length n w ith s ones, and calculations are to be conditional on
Y , Yj = s. Recall th a t direct M onte C arlo sim ulation is possible, since all (")
possible d a ta sequences are equally likely und er the null hypothesis o f constant
probability o f a unit response.
One simple M arkov chain has one-step transitions which select a pair o f
subscripts i, j a t random , an d switch y t an d yj. Clearly the chain is irreducible,
since one can progress from any one binary sequence with s ones to any other.
All ratios o f null probabilities /o (u )//o (« ) are equal to one, since all binary
sequences w ith s ones are equally probable. Therefore if we run the M etropolis
algorithm , all switches are accepted. But note th a t this M arkov chain, while
simple to im plem ent, is inefficient and will require a large num ber o f steps to
induce approxim ate independence o f the t’s. T he m ost effective M arkov chain
would have one-step transitions which are ran d o m p erm utations, and for this
only one step w ould be required. ■
Example 4.4 (AM L data) F or d a ta such as those in Exam ple 3.9, consider
testing the null hypothesis o f p ro p o rtio n al h azard functions. D enote the failure
times by z\ < z2 < • • • < z„, assum ing no ties for the m om ent, and define rtj to
be the nu m b er in group i w ho were at risk ju st p rior to zj. Further, let yj be
0 or 1 according as the failure at zj is in group 1 or 2, and denote the hazard
function a t tim e z for group i by fy(z). Then
z 5 5 8 8 9 12 13 18 23 23 27 30 31 33 34 43 45 48
n 11 11 11 11 11 10 10 8 7 7 6 5 5 4 4 3 3 2
n 12 11 10 9 8 8 7 6 6 5 5 4 3 3 2 2 1 0
n 11 n n 10 10 8 7 7 6 5 5 4 3
a 12
1 10 9 8 8 7 6 6 5 5 4 3 3
2 2
3 oo
y 1 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0
N ote th a t because aig = oo, m ust be 0 w hatever the value o f 0ig, and so
this final response is uninform ative. We therefore dro p yig from the analysis.
H aving done this, we see th a t under Ho the sufficient statistic for the com m on
h azard ratio 0 is S = Yj, whose observed value is s = 11.
W hatever the test statistic T, the exact conditional P-value (4.4) m ust be
approxim ated. D irect sim ulation appears impossible, but a simple M arkov
chain sim ulation is possible. First, the state space o f the chain is 3§ = {x =
( x i , . . . , x n ) : Y l x j = s}> th a t is all perm utations o f y i , . . . , y n . F or any two vec
tors x and x in the state-space, the ratio o f null conditional jo in t probabilities
p{x | s, 0 ]
p(x | s, 01
;'= i
We take the carrier M arkov chain to have one-step transitions which are ra n
dom p erm u tatio n s: this guarantees fast m ovem ent over the state space. A step
which moves from x to x is then accepted with probability min ^ 1, f l j l i a]‘ •
By sym m etry the reverse chain is defined in exactly the same way.
The test statistic m ust be chosen to m atch the particular alternative hy p o th
esis th o u g h t relevant. H ere we suppose th at the alternative is a m onotone ratio
o f hazards, for which T = YljLi Yj log(Zj) seems to be a reasonable choice.
The M arkov chain sim ulation is applied with N = 100 steps back to give the
initial state x an d 100 steps forw ard to state y ' , the latter repeated R = 99
times. O f the resulting £* values, 48 are less th an or equal to the observed value
t = 17.75, so the P-value is (1 + 4 8 )/(l + 99) = 0.49. Thus there appears to be
no evidence against the prop o rtional hazards model.
Average acceptance probability in the M etropolis algorithm is approxim ately
0.7, and results for N = 10 and N = 1000 ap p ear indistinguishable from those
for N = 100. This indicates unusually fast convergence for applications o f the
M arkov chain m ethod. ■
148 4 ■ Tests
the null m odel Fo and use (4.5) to com pute the P-value, i.e. p = P r(T > t \ Fo).
F or exam ple, for the p aram etric m odel where we are testing Ho : ip = ipo
with X a nuisance p aram eter, Fo w ould be the C D F o f f ( y \ ipo,Xo) with Xo
the m axim um likelihood estim ator (M L E ) o f the nuisance param eter when ip
is fixed equal to ipo. C alculation o f the P-value by (4.5) is referred to as a
b o o tstrap test.
If (4.5) can n o t be com puted exactly, o r if there is no satisfactory approx
im ation (norm al or otherwise), then we proceed by sim ulation. T h at is, R
independent replicate sam ples yj,...,_y* are draw n from Fo, and for the rth
such sam ple the test statistic value t'r is calculated. T hen the significance
probability (4.5) will be approxim ated by
Pboot ~ J ( 4 .1 3 )
O rdinarily one would use a simple p ro p o rtio n here, but we have chosen to
m ake the definition m atch th a t for the M onte C arlo test in (4.11).
null distribution o f T , b u t this is often quite unreliable except for very large
n. The p aram etric b o o tstrap provides a m ore reliable and simple option.
The p aram etric b o o tstrap w orks as follows. We generate R sam ples o f size
n by ran d o m sam pling from the fitted null m odel /o (y | fj). For each sample
we calculate estim ates fj* and ( ’ by m axim izing the sim ulated log likelihoods
, , , , Kiicy)*-1 e x p ( - K y / n ) , { l o g y - ot\
f o ( y \ r i ) = ----------^ r ( K ) ----------’ = ^ — p— ) ’y > 0 -
logy and s^og) are the with h(x) = d \o gr ( K) /d K, the digam m a function. The M L E s o f the m ean
average and sample
variance for the log yj.
and variance o f the norm al distribution for log Y are a = lo g y = 3.829 and
P2 = (n — 1)s?ogy/ n = 2.339. The test statistic (4.14) is
whose value for the d a ta is t = —0.465. The left panel o f Figure 4.2 shows
a histogram o f R = 999 values o f t* under sam pling from the fitted gam m a
m odel: o f these, 619 are greater th an t and so p = 0.62.
N ote th a t the histogram has a fairly non-norm al shape in this case, suggesting
th a t a norm al approxim ation will not be very accurate. This is true also for the
(rath er com plicated) studentized version Z o f T : the right panel o f Figure 4.2
shows the norm al plot o f b o o tstrap values z \ The observed value o f z is
0.4954, for which the b o o tstrap P-value is 0.34, som ew hat sm aller th an th at
com puted for t, b u t not changing the conclusion th a t there is no evidence to
change from a gam m a to a lognorm al m odel for these data. T here are good
general reasons to studentize test statistics; see Section 4.4.1.
It should p erhaps be m entioned th at significance tests o f this kind are not
always helpful in distinguishing between models, in the sense th at we could
find evidence against either b o th or neither o f them. This is especially true
w ith small sam ples such as we have here. In this case the reverse test shows
no evidence against the lognorm al model. ■
150 4 ■ Tests
Example 4.6 (Normal plot) C onsider the d ata in Table 3.1, and suppose in
4.2 • Resampling fo r Parametric Tests 151
t*(a), a G si.
U nder the null hypothesis, T(a), Tj*(a),. . . , T R(a) are independent and identi
cally distributed for any fixed a, so th a t (4.9) applies w ith T = T(a). T h at
is,
from the R sim ulated plots. If t(a) exceeds the upp er value, or falls below the
lower value, then the corresponding one-sided P-value is at m ost p; the two-
sided test which rejects Ho if t(a) falls outside the interval [ ^ ( a ) , tJ'J?+1_fc)(a)]
has level equal to 2p. T he set o f all u p p er and lower critical values defines the
test envelope
S'1 2p = {[t(fc)(a), t(R+!_(;)(«)] : a e s / j . (4.16)
Excursions o f t(a) outside S l~2p are regarded as evidence against Ho, and this
sim ultaneous com parison across all values o f a is w hat is usually m eant by the
graphical test.
Example 4.7 (Normal plot, continued) F or the norm al plot o f the previous
example, suppose we set p = 0.05. T he sm allest sim ulation size th a t works is
R = 19, and then we take k = 1 in (4.16). T he test envelope will therefore
be the lines connecting the m axim a and the m inim a. Because we are plotting
studentized sam ple values, which elim inates m ean and variance param eters, the
sim ulation can be done w ith the N ( 0,1) distribution. Each sim ulated sam ple
y \ , . . . , y ‘u is studentized to give z* = ( y ‘ — y*)/s*, i = 1 ,..., 13, whose ordered
values are then plotted against the same norm al quantiles a, = <P-1 ( ^ ) . The
left panel o f Figure 4.4 shows a set o f R = 19 norm al plots (plotted as
connecting dashed lines) and their envelope (solid curves) for studentized
values o f sim ulated sam ples o f n = 13 N{0,1) data. The right panel shows the
envelope o f these plots together w ith the original d a ta plot. N ote th a t one o f
the inner points falls ju st outside the envelope: this m ight be taken as mild
evidence against norm ality o f the data, b u t such an in terpretation m ay be
prem ature, in light o f the discussion below. ■
sam ples can be generated from any null m odel Fo. W hen unknow n model
param eters can n o t be elim inated, we would sim ulate from Fo: then (4.15) will
be approxim ately true provided n is n o t too small.
There are two aspects o f the graphical test which need careful thought,
nam ely the choice o f R and the in terpretation o f the resulting plot. It seems
clear from earlier discussion th a t for p = 0.05, say, R = 19 is too sm all:
the test envelope is too random . R = 99 would seem to be a m ore sensible
choice, provided this is not com putationally difficult. But we should consider
how form al is to be the interp retation o f the graph. As it stands the notional
one-sided significance levels p hold pointwise, and certainly the chance th a t the
envelope captures an entire plot will be far less th an 1 — 2p. So it would not
m ake sense to infer evidence against the null m odel if one arbitrarily placed
p o in t falls outside the envelope, as happened in Exam ple 4.7. In fact in th at
exam ple the chance is ab o u t 0.5 th at some point will fall outside the sim ulation
envelope, in co n trast to the pointw ise chance 0.1.
F or some purposes it will be useful to know the overall erro r rate, i.e.
the chance o f a point falling outside the envelope, or even to control this
rate. W hile this is difficult to do exactly, there is a simple em pirical approach
which w orks satisfactorily. G iven the R sim ulated plots which were used to
calculate the test envelope, we can sim ulate the graphical test by com paring
{t'(a),a G j / } to the envelope S‘lS r2p th at is obtained from the o th er R — 1
sim ulated plots. If we repeat this sim ulated test for r = 1, . . . , R , then we obtain
a resam ple estim ate o f the overall two-sided erro r rate
This is easy to calculate, since {t‘(a),a e jtfj exits S lS r2p if and only if
for at least one value o f a, where as before k = p(R + 1 ) . T hus if the R plots
are represented by a R x N array, we first com pute colum nw ise ranks. T hen
we calculate the p ro p o rtio n o f rows in which either the m inim um rank is less
th an or equal to k, or the m axim um ran k is greater th a n or equal to R + 1 —k,
o r both. T he corresponding one-sided erro r rates are estim ated in the obvious
way.
Example 4.8 (Normal plot, continued) F or the norm al plot o f Exam ple 4.6, an
overall tw o-sided error rate o f approxim ately 0.1 requires R = 199. Figure 4.5
shows a graphical test p lo t for R = 199 w ith outer envelope corresponding to
overall tw o-sided erro r rate 0.1 and inner envelope corresponding to pointw ise
two-sided erro r rate 0.1; the em pirical error rate (4.17) for the o u ter envelope
is 0.10. ■
4.2.5 C hoice o f R
In any sim ulation-based test, relatively few sam ples could be used if it quickly
becam e clear th a t p was so large as to n o t be regarded as evidence against Ho-
F or exam ple, if the event t* > t occurred 50 times in the first 100 samples, then
it is reasonably certain th a t p will exceed 0.25, say, for m uch larger R, so there
is little p o in t in sim ulating further. O n the other hand, if we observed t* > t
only five times, then it w ould be w orth sam pling fu rther to m ore accurately
determ ine the level o f significance.
O ne effect o f n o t com puting p exactly is to w eaken the pow er o f the test,
essentially because the critical region o f a fixed-level test has been random ly
displaced. T he effect can be quantified approxim ately as follows. C onsider
testing a t level a, which is to say reject Ho if p < a. If the integer k is chosen
equal to (R + l)a , then the test rejects Ho when t'{R+l_k) < t. F or the alternative
hypothesis H a , the pow er o f the test is
A fter change o f variable an d some rearrangem ent o f the integral, this becom es
7too(a, H y4)a*R+1*<x(l - + 1)
(R + l ) a r ((R + l)a ) T ((R + 1)(1 - a)) '
The following table gives som e num erical values o f this approxim ate bound.
dnan
used in (4.4) is then a uniform distribution over a set o f perm utations o f the
d a ta structure. The following exam ple illustrates this.
In evaluating p, we can use the fact th at all m arginal sam ple m om ents
158 4 ■ Tests
1 + # { tr* > r}
me R + 1 '
Example 4.10 (Correlation test, ctd) F or the d ataset shown in Figure 4.6,
the test o f Exam ple 4.9 was im plem ented by sim ulation, th a t is generating
random p erm u tatio n s o f the x-values, w ith R = 999. Figure 4.7 is a histogram
o f the correlation values. The unshaded p a rt corresponds to the 4 t* values
which are greater th an the observed correlation t = 0.509: the P-value is
p = ( l + 4 ) / ( l + 9 9 9 ) = 0.005. ■
or
F\(y) = G ( y / n i), F2(y) = G{y/(i2),
for some unknow n G. T hen the null hypothesis implies a com m on C D F F for
the two populations. In this case, the null hypothesis sufficient statistic s is the
set o f order statistics for the pooled sam ple
m com ponents o f a p erm u tatio n will give the first sam ple and the last «2
com ponents will give the second sample. Further, w hen Ho is true all such
perm utatio n s are equally likely, an d there are o f them. Therefore
As in the previous exam ple, this exact probability would usually be approxi
m ated by taking R ran d o m p erm u tatio n s o f the type described, and applying
(4.11). ■
A som ew hat m ore com plicated tw o-sam ple test problem is provided by the
following example.
Example 4.12 (AM L data) Figure 3.3 shows the product-lim it estim ates o f
the survivor function for tim es to rem ission o f tw o groups o f patients with acute
m yelogeneous leukaem ia (A M L), w ith one o f the groups receiving m aintenance
chem otherapy. D oes this treatm en t m ake a difference to survival?
A com m on test for com parison o f estim ated survivor functions is based on
the log-rank statistic, which com pares the actual n u m ber o f failures in group
1 with its expected value at each tim e a failure is observed, under the null
hypothesis th a t the survival distributions o f the two groups are equal. To be
m ore explicit, suppose th a t we pool the two groups and obtain ordered failure
times y\ < ■■■ < ym, w ith m < n if there is censoring. Let / \j and r\j be the
num ber o f failures and the nu m b er a t risk o f failure in group 1 at tim e yj, and
similarly for group 2. T hen the log-rank statistic is
T = E j = i ( /U - mij)
where
(fij+f2j)rij ( / l j + f 2 j ) r i j r 2j ( r i j + r2j - f i; - f 2J)
1] r ij + r y ’ lJ (ri; + r 2j ) 2{r\ j + r2j — 1)
are the conditional m ean an d variance o f the n u m b er in group 1 to fail a t time
tj, given the values o f f i j + f 2j, r\j and r2j. F or the A M L d a ta t = 1.84. Is this
evidence th a t chem otherapy lengthens survival tim es?
For a suitable null distrib u tio n we simply treat the observations in the rows
o f Table 3.4 as a single group and perm ute them , effectively random ly allocating
group labels to the observations. F or each o f R perm utations, we recalculate
t, obtaining t\, ..., t*R. Figure 4.8 shows the t'r plotted against order statistics
from the JV(0,1) distribution, which is the asym ptotic null distribution o f T.
The asym ptotic P-value is 0.033, in reasonable agreem ent with the P-value
26/(999 + 1) = 0.026 from the p erm u tatio n test. ■
4.4 • Nonparametric Bootstrap Tests 161
p i + # K > t}
P R+l
o
o
■6 -4
Example 4.13 (Comparison of two means, continued) C onsider the last two
series o f m easurem ents in Exam ple 3.1, which are reproduced here labelled
sam ples 1 and 2 :
sam ple 1 82 79 81 79 77 79 79 78 79 82 76 73 64
sam ple 2 84 86 85 82 77 76 77 80 83 81 78 78 78
The question is: do we gain or lose anything by assum ing th a t the two
distributions have the same shape? ■
The p articu lar null fitted m odel used in the previous exam ple was suggested
in p a rt by the p erm u tatio n test, and is clearly n o t the only possibility. Indeed,
a m ore reasonable null m odel in the context would be one which allowed
different variances for the tw o p opulations sam pled: an analogous m odel is
used in Exam ple 4.14 below. So in general there can be m any candidates for null
m odel in the nonparam etric case, each corresponding to different restrictions
im posed in ad d itio n to H q. O ne m ust judge which is m ost ap p ro p riate on the
basis o f w hat m akes sense in the practical context.
°K> = (« i “ l ) s f / « i + ( Pi ~ M 2-
y'j = fo +
164 4 ■ Tests
Figure 4.10
Resampling results for
comparison of the
means of the eight series
of gravity data. Left
panel: histogram of
R = 999 values of t*
under nonparametric
resampling from the
null model with pooled
studentized residuals;
the unshaded area to
right of observed value
t = 21.275 gives
p = 0.029. Right panel:
ordered t‘ values versus
0 10 20 30 40 50 60 Xi quantiles; the dotted
line is the theoretical
t* Chi-squared quantiles
approximation.
with e'jS random ly sam pled from the pooled residuals {e^, i = 1.......8, j =
l,...,n ,} . F or each such sim ulated d ataset we calculate sam ple averages and
variances, then weights, the pooled m ean, and finally t*.
Table 4.3 contains a sum m ary o f the null m odel fit, from which we calculate
f o = 78.6 an d t = 21.275.
A set o f R = 999 b o o tstrap sam ples gave the histogram o f t ‘ values in the
left panel o f Figure 4.10. O nly 29 values exceed t = 21.275, so p = 0.030. The
right panel o f the figure plots ordered t* values against quantiles o f the Xi
approxim ation, which is off by a factor o f ab o u t 1.24 and gives the distorted
P-value 0.0034. A n o rm al-error p aram etric b o o tstrap gives results very sim ilar
to the nonparam etric b o otstrap. ■
4.4 ■Nonparametric Bootstrap Tests 165
Example 4.15 (Ratio test) Suppose that, as in Exam ple 1.2, each observation y
is a p air (u,x), and th a t we are interested in the ratio o f m eans 8 = E ( X ) /E ( U) .
In p articu lar suppose th a t we wish to test the null hypothesis Hq : 6 = 0O-
This problem could arise in a variety o f contexts, and the context would help
to determ ine the relevant null model. F or example, we m ight have a paired-
com parison experim ent where the m ultiplicative effect 0 is to be tested. H ere
do would be 1, an d the m arginal distributions o f U and X should be the same
und er Hq- O ne n atu ral null m odel Fo w ould then be the sym m etrized E D F, i.e.
the E D F o f the expanded d a ta ( u i , x i ) , . . . , (u„,x„),(xi,ui),. . . , ( x n,u„). ■
(4.22)
k ttj
(4.23)
166 4 ■ Tests
Both are m inim ized by the set o f E D F s when no constraints are im posed. The
second m easure has the advantage o f autom atically providing non-negative
solutions. T he following exam ple illustrates the m ethod and som e o f its im pli
cations.
2 n, 2 / n, N
Setting derivatives w ith respect to pi; equal to zero gives the equations
exp(A)>ij) exp i - X y y ) .
17,0 EkLiexp(Ayik)’ 2 j' ° E"Li ^ p i - X y i k Y
The specific value o f X is uniquely determ ined by the null hypothesis constraint,
which becom es
Eyijexp(/l};iv-) = E y 2jexp(-Ay2j)
E * e x p ( ^ lt) s x p ( - X y 2k) ’
whose solution m ust be determ ined numerically. D istributions o f the form
(4.25) are usually called exponential tilts o f the E D Fs.
F or o u r d a ta X = 0.130. The resulting null m odel probabilities are shown in
the left panel o f Figure 4.11. The right panel will be discussed later.
H aving determ ined these null probabilities, the b o o tstrap test algorithm is
as follows:
C alculate
1 + # { f* ^ t}
v = -------- --------- -■
V R+ 1
•
N um erical results for R = 999 are given in Table 4.4 in the line labelled
“exponential tilt, t". R esults for other resam pling tests are also given for
com parison: z refers to a studentized version o f t, “M L E ” refers to use o f
constrained m axim um likelihood (see Problem 4.8), “null variances” refers to
the sem iparam etric m ethod o f Exam ple 4.14. Clearly the choice o f null m odel
can have a strong effect on the P-value, as one m ight expect. T he studentized
test statistics z are discussed in Section 4.4.1. ■
The m ethod as illustrated here has strong sim ilarity to use o f em pirical
likelihood m ethods, as described in C h ap ter 10. In practice it seems wise to
168 4 ■ Tests
check the null m odel produced by the m ethod, since resulting P-values are
generally sensitive to m odel. Thus, in the previous example, we should look at
Figure 4.11 to see if it m akes practical sense. The sm oothed versions o f the null
distributions in the right panel, which are obtained by kernel sm oothing, are
perhaps easier to interpret. One m ight well judge in this case th a t the two null
distributions are m ore different th a n seems plausible. D espite this reservation
ab o u t this exam ple, the general m ethod is a valuable tool to have in case o f
need.
There are, o f course, situations where even this quite general approach will
n o t work. N evertheless the basic idea behind the ap proach can still be applied,
as the following exam ples show.
/<>■;» = (4.26)
j=1
where (j> is the stan d ard norm al density. It is possible to show th at the num ber
o f m odes o f f decreases as h increases. So one way to test unim odality is to
see if an unusually large h is needed to m ake / unim odal. This suggests th a t
we take as test statistic
A natural candidate for the null sam pling distribution is f { y , t ) , since this is
the least sm oothed version o f the E D F which satisfies the null hypothesis o f
unim odality. By the convolution p roperty o f / , random sam ple values from
f ( y ; t ) are given by
Example 4.18 (Tuna density estimate) O ne m ethod for estim ating the ab u n
dance o f a species in a region is to traverse a straight line o f length L through
the region, an d to record the p erpendicular distances from the line to posi
tions where there are sightings. If there are n independent sightings and their
(unsigned) distances y \ , . . . , y n are presum ed to have P D F f ( y ) , y > 0, the
ab undance density can be estim ated by n /( 0 ) /( 2L), where / ( 0 ) is an estim ate
o f the density a t distance y = 0. The P D F f ( y ) is p roportional to a detection
function th a t is assum ed to decline m onotonically with increasing distance,
w ith non-m onotonic decline suggesting th a t the assum ptions th a t underlie line
transect sam pling m ust be questioned.
Table 4.5 gives d a ta from an aerial survey o f schools o f S outhern Bluefin
T una in the G reat A ustralian Bight. Figure 4.12 shows a histogram o f the data.
The figure also shows kernel density estim ates
y * 0- ( 4 -2 8 >
with h = 0.75, 1.5125, an d 3. This seemingly unusual density estim ate is used
because the probability o f detection, and hence the distribution o f signed
distances, should be sym m etric ab o u t the transect. The estim ate is obtained by
first calculating the E D F o f the reflected distances + y i , - . . , + y n, then applying
the kernel sm oother, and finally folding the result a b o u t the origin.
A lthough the estim ated density falls m onotonically for h greater th an 1.5125,
the estim ate for sm aller values suggests non-m onotonic decline. Since we
consider f ( y ; h ) for positive values o f y only, we are interested in w hether the
underlying density falls m onotonically or not. We take the sm allest h such th at
f ( y ; h ) is unim odal to be the value o f o u r test statistic t. This corresponds
to m ono tonic decline o f f ( y ; h ) for y > 0, giving no m odes for y > 0. The
observed value o f the test statistic is t = 1.5125, and we are interested in the
significance probability
P r( T > 1 1 Fo),
for d a ta arising from Fo, an estim ate o f F th at satisfies the null hypothesis o f
170 4 ■ Tests
Distance (miles)
y] = I ± y i j + tej\, j = l,...,n ,
where the signs + are assigned random ly, the l j are random integers from
{ 1 ,2 ,...,n}, and the r.j are independent N ( 0,1) variates; cf. (4.27). T he kernel
density estim ate based on the y ’ is f *(y;h). We now calculate the test statistic
as outlined in the previous example, an d rep eat the process R = 999 times to
obtain an approxim ate significance probability. We restrict the h u n t for m odes
to 0 < y < 10, because it does n o t seem sensible to use so small a sm oothing
param eter in the density tails.
W hen the sim ulations were perform ed for these data, the frequencies o f the
num ber o f m odes o f f ' ( y ; t ) for 0 < y < 10 were as follows.
M odes 0 1 2 3
Frequency 536 411 50 2
Like the fitted null distribution, a replicate where the full f * {y ;t ) is unim odal
will have no m odes for y > 0. I f we assum e th a t the event t* = t is impossible,
b o o tstrap d atasets w ith no m odes have t* < t, so the significance probability
is (411 + 5 0 + 2 + l)/(9 9 9 + 1) = 0.464. T here is no evidence against m onotonic
decline, giving no cause to d o u b t the assum ptions underlying line transect
m ethods. ■
4.4 ■Nonparametric Bootstrap Tests 171
which we can approxim ate by sim ulation w ithout having to decide on a null
m odel Fo- T he usual choice for v would be the nonparam etric delta m ethod
estim ate vL o f Section 2.7.2. T he theoretical support for the use o f Z is given in
Section 5.4; in certain cases it will be advantageous to studentize a transform ed
estim ate (Sections 5.2.2 an d 5.7). In practice it would be appropriate to check
on w hether or n o t Z is approxim ately pivotal, using techniques described in
Section 3.10.
A pplications o f this m ethod are described in Section 6.2.5 and Section 6.3.2.
T he m odifications for the oth er one-sided alternative and for the two-sided
alternative are simply p = Pr*(Z* < zo | F ) and p = Pr*(Z *2 > z \ \ F).
2o = h - h
( s \ / n 2 + S ]/« i) 1/2
172 4 ■ Tests
z. = f 2 ~ fi ~ ( h - h )
(s 22/ n 2 + s \ 2/ n i ) l/2
are generated, w ith each sim ulated d ataset containing n\ values sam pled with
replacem ent from sam ple 1 an d n2 values sam pled with replacem ent from
sam ple 2.
In R = 999 sim ulations we found 14 values in excess o f 1.768, so the P-value
is 0.015. This is entered in Table 4.4 in the row labelled “ (pivot)”. ■
Q = {T - 6 ) t V ~ \ T -G ),
qo = ( t - O o ) Tv~1( t - 9 o ) .
Z = (4.30)
Q = ( T - 9 o) t Vo 1( T - 9 o)
for the vector case, where Vo is an estim ated variance under the null model. If
Zo is used the b o o tstrap P-value will sim ply be
w ith the obvious changes for a test based on Qo. Even though the statistic
is n o t pivotal, its use is likely to reduce the effects o f nuisance param eters,
and to give a P-value th a t is m ore nearly uniform ly distributed u n der the null
hypothesis th a n th a t calculated from T alone.
Example 4.20 (Comparison of two means, continued) In Table 4.4 all the
entries for z, except for the row labelled “(pivot)”, were obtained using (4.30)
w ith t = y 2 — yi an d vo depending on the null m odel. F or example, for the null
m odels discussed in Exam ple 4.16,
2 n,
where £,o = Yl'j=i yijPijfi■ F or the two sam ples in question, under the ex
ponential tilt null m odel b o th m eans equal 79.17 and vo = 1.195, the latter
differing considerably from the variance estim ate 2.59 used in the pivot m ethod
(Exam ple 4.19).
The associated P-values com puted from (4.31) are shown in Table 4.4 for
all null models. These P-values are less dependent upon the p articular m odel
th an those obtained w ith t unstudentized. ■
Adaptive tests
C onditioning occurs in a som ew hat different way in the adaptive choice of
test statistic. Suppose th a t we have possible test statistics T \ , . . . , T k for which
efficiency m easures can be defined and estim ated by e i , . . . , ^ : for example, if
the T, are alternative estim ators for scalar param eter 9 and Ho concerns 9,
then e, m ight be the reciprocal o f the estim ated variance o f T,. The idea o f the
adaptive test is to use th a t T* which is estim ated to be m ost efficient for the
observed data, and to condition on this fact.
We first p artitio n the set 9 o f all possible null m odel resam ples _y‘
into < W i k such th a t
V i = {Cxi. •■•,3';) =% = m a x e’ }.
1< J< k J
F o r an exam ple o f this, see Problem 4.13. In the case o f exact tests, such as
p erm utatio n tests, the adaptive test is also exact.
t* = m in [ !+ # { ■
\ M
where p is the observed P-value defined above. This requires b o o tstrapping the
algorithm for com puting P-values, an o th er instance o f increasing the accuracy
o f a b o o tstrap m ethod by b o o tstrapping it, an idea introduced in Section 3.9.
T he problem can be explained theoretically in either o f two ways, perturbing
the critical value o f t for a fixed nom inal erro r rate a, or adjusting for the bias
in the P-value. We take the second approach, and since we are dealing with
statistical erro r rath er th an sim ulation erro r (Section 2.5), we ignore the latter.
The P-value com puted for the d ata is w ritten po{F), where the function po(')
depends on the m ethod used to obtain Fo from F. W hen the null hypothesis
is true, suppose th a t the p articu lar null distribution Fo obtains. T hen the null
distrib u tio n function for the P-value is
G ( u , F o) = P t { Po ( F ) < u \ F o } , (4.33)
which w ith u = a is the true error rate corresponding to nom inal erro r rate a.
N ow (4.33) implies th at
and so G{po(F),Fo) would be the ideal adjusted P-value, having actual error
rate equal to the nom inal erro r rate. N ext notice th a t by substituting Fo for Fo
in (4.33) we can estim ate G{u,Fo) by
P = P r L - - G. - „ - G . > (m + * - ‘ ) } , (4.35)
(_ mu + n j
where u = x / y , and Gm and Gn are independent gam m a random variables with
indices m a n d n respectively and unit scale param eters.
The b o o tstrap P-value (4.35) does n o t have a uniform distribution under
the null hypothesis, so P = p does n o t correspond to erro r rate p. This is fully
corrected using the adjustm ent (4.34). To see this, w rite (4.35) as p = h(u), so
th a t po(F') equals
P r* * (T " > T* | F*o) = h(U*),
where U ' = X ' / Y ' . Since h( ) is decreasing, it follows th at
Padj = Pr*{/i(l/*) < h(u) | x , y } = Pr*(t/* > u | x , y ) = P r(F 2m,2„ > u),
4.5 ■Adjusted P-values 177
which is the P-value o f the exact test. Therefore p a<jj is exactly uniform and the
adjustm ent is perfectly successful. ■
In the previous example, the same result for pa^ would be achieved if the
b o o tstrap distribution o f T were replaced by a norm al approxim ation. This
m ight suggest th a t b o o tstrap calculation o f p could be replaced by a rough
theoretical approxim ation, thus rem oving one level o f boo tstrap sam pling from
calculation o f padj- U nfortunately this is n o t always true, as is clear from the
fact th a t if an approxim ate null distribution o f T is used which does not
depend upon F at all, then pa<jj is ju st the ordinary bo o tstrap P-value.
In m ost applications it will be necessary to use sim ulation to approxim ate the
adjusted P-value (4.34). Suppose th at we have draw n R resam ples from the null
m odel Fo, w ith corresponding test statistic values r j.......t'R. The rth resam ple
has E D F F* (possibly a vector o f E D Fs), to which we fit the null model
Ko- R esam pling M times from F *0 gives sam ples from which we calculate f " ,
m = 1 ,..., M. T hen the M onte C arlo approxim ation for the adjusted P-value
is
1 + # { p r* < p }
dj — R +1 ’ (4.36)
Example 4.22 (Two-way table) Table 4.6 contains a set o f observed m ulti
nom ial counts, for which we wish to test the null hypothesis o f row -colum n
independence, or additive loglinear model.
178 4 ■ Tests
If the co u n t in row i an d colum n j is y ,j, then the null fitted values are
P-ijfi = yi+y+j/y++> where y l+ = E /J t y an d so forth. The log likelihood ratio
test statistic is
To calculate the sim ulation m ean squared error, we begin w ith equation
(4.37), which we rew rite in the form
I {A} is the indicator
function of the event A. 1 +Em=lJ{C ^ K}
Pr
M+ 1
In order to simplify the calculations, we suppose that, as M —>oo, p ' —>ur such
th a t the urs are a ran d o m sam ple from the uniform distribution on [0,1]. In
this case there is no need to adjust the b o o tstrap P-value, so padj = P■U nder
this assum ption (M + l)p* is alm ost a B inom (M ,ur) random variable, so th a t
equation (4.36) can be approxim ated by
■ l + £ r = l* r
Padj = — r + t ~ '
where X r = /{B in o m (M , ur) < ( M + \)p}. We can now calculate the sim ulation
m ean and variance o f p adj by using the fact th at
A simple aggregate m easure o f sim ulation erro r is the m ean squared error
relative to p,
Estimation o f power
A s regards collection o f data, in simple problem s o f the kind under discussion
in this chapter, the statistical co n trib u tio n lies in recom m endation o f sample
sizes via considerations o f test power. I f it is proposed to use test statistic T,
an d if the p articu lar alternative H a to the null hypothesis Ho is o f prim ary
interest, then the pow er o f the test is
Example 4.23 (M aize height data) The E D F s plotted in the left panel o f
Figure 4.14 are for heights o f m aize plants growing in two adjacent rows, and
differing only in a pollen sterility factor. The two sam ples can be modelled
approxim ately by a sem iparam etric m odel with an unspecified baseline distri
b u tio n F and one m edian-shift p aram eter 8. F or analysis o f such d a ta it is
proposed to test Ho : 8 = 0 using the W ilcoxon test. W hether or n o t there are
enough d a ta can be assessed by estim ating the power o f this test, which does
depend upon F.
D enote the observations in sample i by y i j, j = l ,...,n ; . The underlying
distributions are assum ed to have the form s F ( y ) and F(y — 8), where 8 is
estim ated by the difference in sam ple m edians 0. To estim ate F we subtract 0
from the second sam ple to give y 2j = y ij — 8- Then F is the pooled E D F o f
the yijS and y 2js. F or these d a ta n\ = n2 = 12 and 8 = —4.5. The right panel
o f Figure 4.14 plots E D F s o f the y );s and y 2js.
T he next step is to sim ulate d a ta for selected values o f 0 and selected sample
sizes N i an d N 2 as follows. F or group 1, sam ple d a ta from F(y),
i.e. random ly w ith replacem ent from
and for group 2, sam ple d a ta y 2\ , - - - , y 2Nl from F(y — 8), i.e. random ly with
replacem ent from
y n + 8, . . . , yi„, + 8, y 2\ + 0, . . . , y 2„2 + 0-
T hen calculate test statistic t*. W ith R repetitions o f this, the pow er o f the test
at level p is the p ro p o rtio n o f tim es th a t t* > tp, where tp is the critical value
o f the W ilcoxon test for specified N\ and N 2.
In this p articu lar case, the sim ulations show th a t the W ilcoxon test at level
p = 0.01 has pow er 0.26 for 8 = 8 and the observed sam ple sizes. A dditional
182 4 • Tests
If the proposed test uses the pivot m ethod o f Section 4.4.1, then calculations
o f sample size can be done m ore simply. F or exam ple, for a scalar 9 consider
a two-sided test o f Ho : 9 = 9o w ith level 2a based on the pivot Z . The pow er
function can be w ritten
Sequential tests
Sim ilar sorts o f calculations can be done for sequential tests, where one
im p o rtan t criterion is term inal sam ple size. In this context sim ulation can also
be used to assess the likely eventual sam ple size, given d a ta y i , . . . , y „ at an
interim stage o f a test, w ith a specified protocol for term ination. This can
be done by sim ulating d a ta co n tin u atio n y^+i,y^,+2 , - ■■ up to term ination, by
sam pling from fitted m odels or E D F s, as appropriate. F rom repetitions o f this
sim ulation one obtains an approxim ate distribution for term inal sam ple size N.
4.7 ■Bibliographic Notes 183
4.8 Problems
1 For the dispersion test of Example 4.2, y \ , . . . , y n are hypothetically sampled from
a Poisson distribution. In the Monte Carlo test we simulate samples from the
conditional distribution of Y i,..., Y„ given Y Yj — s<with s = Yl yj- If the exact
multinomial simulation were not available, a Markov chain method could be used.
Construct a Markov chain Monte Carlo algorithm based on one-step transitions
from (mi,...,u„) to (t>i,_,u„) which involve only adding and subtracting 1 from
two randomly selected us. (Note that zero counts must not be reduced.)
Such an algorithm might be slow. Suggest a faster alternative.
(Section 4.2)
2 Suppose that X i , . . . , X n are continuous and have the same marginal CDF F,
although they are not independent. Let / be a random integer between 1 and n.
Show that rank(X/) has a uniform distribution on {1,2,...,n}.
Explain how to apply this result to obtain an exact Monte Carlo test using one
realization of a suitable Markov chain.
(Section 4.2.2; Besag and Clifford, 1989)
3 Suppose that we have a m x m contingency table with entries ytj which are counts.
(a) Consider the null hypothesis of row-column independence. Show that the
sufficient statistic So under this hypothesis is the set of row and column marginal
totals. To assess the significance of the likelihood ratio test statistic conditional
on these totals, a Markov chain Monte Carlo simulation is used. Develop a
Metropolis-type algorithm using one-step transitions which modify the contents of
a randomly selected tetrad yik,yu>yjk>yji> where i ^ j , k ^ I.
(b) Now consider the the null hypothesis of quasi-symmetry, which implies that
in the loglinear model for mean cell counts, log E(Yy) = /i + a, + + ytj, the
interaction parameters satisfy yy = y;i- for all /, j. Show that the sufficient statistic
So under this hypothesis is the set of totals yy+yji, i =£ j, together with the row and
column totals and the diagonal entries. Again a conditional test is to be applied.
Develop a Metropolis-type algorithm for Markov chain Monte Carlo simulation
using one-step transitions which involve pairs of symmetrically placed tetrads.
(Section 4.2.2; Smith et al, 1996)
5 (a) Consider the following rule for choosing the number of simulations in a Monte
Carlo test. Choose k, and generate simulations t\,t’2,..., t] until the first I for which
k of the t’ exceed the observed value t; then declare P-value p = (k + I)/(I + 1).
Let the random variables corresponding to I and p be L and P. Show that
Pr{P < (k + 1)/(/ + 1)} = Pr(L > 1 - 1 ) = k / l , l = k , k + 1,. .
and deduce that L has infinite mean. Show that P has the distribution of
a t/(0, 1) random variable rounded to the nearest achievable significance level
l , k / ( k + l ) , k / ( k + 2),..., and deduce that the test is exact.
(b) Consider instead stopping immediately if k of the f* exceed t at any I < R, and
anyway stopping when I = R, at which point m values exceed t. Show that this
rule gives achievable significance levels
/ ( * + ! ) /( / + !), m = k,
P ~ \( m + l) /( K + l) , m <k.
6 Suppose that n subjects are allocated randomly to each of two treatments, A and
B. In fact each subject falls in one of two relevant groups, such as gender, and the
treatment allocation frequencies differ between groups. The response y t] for the j l h
subject in the ith group is modelled as y,j = y,- + + e,;, where xA and rb are
treatment effects and k(i, j ) is A or B according to which treatment was allocated
to the subject. Our interest is in testing Ho : rA = xB with alternative that xA < tb,
and the test statistic chosen is
T = Y . ri> - Y r‘>’
i,j±(i,j)=B i,jM<,j)=A
where is the residual from regression of the >>s on the group indicators.
(a) Describe how to calculate a permutation P-value for the observed value t using
the method described above Example 4.12.
(b) A different calculation of the P-value is possible which conditions on the
observed covariates, i.e. on the treatment allocation frequencies in the two groups.
The idea is to first eliminate the group effects by reducing the data to differences
djj = yij — yij+i, and then to note that the joint probability of these differences
under Ho is constant under permutations of data within groups. That is, the
minimal sufficient statistic So under H0 is the set of differences — Yl(J+l), where
Yni) < % ) < ■• • are the ordered values within the ith group. Show carefully how
to calculate the P-value for t conditional on so
le) Apply the unconditional and conditional permutation tests to the following
data:
Group 1 Group 2
A 3 5 4 4 1
B O 2 1
d) = Sjdj, j =
where the Sj are independent and equally likely to be + 1 and —1. W hat would
be the corresponding nonparametric bootstrap sampling m odel Fo? Would the
resulting bootstrap P-value differ much from the randomization P-value?
See Practical 4.4 to apply the randomization and bootstrap tests to the following
data, which are differences o f measurements in eighths o f an inch on cross- and
self-fertilized plants grown in the same pot (taken from R. A. Fisher’s famous
discussion o f Darwin’s experiment).
49 -6 7 8 16 6 23 28 41 14 29 56 24 7560 -4 8
8 For the two-sample problem o f Example 4.16, consider fitting the null m odel by
maximum likelihood. Show that the solution probabilities are given by
- 1 . 1
Pij,° ni (a + Xy i j) ’ P2]'° n2(P - Xy2j) ’
where a, fi and / are the solutions to the equations Y P i j f l = 1>Y PVfl ~ U and
Y yijPij.o = Y y 2jP2j,o- Under what conditions does this solution not exist, or give
negative probabilities? Compare this null m odel with the one used in Example 4.16.
d(p, q) = Y ^ V j log Pj - Y 2 Pi lo §
Verify that for small values o f do — x / u these PjS are approximately the same as
those obtained by the M LE method.
(Section 4.4; Efron, 1981b)
10 Suppose that we wish to test the reduced-rank m odel H0 : g(0) — 0, where g(-) is a
Pi-dimensional reduction o f p-dimensional 6. For the studentized pivot method we
take Q = {g(T ) - g(6)}T V ~ l { g ( T ) - g(0)}, with data test value q0 = g(t)r i;g-1g(t),
where vg estimates var[g(T )}. Use the nonparametric delta method to show that
var{g(T )} = g(t)VLg ( t y , where g(0) = 8 g( 6 ) / d d T.
Show how the method can be applied to test equality o f p means given p indepen
dent samples, assuming equal population variances.
(Section 4.4.1)
4.9 ■Practicals 187
11 In a parametric situation, suppose that an exact test is available with test statistic
U, that S is sufficient under the null hypothesis, but that a parametric bootstrap
test is carried out using T rather than U. Will the adjusted P-value padj always
produce the exact test?
(Section 4.5)
12 In calculating the mean squared error for the simulation approximation to the
adjusted P-value, it might be more reasonable to assume that P-values u, follow
a Beta distribution with parameters a and b which are close to, but not equal to,
one. Show that in this case
where X r = /{B in om (M , ur) < ( M + l)p}. Use this result to investigate numerically
the choice o f M.
(Section 4.5)
13 For the matched-pair experiment o f Problem 4.7, suppose that we choose between
the two test statistics ty = d and t2 = (n — 2m)~l J2"Z2+i ^c/)> f° r som e m in the
range 2, . . . , [^n], on the basis o f their estimated variances Vi and v2, where
„ = E (d j-h )2
1 n2
Ej=m+l(^U) ~ f2)2 + m(^(rn+1) ~ h ) 2 + m(rf(„_m) —t2)2
v-> = --- ----------------------------------------------------------------------- .
n(n — 2m)
4.9 Practicals
1 The data in dataframe dogs are from apharmacological experiment. The two
variables are cardiac oxygen consum ption (M VO) and left ventricular pressure
(LVP). D ata for n = 7 dogs are
Apply a bootstrap test for the hypothesis o f zero correlation between M VO and
LVP. Use R = 499 simulations.
(Sections 4.3, 4.4)
3 For a graphical test o f suitability o f the exponential m odel for the data in Table 1.2,
we generate data from the exponential distribution, and plot an envelope.
v2 <- ((sum((d-t2)~2)+m*(min(d)-t2)“2+m*(max(d)-t2)"2))/(n*(n-2*m))
c(tl, vl, t2, v2) }
darwln.ad <- boot(darwin$y, darwin.f, R=999, sim="parametric",
r a n .gen=darwin.g e n , mle=nrow (darwin))
darwin.ad$tO
i <- c (1:999)[darwin.ad$t[,2]>darwin.ad$t[,4]]
(1+sum(darwin.ad$t [i,3] >darwin.ad$tO [3] )) / (1+length (i))
Is a different result obtained with the adaptive version o f the bootstrap test?
(Sections 4.3, 4.4)
h <- 1.5
hist(paulsen$y,probability=T,breaks=c(0:30))
lines(density(paulsen$y,width=4*h,from=0,to=30))
peak.test <- function(y, h)
{dens <- density(y,width=4*h,n=100)
sum(peaks(dens$y[(dens$x>=0) k (dens$x<=20)])) }
peak.test(paulsen$y, h)
Check that h = 1.87 is the smallest value giving just one peak.
For bootstrap analysis,
6 For the cd4 data o f Practicals 2.3 and 3.6, test the hypothesis that the distribution
o f C D 4 counts after one year is the same as the baseline distribution. Test
also whether the treatment affects the counts for each individual. Discuss your
conclusions.
5
Confidence Intervals
5.1 Introduction
T he assessm ent o f uncertainty ab o ut param eter values is m ade using confidence
intervals or regions. Section 2.4 gave a brief introduction to the ways in which
resam pling can be applied to the calculation o f confidence limits. In this chapter
we u ndertake a m ore tho ro u g h discussion o f such m ethods, including m ore
sophisticated ideas th a t are potentially m ore accurate th an those m entioned
previously.
Confidence region m ethods all focus on the same target properties. T he first
is th a t a confidence region w ith specified coverage probability y should be a
set Cy(y) o f p aram eter values which depends only upon the d a ta y and which
satisfies
191
192 5 • Confidence Intervals
not serious for scalar 9, which is the m ajor focus in this chapter, because in
m ost applications the confidence region will be a single interval.
A confidence interval will be defined by limits 0ai and 9 i_a2, such th a t for
any a
Pr(0 < 0„) = a.
The coverage o f the interval [0a,,0 i_ a2] is y = 1 — (x\ + a 2), and ai and a 2 are
respectively the left- an d right-tail error probabilities. For som e applications
only one lim it is required, either a low er confidence limit 6a o r an upper
confidence limit 9 i_a, these b o th having coverage 1 — a. If a closed interval is
required, then in principle we can choose oti and a2, so long as they sum to the
overall erro r probability 2a. T he sim plest way to do this, which we ad o p t for
general discussion, is to set a.\ = a2 = a. T hen the interval is equi-tailed with
coverage probability 1 — 2a. In p articu lar applications, however, one m ight
well w ant to choose ai and a 2 to give approxim ately the shortest interval: this
would be analogous to having the likelihood property m entioned earlier.
A single confidence region can n o t give an adequate sum m ary o f the u n
certainty ab o u t 9, so in practice one should give regions for three or four
confidence levels betw een 0.50 and 0.99, say, together with the p o int estim ate
for 9. O ne benefit from this is th a t any asym m etry in the uncertainty ab o u t 6
will be fairly clear.
So far we have assum ed th a t a confidence region can be found to satisfy
(5.1) exactly, b u t this is n o t possible except in a few special param etric models.
The m ethods developed in this chapter are based on approxim ate probability
calculations, an d therefore involve a discrepancy betw een the nom inal or target
coverage, an d the actual coverage probability.
In Section 5.2 we review briefly the stan d ard approxim ate m ethods for
param etric an d n o nparam etric models, including the basic b o o tstrap m ethods
already described in Section 2.4. M ore sophisticated m ethods, based on w hat
is know n as the percentile m ethod, are the subject o f Section 5.3. Section 5.4
com pares the various m ethods from a theoretical viewpoint, using asym ptotic
expansions, and introduces the A B C m ethod as an alternative to sim ulation
m ethods. The use o f significance tests to obtain confidence lim its is outlined
in Section 5.5. A nested b o o tstrap algorithm is introduced in Section 5.6.
Em pirical com parisons betw een m ethods are m ade in Section 5.7.
Confidence regions for vector p aram eters are described in Section 5.8. The
possibility o f conditional confidence regions is explored in Section 5.9 through
discussion o f two examples. Prediction intervals are discussed briefly in Sec
tion 5.10.
The discussion in this chap ter is ab o u t how to use the results o f boo tstrap
sim ulation algorithm s to obtain confidence regions, irrespective o f w hat the
resam pling algorithm is. T he presentation supposes for the m ost p a rt th at we
5.2 • Basic Confidence Limit M ethods 193
are in the simple situation o f C h apter 2, where we have a single, com plete
hom ogeneous sample. M ost o f the m ethods described can be applied to m ore
com plex d a ta structures, provided th a t appropriate resam pling algorithm s are
used, b u t for m ost sorts o f highly dependent d a ta the theoretical properties o f
the m ethods are largely unknow n.
where as usual z i_ a = <I> '(1 —a). If T is a m axim um likelihood estim ator, then
the approxim ate variance v can be com puted directly from the log likelihood
function tf(9). I f there are no nuisance param eters, then we can use the recip-
.. A
((B) = d({e)id8, and rocal o f either the observed Fisher inform ation, v = —l/tf(9) o r the estim ated
?(e) = d2t(e)/B0deT. expected Fisher inform ation v = 1/7(0), where i(9) = E{—if(9)} —var{/(0)}.
T he form er is usually preferable. W hen there are nuisance param eters, we use
the relevant elem ent o f the inverse o f either —?(0) o r i(9). M ore generally, if
T is given by an estim ating equation, then v can be calculated by the delta
m etho d ; see Section 2.7.2. E quation (5.4) is the stan d ard form for norm al
approxim ation confidence limits, although it is som etim es augm ented by a bias
correction which is based on the third derivative o f the log likelihood function.
194 5 • Confidence Intervals
9a, 9i - x = t - b R + v ^ 2z i - x. (5.5)
conventional values o f a. But if for some reason ( R + l)a is not an integer, then
interp o latio n can be used. A simple m ethod th a t w orks well for approxim ately
norm al estim ators is linear interp olation on the norm al quantile scale. For
exam ple, if we are trying to apply (5.6) and the integer p a rt o f (R + l)a is k,
then we define
O -ifa ) —0 _1( - ^ r )
= fo + ‘ = [(* + !)« ]• (5-8)
w 'R + l' v ^R+l-’
The sam e in terp o latio n can be applied to the z* s. Clearly such interpolations
fail if k = 0, R or R + 1.
Parameter transformation
T he norm al approxim ation m ethod m ay fail to w ork well because it is being
applied on the w rong scale, in which case it should help to apply the approxi
m ation on an appropriately transform ed scale. Skewness in the distribution o f
T is often associated w ith v a r(T ) varying w ith 9. F or this reason the accuracy
o f norm al approxim ation is often im proved by transform ing the param eter
scale to stabilize the variance o f the estim ator, especially if the transform ed
scale is the w hole real line. T he accuracy o f the basic boo tstrap confidence
lim its (5.6) will also tend to be im proved by use o f such a transform ation.
Suppose th a t we m ake a m onotone increasing transform ation o f the p aram
eter scale from 9 to tj = h(9), and then transform t correspondingly to u = h(t).
A ny confidence limit m ethod can be applied for tj, and untransform ing the
results will give confidence limits for 9. F o r example, consider applying the
norm al approxim ation limits (5.4) for r). By the delta m ethod (Section 2.7.1)
the variance approxim ation v for T transform s to
say. T hen the confidence limits for r\ are h(t) + t;|/2zi_ a, which transform back
to the limits
9 j i - a = h - l {h(t) + vli2z l^ } . (5.9)
which som etim es w orks is to m ake norm al Q -Q plots o f h(t’) for candidate
transform ations.
It is im p o rta n t to stress th a t the use o f transfo rm ation can im prove the basic
b o o tstrap m ethod considerably. N evertheless it m ay still be beneficial to use
the studentized m ethod, after transform ation. Indeed there is strong em pirical
evidence th a t the studentized m ethod is im proved by w orking on a scale with
stable approxim ate variance. T he studentized transform ed estim ator is
H T ) - m
\ h ( T) \ VV 2 '
G iven R values o f the b o o tstrap q uantity z* = {/i(f’) — h(t)} / {\h(t*)\v*1/2}, the
analogue o f (5.10) is given by
where ciiP is the p quantile o f the y2 distribution. This confidence region need
n o t be a single interval, although usually it will be, and the left- and right-
tail errors need n o t be even approxim ately equal. Separate lower and upper
confidence lim its can be defined using
sgn(u) = u/\u\ is the sign
z(0) = sgn(d - d ) y/ Md) , function.
which is approxim ately N ( 0,1). T he resulting confidence lim its are defined
im plicitly by
In m ost applications the accuracy will be very good, provided the m odel is
correct, b u t it m ay nevertheless be sensible to consider replacing the theoretical
quantiles by b o o tstrap approxim ations. W hether or n o t this is w orthw hile can
( ' is the log likelihood be ju d g ed from a chi-squared Q-Q plot o f sim ulated values o f
for a set of data
simulated using 6, for w -(6) = 2 { r ( 6 ' ) - f ( G ) } ,
which the MLE is 0 \
These co n trast sharply w ith the exact lim its 65.9 and 209.2.
T ransform ation to the variance-stabilizing logarithm ic scale does improve
the norm al approxim ation. A pplication o f (2.14) with v(/i) = n“ V 2 gives
h(t) = log(t), if we d ro p the m ultiplier n1/2, and the approxim ate variance
transform s to n~l . The 95% confidence interval limits given by (5.9) are
W hile a considerable im provem ent, the results are still not very close to the
exact solution. A p artial explanation for this is th a t there is a bias in log(T )
and the variance approxim ation is no longer equal to the exact variance. Use
o f b o o tstrap estim ates for the bias and variance o f log(T ), w ith R = 999, gives
limits 58.1 and 228.8.
F or the basic b o o tstrap confidence lim its we use R = 999 sim ulations under
the fitted exponential m odel, sam ples o f size n = 12 being generated from
the exponential distribution w ith m ean 108.083; see Exam ple 2.6. The relevant
ordered values o f y ' are the (9 9 9 + l)0.025th and (9 9 9 + l)0.975th, i.e. the 25th
and 975th, which in o u r sim ulation were 53.3 and 176.4. The 95% confidence
limits obtained from (5.6) are therefore
These are no b etter th a n the norm al approxim ation limits. However, applica
tion o f the sam e m ethod on the logarithm ic scale gives m uch b etter results:
198 5 ■Confidence Intervals
using the same ordered values o f >■' in (5.10) we o btain the limits
In fact these are sim ulation approxim ations to the exact limits, which are based
on the exact gam m a distribution o f Y / p.- The sam e results are obtained using
the studentized b o o tstrap limits (5.7) in this case, because z = n l/2(y — n ) / y is
a m onotone function o f log(y) — log(/i) = log(y/p). E quation (5.11) also gives
these results.
N ote th a t if we h ad used R = 99, then the b oo tstrap confidence limits
would have required interpolation, because (9 9 + 1)0.025 = 2.5 which is n o t an
integer. T he application o f (5.8) would be
Normal approximation
The sim plest m ethod is again to use a norm al approxim ation, now w ith a
nonparam etric estim ate o f variance such as th a t provided by the nonparam etric
delta m ethod described in Section 2.7.2. If lj represents the em pirical influence
value for the ;'th case yj, then the approxim ate variance is vL = n~2 J2 lj, so the
n onparam etric analogue o f (5.4) for the limits o f a 1 — 2a confidence interval
for 6 is
* -r 1/2 (5.14)
t + V£ Z 1—a.
Section 2.7 outlines various ways o f calculating or approxim ating the influence
values.
If a sm all nonparam etric b o o tstrap has been run to produce bias and
5.2 • Basic Confidence Lim it Methods 199
variance estim ates bR an d vR, as described in Section 2.3, then the corresponding
approxim ate 1 — 2a confidence interval is
t - bR + 4 /2zi-a- (5.15)
where now z* = (t* — t ) / v ' ^ 2. N ote th at the influence values m ust be recom
puted for each b o o tstra p sample, because in expanded n o tation lj = l(yj;P)
depends u p o n the E D F o f the sample. Therefore
» i = « - 2£ / v ; ; n
7=1
b u t this is unreliable unless t is approxim ately linear; see Section 2.7.5 and
Problem 2.20.
As in the p aram etric case, one m ight consider m aking a bias adjustm ent
in the n u m erato r o f z, for exam ple based on the em pirical second derivatives
o f t. However, this rarely seems effective, and in any event an approxim ate
adjustm ent is implicitly m ade in the b o o tstra p distribution o f Z*.
This, as w ith m ost o f the num erical results here, is very sim ilar to w hat is
obtained und er p aram etric analysis w ith the best-fitting gam m a m odel; see
Exam ple 2.9.
F or the basic b o o tstra p m ethod w ith R = 999 sim ulated datasets, the 25th
and 975th ordered values o f y * are 43.92 and 192.08, so the lim its o f the 95%
confidence interval are
This is n o t obviously a p o o r result, unless com pared with results for the gam m a
m odel (likelihood ratio limits 57 an d 243), b u t the corresponding 99% interval
has lower lim it —27.3, which is clearly very bad! T he studentized boo tstrap
fares better: the 25th an d 975th ordered values o f z* are —5.21 and 1.66, so
th at application o f (5.7) gives 95% interval limits
But are these last results adequate, and how can we tell? T he first p art
o f this question we can answ er b o th by com parison w ith the gam m a m odel
results, an d by applying m ethods on the logarithm ic scale, which we know
is appro p riate here. T he basic b o o tstra p m ethod gives 95% limits 66.2 and
218.8 w hen the log scale is used. So it would ap p ear th a t the studentized
boo tstrap m ethod lim its are too wide here, b u t otherw ise are adequate. If the
studentized b o o tstrap m ethod is applied in conjunction with the logarithm ic
transform ation, the lim its becom e 50.5 and 346.9.
How would we know in practice th a t the logarithm ic transform ation o f T is
appropriate, o th er th an from experience w ith sim ilar d ata ? O ne way to answ er
this is to p lo t v’L versus t*, as a surrogate for a “v arian ce-p aram eter” plot,
as suggested in Section 3.9.2. F or this p articu lar dataset, the equivalent plot
o f stan d ard errors vL is shown in the left panel o f Figure 5.1 and strongly
suggests th a t variance is approxim ately p ro p o rtio n al to squared param eter, as it
is under the param etric model. F rom this we w ould deduce, using (2.14), th at
the logarithm ic tran sfo rm atio n should approxim ately stabilize the variance.
T he right panel o f the figure, which gives the corresponding plot for log-
transform ed estim ates, shows th a t the tran sfo rm atio n is quite successful. ■
Parameter transformation
For suitably sm ooth statistics, the consistency o f the studentized boo tstrap
m ethod is essentially g u aranteed by the consistency o f the variance estim ate
V. In principle the m ethod is m ore accurate th an the basic b o o tstrap m ethod,
5.2 ■Basic Confidence Lim it Methods 201
Figure 5.1
o
Air-conditioning d ata: CO
nonparametric delta
method standard errors o
in
for t = y (left panel) and
for log(t) (right panel) in
csj o CNJ
R = 999 nonparametric
*<" o
bootstrap samples.
_i co *
> >
o
C\J
K = v { K) l/2{ k O - M O }/”! 1/ 2-
Example 5.3 (City population data) F or the d a ta o f Exam ple 2.8, w ith ratio
6 estim ated by t = x / u, we discussed em pirical choice o f transform ation
in Exam ple 3.23. A pplication o f the em pirical transform ation illustrated in
202 5 ■Confidence Intervals
Figure 3.11 w ith the studentized b o o tstra p limits (5.17) leads to the 95%
interval [1.23, 2.25], This is sim ilar to the 95% interval based on the h(t*) — h(t),
[1.27, 2.21], while the studentized b o o tstra p interval on the original scale is
[1.12, 1.88]. T he effect o f the tran sfo rm atio n is to m ake the interval m ore like
those from the percentile m ethods described in the following section.
To com pare the studentized m ethods, we took 500 sam ples o f size 10 w ithout
replacem ent from the full city pop u latio n d a ta in Table 1.3. T hen for each
sam ple we calculated 90% studentized b o o tstra p intervals on the original scale,
and on the transform ed scale w ith and w ithout using the transform ed standard
erro r; this last interval is the basic b o o tstrap interval on the transform ed scale.
The coverages were respectively 90.4, 88.2, and 86.4%, to be com pared to the
ideal 90% . The first tw o are n o t significantly different, b u t the last is rath er
smaller, suggesting th a t it can be w orthw hile to studentize on the transform ed
scale, w hen this is possible. The draw back is th a t studentized intervals th a t use
the transform ed scale tend to be longer th an on the original scale, and their
lengths are m ore variable. ■
5.2.3 Choice o f R
W h at has been said ab o u t sim ulation size in earlier chapters, especially in
Section 4.2.5, applies here. In particular, if confidence levels 0.95 and 0.99 are
to be used, then it is advisable to have R = 999 or m ore, if practically feasible.
Problem 5.5 outlines som e relevant theoretical calculations.
f ((R+l)<x)’ f ((R+1)(1—ot))- ( 5 .1 8 )
In fact this is an im proved norm al approxim ation, after applying the (unknow n)
norm alizing tran sfo rm atio n which elim inates the leading term in a skewness
approxim ation. T he usual factor n-1 has been taken o u t o f the variance by
scaling h(-) appropriately, so th a t b o th a an d w will typically be o f order n~x/2.
The use o f a an d w is analogous to the use o f B artlett correction factors in
likelihood inference for p aram etric models.
The essence o f the m ethod is to calculate confidence limits for <j) and then
transform these back to the 6 scale using the b o o tstrap distribution o f T. To
begin with, suppose th a t a an d w are know n, an d write
U = (j) + (I + acj))(Z - w ) ,
1
</>a =
i r\ w + z«
U + f f ( u h ---------;-------------- r .
1 — a(w + za)
= <D I w +I ■ w + z“
1 - a(w + za)
which is know n. T herefore the a confidence limit for 0 is
<5 -2 0 >
0* = ?m + l w , a = 0 ) ( w + 1 _ ^ + ^ ^ )) . (5.21)
These lim its are usually referred to as B C a confidence limits. N ote th at they
share the tran sfo rm atio n invariance p roperty o f percentile confidence limits.
The use o f G overcom es lack o f know ledge o f the transform ation h. The
values o f a an d w are unknow n, o f course, b u t they can be easily estim ated.
F o r w we can use the initial norm al approxim ation (5.19) for U to write
so th at
w = 0>-1{G(0}. (5.22)
5.3 ■Percentile Methods 205
T he value o f a can be determ ined inform ally using (5.19). Thus if /(</>) denotes
the log likelihood defined by (5.19), with derivative ?(<f>), then it is easy to show
th at
e { m 3}
= 6a,
var{<f(</>)}3/2
ignoring term s o f o rd er n~l . But the ratio on the left o f this equation is
invariant und er p aram eter transform ation. So we transform back from (j) to 6
and deduce th at, still ignoring term s o f order n-1 ,
Em > }
v a r{ /(0)}3/2
1 E*{/*(0)3}
a = T -------- :— *—1— > (5.23)
6 v a r * { n 0 )3/2}
w here ( ' is the log likelihood o f a set o f d a ta sim ulated from the fitted
m odel. M ore generally a is one-sixth the standardized skewness o f the linear
approxim ation to T.
One p o tential problem w ith the B C a m ethod is th a t if a in (5.21) is m uch
closer to 0 or 1 th an a, then (R + l)a could be less th an 1 o r greater th an
R, so th a t even w ith interpolation the relevant quantile can n o t be calculated.
If this happens, and if R can n o t be increased, then it would be appropriate
to quote the extrem e value o f t' and the im plied value o f a. For example, if
( R + l)a > JR, then the u pper confidence limit t'Rj would be given w ith implied
right-tail error a 2 equal to one m inus the solution to a = R / ( R + 1).
iw = ^ - n-
fiz fi
The second and third m om ents o f if(fi) are nfi~2 and 2n/i~3, so by (5.23)
a = I n ” 1/2 = 0.0962.
206 5 ■Confidence Intervals
9"
1000
= d> (V0 0 8 7 8
. + 0-0878 + ^
1 -0 .0 9 6 2 (0 .0 8 7 8 + Z !_ a2)
nam ely a2 = 0.0125. ■
fLF(0 = f ( < P + t h
akin to the profile log likelihood for 9. The M L E o f £ is £ = 0.
The bias-corrected percentile m ethod is now applied to the least-favourable
family. E quations (5.21) an d (5.22) still apply. The only change in the calcula
tions is to the skewness correction factor a, which becom es
1 E -fc(0 )3 }
6 var'K lf (0))W' ‘ 1
In this expression the p aram eter estim ates ip are regarded as fixed, and the
m om ents are calculated u nder the fitted model.
A som ew hat sim pler expression for a can be obtained by noting th at i?LF{0)
is p ro portio n al to the influence function for t. The result in Problem 2.12 shows
th at
Lt(yj',Fv ) = m l (ipy{ip,yj),
5J • Percentile Methods 207
a = 0.1145, w = 0.1372.
w here i-1(tp) is the first row o f the inverse o f i(ip) and /(\p,yj) is the contribution
to /(tp) from the _/th case. We can then rew rite (5.24) as
where
L* = nil ( xpy(xp,Y’ )
an d Y ' follows the fitted distribution w ith param eter value ip. As before,
to first o rd er a is one-sixth the estim ated standardized skewness o f the linear
approxim ation to t. In the form given, (5.25) will apply also to nonhom ogeneous
data.
T he B C a m ethod can be extended to any sm ooth function o f the original
m odel p aram eters \p; see Problem 5.7.
The inform ation m atrix is diagonal, so th a t the least-favourable fam ily is the
original gam m a family w ith k fixed a t k = 0.7065. It follows quite easily th a t
? l f (°) - y ,
and so a is one-sixth o f the skewness o f the sam ple average under the fitted
gam m a m odel, th a t is a = The same result is obtained som ew hat
m ore easily via (5.25), since we know th a t the influence function for the m ean
is L t( y \ F ) = y - fi.
The num erical values o f a and w for these d a ta are 0.1145 and 0.1372
respectively, the latter from R = 999 sim ulated samples. Using these we
com pute the adjusted percentile b o o tstrap confidence limits as in Table 5.2.
208 5 ■Confidence Intervals
Just how flexible is the B C a m eth o d ? The following exam ple presents a
difficult challenge for all b o o tstrap m ethods, an d illustrates how well the
studentized b o o tstrap an d B C a m ethods can com pensate for weaknesses in the
m ore prim itive m ethods.
The coverages o f these limits are calculated using the exact distribution o f T.
F or exam ple, for the basic b o o tstrap confidence lim it
P r(7 * = y j ) = pj = (5-26)
The p aram eter o f interest 6 is a m onotone function o f r\ with inverse rj(6), say.
The M L E o f rj is fj = rj(t) = 0, which corresponds to the E D F F being the
n o n p aram etric M L E o f the sam pling distribution F.
The bias correction factor w is calculated as before from (5.22), b u t using
nonp aram etric b o o tstrap sim ulation to obtain values o f t*. The skewness
correction a is given by the em pirical analogue o f (5.23), where now
fj($) is the first
derivative drj(6)/dd.
a 6 / \ 3/2’
6 (e if)
, i s i , » rJ
vL = n 2 Y ^ i - ° = (5.29)
ij
( e u ?5)!
5.4 ■ Theoretical Comparison o f M ethods 211
see Problem 3.7. This can be helpful in w riting an all-purpose algorithm for the
B C a m eth o d ; see also the discussion o f the A B C m ethod in the next section.
A n exam ple is given at the end o f the next section.
and
K _ 1(a) = za + n~l/2 {mi - ±m3 - { { m n - gw3)z2} . (5.34)
212 5 ■ Confidence Intervals
This will also be second-order accurate if it agrees w ith (5.35), which requires
th at to order n~1/2,
T hen calculations o f the first and third m om ents o f T — 0 from the quadratic
approxim ation show th at
The results in (5.40) and (5.41) im ply the identity for a in (5.37), after
noting th a t the definitions o f a in (5.23), (5.25) and (5.27) used in the adjusted
percentile m ethod are obtained by substituting estim ates for m om ents o f the
influence function. The identity for w in (5.37) is confirm ed by noting th at the
original definition w = 4>~’ {G(t)} approxim ates <1>_ 1{G(0)}, which by applying
(5.30) w ith u = 0 agrees w ith (5.37).
Basic and percentile methods
Sim ilar calculations show th a t the basic b o o tstrap and percentile confidence
limits are only first-order accurate. However, they are b o th superior to the
norm al approxim ation limits, in the sense th at equi-tailed confidence intervals
are second-order accurate. F or example, consider the 1 — 2a basic boo tstrap
confidence interval w ith limits
which, after expanding in Taylor series and dropping n~l term s, and then
n oting th a t z2 = z \_ a an d <j)(za) = <f>{zi_«), turns out to equal
here v has been approxim ated by v in the definition o f mi, and we have used
Z \ - a — —z«.
The con stan ts a, b and c in (5.42) are defined by (5.39), in which the
expectations will be estim ated. Special form s o f the A B C m ethod correspond
to special-case estim ates o f these expectations. In all cases we take v to be vl -
Parametric case
If the estim ate t is a sm ooth function o f sam ple m om ents, as is the case for
an exponential family, then the co nstants in (5.39) are easy to estim ate. W ith
a tem porary change o f notatio n , suppose th a t t = t(s) where s = n~l ^ s(yj)
has p com ponents, and define fi = E(S), so th a t 6 = t(n). Then
t = dt(s)/ds, and
L t(Y,) = t(n)T {s(Yj) - fi}, Qt(Yj, Yk) = {s(Yj) - fi}Tt(fi){s(Yk) - /i}. (5.43) V= d2t(s)/dsdsT .
Estim ates for a, b and c can therefore be calculated using estim ates for the
first three m om ents o f s( Y ).
For the p articu lar case where the distribution o f S has the exponential family
PD F
f S ) = exp{//r s - £ ( f / ) } ,
the calculations can be simplified. First, define L(^) = var(5) = l(rj). Then Ul) = 81Ul)/d'ldqT-
vl = t(s)T'L(s)i(s).
S ubstitution from (5.43) in (5.39), and estim ation o f the expectations, gives
estim ated con stan ts which can be expressed sim ply as
1 d2t(s + ke)
c = (5.44)
2vm del £=0
where k = £ (s )i ( s ) / v ^ 2.
The confidence lim it (5.42) can also be approxim ated by an evaluation o f
the statistic t, analogous to the B C a confidence limit (5.20). This follows by
equating (5.42) w ith the right-hand side o f the approxim ation t(s + v 1^2e) =
t(s) + v ^ 2e T 't(s), w ith ap p ro p riate choice o f e. The result is
= t ii + F k ? k|* ( 5 -4 5 )
where
za = w + za = a + c - bvL i/2 + z«.
In this form the ABC confidence limit is an explicit approxim ation to the B C a
confidence limit.
If the several derivatives in (5.44) are calculated by num erical differencing,
then only 4p + 4 evaluations o f t are necessary, plus one for every confidence
lim it calculated in the final step (5.45). A lgorithm s also exist for exact num erical
calculation o f derivatives.
Nonparametric case: single sample
If the estim ate t is again a sm ooth function o f sam ple m om ents, t = t(s), then
(5.43) still applies, and substitution o f em pirical m om ents leads to
(5.47)
and
d2
(5.48)
qjj = £=0
where 1; is the vector w ith 1 in the ; t h position and 0 elsewhere. Let us
set tj(p) = dt(p)/8pj, an d tjk(p) = d2t (p)/dpjdpk; see Section 2.7.2 and Prob
lem 2.16. T hen alternative form s for the vector I and the full m atrix q are
where J = 11T. F or each derivative the first form is convenient for approx 1 is a vector of ones.
im ation by num erical differencing, while the second form is often easier for
theoretical calculation.
Estim ates for a and b can be calculated directly as em pirical versions o f
their definitions in (5.39), while for c it is sim plest to use the analogue o f the
representation in (5.44). The resulting estim ates are
i E 'j , i v- 1 / ,
a =
6 ( £ I ] ? ' 1' 2n2 ^ qjj ~ 2n1 ^
(5.49)
1 d2t(p + ek)
c — t (I — n J)t(I — n J)t
2 V 1/ 2 d£l 2*vl»
d« = t [ P + n ' . J ■ (5.50)
V (1 - a z ay )
If the several derivatives are calculated by num erical differencing, then the
num ber o f evaluations o f t(p) needed is only 2n+2, plus one for each confidence
limit and the original value t. N ote th a t the probability vector argum ent in
(5.50) is n o t constrained to be proper, o r even positive, so th at it is possible
for A B C confidence lim its to be undefined.
N um erical com parisons betw een the adjusted percentile confidence limits and
5.4 • Theoretical Comparison o f Methods 217
A B C lim its are shown in Table 5.5. The A B C m ethod appears to give rea
sonable approxim ations, except for the 99% interval under the gam m a model.
71 = (7T11, . . . , 7T21,. . . , )
py = (5.51)
E ;= i n n
The set o f E D F s is equivalent to ft = (£,•■ •,£) and the observed value o f
the estim ate is t = u(n). This artificial representation leads to expressions such
as (5.29), in which the definition o f 7i; is obtained by applying (5.47) to u(p).
(N ote th a t the real influence values /y and second derivatives q(j j derived from
t ( pi , .. ., pk ) should n o t be used.) T h a t this m ethod produces correct results
is quite easy to verify using the several sam ple extension o f the quadratic
approxim ation (5.38); see Section 3.2.1 and Problem 3.7.
Example 5.10 (Air-conditioning data failure ratio) The d a ta o f Exam ple 1.1
form one o f several sam ples corresponding to different aircraft. The previous
sam ple («i = 12) and a second sam ple (n2 = 24) are given in Table 5.6. Suppose
th a t we w ant to estim ate the ratio o f failure rates for the two aircraft, and give
confidence intervals for this ratio.
To set n otation, let the m ean failure times be fii and fi 2 for the first and
second aircraft, w ith 6 = n t / n \ the param eter o f interest. T he corresponding
218 5 ■Confidence Intervals
Second aircraft
3 5 5 13 14 15 22 22 23 30 36 39
44 46 50 72 79 88 97 102 139 188 197 210
sam ple m eans are y\ = 108.083 an d y 2 = 64.125, so the estim ate for 6 is
t = y i / y i = 0-593.
The em pirical influence values are (Problem 3.5)
hj =
yi yi
CM
/
■ .• is- v y
N O
CVJ
/
• ‘ •
and
' n \ ( y2j-yi
» j - .......«■
This leads to form ulae in agreem ent with (5.29), which gives the values o f a
and vL already calculated. It rem ains to calculate b and c.
F or b, application o f (5.48) gives
w , , - y. r,. \2
= -2t f, n2(yij \) . n n 2( y i j - y i )
iUjj — t-. 1 „2r, „2
h I »i yi n\
220 5 ■Confidence Intervals
and
*,nrn(y2J - y2)
^ l.jj 2- '
n\yx
so by (5.49) we have
b = n f M T 3' Y i y i j - y i f ,
whose value is b = 0.0720. (The b o o tstrap estim ates b and v are respectively
0.104 and 0.1125.) Finally, for c we apply the second form in (5.49) to u(n),
th at is
c = ^n~4v ^ 3/2l Tii(7t)I,
Pr{(Yu . . . , Y n) e R a( d o ) \ e 0} = «,
is a 1 — a confidence region for 6. The shape o f the region will be determ ined
by the form o f the test, including the alternative hypothesis for which the test is
designed. In particular, an interval w ould usually be obtained if the alternative
is two-sided, H A : 6 0O; an upp er lim it if H A : 8 < 0O; and a lower limit if
H a : 8 > 80.
5.5 ■Inversion o f Significance Tests 221
Example 5.11 (Hazard ratio) For the A M L d a ta in Exam ple 3.9, also an
alysed in Exam ple 4.4, assum e th a t the ratio o f hazard functions h 2 (z)/hi(z)
for the tw o groups is a co n stan t 9. As before, let rtJ be the num ber in group
i w ho were at risk ju st p rio r to the y'th failure time zj, and let y} be 0 or 1
according as the failure ^t Zj is in group 1 or 2. T hen a suitable statistic for
testing Ho : 9 = 9o is
this is the score test statistic in the Cox p roportional hazards model. Large
values o f t(6o) are evidence th a t 9 > Oo-
There are several possible resam pling schemes th a t could be used here,
including those described in Section 3.5 b u t m odified to fix the constant
hazard ratio 9o- H ere we use the sim pler conditional m odel o f Exam ple 4.4,
which holds fixed the survival and censoring times. T hen for any fixed 9o the
sim ulated values y \ , . . . , y ' n are generated by
222 5 • Confidence Intervals
log(theta)
f J-i ) ( 1
r\j = m ax I 0, m - ^ ( 1 - y ’k ) - c1;- r2j = m ax 0, r 2i Y.y'k C2j
I *=i k= 1
In a m ore system atic developm ent o f the m ethod, we m ust allow for a
nuisance p aram eter X, say, which also governs the d a ta distribution b u t is not
constrained by Ho. T hen b o th Ra(0) an d C \ - a{ Y \ , . . . , Y„) m ust depend upon X
to m ake the inversion m ethod w ork exactly. U nder the b o o tstra p approach X
is replaced by an estim ate.
5.6 • Double Bootstrap M ethods 223
where T*(w) follows the distribution under xp = (u , s). This requires application
o f an interp o latio n m ethod such as the one illustrated in the previous example.
T he sim plest test statistic is the point estim ate T o f 9, and then T(9o) = T.
The m ethod will tend to be m ore accurate if the test statistic is the studentized
estim ate. T h a t is, if v a r(T ) = o 2(9,A), then we take Z = (T — 9o)/v(9o,S)\
for furth er details see Problem 5.11. The same rem ark would apply to score
statistics, such as th a t in the previous example, where studentization would
involve the observed or expected Fisher inform ation.
N ote th a t for the p articu lar alternative hypothesis used to derive an upper
limit, it w ould be stan d ard practice to define the P-value as Pr{T(0o) < t(9o) \
Fo}, for exam ple if T ( 0 q) were an estim ator for 9 or its studentized form.
Equivalently one can retain the general definition and solve p(9o) = 1 — a for
an upp er limit.
In principle these m ethods can be applied to b o th param etric and sem ipara
m etric problem s, b u t not to com pletely nonparam etric problems.
see Problem 5.12. A m ore am bitious application is b o o tstrap adjustm ent o f the
basic b o o tstrap confidence limit, which we develop here.
First we recall the full n o tatio n s for the quantities involved in the basic
bo o tstrap confidence interval m ethod. The “ideal” u p per 1 —a confidence limit
is t(F) — ax(F), where
We could try to elim inate the bias by adding a correction to ax(F), b u t a m ore
successful approach is to adjust the subscript a. T h a t is, we replace ax(F) by
Oq(a)(F) an d estim ate w hat the adjusted value q(a) should be. This is in the
sam e spirit as the B C a m ethod.
Ideally we w ant q(a) to satisfy
The solution q(a) will depend u p o n F, i.e. q(oc) = q(a, F). Because F is unknow n,
we estim ate q(a) by q(a) = q(a, F). This m eans th a t we obtain q(a) by solving
the b o o tstrap version o f (5.53), namely
This looks intim idating, b u t from the definition o f aa(F) we see th a t (5.54) can
be rew ritten as
The sam e m ethod o f adjustm ent can be applied to any b o o tstrap confi
dence lim it m ethod, including the percentile m ethod (Problem 5.13) and the
studentized b o o tstra p m ethod (Problem 5.14).
To verify th a t the nested b o o tstrap reduces the o rd er o f coverage erro r m ade
by the original b o o tstra p confidence limit, we can apply the general discussion
o f Section 3.9.1. In general we find th a t coverage 1 —a + 0 ( n ~ “) is corrected to
1—a + 0 ( n ~ fl~1/2) for one-sided confidence limits, w hether a = | or 1. However,
for equi-tailed confidence intervals coverage 1 — 2a + 0 (n-1 ) is corrected to
1 — 2a -I- 0 ( n ~ 2); see Problem 5.15.
Before discussing how to solve equation (5.55) using sim ulated samples,
we look at a simple illustrative exam ple where the solution can be found
theoretically.
Example 5.12 (Exponential mean) C onsider the param etric problem o f ex
ponential d a ta w ith unknow n m ean /i. T he d a ta estim ate for fi is t = y, F is
5.6 ■Double Bootstrap M ethods 225
2y - y c 2n,u/(2n),
where Pt(x I„ < cjn,%) = oc. To evaluate the left-hand side o f (5.55), for the inner
probability we have
which exceeds q if and only if 2n(2 — y / y ’) > C2n,q■ Therefore the outer
probability on the left-hand side o f (5.55) is
w ith q = q(a). Setting the probability on the right-hand side o f (5.56) equal to
1 — a, we deduce th a t
2n
2 - cl n m l{2n) C2n’a'
Using q(a) in place o f a in the basic b o o tstrap confidence lim it gives the
adjusted u p p er 1 —a confidence limit 2 n y / c 2n,a, which has exact coverage 1 —oc.
So in this case the double b o o tstrap adjustm ent is perfect.
Figure 5.4 shows the actual coverages o f nom inal 1 — a b o o tstrap upper
confidence limits when n = 10. There are quite large discrepancies for both
basic and percentile m ethods, which are com pletely rem oved using the double
b o o tstrap adjustm ent; see Problem 5.13. ■
This will be approxim ated by draw ing M sam ples from F", calculating the
estim ator values r” for m = 1, . . . , M and com puting the estim ate
^ «(«)} = 1 -
r= l
which is to say th a t q(a) is the a quantile o f the uMr. The sim plest way to
obtain <j(ot) is to o rd er the values uMr into uM{l) < ■■■ < and then
set q{a) = W h at this am ounts to is th a t the (R + l)a th ordered
value is read off from a Q -Q plot o f the uMr against quantiles o f the U ( 0 , 1)
distribution, and th a t ordered value is then used to give the required quantile
o f the t* — t. We illustrate this in the next example.
The to tal nu m b er o f sam ples involved in this calculation is R M . Since
we always think o f sim ulating as m any as 1000 sam ples to approxim ate
probabilites, here this w ould suggest as m any as 106 sam ples overall. The
calculations o f Section 4.5 w ould suggest som ething a bit smaller, say M = 249
to be safe, b u t this is still ra th e r im practical. However, there are ways o f greatly
reducing the overall nu m b er o f sim ulations, two o f which are described in
C h apter 9.
Example 5.13 (Kernel density estimate) B ootstrap confidence intervals for the
value o f a density raise som e aw kw ard issues, which we now discuss, before
outlining the use o f the nested b o o tstra p in this context.
The stan d ard kernel estim ate o f the P D F f ( y ) given a ran d o m sample
y u - - - , y n is
5.6 ■Double Bootstrap M ethods 227
where w( ) is a sym m etric density with m ean zero and unit variance, and h
is the bandw idth. O ne source o f difficulty is th a t if we consider the estim ator
to be t(F), as we usually do, then t(F) = h~l f w{h~l (y — x ) } f ( x ) d x is being
estim ated, n o t f ( y) . The m ean and variance o f f ( y ; h ) are approxim ately
such a way th a t nh—*-oo, an d this m akes both bias and variance tend to
zero as n increases. T he density estim ate then has the form t„(F), such th at
t „ ( F ) - t ( F ) = f (y) .
Because the variance in (5.57) is approxim ately proportional to the mean,
it m akes sense to w ork w ith the square root o f the estim ate. T h a t is we
take T = {f ( y ; h )}1/2 as estim ator o f 9 = {f ( y )}1/2. By the delta m ethod o f
Section 2.7.1 we have from (5.57) th at the approxim ate m ean and variance o f
T are
has m ean exactly equal to f ( y ’,h); the approxim ate variance is the same as in
(5.57) except th a t f ( y \ h ) replaces f ( y ) . It follows th at T* = { f ' ( y \ h ) } 1^2 has
approxim ate m ean and variance
7 = { f ( y M ll 2- { f ( y ) Y 12 z<= {r(^}1/2-{/(>>;ft)}1/2
i( n /j) - ‘/ 2K i /2 ’ \(nh)~^K ^
20 50 100 200 5001000 20 50 100 200 5001000 20 50 100 200 5001000 20 50 100 200 5001000
where b o th e and s' are N ( 0,1). This m eans th a t quantiles o f Z can n o t be well
approxim ated by quantiles o f Z*, no m atter how large is n. The same thing
happens for the u n transform ed density estim ate.
There are several ways in which we can try to overcome this problem . O ne
o f the sim plest is to change h to be o f o rd er « -1/3, when calculations sim ilar to
those above show th a t Z = e an d Z* = e*. Figure 5.5 illustrates the effect. H ere
we estim ate the density a t y = 0 for sam ples from the N ( 0,1) distribution, with
w(-) the stan d ard norm al density. T he first two panels show box plots o f 500
values o f z an d z* w hen h = n~1/s, which is near-optim al for estim ation in this
case, for several values o f n; the values o f z* are obtained by resam pling from
one dataset. T he last two panels correspond to h = n~1/3. The figure confirm s
the key points o f the theory sketched above: th a t Z is biased aw ay from zero
when h = n-1^5, b u t not w hen h = n_1/3; an d th a t the distributions o f Z and
Z ’ are quite stable and sim ilar when h = n-1/3.
U nder resam pling from F, the studentized b o o tstrap applied to {/(>’; ^)}1/2
should be consistent if h oc n~1/3. F rom a practical point o f view this m eans
considerable undersm oothing in the density estim ate, relative to standard
practice for estim ation. A bias in Z o f o rd er n~ 1/3 or worse will rem ain, and
this suggests a possibly useful role for the double bootstrap.
F or a num erical exam ple o f nested b o o tstrap p in g in this context we revisit
Exam ple 4.18, where we discussed the use o f a kernel density estim ate in
estim ating species abundance. T he estim ated P D F is
f(y.h) = z z
where </>(•) is the stan d ard norm al density, and the value o f interest is / ( 0 ;/i),
which is used to estim ate /(0 ). In light o f the previous discussion, we base
5.6 ■Double Bootstrap M ethods 229
In Exam ple 9.14 we describe how saddlepoint m ethods can greatly reduce
the tim e taken to perform the double b o o tstrap in this problem . It m ight be
possible to avoid the difficulties caused by the bias o f the kernel estim ate by
using a clever resam pling scheme, b u t it would be m ore com plicated th an the
direct ap p ro ach described above. ■
divided by 100. The norm al approxim ation m ethod uses the delta m ethod
variance approxim ation. The results suggest th a t the studentized m ethod gives
the best results, provided the log scale is used. Otherwise, the studentized
m ethod and the percentile, B C a and A B C m ethods are com parable b u t only
really satisfactory a t the larger sample sizes.
Figure 5.7 shows box plots o f the lengths o f 1000 confidence intervals for
b o th sam ple sizes. The m ost pronounced feature for ni = n2 = 10 is the long
— som etim es very long — lengths for the two studentized m ethods, which
helps to account for their good error rates. This feature is far less prom inent
a t the larger sam ple sizes. It is noticeable th a t the norm al, percentile, B C a
an d A B C intervals are sh o rt com pared to the exact ones, and th at taking logs
improves the basic intervals. Sim ilar com m ents apply when ni = n2 = 25, but
w ith less force.
n1=n2=25
10
Q = ( T - 9 ) t V ~ 1( T - 9 ) , (5.60)
{9 : ( T - 9 ) t V ~ 1( T - 9 ) < a ^ } . (5.61)
5.8 ■Multiparameter Methods 233
Q’ = ( T , - t ) r F * - 1( T * - t ) ,
which will be calculated for each o f R sim ulated samples. If we denote the
ordered b o o tstra p values by q[ <■■■ < q'R, then the 1 —a b o o tstrap confidence
region is the set
As in the scalar case, a com m on and useful choice for v is the delta m ethod
variance estim ate v^.
T he sam e m ethod can be applied on any scales which are m onotone tra n s
form ations o f the original p aram eter scales. F or example, if h(6) has ith
com ponent /i,(0;), say, and if d is the diagonal m atrix with elem ents dhi/d6j
evaluated at 0 = t, then we can apply (5.62) with the revised definition
A p articu lar choice for h(-) would often be based on diagnostic plots o f
com ponents o f t* and v", the objectives being to attain approxim ate norm ality
an d approxim ately stable variance for each com ponent.
This m ethod will be subject to the same potential defects as the studentized
b o o tstrap m ethod o f Section 5.2. T here is no vector analogue o f the adjusted
percentile m ethods, b u t the nested b o o tstrap m ethod can be applied.
from which we calculate the m axim um likelihood estim ators T = (p,,k). The
234 5 ■Confidence Intervals
num erical values are p. = 108.083 and k = 0.7065. A straightforw ard calcula
tion shows th a t the delta m ethod variance approxim ation, equal to the inverse
o f the expected inform ation m atrix as in Section 5.2, is
The stan d ard likelihood ratio 1 — a confidence region is the set o f values o f
(fi, k ) for which
where c2,i_« is the 1 — a quantile o f the x l distribution. The top left panel
o f Figure 5.8 shows the 0.50, 0.95 an d 0.99 confidence regions obtained in
this way. T he top right panel is the same, except th a t C2,i_a is replaced by a
b o o tstrap estim ate obtained from R = 999 sam ples sim ulated from the fitted
gam m a m odel. This second region is som ew hat larger than, b u t o f course has
the same shape as, the first.
From the b o o tstrap sim ulation we have estim ators t" = (£*,£*) from each
sample, from which we calculate the corresponding variance approxim ations
using (5.64), an d hence the quad ratic form s q * = ( f — f)r i>2-1 (f* — t). We then
apply (5.62) to obtain the studentized b o o tstrap confidence regions shown in
the bottom left panel o f Figure 5.8. This is clearly nothing like the likelihood-
based confidence regions above, p artly because it fails com pletely to take
account o f the m ild skewness in the distribution o f fi and the heavy skewness
in the distrib u tio n o f k. These features are clear in the histogram plots o f
Figure 5.9.
L ogarithm ic transfo rm atio n o f b o th fi an d k improves m atters considerably:
the b otto m right panel o f Figure 5.8 com es from applying the studentized
boo tstrap m ethod after d ual logarithm ic transform ation. Nevertheless, the
solution is n o t com pletely satisfactory, in th a t the region is too wide on the k
axis and slightly narrow on the fi axis. This could be predicted to som e extent
by plotting v'L versus f*, which shows th a t the log transform ation o f k is not
quite strong enough. Perhaps m ore im p o rtan t is th a t there is a substantial bias
in k: the b o o tstrap bias estim ate is 0.18.
One lesson from this exam ple is th a t where a likelihood is available and
usable, it should be used — w ith param etric sim ulation to check on, and if
necessary replace, stan d ard approxim ations for quantiles o f the log likelihood
ratio statistic. ■
Q.
Q.
<o
J*
mu
C\J
o Figure 5.9 Histograms
o of ft and k* from
R = 999 bootstrap
samples from gamma
co model with p. = 108.083
oo
and ic = 0.7065, fitted to
air-conditioning data.
so in
o
o
o o I I I i i i i—. i—i i—
o o
50 100 150 200 250 300 0.5 1.0 1.5 2.0 2.5 3.0
mu kappa
In ord er to set a confidence region for the m ean p o lar axis, or equivalently
(6, <f>), we let
b(6, <£) = (sin 6 cos (j), sin 9 sin 0 , —cos d)T, c (0 ,0 ) = (— sin <j>t — cos <j>,0)T
denote the unit vectors ortho g o n al to a(0, </>). The sam ple values o f these
vectors are 2, b and c, and the sam ple eigenvalues are 1\ < %2 < ^3- Let A
denote the 2 x 3 m atrix (S,c)r and B the 2 x 2 m atrix with { j, k)th element
------— n~ l y ^ ( b Tyj)(cTyj)(aTyj)2.
5.8 • M ultiparameter Methods 237
Example 5.16 (City population data) F o r the ra tio estim ation problem o f
Exam ple 1.2, the statistic d = u w ould often be regarded as ancillary. The
reason rests in p a rt on the n o tio n o f a m odel for linear regression o f x on
u with v ariatio n p ro p o rtio n al to u. The left panel o f Figure 5.12 shows the
scatter plo t o f t* versus d" for the R = 999 n o n p aram etric b o o tstrap sam ples
used earlier. T he observed value o f d is 103.1. T he m iddle and right panels o f
the figure show trends in the conditional m ean an d variance, E*(T* | d') and
v ar* (T ’ | d"), these being approxim ated by crude local averaging in the scatter
plot on the left.
The calculation o f confidence lim its for the ratio 6 = E(AT)/E(l/) is to be
m ade conditional on d* = d, the observed m ean o f u. Suppose, for example, th at
we w ant to apply the basic b o o tstra p m ethod. T hen we need to approxim ate
the conditional quantiles ap(d) o f T — 6 given D = d for p = a and 1 — a, and
5.9 ■Conditional Confidence Regions 239
91001
population data, n = 49.
V V
. ...
(
5
0.0014
ratio estimates t* versus
d*, and conditional
vaitH
■ l i f t ; , .
0.0012
means and variances of 15
t* given d*. R = 999
0.0010
nonparametric samples.
0.0008
<3
80 100 120 140 160 80 90 100 110 120 130 80 90 100 110 120 130
6’ d' <S'
and the sim plest way to use o u r sim ulated sam ples to approxim ate this is to
use only those sam ples for which d* is “n ea r” d. F or example, we could take
the R i = 99 sam ples whose d* values are closest to d and approxim ate ap(d)
by the lOOpth ordered value o f t* in those samples.
C ertainly stratification o f the sim ulation results by intervals o f d* values
shows quite strong conditional effects, as evidenced in Figure 5.12. The difficulty
is th a t R j = 99 sam ples is n o t enough to obtain good estim ates o f conditional
quantiles, and certainly not to distinguish betw een unconditional quantiles and
the conditional quantiles given d' = d, which is near the m ean. O nly w ith an
increase o f R to 9999, an d using strata o f Rd = 499 samples, does a clear
picture emerge. Figure 5.13 shows plots o f conditional quantile estim ates from
this larger sim ulation.
How different are the conditional and unconditional distributions? Table
5.10 shows b o o tstrap estim ates o f the cum ulative conditional probabilities
Pr( T < ap | D = d), where ap is the unconditional p quantile, for several values
o f p. Each estim ate is the p ro p o rtio n o f times in Rd = 499 sam ples th a t t" is less
than or equal to the unconditional quantile estim ate £(’10ooop)- The com parison
suggests th a t conditioning does n o t have a large effect in this case.
A m ore efficient use o f b o o tstrap samples, which takes advantage o f the
sm oothness o f quantiles as a function o f d, is to estim ate quantiles for interval
stra ta o f Rd sam ples an d then for each level p to fit a sm ooth curve. For
exam ple, if the k th such stratu m gives quantile estim ates ap# and average
240 5 * Confidence Intervals
Ancillary d*
Ancillary d*
value dk for d', then we can fit a sm oothing spline to the points (dk, (ip^) for
each p an d interpolate the required value ap(d) at the observed d. Figure 5.14
illustrates this for R = 9999 and non-overlapping s tra ta o f size R^ = 199, with
p = 0.025 an d 0.975. N ote th a t interp o latio n is only needed at the centre o f
the curve. Use o f non-overlapping intervals seems to give the best results. ■
i
1880 1900 1920 1940 1960
Year
Example 5.17 (N ile data) T he d a ta plotted in Figure 5.15 are annual dis
charges y o f the R iver Nile at A sw an from 1871 to 1970. Interest lies in the
year 1870+0 in which the m ean discharge drops from n\ — H 0 0 to H2 = 870;
these m ean values are estim ated, b u t it is reasonable to ignore this fact and we
shall do so.
The least squares estim ate o f the integer 0 maximizes
e
S(0) = ^ { > 7 “ 3 ^ i + w ) } -
j= i
4.62 .. 92 84 93 93 95 97 87 93 76
4.87 91 91 91 95 89 92 92 95 76
5.12 .. 92 96 100 95 86 97 100 97 81
5.49 97 96 89 98 96 95 97 96 85
6.06 94 100 100 100 97 96 95 95 86
6.94 93 100 100 100 100 100 100 100 100
nam ely
is sm ooth in b ' , c ' . We fitted a logistic regression to the proportions in the 201
non-em pty cells o f the com plete version o f Table 5.11, the result being
5.10 Prediction
Closely related to confidence regions for param eters are confidence regions for
future outcom es o f the response Y , m ore usually called prediction regions.
A pplications are typically in m ore com plicated contexts involving regression
m odels (C hapters 6 and 7) and time series m odels (C hapter 8), so here we give
only a b rief discussion o f the m ain ideas.
In the sim plest situation we are concerned with prediction o f one future
response Yn+l given observations y \ , . . . , y n from a distribution F. The ideal
upp er y prediction lim it is the y quantile o f F, which we denote by ay(F). The
sim plest ap p ro ach to calculating a prediction limit is the plug-in approach,
th a t is substituting the estim ate F for F to give ay = ay(F). But this is clearly
biased in the optim istic direction, because it does n o t allow for the uncertainty
in F. R esam pling is used to correct for, or remove, this bias.
Parametric case
Suppose first th a t we have a fully param etric model, F = Fg, say. T hen the
prediction lim it ay(F) can be expressed m ore directly as ay(9). T he true coverage
o f this limit over repetitions o f b o th d a ta and predictand will n o t generally be
y, b u t rath er
P r{7 n+i < ay(6) \ 6} = h(y), (5.66)
(5.68)
Once h(y) has been calculated, the adjusted y prediction limit is taken to be
at<7) = ag(y)(h where
Hgi v) } = 7-
Example 5.18 (Normal prediction limit) Suppose th a t Y^,..., Y„+i are inde
pendently sam pled from the N(/i, cr2) distribution, where fi and a are unknow n,
and th a t we wish to predict Yn+\ having observed yi, . .. , y „- The plug-in m ethod
gives the basic y prediction limit
aY = y„ + s„<l> ‘(y),
e„ is the average of
where Z„_i has the S tudent-f distribution w ith n — 1 degrees o f freedom . This
leads directly to the S tudent-f prediction limit
The preceding exam ple suggests a m ore direct m ethod for special cases
involving m eans, which m akes use o f a poin t prediction y n+\ and the distribu
tion o f prediction error Yn+l — y„+1: resam pling can be used to estim ate this
distribution directly. This m ethod will be applied to linear regression m odels
in Section 6.3.3.
5.10 ■Prediction 245
Figure 5.16
Adjustment function
/i(y) for prediction with
sample size n = 10 from
N(n,cr2), with quadratic
logistic fit (solid), and
line giving /i(y) = y
(dots).
Logit of gamma
Nonparametric case
N ow consider the n o nparam etric context, where F is the E D F o f a single
sample. The calculations outlined for the param etric case apply here also.
First, if r / n < y < (r + 1)/n then the plug-in prediction limit is ay(F) = y(r)\
equivalently, ay(F) = y([ny\), where [■] m eans integer part. Straightforw ard
calculation shows th at
Pr(Y„+1 < yw ) = r / ( n + l ) ,
(1994) discuss the num bers o f sam ples required when the nested b o o tstrap is
used to calibrate a confidence interval.
C onditional m ethods have received little attention in the literature. E xam
ple 5.17 is tak en from H inkley an d Schechtm an (1987). B ooth, H all and W ood
(1992) describe kernel m ethods for estim ating the conditional distribution o f
a b o o tstrap statistic.
Confidence regions for vector param eters are alm ost untouched in the lit
erature. T here are no general analogues o f adjusted percentile m ethods. H all
(1987) discusses likelihood-based shapes for confidence regions.
Geisser (1993) surveys several approaches to calculating prediction intervals,
including resam pling m ethods such as cross-validation.
References to confidence interval and prediction interval m ethods for regres
sion m odels are given in the notes for C hapters 6 and 7; see also C hapter 8
for tim e series.
5.12 Problems
1 Suppose that we have a random sample from a distribution F whose
mean is unknown but whose variance is known and equal to a 1. D iscuss possi
ble nonparametric resampling methods for obtaining confidence intervals for ^,
including the following: (i) use z = J n ( y — n ) / a and resample from the E D F ; (ii)
s2 is the usual sample use z = J n ( y — fi)/s and resample from the E D F ; (iii) as in (ii) but replace the
variance of .........y„. E D F o f the data by the E D F o f values y + a(yi — y ) / s; (iv) as in (ii) but replace
the E D F by a distribution on the data values whose mean and variance are y and
a 2.
4 The gamma model (1.1) with mean /i and index k can be applied to the data o f
Example 1.1. For this model, show that the profile log likelihood for pt is
n log(K/n) + n + log yj - ^ y j / f i - m p ( K ) = 0,
248 5 • Confidence Intervals
where p = p(F) = P r'(Z ‘ < z \ F). Let P be the random variable corresponding to
p(F), with C D F G( ). Hence show that the unconditional probability is
N ote that Pr(P < a) = Pr{0 6 [T — 7 1/2Z a',oo)}, where Z a* is the a quantile o f the
distribution o f Z ', conditional on Y i , . . . , Y n.
(b) Suppose that it is reasonable to approximate the distribution o f P by the beta
distribution with density wa l (1 — u)b~l / B(a,b), 0 < u < 1; note that a, b—>\ as
n—► o o . For som e representative values o f R, a, a and b, compare the coverage error
o f I , with that o f the interval [T — V 1/2Z ’,oo).
(Section 5.2.3; Hall, 1986)
(a) When estimates (i) are used, and the are independent N ( p , a 2) variables,
show that an exact (1 — 2a) confidence interval for 8 has endpoints
s{ ^ f •
coverage (1 — 2a) is
(n -l)2 _ 2 „ (n -l)2
Pr i < t n_ x <
i. ^n—1,1—a Cn—l,a
where C has the x l - 1 distribution. Give also the coverages o f the basic bootstrap
confidence intervals based on 9 and log 6.
Calculate these coverages for n = 25, 50, 75 and a = 0.05, 0.025, and 0.005. Which
o f these intervals is preferable?
(c) See Practical 5.4, in which we take d5 = 2.236.
(Section 5.3.1)
7 Suppose that we have a parametric model with parameter vector tp, and that
9 = h(xp) is the parameter o f interest. The adjusted percentile ( B C a) method is
found by applying the scalar parameter method to the least-favourable family, for
which the log likelihood <f(ip) is replaced by / l f ( 0 = £($>+£&), with S = i~l {rp)h(y))
and h( ) is the vector o f partial derivatives. Equations (5.21), (5.22) and (5.24) still
apply.
Show in detail how to apply this extension o f the B C a method to the problem
o f calculating confidence intervals for the ratio 9 = Hi/\i\ o f the means o f two
exponential distributions, given independent samples from those distributions. Use
a numerical example (such as Example 5.10) to compare the B C a m ethod to the
exact method, which is based on the fact that 9 / 9 has an F distribution.
(Sections 5.3.2, 5.4.2; Efron, 1987)
8 For the ratio o f independent means in Example 5.10, show that the matrix o f
second derivatives ii{n) has elements
n2t 1 2 ( y u - y i X y i j ~ y \ )
uu,ij — ~ ^ r \ --------- =--------------- h (yu — y 0 + (y\j — y\)
njyi I yi
n2
uu.2j = n\n2y {
r j {(yi,- - yi )(yn - h)},
and
ft 2 _
“2i,2j = — j~i ( y 2 ‘ ~ fo) + (yy ~ ^)}-
n2y i
Use these results to check the value o f the constant c used in the A B C method in
that example.
For the data o f Example 1.2 we are interested in the ratio o f means 9 = E ( X ) / E ( U) .
Define /j. = (E((7), E ( X ) ) T and write 9 = t(n), which is estimated by t = t(s) with
s = («, 5c)t . Show that
(' - I*2//^l
. ___ h 2/ h \ \ -i = ( l -Hi /Hl — 1/Mi
” V lVm /in )r ’ \-i/fii o
From Problem 2.16 we have lj = e j / u with = x; —tUj. Derive expressions for the
constants a, b and c in the nonparametric A B C method, and note that b = cv1/ 2-
250 5 ■Confidence Intervals
~ = x + d„ Y X j e j / ( n 2v l / 2u)
u + d x Y l u j e j / ( n 2v l / 2u) ’
The bootstrap confidence limit is « i_ „ = ui_a(r, s, s). Show that if S is a consistent s is consistent for k if
estimator for X then the method is consistent in the sense that Pr(0 < tii_a) = s = A+ op(i) as n->oo.
1 — a + o(l). Further show that under certain conditions the coverage differs from
1 — a by 0 (n _1).
(Section 5.5; Kabaila, 1993a; Carpenter, 1996)
The adjusted 1 — a upper confidence limit is then the 1 — q(a) quantile o f T*.
In the parametric bootstrap analysis for a single exponential mean, show that the
percentile method gives upper 1 — a limit > ’C 2 „ , i - a / ( 2 n ) . Verify that the bootstrap cv is the a quantile of
adjustment o f this limit gives the exact upper 1 — a limit 2 n y / c 2n,tt- the distribution.
(Section 5.6; Beran, 1987; Hinkley and Shi, 1989)
15 For an equi-tailed (1 — 2a) confidence interval, the ideal endpoints are t + p with
values o f P solving (3.31) with
Suppose that the bootstrap solutions are denoted by [i? and P t- a., and that in
the language o f Section 3.9.1 the adjustments b(F, y) are /Ja+?1 and /?i_a+w. Show
how to estimate yi and y2, and verify that these adjustments modify coverage
1 — 2a + 0 (n _1) to 1 — 2a + 0(n~2).
(Sections 3.9.1, 5.6; Hall and Martin, 1988)
G(„ I d ) ,
£ f= i W{h-'(d;-d)}
where w( ) is a density symmetric about zero and h is an adjustable bandwidth.
Investigate the bias and variance o f this estimate in the case where ( T , D ) is ap
proximately bivariate normal and w( ) = <p(-). Show that h = R ~ i/2 is a reasonable
choice.
(Section 5.9; Booth, Hall and W ood, 1992)
5.13 Practicals
1 Suppose that we wish to calculate a 90% confidence interval for the correlation
9 between the two counts in the colum ns o f cd4; see Practical 2.3. To obtain
252 5 ■Confidence Intervals
t = ilog{(l+ 0)/(l-0)} :
Suppose that we wish to calculate a 90% confidence interval for the largest
eigenvalue 9 o f the covariance matrix o f the two counts in the colum ns o f cd4; see
Practicals 2.3 and 5.1. To obtain confidence intervals for 9 under nonparametric
resampling, using the empirical influence values to calculate vL :
eigen.fun <- function(d, w = rep(l, nrow(d))/nrow(d))
{ w <- w/sum(w)
n <- nrow(d)
m <- crossprod(w, d)
m2 <- sweep(d,2,m)
v <- crossprod(diag(sqrt(w)) ■/.*■
/, m2)
eig <- eigen(v,symmetric=T)
stat <- eig$values[l]
e <- eig$vectors[,l]
i <- rep(l:n,round(n*w))
ds <- sweep(d[i,],2,m)
5.13 ■Practicals 253
par(mfrow=c(2,2))
tsplot(capabilityly,ylim=c(5,6))
abline(h=5.79,lty=2); abline(h=5.49,lty=2)
qqnorm(capability$y)
acf(capabilitySy)
254 5 ■Confidence Intervals
acf(capability$y,type="partial")
To find nonparametric confidence limits for rj using the estimates given by (ii) in
Problem 5.6:
Following on from Practical 2.3, w e use a double bootstrap with M = 249 to adjust
the studentized bootstrap interval for a correlation coefficient applied to the cd4
data.
nested.corr <- function(data, w, tO, M)
{ n <- nrow(data)
i <- rep(l:n,round(n*w))
t <- corr.fun(data, w )
z <- (t[l]-t0)/sqrt(t[2])
nested.boot <- boot(data[i,], corr.fun, R=M, stype="w")
z.nested <- (nested.boot$t[,1]—t [1])/sqrt(nested.boot$t[,2])
c(z, sum(z.nested<z)/(M+l)) }
cd4.boot <- boot(cd4, nested.corr, R=9, stype="w",
tO=corr(cd4), M=249)
To get som e idea how long you will have to wait if you set R = 999 you can
time the call to b o o t using u n ix .t im e or d o s . t i m e : beware o f time and memory
problems. It may be best to run a batch job, with contents
but with the last three lines repeated eight further times.
cd4.nested contains a nested simulation we did earlier. T o compare the actual
and nominal coverage levels:
par(pty="s")
qqplot((1:c d4.nested$R)/ (l+cd4.nested$R),cd4.nested$t[,2],
xlab="nominal coverage",ylab="estimated coverage",pch=".")
lines(c(0,l),c(0,l))
How close to nominal is the estimated coverage? To read off the original and
corrected 95% confidence intervals:
q <- c(0.975,0.025)
q.adj <- quantile(cd4.nested$t[,2],q)
tO <- corr.fun(cd4)
z <- sort(cd4.nested$t[,1])
5.13 ■Practicals 255
t O [1]-sqrt(tO[2])*z[floor((l+cd4.nested$R)*q)]
t O [1]-sqrt(tO[2])*z[floor((l+cd4.nested$R)*q.adj)]
Does the correction have much effect? Compare this interval with the correspond
ing ABC interval.
(Section 5.6)
6
Linear Regression
6.1 Introduction
O ne o f the m ost im p o rta n t and frequent types o f statistical analysis is re
gression analysis, in which we study the effects o f explanatory variables or
covariates on a response variable. In this chap ter we are concerned with Unear
regression, in which the m ean o f the ran d o m response Y observed at value
x = ( x i,. . . , x p)T o f the explanatory variable vector is
E ( y | x) = n(x) = x Tp.
The m odel is com pleted by specifying the natu re o f random variation, which
for independent responses am o u n ts to specifying the form o f the variance
v a r(7 | x). F or a full p aram etric analysis we would also have to specify the
distribution o f Y , be it norm al, Poisson o r w hatever. W ithout this, the m odel
is sem iparam etric.
F or linear regression w ith norm al ran d o m errors having co n stan t variance,
the least squares theory o f regression estim ation and inference provides clean,
exact m ethods for analysis. But for generalizations to non-norm al errors and
non-con stan t variance, exact m ethods rarely exist, and we are faced with
approxim ate m ethods based o n linear approxim ations to estim ators and central
lim it theorem s. So, ju s t as in the sim pler context o f C hapters 2-5, resam pling
m ethods have the poten tial to provide m ore accurate analysis.
We begin o u r discussion in Section 6.2 w ith simple least squares linear re
gression, where in ideal conditions resam pling essentially reproduces the exact
theoretical analysis, b u t also offers the p o tential to deal with non-ideal cir
cum stances such as non-co n stan t variance. Section 6.3 covers the extension
to m ultiple explanatory variables. The related topics o f aggregate prediction
erro r an d o f variable selection based on predictive ability are discussed in
Section 6.4. R obust m ethods o f regression are exam ined briefly in Section 6.5.
256
6.2 ■Least Squares Linear Regression 257
w here the EjS are uncorrelated w ith zero m eans and equal variances a 2. This
constancy o f variance, or hom oscedasticity, seems roughly right for the example
data. We refer to the d a ta (x j , y j ) as the y'th case.
In general the values Xj m ight be controlled (by design), random ly sampled,
o r m erely observed as in the example. But we analyse the d a ta as if the x,s
were fixed, because the am o u n t o f inform ation ab o u t ft = (/fo, l h ) T depends
u p o n their observed values.
The sim plest analysis o f d a ta under (6.1) is by the ordinary least squares
258 6 • Linear Regression
m ethod, on which we concentrate here. The least squares estim ates for (i are
, h = y - Pi*, (6 .2 )
where
ei = yj - A> (6.3)
A/ = Po + Plxj (6.4)
the fitted values, or estim ated m ean values, for the response at the observed x
values.
The basic properties o f the p aram eter estim ates Po, Pi, which are easily
obtained u n d er m odel (6.1), are
(6.5)
and
The estim ates are norm ally distributed and optim al if the errors e;- are norm ally
distributed, they are often approxim ately norm al for other erro r distributions,
b u t they are n o t robust to gross non-norm ality o f errors or to outlying response
values.
The raw residuals e} are im p o rtan t for various aspects o f m odel checking,
and potentially for resam pling m ethods since they estim ate the random errors
Ej, so it is useful to sum m arize their properties also. U nder (6.1),
n
(6.7)
k= 1
where
• (6.9)
1 (i - h j W
Standardized residuals We shall refer to these as modified residuals, to distinguish them from standard
are called studentized
ized residuals which are in addition divided by the sam ple standard deviation.
residuals by some
authors. A norm al Q -Q p lo t o f the r;- will reveal obvious outliers, or clear non-norm ality
o f the ran d o m errors, alth o u g h the latter m ay be obscured som ew hat because
o f the averaging pro p erty o f (6.7).
A sim pler m odification o f residuals is to use 1 — h = 1 — 2n-1 instead o f
individual leverages 1 — hj, where h is the average leverage; this will have a
very sim ilar effect only if the leverages hj are fairly hom ogeneous. This simpler
m odification implies m ultiplication o f all raw residuals by (1 — 2n~1)~]/'2:
the average will equal zero autom atically because ^ ej = 0.
I f (6.1) holds w ith hom oscedastic random errors e; and if those random
errors are norm ally distributed, or if the dataset is large, then stan d ard distri
butio n al results will be adequate for draw ing inferences w ith the least squares
estim ates. But if the errors are very non-norm al o r heteroscedastic, m eaning
th a t their variances are unequal, then those stan d ard results m ay n o t be reliable
an d a resam pling m ethod m ay offer genuine im provem ent. In Sections 6.2.3
an d 6.2.4 we describe two quite different resam pling m ethods, the second o f
w hich is robust to failure o f the m odel assum ptions.
I f strong non-norm ality o r heteroscedasticity (which can be difficult to
distinguish) ap p ear to be present, then robust regression estim ates m ay be
considered in place o f least squares estim ates. These will be discussed in
Section 6.5.
First formulation
T he first possibility is th a t the pairs are random ly sam pled from a bivariate
distrib u tio n F for (X, 7 ). T hen linear regression refers to linearity o f the
conditional m ean o f Y given X = x, th a t is
w ith n x = E(X ), fly = E(Y ), a 2 = \a.r(X) and axy = cov(X, Y). This condi
tional m ean corresponds to the m ean in (6.1), w ith
Po = H y - y f i x, Pi=y- (6.11)
L^ F> = C - S ? <612>
T he em pirical influence values as defined in Section 2.7.2 are therefore
(1 -n x (x j-x )/S S x \
'< = { n(Xj — x ) / S S x ) “■ (6' 13)
T he nonparam etric delta m ethod variance approxim ation (2.36) applied to [1]
gives
Y, { x j — x)2e2j
vl = — -S S 2 1■ (6-14)
Second formulation
The second possibility is th a t a t any value o f x, responses Yx can be sam pled
from a distribution Fx(y) whose m ean an d variance are n(x) and <r2(x), such
th a t n{x) = Po + Pix. Evidently /?o = /40), an d the slope param eter /ii is a
linear co n trast o f m ean values n(x i ) ,^ (* 2), • • •, nam ely
_ E (xj - x)n(xj)
SS X
In principle several responses could be obtained at each xj. Simple linear
regression w ith hom oscedastic errors, w ith which we are initially concerned,
corresponds to cr(x) = a and
Fx(y) = G { y - r t x ) } . (6.15)
The influence function for the least squares estim ator is again given by
(6.12), b u t w ith fix and a \ respectively replaced by x and n~' J2(x j ~ *)2-
Em pirical influence values are still given by (6.13). The analogue o f linear
approxim ations (2.35) an d (3.1) is $ = fi + n~x Lt { ( xj , y j) ; F} , w ith vari
ance n_ 2 ^ " =1 v ar [Lt{( xj, Yj) ;F}]. If the assum ed hom oscedasticity o f errors
is used to evaluate this, w ith the constant variance a 2 estim ated by n~l ep
then the delta m ethod variance approxim ation for /?i, for example, is
'Z i.
nSSx ’
strictly speaking this is a sem iparam etric approxim ation. This differs by a
factor o f (n — 2) / n from the stan d ard estim ate, which is given by (6.6) with
residual m ean square s2 in place o f a 2.
The stan d ard analysis for linear regression as outlined in Section 6.2.1 is the
sam e for b o th situations, provided the random errors ej have equal variances,
as w ould usually be jud g ed from plots o f the residuals.
Y j = p . j + ep j = (6.16)
a. a ,
E (* )-* > 2 “ f t + SS, '
. y^(x; — x)2var*(£;) , -
v ar (Pi) = -----------^ -------- J- = n ^ ( r , - - r f / S S x.
The latter will be approxim ately equal to the usual estim ate s2/ S S x, because
n_1 Y;(rj ~ r ) 2 = (n ~ 2)~' e] = s2- 1° fact if the individual hj are replaced by
their average h, then the m eans an d variances o f Pq and p \ are given exactly
by (6.5) an d (6.6) w ith the estim ates Pq, P i an d s2 substituted for param eter
values. T he advantage o f resam pling is im proved quantile estim ation when
norm al-theory distributions o f the estim ators Pq, P i , S 2 are n o t accurate.
Example 6.1 (M am m als) F or the d a ta plotted in the right panel o f Figure 6.1,
the simple linear regression m odel seems appropriate. S tan d ard analysis sug
gests th a t errors are approxim ately norm al, although there is a small suspicion
o f heteroscedasticity: see Figure 6.2. T he p aram eter estim ates are Po = 2.135
and Pi = 0.752.
From R = 499 b o o tstra p sim ulations according to the algorithm above, the
6.2 ■Least Squares Linear Regression 263
estim ated sta n d a rd errors o f intercept and slope are respectively 0.0958 and
0.0273, com pared to the theoretical values 0.0960 and 0.0285. The em pirical
distributions o f b o o tstra p estim ates are alm ost perfectly norm al, as they are
for the studentized estim ates. T he estim ated 0.05 and 0.95 quantiles for the
studentized slope estim ate
sE{fay
w here SE(fS\) is the stan d ard error for obtained from (6.6), are z*25) = —1.640
an d z'475) = 1.5 89, com pared to the stan d ard norm al quantiles +1.645. So, as
expected for a m oderately large “clean” dataset, the resam pling results agree
closely w ith those obtained from stan d ard m ethods. ■
Zero intercept
In som e applications the intercept f o will n o t be included in (6.1). This affects
the estim ation o f Pi and a 2 in obvious ways, b u t the resam pling algorithm will
also differ. First, the leverage values are different, nam ely
There are two im p o rtan t differences betw een this second b o o tstrap m ethod
and the previous one using a p aram etric m odel an d sim ulated errors. First,
w ith the second m ethod we m ake no assum ption ab o u t variance hom ogeneity
— indeed we do n o t even assum e th a t the conditional m ean o f Y given X = x
is linear. This offers the advantage o f potential robustness to heteroscedasticity,
and the disadvantage o f inefficiency if the constant-variance m odel is correct.
Secondly, the sim ulated sam ples have different designs, because the values
6.2 ■Least Squares Linear Regression 265
x j ,...,x * are random ly sam pled. The design fixes the inform ation content o f a
sample, and in principle o u r inference should be specific to the inform ation in
o u r data. The variation in x j , . . . , x ’ will cause some variation in inform ation,
b u t fortunately this is often u n im p o rtan t in m oderately large datasets; see,
however, Exam ples 6.4 and 6.6.
N ote th a t in general the resam pling distribution o f a coefficient estim ate
will not have m ean equal to the d a ta estim ate, contrary to the unbiasedness
property th a t the estim ate in fact possesses. However, the difference is usually
negligible.
Example 6.2 (M ammals) F or the d ata o f Exam ple 6.1, a b o o tstra p sim ulation
was run by resam pling cases with R = 999. Table 6.1 shows the bias and
stan d ard error results for b o th intercept and slope. The estim ated biases are
very small. T he striking feature o f the results is th at the stan d ard erro r for the
slope is considerably sm aller than in the previous b o o tstrap sim ulation, which
agreed w ith stan d ard theory. The last colum n o f the table gives robust versions
o f the stan d ard errors, which are calculated by estim ating the variance o f Ej to
be rj. For exam ple, the robust estim ate o f the variance o f (it is
This corresponds to the delta m ethod variance approxim ation (6.14), except
th a t rj is used in preference to e; . As we m ight have expected from previous
discussion, the b o o tstrap gives an approxim ation to the robust stan d ard error.
A A
Figure 6.3 shows norm al Q -Q plots o f the b o o tstra p estim ates Pq and fi'.
F or the slope p aram eter the right panel shows lines corresponding to norm al
d istributions w ith the usual and the robust stan d ard errors. T he distribution
o f Pi is close to norm al, with variance m uch closer to the robust form (6.17)
th an to the usual form (6.6). ■
One disadvantage o f the robust stan d ard error is its inefficiency relative to
the usual stan d ard erro r when the latter is correct. A fairly straightforw ard
calculation (Problem 6.6) gives the efficiency, which is approxim ately 40% for
the slope p aram eter in the previous example. T hus the effective degrees o f
freedom for the robust stan d ard error is approxim ately 0.40 times 62, or 25.
266 6 • Linear Regression
Figure 63 Normal
plots for bootstrapped
estimates of intercept
(left) and slope (right)
for linear regression fit
to logarithms of
mammal data, with
R = 999 samples
obtained by resampling
cases. The dotted lines
give approximate
normal distributions
based on the usual
formulae (6.5) and (6.6),
while the dashed line
shows the normal
distribution for the
slope using the robust
Quantiles of standard normal Quantiles of standard normal variance estimate (6.17).
The sam e loss o f efficiency would apply approxim ately to b o o tstrap results for
resam pling cases.
where perm { } denotes a perm utation. Because all perm utations are equally
likely, we have
# o f perm utations such th a t T > t
P = --------------------n!i-------------------’
as in (4.20). In the present context we can take T = fii, for which p is the same
as if we used the sam ple Pearson correlation coefficient, b u t the same m ethod
applies for any ap p ro p riate slope estim ator. In practice the test is perform ed
by generating sam ples ( x j ,y j ) ,. ..,(x * ,y * ) such th a t x* = x j and (_ y j,...,y ’ )
is a ran d o m p erm u tatio n o f ( y i , . . . , y n), and fitting the least squares slope
estim ate jSj. If this is done R times, then the one-sided P-value for alternative
H A : fi i > 0 is
# { fr> M + i
P R + 1
It is easy to show th a t studentizing the slope estim ate would n o t affect
this test; see Problem 6.4. The test is exact in the sense th at the P-value has
a uniform distrib u tio n under Ho, as explained in Section 4.1; note th at this
uniform distribution holds conditional on the x values, which is the relevant
property here.
First bootstrap test
A b o o tstrap test whose result will usually differ negligibly from th a t o f the
p erm u tatio n test is obtained by taking the null m odel as the pair o f m arginal
E D F s o f x an d y , so th a t the x*s are random ly sam pled with replacem ent from
the X j S , and independently the y * s are random ly sam pled from the y j s. A gain
is the slope fitted to the sim ulated data, and the form ula for p is the same.
As w ith the p erm u tatio n test, the null hypothesis being tested is stronger than
ju st zero slope.
The p erm u tatio n m ethod and its b o o tstrap look-alike apply equally well to
any slope estim ate, n o t ju st the least squares estimate.
Second bootstrap test
The next b o o tstrap test is based explicitly on the linear m odel structure with
hom oscedastic errors, and applies the general approach o f Section 4.4. The
null m odel is the null m ean fit and the E D F o f residuals from th a t fit. We
calculate the P-value for the slope estim ate under sam pling from this fitted
model. T h a t is, d a ta are sim ulated by
x) = xp yj = £;0 + 8}o>
w here pjo = y an d the £*0 are sam pled with replacem ent from the null m odel
residuals e^o = yj ~ y , j = 1 , The least squares slope /Jj is calculated
from the sim ulated data. A fter R repetitions o f the sim ulation, the P-value is
calculated as before.
268 6 ■Linear Regression
This second b o o tstrap test differs from the first b o o tstrap test only in th at
the values o f explanatory variables x are fixed at the d a ta values for every
case. N ote th a t if residuals were sam pled w ithout replacem ent, this test would
duplicate the exact p erm u tatio n test, which suggests th at this boo tstrap test
will be nearly exact.
The test could be m odified by standardizing the residuals before sam pling
from them , which here w ould m ean adjusting for the constant null m odel
leverage n-1 . This w ould affect the P-value slightly for the test as described,
b u t not if the test statistic were changed to the studentized slope estimate.
It therefore seems wise to studentize regression test statistics in general, if
m odel-based sim ulation is used; see the discussion o f b o o tstrap pivot tests
below.
p = Pr - P i = 0, P o, c r - Pi,Po,<r),
where Z* = (j?,* — Pi ) / S ' is com puted from a sam ple sim ulated according to
A lgorithm 6.1, which uses the fit from the full m odel as in (6.16). So, applying
the b o o tstrap as described in Section 6.2.3, we calculate the b o o tstrap P-value
from the results o f R sim ulated sam ples as
where zq = Pi/si.
The relation o f this m ethod to confidence limits is th a t if the lower 1 — a
6.2 • Least Squares Linear Regression 269
so
against x (Simonoff and •• w• - ,* •
Tsai, 1994).
/ t
00
CO
d
- 0.2 - 0.1 0.0 - 0.2 - 0.1 0.0
x x
confidence lim it for fa is above zero, then p < oc. Sim ilar interpretations apply
with upper confidence limits and confidence intervals.
T he sam e m ethod can be used with case resampling. If this were done as
a precaution against erro r heteroscedasticity, then it would be appropriate to
replace si w ith the robust stan d ard erro r defined as the square root o f (6.17).
If we wish to test a non-zero value fa$ for the slope, then in (6.18) we
simply replace f a / s \ by zo = (fa — fa,o)/si, or equivalently com pare the lower
confidence lim it to fay-
W ith all o f these tests there are simple m odifications if a different alternative
hypothesis is appropriate. For example, if the alternative is H A : fa < 0, then
the inequalities “ > ” used in defining p are replaced by and the two-sided
P-value is twice the sm aller o f the two one-sided P-values.
O n balance there seems little to choose am ong the various tests described.
The perm u tatio n test an d its b o o tstrap look-alike are equally suited to statis
tics other th an least squares estim ates. T he b o o tstrap pivot test with case
resam pling is the only one designed to test slope w ithout assum ing constant
erro r variance u nder the null hypothesis. But one would usually expect sim ilar
results from all the tests.
The extensions to m ultiple linear regression are discussed in Section 6.3.2.
which happens 233 times. Therefore the b o o tstrap P-value is 0.234. In fact the
use o f the robust stan d ard erro r m akes little difference here: using the ordinary
stan d ard erro r gives P-value 0.252. C om parison o f the ordinary t-statistic to
the stan d ard norm al table gives P-value 0.28. ■
r _ y j-h or y j-h
J {V (X j)(l-h j)y/2 { F( ^. ) ( 1/ 2’
6.2 ■L east Squares Linear Regression 271
Yj = p 0 + fa Xj + V } % , (6.20)
1 F or j = 1 ,..., n,
(a) set x* = Xj\
(b ) random ly sam ple <5* from r\ — r , . . . , r n — r; then
(c) set y'j = fio + fa Xj + Vj1/2Sj, where Vj is V( xj ) or V(frj) as
appropriate.
2 F it linear regression by ordinary least squares to d a ta (xj, y [ ) , (x*, >’*),
giving estim ates f a r, s*2.
a _ T , wA x j - x » ) y j a _ 5
P1 — P0 — PlXw,
22 Wj{xj - x w)2
Wj(Xj - x w)2
hj — ^ ------h
E wi ’ E wi(* i-X w )2’
272 6 ■Linear Regression
K
}, var(/?i)
Y , W j ( X j - X w)2 ’
(6 .21 )
Example 6.4 (Returns data) As m entioned in Exam ple 6.3, the d ata in Fig
ure 6.4 show an increase in error variance w ith m arket return, x. Table 6.3
com pares the b o o tstrap variances o f the p aram eter estim ates from ordinary
least squares for case resam pling an d the wild b o o tstrap, with R = 999. The
estim ated variance o f fii from resam pling cases is larger th a n for the wild
6.3 ■M ultiple Linear Regression 273
b ootstrap , an d for the full d a ta it m akes little difference when the modified
residuals are used.
Case 22 has high leverage, and its exclusion increases the variances o f both
estim ates. T he wild b o o tstrap is again less variable th an bootstrapping cases,
with the wild b o o tstrap o f modified residuals interm ediate betw een them.
We m entioned earlier th a t the design will vary when resam pling cases. The
left panel o f Figure 6.6 shows the sim ulated slope estim ates plotted against
the sum s o f squares X X — x ”)2> f ° r 200 b o o tstrap samples. The plotting
ch aracter distinguishes the num ber o f tim es case 22 occurs in the resam ples:
we retu rn to this below. The variability o f /}j decreases sharply as the sum o f
squares increases. N ow usually we would treat the sum o f squares as fixed in
the analysis, and this suggests th at we should calculate the variance o f P\ from
those b o o tstra p sam ples for which X ( x} — x*)2 is close to the original value
XXx; ~ x)2, show n by the d otted vertical line. If we take the subset between
the dashed lines, the estim ated variance is closer to th at for the wild bootstrap,
as show n the values in Table 6.2 and by the Q-Q plot in the right panel o f
Figure 6.6. This is also true when case 22 is excluded.
The m ain reason for the large variability o f XXxy — x ’)2 is th a t case 22 has
high leverage, as its position at the b o tto m left o f Figure 6.4 shows. Figure 6.6
shows th a t it has a substantial effect on the precision o f the slope estim ate:
the m ost variable estim ates are those where case 22 does not occur, and the
least variable those w here it occurs two or m ore times. ■
( 6.22)
274 6 • Linear Regression
where for m odels w ith an intercept Xjo = 1. In the m ore convenient vector
form the m odel is
Yj = Xj (i + £ j
with x j = ( x jo , Xj i, .. ., Xj P). The com bined m atrix representation for all re
sponses Y t = ( Y i , . . . , Y„) is
y = xp + s (6.23)
P = (X TX r lX Ty ,
lj = n ( X T X ) ~ l Xjej, (6.25)
see Problem 6.1. These generalize equations (6.13) and (6.14). The variance
approxim ation is im proved by using the modified residuals
7 (1 - M 1/2
in place o f the e; , and then v i generalizes (6.17).
B ootstrap algorithm s generalize those in Sections 6.2.3-6.2.4. T h at is, model-
based resam pling generates d a ta according to
Y] = x J P + E p
where the s' are random ly sam pled from the modified residuals n , . . . , rn,
or their centred co u n terp arts — r. Case resam pling operates by random ly
resam pling cases from the data. Pros and cons o f the two m ethods are the
sam e as before, provided p is small relative to n and the design is far from
being singular. T he situation where p is large requires special attention.
Large p
Difficulty can arise w ith b o th m odel-based resam pling and case resam pling if
p is very large relative to n. The following theoretical exam ple illustrates an
extrem e version o f the problem .
276 6 • Linear Regression
/I 0 0\
1 0 0
0 0 0
X = 0 0 0
0 0 0 0 1
\0 0 0 0 1/
For this m odel
Pi = 3 (yn + y n - i ), i= l,...,p,
and
The im plication for m ore general designs is th a t difficulties will arise with
com binations cTp where c is in the subspace spanned by those eigenvectors o f
X TX corresponding to sm all eigenvalues. First, m odel-based resam pling will
give adequate results for stan d ard erro r calculations, but b o o tstrap distribu
tions m ay n o t im prove on norm al approxim ations in calculating confidence
limits for the /?,-s, o r for prediction. Secondly, unconstrained case resam pling
6.3 ■M ultiple Linear Regression 277
Example 6.6 (Cement data) The d a ta in Table 6.3 are classic in the regression
literature as an exam ple o f near-collinearity. The four covariates are percent
ages o f constituents which sum to nearly 100: the sm allest eigenvalue o f X TX
is = 0.0012, corresponding to eigenvector (—1,0.01,0.01,0.01,0.01).
T heoretical an d b o o tstrap stan dard errors for coefficients are given in Table
6.4. For error resam pling the results agree closely w ith theory, as expected.
The b o o tstrap distributions o f /?* are very norm al-looking: the h at m atrix H
is such th a t modified residuals r; w ould look norm al even for very skewed
errors Ej.
Case resam pling gives m uch higher standard errors for coefficients, and
the b o o tstrap distributions are visibly skewed w ith several outliers. Figure 6.7
shows scatter plots o f tw o b o o tstrap coefficients versus smallest eigenvalue
o f X T' X ' ; plots for the oth er two coefficients are very similar. The variability
o f /?,* increases substantially for small values o f /}, whose reciprocal ranges
from j to 100 tim es the reciprocal o f £\. Taking only those b o o tstrap samples
which give the m iddle 500 values o f / j (which are betw een 0.0005 and 0.0012)
278 6 • Linear Regression
1 5 10 50 500 1 5 10 50 500
Smallest eigenvalue Smallest eigenvalue
gives m ore reasonable stan d ard errors, as seen in the penultim ate row o f
Table 6.4. T he last row, corresponding to d ropping the smallest 200 values o f
f \ , gives very sim ilar results. ■
p = (X T W X ) ~ lX T Wy, (6.27)
the fitted values are p. = Xfl, and the residual vector is e = (I — H)y, where
now the h a t m atrix H is defined by Note that H is not
symmetric in general.
H = X ( X T WX)~lX T W, (6.28)
Some authors prefer to
work with the
symmetric matrix
X' ( X' TX ' ) - ' X 'T, where
X' = W l' 1X.
6.3 ■M ultiple Linear Regression 279
w hose diagonal elem ents are the leverage values hj. The residual vector e has
variance var(e) = k (I — H ) W ~ [, whose y'th diagonal elem ent is /c(l — h j ) w j 1.
So the m odified residual is now
rj = _ J 2J -- ------ • (6.29)
Wj (1 — hj)1/2
y; = x j p + w j ll2£j,
Y = X (3 + £ = X q oc + X \ y + e,
We shall also need the residuals from this fit, which are eo = (/ — Ho)y with
Ho = X q( X q Xo)~lX q . The test statistic T will be based on the least squares
estim ate y for y in the full m odel, which can be expressed as
y — (Xi-oXio) 1X[.0eo
y = Ao + £o,
where the com ponents o f the sim ulated error vector e0 are sam pled w ithout
(perm utation) or w ith (bo o tstrap ) replacem ent from the n residuals in eo- N ote
th at this m akes use o f the assum ed hom oscedasticity o f errors. Each case keeps
its original covariate values, which is to say th a t X ’ = X . W ith the sim ulated
d a ta we regress y ’ on X to calculate y' and hence the sim ulated test statistic
t \ as described below. W hen this is repeated R times, the b o o tstrap P-value is
# { t; > t} + l
R + l
T he p erm u tatio n version o f the test is not exact w hen nuisance covariates X j
are present, b u t em pirical evidence suggests th a t it is close to exact.
Scalar y
W hat should t be? F or testing a single com ponent, so th a t y is a scalar, suppose
th a t the alternative hypothesis is one-sided, say H A : y > 0. T hen we could
A 1/2
take t to be y itself, o r possibly a studentized form such as zo = y / v 0 , where
Do is an ap p ro p riate estim ate o f the variance o f y. If we com pute the standard
error using the null m odel residual sum o f squares, then
v0 = ( n - q r ' e l e o i X l o X i o r 1,
where q is the ran k o f X q. T he sam e form ula is applied to every sim ulated
sam ple to get i>q an d hence z* = y*/vq1/2.
W hen there are no nuisance covariates Xo, Vq = vq in the p erm u tatio n test,
and studentizing has no effect: the sam e is true if the non-null stan d ard error
is used. Em pirical evidence suggests th a t this is approxim ately true w hen Xo is
present; see the exam ple below. Studentizing is necessary if m odified residuals
are used, w ith stan d ard izatio n based on the null m odel hat m atrix.
A n alternative b o o tstrap test can be developed in term s o f a pivot, as
described for single-variable regression in Section 6.2.5. H ere the idea is to
treat Z = (y — y ) / V l/2 as a pivot, w ith V l/1 an ap propriate stan d ard error.
B ootstrap sim ulation u nder the full fitted m odel then produces the R replicates
o f z ’ which we use to calculate the P-value. To elaborate, we first fit the full
m odel p = X f i by least squares and calculate the residuals e = y — p. Still
assum ing hom oscedasticity, the stan d ard erro r for y is calculated using the
residual m ean square — a simple form ula is
v = ( n - p - 1) l e Te ( X l 0Xi . 0)
6.3 ■M ultiple Linear Regression 281
/ = X p + e*, X ' = X,
where the n errors in e* are sam pled independently w ith replacem ent from the
residuals e o r m odified versions o f these. The full regression o f y ‘ on X is then
fitted, from which we obtain y * and its estim ated variance v", these being used
to calculate z* = (y* — y ) / v ' ll2. F rom R repeats o f this sim ulation we then
have the one-sided P-value
# { z r* > Z q } + 1
P R + 1
Example 6.7 (Rock data) The d a ta in Table 6.5 are m easurem ents on four
cross-sections o f each o f 12 oil-bearing rocks, taken from two sites. The aim is
to predict perm eability from the other three m easurem ents, which result from
a com plex im age-analysis procedure. In all regression m odels we use logarithm
o f perm eability as response y. The question we focus on here is w hether the
coefficient o f shape is significant in a m ultiple linear regression on all three
variables.
The problem is n o n stan d ard in th at there are four replicates o f the ex
p lanatory variables for each response value. If we fit a linear regression to
all 48 cases treating them as independent, strong correlation am ong the four
residuals for each core sam ple is evident: see Figure 6.8, in which the residuals
have unit variance.
U nder a plausible m odel which accounts for this, which we discuss in
E xam ple 6.9, the ap p ro p riate linear regression for testing purposes uses core
averages o f the explanatory variables. T hus if we represent the d a ta as responses
yj and replicate vectors o f the explanatory variables Xjk, k = 1,2,3,4, then the
m odel for o u r analysis is
yj = x J . P + Ej,
where the Ej are independent. A sum m ary o f the least squares regression
282 6 ■Linear Regression
4 6 8 10 12
Core number
Figure 6.9 shows results from b o th the null m odel resam pling m ethod and
the full m odel pivot resam pling m ethod, in b o th cases using resam pling o f
errors. The observed value o f z is z0 = 0.73, for which the one-sided P-value is
0.234 und er the first m ethod, an d 0.239 under the second m ethod. Thus sh ap e
should n o t be included in the linear regression, assum ing th at its effect would
be linear. N ote th a t R = 99 sim ulations would have been sufficient here. ■
Vector y
F or testing several com ponents sim ultaneously, we take the test statistic to be
the quad ratic form
T = F i X l o X v 0)y,
284 6 *Linear Regression
-6 -4 -2 0 2 4 6 8 -6 -4 -2 0 2 4 6 8
z* z0*
or equivalently the difference in residual sum s o f squares for the null and full
m odel least squares fits. This can be standardized to
n —q RSSo — R S S
q X RSSo
where RSSo and R S S denote residual sum s o f squares under the null m odel
and full m odel respectively.
We can apply the pivot m ethod with full m odel sim ulation here also, using
Z = (y — y)T ( X l 0Xi.o)(y — y ) / S 2 w ith S 2 the residual m ean square. The test
statistic value is zo = y T(X[.0Xi .0) y /s 2, for w hich the P-value is given by
R + 1
This would be equivalent to rejecting Ho at level a if the 1 — a confidence set
for y does n o t include the point y = 0. A gain, case resam pling would provide
protection against heteroscedasticity: z would then require a robust standard
error.
6.3.3 Prediction
A fitted linear regression is often used for prediction o f a new individual
response Y+ when the explanatory variable vector is equal to x +. T hen we shall
w ant to supplem ent o u r predicted value by a prediction interval. Confidence
limits for the m ean response can be found using the same resam pling
as is used to get confidence limits for individual coefficients, b u t limits for
the response Y+ itself — usually called prediction lim its — require additional
resam pling to sim ulate the variation o f 7+ ab o u t x \ j i .
6.3 ■M ultiple Linear Regression 285
S = Y+ - Y + = x tJ - ( x l P + £+)
by the distribution o f
<5* = x+/?* — (x+/? + e+), (6.30)
w here £+ is sam pled from G and /T is a sim ulated vector o f estim ates from the
m odel-based resam pling algorithm . This assum es hom oscedasticity o f random
error. U nconditional properties o f the prediction erro r correspond to averaging
over the distributions o f b o th £+ and the estim ates /?, which we do in the
sim ulation by repeating (6.30) for each set o f values o f /T. H aving obtained
the m odified residuals from the d a ta fit, the algorithm to generate R sets
each w ith M predictions is as follows.
y+ - fli-a, $+ - a*-
the pooled <5*s, w hose ordered values we denote by < 5( < • • • < The
boo tstrap prediction lim its are
where y+ = *+/?. This is analogous to the basic b o o tstrap m ethod for confi
dence intervals (Section 5.2).
A som ew hat b etter ap p ro ach w hich mimics the stan d ard norm al-theory
analysis is to w ork w ith studentized prediction error
where S is the square root o f residual m ean square for the linear regression.
The corresponding sim ulated values are z*m = <5*m/s*, with s ' calculated in step It is unnecessary to
standardize also by the
2 o f A lgorithm 6.4. T he a and (1 —a) quantiles o f Z are estim ated by z*(RM+1)0,) square root of
and respectively, where z'{V) < ■■■ < z ’RM) are the ordered values 1 + x l ( X TX)- ' x+,
which would make the
o f all R M z* s. T hen the studentized b o o tstrap prediction interval for 7+ is variance of Z close to 1.
unless bootstrap results
for different x+ are
y+ ~ SZ((RM+l)(l-ct))’ £+ ~ SZ((RM+1)«)- (6.32) pooled.
E xam ple 6.8 (N uclear power stations) Table 6.7 contains d a ta on the cost o f
32 light w ater reactors. T he cost (in dollars x l0 ~ 6 adjusted to a 1976 base) is
the response o f interest, an d the o th er quantities in the table are explanatory
variables; they are described in detail in the d a ta source.
We take lo g (c o s t) as the w orking response y, and fit a linear m odel with
covariates PT, CT, NE, d a te , lo g (c a p a c ity ) and log(N). T he dum m y variable PT
indicates six plants for w hich there were p artial turnkey guarantees, and it is
possible th a t some subsidies m ay be hidden in their costs.
Suppose th a t we wish to obtain 95% prediction intervals for the cost o f a
station like case 32 above, except th a t its value for d a te is 73.00. T he predicted
value o f lo g (c o s t) from the regression is x+fi = 6.72, and the m ean squared
erro r from the regression is s = 0.159. W ith a = 0.025 and a sim ulation with
R = 999 an d M = 1, ( R M + l)a = 25 an d ( R M + 1)(1 — a) = 975. The values
o f 3(25) an d <5*975) are -0.539 and 0.551, so the 95% lim its (6.31) are 6.18 and
7.27, which are slightly w ider th a n the norm al-theory limits o f 6.25 and 7.19.
F or the lim its (6.32) we get z(*25) = —3.680 and z(*975) = 3.5 12, so the lim its for
lo g (c o st) are 6.13 and 7.28. T he corresponding prediction interval for c o s t is
[exp(6.13), exp(7.28)] = [459.4,1451],
The usual caveats apply a b o u t extrapolating a trend outside the range o f
the data, an d we should use these intervals w ith great caution. ■
The next exam ple involves an u nusual d a ta structure, where there is hierar
chical variatio n in the covariates.
6.3 ■M ultiple Linear Regression 287
Example 6.9 (Rock data) F or the d a ta discussed in Exam ple 6.7, one objective
is to see how well one can predict perm eability from a single replicate o f the
three im age-based m easurem ents, as opposed to the four replicates obtained
in the study. The previous analysis suggested th a t variable sh ap e did not
contribute usefully to a linear regression relationship for the logarithm o f
perm eability, an d this is confirm ed by cross-validation analysis o f prediction
errors (Section 6.4.1). So here we concentrate on predicting perm eability from
the linear regression o f y = lo g ( p e r m e a b ility ) on a r e a and p e r i .
In Exam ple 6.7 we com m ented on the strong intra-core correlation am ong
the explanatory variables, and th a t m ust be taken into account here if we are
to correctly analyse prediction o f core perm eability from single m easurem ents
o f a r e a and p e r i . O ne way to do this is to think o f the four replicate values
o f u = ( a r e a , p e r i ) T as unbiased estim ates o f an underlying core variable £,
on which y has a linear regression. T hen the d a ta are m odelled by
for j = 1 ,...,1 2 and k = where rjj and < 5 are uncorrelated errors
with zero means, and for o u r d a ta K = 4.
U nder norm ality assum ptions on the errors and the the linear regression
o f yj on Uj\,...,UjK depends only on the core average u; = K ~ l Y a =i ujk-
The regression coefficients depend strongly on K . F or prediction from a single
m easurem ent u+ we need the m odel w ith K = 1, and for resam pling analysis
we shall need the m odel w ith K = 4. These tw o versions o f the observation
regression m odel we w rite as
for K = 1 and 4; the param eters a and y in (6.33) correspond to a (x) and
when K = oo. Fortunately it turns out th a t b o th observation m odels can be fit
ted easily: for K = 4 we regress the yjs on the core averages Uj; and for K = 1
we fit linear regression w ith all 48 individual cases as tabled, ignoring the
intra-core correlation am ong the e;*s, i.e. pretending th at y; occurs four times
independently. Table 6.8 shows the coefficients for both fits, and com pares
them to corresponding estim ates based on exact norm al-theory analysis.
Suppose, then, th a t we w ant to predict the new response y + given a single
set o f m easurem ents u+. If we define x \ = (1,m+), then the point prediction Y+
is x l P \ where /?(1) are the coefficients in the fit o f m odel (6.34) with K = 1,
shown in the first row o f Table 6.8. T he E D F o f the 48 modified residuals
from this fit estim ates the m arginal distribution o f the e*1* in (6.34), and hence
o f the error e+ in
Y+ = x l ^ + s +.
5 = Y+ - Y + = x l $ W - - £+, (6.35)
Ujk = Uj + djk,
where djk = ujk — Uj and J is random ly sam pled from { 1 ,2 ,..., 12}. O ur ju s
tification for this, in term s o f retaining intra-core correlation, is given by the
discussion in Section 3.8. It is potentially im p o rtan t to build the variation o f
u into the analysis. Since u* = Uj, the resam pled responses are defined by
>■=*j r + t f .
where the £*4)* are random ly sam pled from the 12 m ean-adjusted, modified
residuals r ^ — rw from the regression o f the y; s on the iijS. The estim ates
are now obtained by fitting the regression to the 48 sim ulated cases ( u ^ y j ) ,
k = 1 , ...,4 and j = 1 ,..., 12.
Figure 6.10 shows typical norm al plots for prediction error y + — y+ , these
for x + = (1,4000,1000) and x + = (1,10000,4000) which are near the edge o f
the observed space, from R = 999 resam ples and M = 1. The skewness o f
prediction erro r is quite noticeable. The resam pling stan d ard deviations for pre
diction errors are 0.91 an d 0.93, som ew hat larger th an the theoretical standard
deviations 0.88 and 0.87 obtained by treating the 48 cases as independent.
To calculate 95% intervals we set a = 0.025, so th at ( R M + l)a = 25 and
( R M + 1)(1 — a) = 975. The sim ulation values <5(*25) and <5('975) are —1.63 and
1.93 at x+ = (1,4000,1000), and -1 .5 7 and 2.19 at x + = (1,10000,4000). The
corresponding p o in t predictions are 6.19 and 4.42, so 95% prediction intervals
are (4.26,7.82) at x+ = (1,4000,1000) and (2.23,5.99) at x+ = (1,10000,4000).
These intervals differ m arkedly from those based on norm al theory treating all
48 cases as independent, those being (4.44,7.94) and (2.68,6.17). M uch o f the
difference is due to the skewness o f the resam pling distribution o f prediction
error. ■
290 6 • Linear Regression
D = n - x Y j E (Y + j - x ] h \
j= i
6.4 ■Aggregate Prediction Error and Variable Selection 291
in which ft is fixed and the expectation is over y+J = x]p + e+j. We cannot
calculate D exactly, because the m odel param eters are unknow n, so we m ust
settle for an estim ate — which in reality is an estim ate o f A = E(D), the
average over all possible sam ples o f size n. O ur objective is to estim ate D or A
as accurately as possible.
As stated the problem is quite simple, at least under the ideal conditions
X is the n x q matrix where the linear m odel is correct and the error variance is constant, for then
with rows x j , . . . , x j ,
where q = p + 1 if there
are p covariate terms D = n - l Y ™ r ( Y +j) + n - l Y , ( X j P - x J [ l ) 2
and an intercept in the
model. = a 2 + n - l ( p - l } ) TX TX 0 - p ) , (6.36)
w hose expectation is
A = <j 2(1 + ^ - 1), (6.37)
However, this estim ate is very specialized, in two ways. First, it assumes th at
the linear m odel is correct and th a t erro r variance is constant, b o th unlikely to
be exactly true in practice. Secondly, the estim ate applies only to least squares
prediction and the squared erro r m easure o f accuracy, w hereas in practice we
need to be able to deal w ith other m easures o f accuracy and other prediction
rules — such as robust linear regression (Section 6.5) and linear classification,
where y is binary (Section 7.2). T here are no simple analogues o f (6.38) to
cover these situations, b u t resam pling m ethods can be applied to all o f them.
In order th a t o u r discussion apply as broadly as possible, we shall use
general n o tatio n in which prediction erro r is m easured by c(y+, y +), typically
an increasing function o f |y+ — y+|, and the prediction rule is y + = /i(x+, F),
where the E D F F represents the observed data. Usually n(x +>F) is an estim ate
o f the m ean response at x +, a function o f x+/? with /? an estim ate o f /?, and
the form o f this prediction rule is closely tied to the form o f c(y+,y+). We
suppose th a t the d a ta F are sam pled from distribution F, from which the
cases to be predicted are also sampled. This implies th at we are considering
x + values sim ilar to d a ta values x i ,...,x „ . Prediction accuracy is m easured by
the aggregate prediction error
A = A (F ) = E { D ( F , F ) } , (6.40)
292 6 ■Linear Regression
the average prediction accuracy over all possible d atasets o f size n sam pled
from F.
The m ost direct ap proach to estim ation o f A is to apply the boo tstrap
substitution principle, th a t is substituting the E D F F for F in (6.40). However,
there are o th er widely used resam pling m ethods which also m erit consideration,
in p art because they are easy to use, an d in fact the best approach involves a
com bination o f m ethods.
Apparent error
The sim plest way to estim ate D or A is to take the average prediction error
w hen the prediction rule is applied to the sam e d a ta th at was used to fit it.
This gives the apparent error, som etim es called the resubstitution error,
n
K PP = D( F, F) = n ~x ' Y ^ c { y j ,ii{xj,F)}. (6.41)
7=1
This is n o t the sam e as the b o o tstrap estim ate A(F), which we discuss later.
It is intuitively clear th a t A app will tend to underestim ate A, because the
latter refers to prediction o f new responses. The underestim ation can be easily
A |
checked for least squares prediction w ith squared error, when A app = n~ R S S ,
the average squared residual. If the m odel is correct with hom oscedastic
random errors, then A app has expectation a 2(l —qn~ l ), w hereas from (6.37) we
know th a t A = <x2(l + qn~l ).
The difference betw een the true erro r and ap p aren t erro r is the excess error,
D( F, F) — D(F,F), whose m ean is the expected excess error,
Cross-validation
T he ap p aren t error is dow nw ardly biased because it averages errors o f predic
tions for cases at zero distance from the d a ta used to fit the prediction rule.
C ross-validation estim ates o f aggregate erro r avoid this bias by separating the
d a ta used to form the prediction rule and the d a ta used to assess the rule. The
general paradigm is to split the d ataset into a training set {(x j , y j ) : j £ S,}
and a separate assessment set {(X j , y j ) : j e Sa}, represented by Ft and Fa, say.
The linear regression predictor is fitted to Ft, used to predict responses yj for
6.4 ■Aggregate Prediction Error and Variable Selection 293
w ith na the size o f Sa. T here are several variations on this estim ate, depending
on the size o f the training set, the m anner o f splitting the dataset, and the
num ber o f such splits.
The version o f cross-validation th at seems to come closest to actual use o f
o u r predictor is leave-one-out cross-validation. H ere training sets o f size n —1 are
taken, and all such sets are used, so we m easure how well the prediction rule
does when the value o f each response is predicted from the rest o f the data. If
F^j represents the n — 1 observations {(xk,yk),k ^ j}, and if /u(Xy,F_; ) denotes
the value predicted for yj by the rule based on F _; , then the cross-validation
estimate o f prediction error is
n
Ac v = n~l c{yj> F-j)}, (6.44)
i= i
which is the average erro r when each observation is predicted from the rest o f
the sample.
In general (6.44) requires n fits o f the model, b u t for least squares linear
regression only one fit is required if we use the case-deletion result (Problem 6.2)
T A
~ , Vi — x j B
P - P- j = ( X TX ) ~ ' x j ^ _ £ ,
where as usual hj is the leverage for the 7th case. F or squared erro r in particular
we then have
From the natu re o f Ac v one would guess th a t this estim ate has only a small
bias, and this is so: assum ing an expansion o f the form A(F) = oq + a\ n~l +
a2n~2 + ■■■, one can verify from (6.44) th a t E(A c^) = «o + a i(n — I )-1 + • • ■,
which differs from A by term s o f order n~2 — unlike the expectation o f the
ap p aren t error which differs by term s o f order n_ l .
K -fold cross-validation
In general there is no reason th at training sets should be o f size n — 1. For
certain m ethods o f estim ation the num ber n o f fits required for Ac v could
itself be a difficulty — although not for least squares, as we have seen in
(6.45). T here is also the possibility th at the small p erturbations in fitted m odel
w hen single observations are left out m akes Ac v too variable, if fitted values
H(x,F) do n o t depend sm oothly on F o r if c(y+ ,y+ ) is n o t continuous. These
294 6 ■Linear Regression
In principle there are (") possible splits, possibly an extrem ely large num ber,
b u t it should be adequate to take R in the range 100 to 1000. It would be in
the spirit o f resam pling to m ake the splits at random . However, consideration
should be given to balancing the splits in some way — for example, it would
seem desirable th a t each case should occur w ith equal frequency over the R
assessm ent sets; see Section 9.2. D epending on the value o f nt = n — m and the
num ber p o f explanatory variables, one m ight also need some form o f balance
to ensure th a t the m odel can always be fitted.
There is an efficient version o f group cross-validation th at does involve ju st
one prediction o f each response. We begin by splitting the d a ta into K disjoint
sets o f nearly equal size, w ith the corresponding sets o f case subscripts denoted
by C i , . . . , C k , say. These K sets define R = K different splits into training and
assessm ent sets, w ith S^k = Q the kt h assessm ent set and the rem ainder o f the
d a ta Stf = |J,y* Ci the /cth training set. F or each such split weapply (6.43), and
then average these estim ates. The result is the K-fold cross-validation estimate
o f prediction error
n
F-k denote the d a ta w ith the /cth group om itted, for k = 1 and let
if n / K =m is an pk denote the p ro p o rtio n o f the d ata falling in the /cth group. T he adjusted
integer, then ail groups cross-validation estimate o f aggregate prediction erro r is
are o f size m and 00 0 r
Pk = l / K .
T his has sm aller bias th a n Acvjc and is alm ost as simple to calculate, because
it requires n o additional fits o f the model. F or a com parison betw een ACvjc
an d A acvjc in a simple situation, see Problem 6.12.
T he following algorithm sum m arizes the calculation o f AAcvji w hen the
split into groups is m ade a t random .
Bootstrap estimates
A direct ap plication o f the b o o tstrap principle to A(F) gives the estim ate
A = A(F) = E*{D(F,F*)},
w here F* denotes a sim ulated sam ple ( x j,y j) ,. . . , (x*, >’”) taken from the d a ta by
case resam pling. U sually sim ulation is required to approxim ate this estim ate,
as follows. F or r = 1 we random ly resam ple cases from the d ata to
obtain the sam ple (x*j,y*j) , . . . , (x*n,y'„), which we represent by F*, and to this
sam ple we fit the prediction rule and calculate its predictions n ( x j , F ' ) o f the
d a ta responses yj for j = 1 The aggregate prediction erro r estim ate is
then calculated as
R n
R - 1 Y 2 c { y j,f i{ x j,F ') } . (6.49)
r= l j=l
296 6 ■Linear Regression
and its approxim ation from the sim ulations described in the previous p a ra
graph defines the bootstrap estimate o f expected excess error
T h at is, for the rth b o o tstra p sam ple we construct the prediction rule n(x, F'),
then calculate the average difference betw een the prediction errors when this
rule is applied first to the original d a ta an d secondly to the b o o tstrap sam ple
itself, an d finally average across b o o tstra p samples. We refer to the resulting
estim ate o f aggregate prediction error, Ab = $b + A app, as the bootstrap estimate
o f prediction error, given by
n R R
N ote th a t the first term o f (6.51), which is also the simple b o o tstra p estim ate
(6.49), is expressed as the average o f the contributions jR-1 ^ f = i c{yy-, F ’ )}
th at each original observation m akes to the estim ate o f aggregate prediction
error. These contributions are o f interest in their own right, m ost im portantly
in assessing how the perform ance o f the prediction rule changes with values
o f the explanatory variables. This is illustrated in Exam ple 6.10 below.
and
w here R - j is the n u m b er o f the R b o o tstrap sam ples F ' in which (xj ,yj ) does
n o t appear. In (6.52) yj is always predicted using d ata from which (X j , y j) is
excluded, which is analogous to cross-validation, w hereas (6.53) is sim ilar to
an a p p aren t erro r calculation because yj is always predicted using d a ta th at
contain (xj,yj).
N ow R - j / R is approxim ately equal to the constant e~l = 0.368, so (6.52) is
approxim ately p ro p o rtio n al to
A scr = n - 1E Y (6'54)
j=1 J r:j out
som etim es called the leave-one-out bootstrap estimate o f prediction error. The
n o ta tio n refers to the fact th a t Abcv can be viewed as a b o o tstrap sm oothing
o f the cross-validation estim ate Acv- To see this, consider replacing the term
c {y j , n ( x j , F - j )} in (6.44) by the expectation E l j[c{yj,n(Xj,F*)}], where E l-
refers to the expectation over b o o tstrap sam ples F * o f size n draw n from F-j.
T he estim ate (6.54) is a sim ulation approxim ation o f this expectation, because
o f the result n o ted in Section 3.10.1 th a t the R - j b o o tstrap sam ples in which
case j does n o t ap p ear are equivalent to random sam ples draw n from F-j.
T he sm oothing in (6.54) m ay effect a considerable reduction in variance,
com pared to Ac v , especially if c(y+, y +) is n o t continuous. B ut there will also
be a tendency tow ard positive bias. This is because the typical b o o tstrap sample
from which predictions are m ade in (6.54) includes only ab o u t (1 — e~l )n =
0.632n distinct d a ta values, an d the bias o f cross-validation estim ates increases
as the size o f the train in g set decreases.
W hat we have so far is th a t the b o o tstrap estim ate o f aggregate prediction
erro r essentially involves a w eighted com bination o f Abcv and an apparent
erro r estim ate. Such a com bin atio n should have good variance properties, b u t
m ay suffer from bias. However, if we change the weights in the com bination it
m ay be possible to reduce or rem ove this bias. This suggests th at we consider
the hybrid estim ate
A w = w A b cv + (1 - w)Aapp, (6.55)
Example 6.10 (Nuclear power stations) C onsider predicting the cost o f a new
pow er station based on the d a ta o f Exam ple 6.8. We base o u r prediction on
the linear regression m odel described there, so we have n(x j , F ) = x j f i , where
A •'
18 is the least squares estim ate for a m odel w ith six covariates. The estim ated
erro r variance is s2 = 0.6337/25 = 0.0253 w ith 25 degrees o f freedom . The
dow nw ardly biased a p p aren t erro r estim ate is A app = 0.6337/32 = 0.020,
whereas the idealized estim ate (6.38) is 0.025 x (1 + ~ ) = 0.031. In this
situation the prediction e rro r for a p articu lar station seems m ost useful, b u t
before we tu rn to individual stations, we discuss the overall estim ates, which
are given in Table 6.9.
Those estim ates show the p a tte rn we would anticipate from the general
6.4 ■Aggregate Prediction Error and Variable Selection 299
Figure 6.11
Components of
prediction error for
nuclear power data
based on 200 bootstrap
simulations. The top
panel shows the values
of yj — n{xj,F*). The
lower left panel shows
the average error for
each case, plotted
against the residuals.
The lower right panel
shows the ratio of the
model-based to the
bootstrap prediction
standard errors.
Case
o f yj — n(xj,F*) for r = 1 ,...,J ? , p lo tted against case num ber j. The variability
o f the average error corresponds to the variation o f individual observations
a b o u t their predicted values, while the variance w ithin each group reflects
param eter estim ation uncertainty. A striking feature is the small prediction
erro r for the last six pow er plants, whose variances and m eans are both small.
The lower left panel shows the average values o f y j — fi(xj,F*) over the 200
sim ulations, plotted against the raw residuals. They agree closely, as we should
expect w ith a well-fitting m odel. T he lower right panel shows the ratio o f the
m odel-based prediction stan d ard erro r to the b o o tstrap prediction standard
error. It confirm s th a t the m odel-based calculation described in Exam ple 6.8
overestim ates the predictive stan d ard erro r for the last six plants, which have
the partial turnkey guarantee. T he estim ated b o o tstra p prediction erro r for
these plan ts is 0.003, while it is 0.032 for the rest. T he last six cases fall into
three groups determ ined by the values o f the explanatory variables: in effect
they are replicated.
It m ight be preferable to p lo t y j — fi(xj, F ' ) only for those b o o tstrap samples
which exclude the j t h case, and then m ean prediction error would b etter be
com pared to jackknifed residuals yj — x j /L ; . F or these d a ta the plots are very
sim ilar to those we have shown. ■
Example 6.11 (Times on delivery suite) F or a m ore system atic com parison
o f prediction error estim ates in linear regression, we use d ata provided by
E. Burns on the times tak en by 1187 w om en to give b irth a t the Jo h n Radcliffe
H ospital in O xford. A n ap p ro p riate linear m odel has response the log time
spent on delivery suite an d dum m y explanatory variables indicating the type
o f labour, the use o f electronic fetal m onitoring, the use o f an intravenous
drip, the reported length o f la b o u r before arriving a t the hospital and w hether
or n o t the lab o u r is the w om an’s first; seven p aram eters are estim ated in all.
We took 200 sam ples o f size n = 50 at ran d o m from the full data. F or each
o f these sam ples we fitted the m odel described above, and then calculated
cross-validation estim ates o f prediction error Acv#. w ith K = 50, 10, 5 and
2 groups, the corresponding adjusted cross-validation estim ates A a c v j c , the
b o o tstrap estim ate AB, and the hybrid estim ate Ao.632- We took R = 200 for
the b o o tstrap calculations.
The results o f this experim ent are sum m arized in term s o f estim ates o f the
expected excess erro r in Table 6.10. T he average a p p aren t error and excess
erro r were 15.7 x 10-2 and 5.2 x 10-2 , the latter taken to be e(F) as defined
in (6.42). T he table shows averages and stan d ard deviations o f the differences
betw een estim ates A an d A app. T he cross-validation estim ate w ith K = 50,
the boo tstrap an d the 0.632 estim ate have sim ilar properties, while other
choices o f K give estim ates th a t are m ore variable; the half-sam ple estim ate
A C v ,2 is worst. R esults for cross-validation w ith 10 and 5 groups are alm ost
6.4 ■Aggregate Prediction Error and Variable Selection 301
erro r is average squared error. It w ould be a sim ple m atter to use other
prediction rules an d o th er m easures o f prediction accuracy.
First we define som e n otation. We denote an arb itrary candidate m odel by
M , which is one o f the 2P possible linear models. W henever M is used as a
subscript, it refers to elem ents o f th a t model. T hus the n x pm design m atrix
X M contains those pM colum ns o f the full design m atrix X th a t correspond
to covariates included in M ; the y'th row o f X m is x h p the least squares
estim ates for regression coefficients in M are P m , and H M is the h at m atrix
X m ( X I i X m )~1X11 th a t defines fitted values = H My under m odel M . The
total num b er o f regression coefficients in M is qM = pM + 1, assum ing th a t an
intercept term is always included.
Now consider prediction o f single responses y+ a t each o f the original design
points x i,...,x „ . The average squared prediction erro r using m odel M is
n
n ~l J 2 ( y +j ~ x T m
M >
7=1
and its expectation u n d er m odel (6.22), conditional on the data, is the aggregate
prediction error
n
D ( M ) = a 2 + n~x ^ ( ^ - - x ^ j Pm )2,
i= i
where p.T = (AMj■ is the vector o f m ean responses for the true m ultiple
regression m odel. T aking expectation over the d a ta distribution we obtain
where /ir (/ — H M)p is zero only if m odel M is correct. The quantities D (M)
and A(M) generalize D and A defined in (6.36) an d (6.37).
In principle the best m odel w ould be the one th a t m inimizes D{M), but
since the m odel p aram eters are unknow n we m ust settle for m inim izing a
good estim ate o f D ( M) o r A(M). Several resam pling m ethods for estim ating
A were discussed in the previous subsection, so the n atu ral approach would
be to choose a good m ethod an d apply it to all possible models. However,
accurate estim ation o f A(M ) is n o t itself im p o rtan t: w hat is im p o rtan t is to
accurately estim ate the signs o f differences am ong the A(M), so th a t we can
identify which o f the A(M )s is smallest.
O f the m ethods considered earlier, the a p p aren t e rro r estim ate A app( M) =
h^ R S S m was poor. Its use here is im m ediately ruled out w hen we observe th a t
it always decreases w hen covariates are added to a m odel, so m inim ization
always leads to the full model.
6.4 ■Aggregate Prediction Error and Variable Selection 303
Cross-validation
O ne good estim ate, when used w ith squared error, is the leave-one-out cross-
validation estim ate. In the present no tatio n this is
= (6.57)
w here y ^ j is the fitted value for m odel M based on all the d a ta and h ^ j is the
leverage for case j in m odel M . The bias o f Ac v ( M ) is small, b u t th at is not
enough to m ake it a good basis for selecting M . To see why, note first th a t an
expansion gives
R
ACv(M) = jR_1 Y l m~ l X ~ yMj(St,r)}2,
r= 1 jesv
where p M j ( S t,r) = x h ^ M ^ t , ) an d ^ M(^t,r) are the least squares estim ates for
coefficients in M fitted to the rth training set whose subscripts are in Sv . N ote
th a t the sam e R splits into training and assessm ent sets are used for all models.
It can be show n that, provided m is chosen so th a t n — m —> o o and m /n —>1 as
n -» o o , m inim ization o f Ac v ( M ) will give consistent selection o f the true m odel
as n—►
o o an d R —>o o .
304 6 ■Linear Regression
Bootstrap methods
C orresponding results can be obtained for b o o tstrap resam pling m ethods. The
b o o tstrap estim ate o f aggregate prediction erro r (6.51) becomes
where the second term on the right-hand side is an estim ate o f the expected
excess erro r defined in (6.42). The resam pling scheme can be either case
resam pling o r error resam pling, w ith x m
Mj r = x Mj for the latter.
It turns o u t th a t m inim ization o f A B( M) behaves m uch like m inim ization o f
the leave-one-out cross-validation estim ate, an d does n o t lead to a consistent
choice o f true m odel as n—*o o . However, there is a m odification o f A B(M),
analogous to th a t m ade for the cross-validation procedure, which does produce
a consistent m odel selection procedure. T he m odification is to m ake sim ulated
datasets be o f size n — m rath er th an n, such th a t m / n —>l and n — m—> o o as
n—>co. Also, we replace the estim ate (6.59) by the sim pler b o o tstrap estim ate
R n
Ab (M ) = R - 1 n- 1 Y ^ ( y j ~ x l j K r ) 2> (6.60)
r= l j= 1
which is a generalization o f (6.49). (The previous doubts ab o u t this simple
estim ate are less relevant for small n — m.) I f case resam pling is used, then
n — m cases are random ly selected from the full set o f n. If m odel-based
resam pling is used, the m odel being M w ith assum ed hom oscedasticity o f
errors, then is a ran d o m selection o f n — m rows from X m and the n — m
errors £* are random ly sam pled from the n m ean-corrected m odified residuals
i"Mj ~ for m odel M.
Bearing in m ind the general advice th a t the nu m ber o f sim ulated datasets
should be at least R = 100 for estim ating second m om ents, we should use at
least th a t m any here. T he sam e R b o o tstra p resam ples are used for each m odel
M , as w ith the cross-validation procedure.
One m ajo r practical difficulty th a t is shared by the consistent cross-validation
and b o o tstrap procedures is th a t fitting all candidate m odels to small subsets
o f d a ta is n o t always possible. W h at em pirical evidence there is concerning
good choices for m / n suggests th a t this ratio should be ab o u t | . If so, then in
m any applications some o f the R subsets will have singular designs X'M for big
models, unless subsets are balanced by ap p ro p riate stratification on covariates
in the resam pling procedure.
Example 6.12 (Nuclear power stations) In Exam ples 6.8 and 6.10 o u r analyses
focused on a linear regression m odel th a t includes six o f the p = 10 covariates
available. T hree o f these covariates — d a te , lo g ( c a p ) and NE — are highly
6.4 ■Aggregate Prediction Error and Variable Selection 305
0 2 4 6 8 10
Number of covariates
include x 2 in the selected m odel — occurred quite frequently even w hen using
training sets o f size 20. This deg radation o f variable selection procedures when
coefficients are sm aller th a n tw o stan d ard errors is reputed to be typical. ■
The theory used to justify the consistent cross-validation and boo tstrap
procedures m ay depend heavily on the assum ptions th at the dim ension o f
the true m odel is small com pared to the num ber o f cases, and th a t the
non-zero regression coefficients are all large relative to their stan d ard errors.
It is possible th a t leave-one-out cross-validation m ay w ork well in certain
situations where m odel dim ension is com parable to num ber o f cases. This
w ould be im p o rtan t, in light o f the very clear difficulties o f using small training
sets w ith typical applications, such as Exam ple 6.12. Evidently fu rther work,
b o th theoretical an d em pirical, is necessary to find broadly applicable variable
selection m ethods.
the jackk n ife-after-b o o tstrap plots o f Section 3.10.1 o r sim ilarly inform ative
diagnostic plots, b u t such plots can fail to show the occurrence o f m ultiple
outliers.
For datasets w ith possibly m ultiple outliers, diagnosis is aided by initial
use o f a fitted m ethod th a t is highly resistant to the effects o f outliers. One
preferred resistant m ethod is least trim m ed squares, which minimizes
m
5 > 0 )(/*)- (6.61)
j=i
Example 6.14 (Survival proportions) T he d a ta in Table 6.11 and the left panel
o f Figure 6.14 are survival percentages for rats a t a succession o f doses o f
radiation, w ith two o r three replicates at each dose. T he theoretical relationship
betw een survival rate an d dose is exponential, so linear regression applies to
T he right panel o f Figure 6.14 plots these variables. There is a clear outlier,
case 13, at x = 1410. T he least squares estim ate o f slope is —59 x 10-4 using
all the data, changing to —78 x 10-4 w ith stan d ard erro r 5.4 x 10-4 when case
13 is om itted. T he least trim m ed squares estim ate o f slope is —69 x 10-4 .
F rom the scatter p lo t it app ears th a t heteroscedasticity m ay be present, so we
resam ple cases. The effect o f the outlier on the resam ple least squares estim ates
is illustrated in Figure 6.15, which plots R = 200 b o o tstrap least squares slopes
PI against the corresponding values o f ]T (x ” — x*)2, differentiated by the
frequency w ith which case 13 appears in the resam ple. There are three distinct
groups o f b o o tstrap p ed slopes, w ith the lowest corresponding to resam ples in
which case 13 does n o t occur and the highest to sam ples where it occurs twice or
more. A jack k n ife-after-b o o tstrap plot w ould clearly reveal the effect o f case 13.
T he resam pling stan d ard erro r o f p \ is 15.3 x 10-4 , b u t only 7.6 x 10-4 for
6.5 • Robust Regression 309
Dose Dose
differentiated by
frequency of case 13
(appears zero, one or
more times), for case
resampling with
R = 200 from survival
data.
Sum of squares
sam ples w ithout case 13. T he corresponding resam pling standard errors o f the
least trim m ed squares slope are 20.5 x 10-4 and 18.0 x 10~4, showing b o th the
resistance an d inefficiency o f the least trim m ed squares m ethod. ■
CO
x>
c
co
55
Robust methods
We suppose now th a t outliers have been isolated by diagnostic plots and set
aside from fu rth er analysis. The problem now is w hether o r n o t th a t analysis
should use least squares estim ation: if there is evidence o f a long-tailed error
distribution, then we should dow nw eight large deviations yj — x j fi by using a
robust m ethod. Two m ain options for this are now described.
O ne ap p ro ach is to m inim ize n o t sums o f squared deviations b u t sums o f
absolute values o f deviations, Y , Iy j ~ x J J®l> so liv in g less weight to those cases
w ith the largest errors. This is the L i m ethod, which generalizes — and has
efficiency com parable to — the sam ple m edian estim ate o f a population mean.
T here is n o simple expression for approxim ate variance o f L\ estim ators.
M ore efficient is M -estim ation, which is analogous to m axim um likelihood
estim ation. H ere the coefficient estim ates /? for a m ultiple linear regression
solve the estim ating equation
0, (6.62)
where tp(z) is a b o unded replacem ent for z, and s is either the solution to a
sim ultaneous estim ating equation, o r is fixed in advance. We choose the latter,
tak in g s to be the m edian absolute deviation (divided by 0.6745) from the least
trim m ed squares regression fit. T he solution to (6.62) is obtained by iterative
weighted least squares, for which least trim m ed squares estim ates are good
startin g values.
312 6 • Linear Regression
W ith a careful choice o f ip(-), M -estim ates should have sm aller standard
errors th a n least squares estim ates for long-tailed d istributions o f random
errors e, yet have com parable stan d ard errors should those errors be hom o
scedastic norm al. O ne stan d ard choice is tp(z) = z m in (l,c /|z |), H u b er’s win-
sorizing function, for which the coefficient estim ates have approxim ate effi
ciency 95% relative to least squares estim ates for hom oscedastic norm al errors
when c = 1.345.
F or large sam ple sizes M -estim ates ft are approxim ately norm al in distribu
tion, with approxim ate variance
tp(u) is the derivative
v ar(£) = o'2 * {'p2{e/<T)\ -2 ( X TX ) - \ (6.63) d\p(u)/du.
[E { v (e /a )}]2
under hom oscedasticity. A m ore robust, em pirical variance estim ate is provided
by the nonp aram etric delta m ethod. First, the em pirical influence values are,
analogous to (6.25),
lj = k n ( X T X ) ~ 1Xj\p ^,
vL = n~2 h lJ = k 2( X TX ) - lX TD X ( X TX ) - \ (6.64)
7=1
where D = diag {y>2( e i/s ),. .. ,xp2( e„/ s) }; this generalizes (6.17).
Resampling
As with least squares estim ation, so w ith robust estim ates we have two simple
choices for resam pling: case resam pling, o r m odel-based resam pling. D epend
ing on which robust m ethod is used, the resam pling algorithm m ay need to be
modified from the simple form th a t it takes for least squares estim ation.
T he Lj estim ates will behave like the sam ple m edian under either resam pling
scheme, so th a t the distrib u tio n o f can be very discrete, and close to th at
o f P ~ P only for very large samples. Use o f the sm ooth b o o tstrap (Section 3.4)
will im prove accuracy. N o simple studentization is possible for L\ estimates.
F or M -estim ates case resam pling should be satisfactory except for small
datasets, especially those w ith unreplicated design points. The advantage o f
case resam pling is simplicity. F or m odel-based resam pling, som e m odifications
are required to the algorithm used to resam ple least squares estim ation in
Section 6.3. First, the leverage correction o f raw residuals is given by
Sim ulated errors are random ly sam pled from the uncentred ru . . . , r n. M ean
6.5 ■Robust Regression 313
T he scale estim ate s' is obtained by the same m ethod as s, b u t from the
resam ple data.
S tudentization o f j?* —ft is possible, using the resam ple analogue o f the delta
m ethod variance (6.64) o r m ore simply ju st using s'.
Exam ple 6.16 (Salinity d ata) In our previous look a t the salinity d a ta in
E xam ple 6.15, we identified case 16 as a clear outlier. We now set th a t
case aside an d re-analyse the linear regression w ith all three covariates. O ne
objective is to determ ine w hether o r n o t the trend variable should be included
in the m odel: the initial, incorrect least squares analysis suggested not.
A norm al Q -Q plot o f the m odified residuals from the new least squares fit
suggests som ew hat long tails for the erro r disribution, so th a t robust m ethods
m ay be w orthw hile. We fit the m odel by four m e th o d s: least squares, H u b er M-
estim ate (w ith c = 1.345), L i and least trim m ed squares. Coefficient estim ates
are fairly sim ilar und er all m ethods, except for t r e n d whose coefficients are
-0 .1 7 , -0 .2 2 , - 0 .1 8 an d -0 .0 8 .
F o r fu rth er analysis we apply case resam pling w ith R = 99. Figure 6.17
illustrates the results for estim ates o f the coefficient o f tr e n d . The d o tted lines
on the top two panels correspond to the theoretical norm al approxim ations:
evidently the stan d ard variance approxim ation — based on (6.63) — for the
H u b er estim ate is too low. N ote also the relatively large resam pling variance for
the least trim m ed squares estim ate, p a rt o f which m ay be due to unconverged
estim ates: tw o resam pling outliers have been trim m ed from this plot.
To assess the significance o f t r e n d we apply the studentized pivot m ethod
o f Section 6.3.2 w ith b o th least squares and M -estim ates, studentizing by the
theoretical stan d ard erro r in each case. The corresponding values o f z are
—1.25 and —1.80, w ith respectively 23 and 12 sm aller values o f z* o u t o f 99.
So there appears to be little evidence o f the need to include tr e n d .
I f we checked diagnostic plots for any o f the four regression fits, a question
m ight be raised ab o u t w hether or n o t case 5 should be included in the
analysis. A n alternative view o f this is provided by jackknife-after-bootstrap
plots (Section 3.10.1) o f the four fits: such plots correspond to case-deletion
resam pling. A s an illustration, Figure 6.18 shows the jackknife-after-bootstrap
plo t for the coefficient o f t r e n d in the M -estim ation fit. This shows clearly th a t
case 5 has an appreciable effect on the resam pling distribution, and th at its
om ission w ould give tighter confidence limits on the coefficient. It also raises
314 6 • Linear Regression
LO
■<fr 9 22 2412 21 15
CO 8 11 1« 13 S
o 2 17 ISO 19
1 i VS 3
O* 14 16 2 * 5 27
LO
3 - 2 - 1 0 1 2
Standardized jackknife value
6.7 Problems
1 Show that for a multivariate distribution with mean vector pi and variance matrix
Q, the influence functions for the sample mean and variance are respectively
Hence show that for the linear regression model derived as the conditional expec
tation E (y | X = x) o f a multivariate C D F F, the empirical influence function
values for linear regression parameters are
h (xj , yj ) = n ( X TX ) ~ i x j eJ,
For hom ogeneous data as in Chapter 2, the empirical influence values for an
estimator can be approximated using case-deletion values. Use the matrix identity
t (X TX ) - l x iXJ ( X TX )->
(* * - ^ + l - xJlXTXT'x,
to show that in the linear regression model with least squares fitting,
Compare this to the corresponding empirical influence value in Problem 6.1, and
obtain the jackknife estimates o f the bias and variance o f fa
(Sections 2.7.3, 6.2.2, 6.4)
3 For the linear regression m odel y, = xjji + ej, with no intercept, show that the
least squares estimate o f /? is ft = Y x jy j/ Y x j. Define residuals by ej = y j — xjfa
If the resampling model is y j = Xjfi + e", with e’ randomly sampled from the e;s,
show that the resample estimate /T has mean and variance respectively
The usual estimated variance o f the least squares slope estimate fa in simple linear
regression can be written
_ n y j - y ) 2- M U x j - x ) 2
(n ~ 2) £ ( * ; - x )2
. U y j - y ) 2 - P ’2n x j - x ) 2
(n - 2) J2(xj ~ x)2
Hence show that in the permutation test for zero slope, the R values o f f}[ are in the
same order as those o f f a / v ' 1/2, and that f a > fa is equivalent to f a /u*1/2 > f a / v lf2.
This confirms that the P-value o f the permutation test is unaffected by studentizing.
(Section 6.2.5)
318 6 • Linear Regression
7=1
uj = xj ( y j - xJ h j = 1......... n.
Show that under this proposal fi" has mean fi and variance equal to therobust
variance estimate (6.26). Examine, theoretically or through numerical examples, to
what extent the skewness of fi’ matches the skewness of fi.
(Section 6.3.1; Hu and Zidek, 1995)
For the linear regression model y = X p + e, the improved version of the robust
estimate of variance for the least squares estimates fi is
where rj is the j th modified residual. If the errors have equal variances, then the
usual variance estimate
v = s2^ 7* ) - 1
would be appropriate and vroi, could be quite inefficient. To quantify this, examine
the case where the random errors e; are independent N(0, a2). Show first that
E(rj) = „=,
Hence show that the efficiency of the ith diagonal element of vrob relative to the hjk is the (J,k)th
ith diagonal element of v, as measured by the ratio of their variances, is element of hat matrix H
and hjj = hj.
bl
(n-p)g{Qgt
where bu is the ith diagonal element of (Z TX )_1, gJ = (d^...... dfn) with D =
(X TX)~lX T, and Q has elements (1 —h j k ) 2/ { ( 1 —/i; )(l —hk ) } .
Calculate this relative efficiency for a numerical example.
(Sections 6.2.4, 6.2.6, 6.3.1; Hinkley and Wang, 1991)
The statistical function /?(F) for M-estimation is defined by the estimating equation
y - x Tm '
dF(x,y) = 0,
J xv{ a(F)
where a(F) is typically a robust scale parameter. Assume that the model contains
an intercept, so that the covariate vector x includes the dummy variable 1. Use the
6.1 ■Problems 319
technique o f Problem 2.12 to show that the influence function for fl(F) is
V?(u) is d ip(u)/du.
M ^ ) = { / x x Tyj(e)dF(x, y) | oxy>(e),
where X is the usual covariate matrix and k = E{ip(e)}. U se the empirical version
o f this to verify the variance approximation
(b) Consider constrained case resampling in which each o f the m samples must be
represented at least once. Show that the probability that there are r resample cases
from the ith sample is
10 For the one-way m odel o f Problem 6.9 with two observations per group, suppose
that 9 = fa ~ Pi- N ote that the least squares estimator o f 9 satisfies
0 = 9 + j (fi 3 + £4 — Si — 62).
Suppose that we use model-based resampling with the assumption o f error hom o-
scedasticity. Show that the resample estimate can be expressed as
1=1
where the e ' are randomly sampled from the 2m modified residuals ± ^ ( « 2 i — S 2 1-1),
i = 1, . .. , m. U se this representation to calculate the first four resampling moments
o f 8‘ — 9. Compare the results with the first four mom ents o f 9 — 6, and comment.
(Section 6.3)
11 Suppose that a 2~r fraction o f a 28 factorial experiment is run, where 1 < r < 4.
Under what circumstances would a bootstrap analysis based on case resampling
be reliable?
(Section 6.3)
yk -
1 (=1 k=1
E ( A c ^ ) = ff2{ l + n - ‘ + n ~ \ K — I)-1 }.
(c) Extend the calculations in (b) to show that the adjusted estimate can be written
K
A acvjc = & c v x —K ~ l ( K — I)-2 ^ ( p * —y ) 2,
k=1
and use this to show that E(AACvjc) — A.
(Section 6.4; Burman, 1989)
Abcv = E '_j(yj - x f f t l j ) 2,
j=i
where /T j is the least squares estimate o f ji from a bootstrap sample with the )th
case excluded and EV denotes expectation over such samples. To calculate the
mean o f ABcv, use the substitution
E( Y j - X j p _ j ) 2 = ^ { l + q l n - l ) - 1},
These results combine to show that E(ABCf ) = ff2( 1 + 2qn~]) 0 ( n 2), which leads
to the choice w = | for the estimate Aw = w A BCv + (1 — w)Aapp.
(Section 6.4; Hall, 1995)
6.8 Practicals
1 D ataset catsM contains a set o f data on the heart weights and body weights o f 97
male cats. We investigate the dependence o f heart weight (g) on body weight (kg).
To see the data, fit a straight-line regression and do diagnostic plots:
catsM
p lo t(c a tsM $ B w t, catsM$Hwt, x lim = c (0,4),y lim = c (0 , 24))
c a t s . l m < - glm (H w t~Bw t,data=catsM )
su m m ary(cats. lm)
cats.diag <- glm.diag.plots(cats.lm,ret=T)
The summary suggests that the line passes through the origin, but we cannot
rely on normal-theory results here, because the residuals seem skewed, and their
variance possibly increases with the mean. Let us assess the stability o f the fitted
regression.
For case resampling:
plot(cats.boot1,j ack=T)
plot(cats.boot1,index=2,j ack=T)
to see a summary and plots for the bootstrapped intercepts and slopes,. How
normal do they seem? Is the model-based standard error from the original fit
accurate? To what extent do the results depend on any single observation? We can
calculate the estimated standard error by the nonparametric delta m ethod by
df <- as.numeric(unlist(data.anova[1]))
res.dev <- as.numeric(unlist(data.anova[4]))
res.df <- as.numeric(unlist(data.anova[3]))
(dev [4] /df [4] ) / (res.dev [4] /r e s .df [4] ) >
poison.fun(poisons)
anova(glm(time~poison*treat,data=poisons),test="F")
To apply resampling analysis, using as the null m odel that with main effects:
For an example o f prediction, we consider using the nuclear power station data to
predict the cost o f new stations like cases 27-32, except that their value for d a te
is 73. We choose to make the prediction using the m odel with all covariates. To fit
that model, and to make the ‘new’ station:
+ct+bw+log(cum.n)+pt,data=d)
predict(d.glm,d.p)-(d.p$fit+d$res[i.p]) }
nuclear.boot.pred <- boot (nuke, nuke.pred,R=199,m=l,d.p=nuke.p)
5 Consider predicting the log brain weight o f a mammal from its log body weight,
using squared error cost. The data are in dataframe mammals. For an initial model,
apparent error and ordinary cross-validation estimates o f aggregate prediction
error:
6 The data o f Examples 6.15 and 6.16 are in dataframe s a l i n i t y . For the linear
regression m odel with all three covariates, consider the effect o f discharge d is and
the influence o f case 16 on estimating this. Resample the least squares, Li and least
trimmed squares estimates, and then look at the jackknife-after-bootstrap p lo ts:
W hat conclusions do you draw from these plots about (a) the shapes o f the
distributions o f the estimates, (b) comparisons between the estimation methods,
and (c) the effects o f case 16?
One possible explanation for case 16 being an outlier with respect to the multiple
linear regression model used previously is that a quadratic effect in d is c h a r g e
should be added to the model. We can test for this using the pivot m ethod with
least squares estimates and case resampling:
7.1 Introduction
In C h ap ter 6 we showed how the basic b o o tstra p m ethods o f earlier chapters
extend to linear regression. The b ro ad aim o f this ch ap ter is to extend the
discussion further, to various form s o f nonlinear regression m odels — espe
cially generalized linear m odels an d survival m odels — and to nonparam etric
regression, where the form o f the m ean response is n o t fully specified.
A particu lar feature o f linear regression is the possibility o f error-based
resam pling, w hen responses are expressible as m eans plus hom oscedastic errors.
T his is p articularly useful w hen o u r objective is prediction. F or generalized
linear m odels, especially for discrete data, responses can n o t be described in
term s o f additive errors. Section 7.2 describes ways o f generalizing error-based
resam pling for such m odels. The corresponding developm ent for survival d a ta
is given in Section 7.3. Section 7.4 looks briefly at nonlinear regression with
additive error, m ainly to illustrate the useful co n trib u tio n th a t resam pling
m ethods can m ake to analysis o f such models. T here is often a need to
estim ate the poten tial accuracy o f predictions based on regression models,
and Section 6.4 contained a general discussion o f resam pling m ethods for
this. In Section 7.5 we focus on one type o f application, the estim ation o f
misclassification rates w hen a binary response y corresponds to a classification.
N o t all relationships betw een a response y an d covariates x can be readily
m odelled in term s o f a p aram etric m ean function o f know n form. A t least
for exploratory purposes it is useful to have flexible nonparam etric curve-
fitting m ethods, an d there is now a wide variety o f these. In Section 7.6 we
exam ine briefly how resam pling can be used in conjunction w ith som e o f these
n onparam etric regression m ethods.
326
7.2 • Generalized Linear Models 327
v a r(Y ) =
w here V(-) is the know n variance function and 4> is the dispersion param eter,
w hich m ay be unknow n. T his includes the im p o rtan t cases o f binom ial, Poisson,
an d gam m a d istributions in add ition to the norm al distribution. Secondly, the
linear m ean structure is generalized to
Exam ple 7.1 (Leukaem ia d a ta ) Table 7.1 contains d a ta on the survival times
in weeks o f tw o groups o f acute leukaem ia victims, as a function o f their w hite
blood cell counts.
A simple m odel is th a t w ithin each group survival tim e Y is exponential
w ith m ean /i = exp(/?o + Pix), where x = log10(white blood cell count). T hus
the link function is logarithm ic. T he intercept is different for each group, b u t
the slope is assum ed com m on, so the full m odel for the- yth response in group
i is
E (Y y) = Hij, lo g (^ y ) = p Qi + pi Xj j , v a r(Y y ) = K(/Zy) = /X2,
T he fitted m eans p. an d the d a ta are show n in the left panel o f Figure 7.1. The
m ean survival tim es for group 2 are shorter th a n those for group 1 at the same
white blood cell count.
U nder this m odel the ratios Y / n are exponentially distributed with unit
m ean, an d hence the Q -Q p lo t o f y y //iy against exponential quantiles in the
right panel o f Figure 7.1 w ould ideally be a straight line. System atic curvature
m ight indicate th a t we should use a gam m a density w ith index v,
y v_1vv / vv\
f i y l ^ v) = J ? T w e x p \ j ) ’ y>0, ^ V>a
In this case v ar(Y ) = /i2/v , so the dispersion p aram eter is taken to be k = 1/v
and Cj = 1. In fact the exponential m odel seems to fit adequately. ■
7.2 ■Generalized Linear Models 329
w here g(n) = dr\/dp. is the derivative o f the link function. Because the dis
persion p aram eters are tak en to have the form k c j , the estim ate fi does n o t
depend on k . N ote th a t although the estim ates are derived as m axim um like
lihood estim ates, their values depend only upon the regression relationship as
expressed by the assum ed variance function and the link function and choice
o f covariates.
T he usual m ethod for solving (7.2) is iterative weighted least squares, in
which a t each iteration the adjusted responses zj = t]j+ (yj — /ij)g(nj) are
regressed on the x; w ith weights wj given by
w j l = c j V(fij)g2(fiJ)- (7.3)
all these quantities are evaluated at the cu rren t values o f the estim ates. The
weighted least squares equation (6.27) applies at each iteration, w ith y replaced
by the adjusted dependent variable z. The approxim ate variance m atrix for p
330 7 • Further Topics in Regression
var(j?) = k ( X t W X ) ~ 1, (7.4)
ft- 1 y ' to - * # (7 6 )
n - p - l j j CjVfrj) ■
Significance tests
Individual coefficients /?; can be tested using studentized estim ates, with stan
dard errors estim ated using (7.4), w ith k replaced by the estim ate k if necessary.
The null distrib u tio n s o f these studentized estim ates will be approxim ately stan
d ard norm al, b u t the accuracy o f this ap proxim ation can be open to question.
Allowance for estim ation o f k can be m ade by using the t distribution with
7.2 ■Generalized Linear Models 331
Q = (Do - D) / k. (7.8)
Residuals
R esiduals an d o th e r regression diagnostics for linear m odels m ay be extended
to generalized linear m odels. T he general form o f residuals will be a suitably
standardized version o f d(y, p) where d(Y,[i) m atches some notion o f random
error.
T he sim plest w ay to define residuals is to m imic the earlier definitions for
linear m odels, a n d to take the set o f standardized differences, the Pearson
residuals, (yj — p,j)/{cjkV(faj)}l/1. Leverage adjustm ent o f these to com pensate
for estim ation o f /? involves hj, the yth diagonal elem ent o f the h at m atrix H
in (7.5), and yields standardized Pearson residuals
t w - m (7 10)
{cjkg2( t i j ) V ( j i j ) ( l - h j ) }
F o r discrete d a ta this definition m ust be altered if g (yj) is infinite, as for
332 7 ■Further Topics in Regression
where dj = d(y; , fij) is the signed square root o f the scaled deviance contribution
due to the yth case, the sign being th a t o f y,- — frj. T he deviance residual is dj.
D efinition (7.7) implies th a t
(7.11)
TDi ( l - h j ) V 2’ j
are m ore com m only used th a n the unadjusted dj.
F or the linear regression m odel o f Section 6.3, r Dj is p roportional to the
m odified residual (6.9). F o r o th er m odels the r Dj can be seriously biased, but
once bias-corrected they are typically closer to stan d ard norm al th an are the
r Pj or r LJ.
One general point to note ab o u t all o f these residuals is th a t they are scaled,
implicitly o r explicitly, unlike the m odified residuals o f C h ap ter 6.
Quasilikelihood estimation
As we have noted before, only the link an d variance functions m ust be specified
in order to find estim ates ft and approxim ate stan d ard errors. So although (7.2)
and (7.6) arise from a param etric m odel, they are m ore generally applicable
— ju st as least squares results are applicable beyond the norm al-theory linear
model. W hen n o response distribution is assum ed, the estim ates ft are referred
to as quasilikelihood estim ates, and there is an associated theory for such
estim ates, although this is n o t o f concern here. T he m ost com m on application
is to d a ta w ith a response in the form o f counts or proportions, which are often
found to be overdispersed relative to the Poisson or binom ial distributions. One
approach to m odelling such d a ta is to use the variance function appropriate
to binom ial or Poisson data, but to allow the dispersion param eter k to be
a free param eter, estim ated by (7.6). This estim ate is then used in calculating
stan d ard errors for ft and residuals, as indicated above.
7.2 • Generalized Linear Models 333
Resampling errors
T he simplest approach mimics the linear m odel sam pling scheme b u t allows for
the different response variances, ju st as in Section 6.2.6. So we define sim ulated
responses by
y ’j = fij + {cjkV{p.j)Yl2t), j = l,...,n, (7.12)
where g _1(') is the inverse link function and £ j,...,e * is a b o o tstrap sample
In these first two from the residuals r L U . . . , r Ln defined at (7.10). H ere the residuals should n o t
resampling schemes the
be m ean-adjusted unless g( ) is the identity link, in which case r Lj = r Pj and
scale factor k~l/2 can be
omitted provided it is the two schemes (7.12) an d (7.13) are the same.
omitted from both the
residual definition and
A th ird ap p ro ach is to use the deviance residuals as surrogate errors. If the
from the definition of deviance residual dj is w ritten as d{yj,p.j), then im agine th a t corresponding
/• ran d o m errors ej are defined by ej = d(yj,fij). The distribution o f these £_,■
334 7 • Further Topics in Regression
This also gives the m ethod o f Section 6.2.3 for linear models, except for the
m ean adjustm ent o f residuals.
N one o f these three m ethods is perfect. O ne obvious draw back is th a t they
can all give negative or non-integer values o f y ' when the original d ata are
non-negative integer counts. A simple fix for discrete responses is to round the
value o f y j from (7.12), (7.13), or (7.14) to the nearest appropriate value. For
count d a ta this is a non-negative integer, and if the response is a proportion
w ith d en o m in ato r m, it is a nu m b er in the set 0 , 1 /m ,2 /m ,. . . , 1. However,
rounding can appreciably increase the p ro p o rtio n o f extrem e values o f y ' for
a case w hose fitted value is n ear the end o f its range.
A sim ilar difficulty can occur w hen responses are positive w ith V(fi) = Kfi2,
as in Exam ple 7.1. T he Pearson residuals are K~l/2(yj — fij)/p.j, all necessarily
greater th a n —k ~ 1^2. But the standardized versions rpj are n o t so constrained,
so th a t the result yj = fij( 1 + /c1/2e*) from applying (7.12) can be negative. The
obvious fix is to tru n cate y j at zero, b u t this m ay distort the distribution o f
y ', and so is n o t generally recom m ended.
Example 7.2 (Leukaemia data) F or the d a ta introduced in Exam ple 7.1 the
p aram etric m odel is gam m a w ith log likelihood contributions
and the regression is additive on the logarithm ic scale, log(/zi;) = /?0i + /?ixy.
The deviance for the fitted m odel is D = 40.32 w ith 30 degrees o f freedom ,
and equation (7.6) gives k = 1.09. The deviance residuals are calculated w ith
k set equal to k ,
T he Zjj w ould be approxim ately a sam ple from the stan d ard exponential
distribution if in fact k = 1, and the right-hand panel o f Figure 7.1 suggests
th a t this is a reasonable assum ption.
O ur basic p aram etric m odel for these d a ta sets k = 1 and puts Y = fie,
where £ has an exponential distrib u tio n w ith unit m ean. Hence the param etric
b o o tstrap involves sim ulating exponential d a ta from the fitted m odel, th a t is
setting y * = fie', where em is stan d ard exponential. A slightly m ore cautious
7.2 ■Generalized Linear M odels 335
Table 7 3 Empirical
coverages (%) for four Cases rL o r rp ro
parameters based on
Po Pi V>1 xp2 Po Pi Vl tp2 Po Pi Vl V>2
applying various
resampling schemes
with R = 199 to 1000 S tan d ard 85 86 89 85 85 86 89 86 85 86 90 86
samples of size 15 N o rm al 88 89 92 90 88 89 90 89 87 89 90 89
generated from various Percentile 85 87 83 89 86 89 86 89 86 88 86 89
models. Target coverage BCa 84 86 82 86 86 88 83 88 86 88 83 88
is 90%. The first two Basic 86 88 87 84 86 89 86 83 85 89 87 83
sets of results are for an
S tu d en t 89 89 86 81 92 92 89 84 92 92 89 84
exponential model fitted
to exponential and
S tan d ard 79 79 82 81 79 78 82 82 79 78 82 82
lognormal data, and the
N o rm al 81 81 84 85 81 80 84 84 82 80 84 84
second two are for a
Poisson model fitted to Percentile 80 84 73 85 80 82 77 83 80 81 76 82
Poisson and negative BCa 78 83 72 81 80 80 74 79 79 81 74 80
binomial data. See text Basic 78 78 82 78 81 80 83 80 80 81 84 80
for details. S tudent 84 85 82 74 90 88 84 79 90 88 84 79
S ta n d a rd 90 90 91 90 89 90 92 90 89 91 92 91
N o rm al 88 88 88 88 87 86 88 88 87 93 97 93
Percentile 87 87 85 86 89 88 88 88 90 94 97 91
BCa 86 86 82 86 88 87 85 87 88 94 96 91
Basic 87 87 85 87 87 87 88 88 86 92 97 92
S tudent 95 90 80 92 90 89 89 89 90 93 92 91
S tan d ard 69 64 59 70 69 63 59 69 67 64 60 71
N o rm al 87 84 86 90 88 84 84 89 87 89 92 94
Percentile 85 86 84 86 90 86 82 88 90 91 93 91
BCa 85 85 80 85 88 83 77 86 87 89 88 89
Basic 86 84 83 85 88 84 83 87 87 89 91 91
S tu d en t 93 87 82 87 89 89 85 85 89 93 90 85
The th ird experim ent used the same design m atrix as the first two, b u t linear
predictor rj = Pq + P\x, w ith Po — Pi = 2 and Poisson responses w ith m ean
H = exp (rj). T he fourth experim ent used the same m eans as the third, b u t had
negative binom ial responses w ith variance function \x + /i2/1 0 . The b o o tstrap
schemes for these two experim ents were case resam pling and m odel-based
resam pling using (7.12) an d (7.14).
Table 7.3 shows th at while all the m ethods tend to undercover, the standard
m ethod can be disastrously b ad w hen the random p a rt o f the fitted m odel is
incorrect, as in the second an d fourth experim ents. The studentized m ethod
generally does b etter th a n the basic m ethod, b u t the B C a m ethod does not
im prove on the percentile intervals. T hus here a m ore sophisticated m ethod
does n o t necessarily lead to b etter coverage, unlike in Section 5.7, and in
p articu lar there seems to be no reason to use the B C a m ethod. Use o f the
studentized interval on an o th er scale m ight im prove its perform ance for the
ratio \p2 , for which the sim pler m ethods seem best. As far as the resam pling
schemes are concerned, there seems to be little to choose betw een the m odel-
338 7 • Further Topics in Regression
based schemes, which im prove slightly on b o o tstrap p in g cases, even when the
fitted variance function is incorrect.
We now consider an im p o rtan t caveat to these general com m ents.
Inhomogeneous residuals
F or some types o f d a ta the standardized Pearson residuals m ay be very
inhom ogeneous. If y is Poisson w ith m ean fi, for example, the distribution
o f (y — f i ) / n l/1 is strongly positively skewed w hen n < I, b u t it becom es
increasingly sym m etric as fi increases. T hus w hen a set o f d a ta contains both
large and sm all counts, it is unwise to treat the rP as exchangeable. One
possibility for such d a ta is to apply (7.12) b u t w ith fitted values stratified by
the estim ated skewness o f their residuals.
Interest focuses on the varieties w ith sm all values o f Pj, which are likely to be
the m ost resistant to the disease.
F or an adequate fit, the deviance would roughly be distributed according
to a X m d istrib u tio n ; in fact it is 1142.8. This indicates severe overdispersion
relative to the model.
7.2 • Generalized Linear Models 339
T he left panel o f Figure 7.3 shows estim ated variety effects for block 1.
Varieties 1 an d 3 are least resistant to the disease, while variety 31 is m ost
resistant. T he right panel shows the residuals plotted against linear predictors.
T he skewness o f the rP drops as rj increases.
Param etric sim ulation involves generating binom ial observations from the
fitted m odel. This greatly overstates the precision o f conclusions, because this
m odel clearly does n o t reflect the variability o f the data. We could instead use
the beta-binom ial distribution. Suppose that, conditional on n, a response is
binom ial w ith den o m in ato r m an d probability n, b u t instead o f being fixed, n
is taken to have a b eta distribution. T he resulting response has unconditional
m ean and variance
where n = E(7t) and <j) > 0 controls the degree o f overdispersion. Param etric
sim ulation from this m odel is discussed in Problem 7.5.
Two variance functions for overdispersed binom ial d a ta are V\{n) =
<f>n(l — n), w ith <j> > 1, and Viin) = 7i(l — 7t){l + (m — with (/> > 0.
T he first o f these gives com m on overdispersion for all the observations, while
the second allows p roportionately greater spread when m is larger. We use the
first, for which 4> = 8.3, an d perform nonparam etric sim ulation using (7.12).
T he sim ulated responses are rounded to the nearest integer in 0 ,1 ,..., m.
The left panel o f Figure 7.4 shows box plots o f the ratio o f deviance to
degrees o f freedom for 200 sim ulations from the binom ial model, the beta-
binom ial m odel, for nonparam etric sim ulation by (7.12), and for (7.12) b u t
w ith residuals stratified into groups for the fifteen varieties w ith the smallest
values o f fij, the m iddle fifteen values o f fij, and the fifteen largest values o f
340 7 • Further Topics in Regression
fij. The d o tted line shows the observed ratio. T he binom ial results are clearly
quite inappropriate, those for the beta-binom ial an d unstratified sim ulation
are better, an d those for the stratified sim ulation are best.
To explain this, we retu rn to the right panel o f Figure 7.3. This shows th a t the
residuals are n o t hom ogeneous: residuals for observations with sm all values
o f rj are m ore positively skewed th a n those for larger values. This reflects the
varying skewness o f binom ial data, which m ust be taken into account in the
resam pling scheme.
The right panel o f Figure 7.4 shows the estim ated variety effects for the
200 sim ulations from the stratified sim ulation. Varieties 1 and 3 are m uch less
resistant th a n the others, b u t variety 31 is not m uch m ore resistant th an 11,
18, and 23; o th er varieties are close behind. As m ight be expected, results for
the binom ial sim ulation are m uch less variable. T he unstratified resam pling
scheme gives large negative estim ated variety effects, due to inappropriately
large negative residuals being used. To explain this, consider the right panel o f
Figure 7.3. In effect the unstratified scheme allows residuals from the right h alf
o f the panel to be sam pled an d placed at its left-hand end, leading to negative
sim ulated responses th a t are rounded u p to zero: the varieties for which this
happens seem spuriously resistant.
Finer stratification o f the residuals seems unnecessary for this application.
■
7.2.4 Prediction
In Section 6.3.3 we showed how to use resam pling m ethods to obtain prediction
intervals based on a linear regression fit. T he sam e idea can be applied here.
7.2 • Generalized Linear Models 341
K = g -'ix lh
In principle any o f the resam pling m ethods in Section 7.2.3 could be used.
In practice the hom oscedasticity is im portant, and should be checked.
where the sum m ation is over the cells o f row j for which yjk was unobserved;
this is step 2. N ote th a t y*+j is equivalent to the results o f steps 3(a) and 3(b)
with M = 1.
We take 8(y,n) = (y — corresponding to Pearson residuals for the
Poisson distribution. This m eans th a t step 3(c) involves setting
_ y-+ J - K j
+J a *1/2
V+J
We repeat this R times, to obtain values d‘+}(l) < ■■■< d \ j(R) for each j.
The final step is to o btain the b o o tstrap u p p er an d lower limits y*+j i_a
for y +j , by solving the equations
y+j a*+j _ j* y +j p +j _ j*
.1 /2 ~ a + , M ( R + 1)«)>TT /2 a + J ,( ( R + l) ( l—a))-
*+J < )
7.2 ■Generalized Linear M odels 343
o Figure I S Results
oin
from the fit of a Poisson
two-way layout to the
oo AIDS data. The left
rT panel shows predicted
diagnoses (solid),
<D O
CO O together with the actual
o CO
c totals to the end of 1992
O) O
<C O (+). The right panel
CVi shows standardized
Pearson residuals
oo plotted against
estimated skewness,
p~l/2; the vertical lines
are at skewness 0.6 and
1.
1984 1986 1988 1990 1992
Skewness
Devianca'df
are m uch less dispersed th an the original data, for which the ratio is 716.5/413.
T he negative binom ial sim ulation gives m ore ap p ro p riate results, which seem
rath er sim ilar to those for n o nparam etric sim ulation w ithout stratification.
W hen stratification is used, the results mimic the overdispersion m uch better.
The pointw ise 95% prediction intervals for the num bers o f A ID S diagnoses
are shown in the right panel o f Figure 7.6. The intervals for sim ulation from
the fitted Poisson m odel are considerably narrow er th an the intervals from
resam pling residuals, b o th o f which are similar. The intervals for the last
quarters o f 1990, 1991, an d 1992 are given in Table 7.5.
T here is little change if intervals are based on the deviance residual form ula
for the Poisson distribution, S(y,fi) = ±[ 2 {y log ( y / n) + n ~ y } \ x/1-
A serious draw back w ith this analysis is th a t predictions from the two-way
layout m odel are very sensitive to the last few rows o f the table, to the extent
th a t the estim ate for the last row is determ ined entirely by the b ottom left
346 7 • Further Topics in Regression
cell. Some sort o f tem poral sm oothing is preferable, and we reconsider these
d a ta in Exam ple 7.12 ■
^Prof(0) = m a
P
in the figure we renorm alize the log likelihood to have m axim um zero. U nder
the stan d ard large-sam ple likelihood asym ptotics outlined in Section 5.2.1, the
approxim ate distrib u tio n o f the likelihood ratio statistic
W( 9) = 2 {< V of(0) —
m ean 1.12, an d their 0.95 quantile is w(*950) = 4.09, to be com pared with
ci,o.95 = 3.84. This gives b o o tstrap calibrated 95% confidence interval the set
o f 9 such th a t / prof(0) > / prof(9) — 5 x 4.09, th a t is [19.62,36.12], which is
slightly w ider th a n the stan d ard interval.
Table 7.7 com pares the bias estim ates and stan d ard errors for the m odel
param eters using the param etric b o o tstra p described above and standard first-
o rd er likelihood theory, und er which the estim ated biases are zero, and the
? is the m atrix o f variance estim ates are obtained as the diagonal elem ents o f the inverse observed
second derivatives o f £
with respect to 0 and /?.
inform ation m atrix (—?)_1 evaluated at the M LEs. The estim ated biases are
sm all b u t significantly different from zero. The largest differences betw een the
stan d ard theory and the b o o tstrap results are for f o and fo, for which the
biases are o f order 2 -3 % . T he threshold param eter xo is well determ ined; the
sta n d a rd 95% confidence interval based on its asym ptotic norm al distribution
is [4.701,4.815], w hereas the norm al interval with estim ated bias and variance
is [4.703,4.820],
A m odel-based nonparam etric b o o tstrap m ay be perform ed by using resid
uals e = ( y / ) . f , three o f which are censored, then resam pling errors £* from
their product-lim it estim ate, an d then m aking uncensored b o o tstrap observa
tions le*1/*. T he observations with x = 5 are then m odified as outlined above,
an d the m odel refitted to the resulting data. The product-lim it estim ate for the
residuals is very close to the survivor function o f the stan d ard exponential dis
tribution, so we expect this to give results sim ilar to the param etric sim ulation,
and this is w hat we see in Table 7.7.
F or censoring at a pre-determ ined tim e c, the sim ulation algorithm s would
w ork as described above, except th a t values o f y * greater th a n c would be
replaced by c an d the corresponding censoring indicators d* set equal to zero.
T he nu m b er o f censored observations in each sim ulated dataset would then be
ran d o m ; see Practical 7.3.
Plots show th a t the sim ulated M L E s are close to norm ally distributed: in
this case sta n d a rd likelihood theory w orks well enough to give good confi
dence intervals for the param eters. The benefit o f param etric sim ulation is th at
the b o o tstra p estim ates give em pirical evidence th a t the stan d ard theory can
350 7 ■Further Topics in Regression
a non-decreasing function th a t ju m p s a t yj by
1- ^ 0 0 = n { i- ^ v ) } . (7.i9)
i-y&y
which generalizes the product-lim it estim ate (3.9), although o th e r estim ators
also exist. W hichever o f them is used, the p ro p o rtio nal hazards assum ption
implies th a t
{1 _ F°(y)}exp<-xJfo
7.3 ■Survival Data 351
will be the estim ated survivor function for an individual with covariate values
Xj.
U nder the ran d o m censorship model, the survivor function o f the censoring
d istribution G is given by (3.11).
T he b o o tstra p m ethods for censored d a ta outlined in Section 3.5 extend
straightforw ardly to this setting. F or example, if the censoring distribution is
independent o f the covariates, we generate a single sam ple under the condi
tional sam pling plan according to the following algorithm .
1 generate 7?* from the estim ated failure time survivor function
{1 — F°(y)}exp(xJW;
2 if dj = 0, set Cj = yj, and if dj = 1, generate Cj from the
conditional censoring distribution given th a t Cj > yj, namely
{G(y) - G{yj)}/{ 1 - G(y,)}; then
3 set Yj = m in(7P*, Cj), w ith Dj = 1 if YJ = Yf* and zero otherwise.
U nder the m ore general m odel where the distribution G o f C also depends
up o n the covariates an d a p ro p o rtional hazards assum ption is ap p ro p riate for
G, the estim ated censoring survivor function when the covariate is x is
f n -1 exp(xr y)
1 - G ( y ;y ,x ) = { l- G ° ( y ) j
where G0(y) is the estim ated baseline censoring distribution given by the
analogues o f (7.18) and (7.19), in which 1 — dj and y replace dj and fi. U nder
m odel-based resam pling, a b o o tstrap dataset is then obtained by
1 generate 7?* from the estim ated failure tim e survivor function
{1 — F°(y)}exp{xyP\ and independently generate Cj from the estim ated
censoring survivor function {1 — G°(>')}exp(x^ ) ; then
2 set 7 / = m in(7P*,C *), w ith Dj = 1 if 7 / = Y f and zero otherwise.
T he sharp increase in risk for small thicknesses is clearly a genuine effect, while
beyond 3mm the confidence interval for the linear predictor is roughly [0,1],
w ith thickness having little o r no effect.
R esults from m odel-based resam pling using the fitted m odel and applying
A lgorithm 7.3, an d from conditional resam pling using A lgorithm 7.2 are also
show n; they are very sim ilar to the results from resam pling cases. In view o f
the discussion in Section 3.5, we did n o t apply the weird bootstrap.
The right panels o f Figure 7.9 show how the estim ated 0.2 quantile o f the
survival distribution, yo.2 = min{y : F i ( y ; P , x ) > 0.2} depends on tum our
thickness. T here is an initial sharp decrease from 3000 days to ab o u t 750
days as tu m o u r thickness increases from 0 -3 mm, but the estim ate is roughly
co n stan t from then on. T he individual estim ates are highly variable, b u t the
degree o f uncertainty m irrors roughly th a t in the left panels. Once again results
for the three resam pling schemes are very similar.
U nlike the previous exam ple, where resam pling and stan d ard likelihood
m ethods led to sim ilar conclusions, this exam ple shows the usefulness o f
resam pling w hen stan d ard approaches would be difficult o r im possible to
apply. ■
0 2 4 6 8 10
Tumour thickness (mm) Tumour thickness (mm)
with ji{x, /?) nonlinear in the p aram eter /?, which m ay be vector o r scalar.
The linear algebra associated w ith least squares estim ates for linear regression
no longer applies exactly. However, least squares theory can be developed by
linear approxim ation, an d the least squares estim ate ft can often be com puted
accurately by iterative linear fitting.
The linear approxim ation to (7.20), obtained by Taylor series expansion,
gives
where
= 8y{xj,P)
W i>-p
T his defines an iteration th a t starts at P' using a linear regression least squares
fit, an d a t the final iteratio n /?' = /?. A t th a t stage the left-hand side o f (7.21)
is simply the residual ej = yj — fi(xj,P). A pproxim ate leverage values and
o th er diagnostics are obtained from the linear approxim ation, th a t is using the
definitions in previous sections b u t w ith the UjS evaluated a t p' = p as the values
o f explanatory variable vectors. This use o f the linear approxim ation can give
m isleading results, depending upon the “intrinsic curvature” o f the regression
surface. In particu lar, the residuals will no longer have zero expectation in
general, an d standardized residuals r; will no longer have co n stan t variance
u n d er hom oscedasticity o f true errors.
T he usual norm al approxim ation for the distribution o f P is also based on
the linear approxim ation. F or the approxim ate variance, (6.24) applies w ith X
replaced by U = ( u i , . . . , u n)T evaluated at p. So w ith s2 equal to the residual
m ean square, we have
P -P ~ N ( 0 , s 2( U T U r l ) . (7.22)
Exam ple 7.7 (Calcium uptake d ata) T he d ata plotted in Figure 7.10 show
the calcium u p tak e o f cells, y, as a function o f tim e x after being suspended
in a solution o f radioactive calcium. Also shown is the fitted curve
fi(x,P) = Po { l - e x p ( - / ? i x ) } .
T he least squares estim ates are Po = 4.31 and Pi = 0.209, and the estim ate o f
a is 0.55 w ith 25 degrees o f freedom. The stan d ard errors for Po and Pi based
on (7.22) are 0.30 an d 0.039.
356 7 *Further Topics in Regression
2 4 6 8 10 12 14
The right panel o f Figure 7.10 shows th a t hom ogeneity o f variance is slightly
questionable here, so we resam ple cases by stratified sam pling. Estim ated biases
and stan d a rd errors for f o an d fo based on 999 b o o tstrap replicates are given
in Table 7.8. T he m ain p o in t to notice is the appreciable difference betw een
A
theoretical an d b o o tstra p stan d ard errors for Po.
Figure 7.11 illustrates the results. N ote the non-elliptical p a ttern o f variation
and the n on-norm ality: the z-statistics are also quite non-norm al. In this case
the b o o tstrap should give b etter results for confidence intervals th an norm al
approxim ations, especially for Po- T he b o tto m right panel suggests th a t the
param eter estim ates are closer to norm al on logarithm ic scales.
Results for m odel-based resam pling assum ing hom oscedastic errors are fairly
similar, alth o u g h the sta n d a rd error for f o is then 0.32. The effects o f nonlin
earity are negligible in this case: for exam ple, the m axim um absolute bias o f
residuals is a b o u t 0.012<r.
Suppose th a t we w ant confidence lim its on som e aspect o f the curve, such
as the “p ro p o rtio n o f m axim um ” n = 1 — exp(—P\x). O rdinarily one m ight
7.4 ■Other Nonlinear Models 357
o •'t
O
o
CO
©
o . A.
co "".4/ *
r.
iS w
<D 0
JO V v W c g 1;
o
CO
d .
I •••
* ^
o
d b
3.5 4.0 4.5 5.0 5.5 6.0 4 5
betaO betaO
approach this by applying the delta m ethod together with the bivariate norm al
approxim ation for least squares estim ates, b u t the b o o tstrap can deal w ith this
using only the sim ulated p aram eter estim ates. So consider the times x = 1,
5, 15, at which the estim ates n = 1 — exp(—fiix) are 0.188, 0.647 and 0.956
respectively. T he top panel o f Figure 7.12 shows b o o tstrap distributions o f
7T* = 1 — exp(—P[x): n ote the strong non-norm ality at x = 15.
T he co n strain t th a t n m ust lie in the interval (0,1) m eans th a t it is unwise
to construct basic or studentized confidence intervals for n itself. F o r example,
the basic b o o tstrap 95% interval for n at x = 15 is [0.922,1.025], The solution
is to do all the calculations on the logit scale, as show n in the lower panel o f
Figure 7.12, an d untransform the lim its obtained a t the end. T h a t is, we obtain
358 7 • Further Topics in Regression
x=15
Figure 7.12 Calcium
x=1 uptake d ata: bootstrap
histograms for estimated
proportion of maximum
n = 1 —exp(—fi\x) at
x=5 x = 1, 5 and 15 based
on R = 999 resamples
of cases.
1 J ItfTh-i-rL-
0.0 0.2 0.4 0.6 0.8 1.0
Proportion 1 - exp(-beta1*x)
-2 0 2 4
Logit of proportion
(7.23)
otherwise.
Exam ple 7.8 (U rine d a ta ) F or an exam ple o f the estim ation o f misclassifi
cation error, we take binary d a ta on the presence o f calcium oxalate crystals
in 79 sam ples o f urine. E xplanatory variables are specific gravity, i.e. the den
sity o f urine relative to w ater, pH , osm olarity (mOsm), conductivity (m M ho
m illiM ho), u rea concen tratio n (millimoles per litre), and calcium concentration
(millimoles p er litre). A fter d ropping two incom plete cases, 77 remain.
C onsider how well the presence o f crystals can be predicted from the ex
planatory variables. A nalysis o f deviance for binary logistic regression suggests
the m odel which includes the p = 4 covariates specific gravity, conductivity,
log calcium concentration, and log urine density, and we base o u r predictions
on this model. T he sim plest estim ate o f the expected aggregate prediction error
A is the average nu m b er o f m isclassifications, A app = n~l E w ith c(-, •)
given by (7.23); it w ould be equivalent to use instead
otherwise.
360 7 • Further Topics in Regression
Figure 7.13
Components of 0.632
estimate of prediction
error, yj — fi(xj; F*), for
urine data based on 200
bootstrap simulations.
Values within the dotted
lines make no
contribution to
prediction error. The
components from cases
54 and 66 are the
rightmost and the
fourth from rightmost
sets of errors shown; the
components from case
27 are leftmost.
Case ordered by residual
In this case A app = 20.8 x 10- 2 . O th er estim ates o f aggregate prediction error
are given in Table 7.9. F o r the b o o tstrap an d 0.632 estim ates, we used R = 200
boo tstrap resamples.
The discontinuous n ature o f prediction error gives m ore variable results th an
for the exam ples with squared erro r in Section 6.4.1. In p articular the results
for K -fold cross-validation now depend m ore critically on which observations
fall into the groups. F or example, the average an d standard deviation o f A q v j
for 40 repeats were 23.0 x 10-2 an d 2.0 x 10~2. However, the broad pattern is
sim ilar to th a t in Table 6.9.
Figure 7.13 shows box plots o f the quantities yj — n(xj ;F*) th a t contribute
to the 0.632 estim ate o f prediction error, plotted against case j ordered by the
residual; only three values o f j are labelled. There are ab o u t 74 contributions
at each value o f j. O nly values outw ith the horizontal d o tted lines contribute
to prediction error. The p attern is broadly w hat we would ex p ect: observations
with residuals close to zero are generally well predicted, and m ake little
contribu tio n to prediction error. M ore extrem e residuals contribute m ost to
prediction error. N ote cases 66 an d 54, which are always misclassified; their
standardized Pearson residuals are 2.13 an d 2.54. T he figure suggests th a t case
7.5 • Misclassification Error 361
7 - n ( x ) + e,
where fi(x) has com pletely unknow n form but w ould be assum ed continuous
in m any applications, an d e is a ran d o m erro r w ith zero m ean. A typical
application is illustrated by the scatter p lo t in Figure 7.14. H ere no simple
param etric regression curve seems appropriate, so it m akes sense to fit a
sm ooth curve (which we do later in Exam ple 7.10) w ith as few restrictions as
possible.
O ften n o n p aram etric regression is used as an exploratory tool, either directly
by producing a curve estim ate for visual interpretation, or indirectly by provid
ing a com parison w ith som e tentative p aram etric m odel fit via a significance
test. In som e applications the ra th e r different objective o f prediction will be o f
interest. W hatever the application, the com plicated n ature o f nonparam etric
regression m ethods m akes it unlikely th a t probability distributions for statistics
o f interest can be evaluated theoretically, an d so resam pling m ethods will play
a prom inent role.
It is n o t possible here to describe all o f the nonparam etric regression
m ethods th a t are now available, an d in any event m any o f them do not yet
have fully developed com panion resam pling m ethods. We shall limit ourselves
to a brief discussion o f som e o f the m ain m ethods, and to applications in
generalized additive m odels, where nonparam etric regression is used to extend
the generalized linear m odels o f Section 7.2.
7.6 • Nonparametric Regression 363
Time (ms)
X > ;w { (x ~ */)/*>}
£(*) = (7.24)
E » {(x -x # )} ’
w ith w(-) a sym m etric density function and b an adjustable “ban d w id th ” con
stan t th a t determ ines how widely the averaging is done. This estim ate is similar
in m any ways to the kernel density estim ate discussed in Exam ple 5.13, and as
there the choice o f b depends upon a trade-off betw een bias and variability o f
the e stim a te : sm all b gives sm all bias and large variance, whereas large b has
the opposite effects. Ideally b would vary w ith x, to reflect large changes in the
derivative o f /i(x) and heteroscedasticity, b o th evident in Figure 7.14.
M odifications to the estim ate (7.24) are needed at the ends o f the x range,
to avoid the inherent bias when there is little or no d ata on one side o f x.
In m any ways m ore satisfactory are the local regression m ethods, where a
local linear or quad ratic curve is fitted using weights w{(x — xj ) / b} as above,
and then p.(x) is taken to be the fitted value at x. Im plem entations o f this
idea include the lowess m ethod, which also incorporates trim m ing o f outliers.
A gain the choice o f b is critical.
A different approach is to define a curve in term s o f basis functions, such
as pow ers o f x which define polynom ials. The fitted m odel is then a linear
co m bination o f basis functions, with coefficients determ ined by least squares
regression. W hich basis to use depends on the application, b u t polynom ials are
364 7 • Further Topics in Regression
~ M*/)}2 + * J { t t x ) } 2dx;
w eighted sum s o f squares can be used if necessary. In m ost softw are im ple
m entations the spline fit can be determ ined either by specifying the degrees o f
freedom o f the fitted curve, o r by applying cross-validation (Section 6.4.1).
A spline fit will generally be biased, unless the underlying curve is in fact a
cubic. T h a t such bias is nearly always present for nonparam etric curve fits can
create difficulties. T he o th er general feature th a t m akes in terp retatio n difficult
is the occurrence o f spurious bum ps an d bends in the curve estim ates, as we
shall see in Exam ple 7.10.
Resampling methods
Two types o f applications o f n o n p aram etric curves are use in checking a p a ra
m etric curve, an d use in setting confidence lim its for fi(x) o r prediction limits
for Y = h ( x ) + e at some values o f x. The first type is quite straightforw ard, be
cause d a ta would be sim ulated from the fitted param etric m odel: Exam ple 7.11
illustrates this. H ere we look briefly a t confidence lim its and prediction limits,
where the n o n p aram etric curve is the only “m odel”.
The basic difficulty for resam pling here is sim ilar to th a t w ith density
estim ation, illustrated in Exam ple 5.13, nam ely bias. Suppose th a t we w ant
to calculate a confidence interval for ji(x) at one o r m ore values o f x. Case
resam pling can n o t be used w ith stan d ard recom m endations for nonparam etric
regression, because the resam pling bias o f f i { x ) will be sm aller th an th at
o f ju(x). T his could probably be corrected, as w ith density estim ation, by
using a larger b andw idth o r equivalent tuning constant. But simpler, at least
in principle, is to apply the idea o f m odel-based resam pling discussed in
C h apter 6.
The naive extension o f m odel-based resam pling would generate responses
y j = p.{xj) + e*, where fa(x; ) is the fitted value from some nonparam etric
regression m ethod, an d ej is sam pled from appropriately m odified versions
o f the residuals yj — fi(xj). U n fortunately the inherent bias o f m ost n o n p a ra
m etric regression m ethods distorts b o th the fitted values and the residuals,
and thence biases the resam pling scheme. O ne recom m ended strategy is to
use as sim ulation m odel a curve th a t is oversm oothed relative to the usual
estim ate. F o r definiteness, suppose th a t we are using a kernel m ethod o r a local
sm oothing m ethod w ith tuning co n stan t b, an d th a t we use cross-validation
7.6 • Nonparametric Regression 365
to determ ine the best value o f b. T hen for the sim ulation m odel we use the
corresponding curve with, say, 2b as the tuning constant. To try to elim inate
bias from the sim ulation errors ej, we use residuals from an undersm oothed
curve, say w ith tuning co n stan t b / 2. As with linear regression, it is appropriate
to use m odified residuals, where leverage is taken into account as in (6.9). This
is possible for m ost nonparam etric regression m ethods, since they are linear.
D etailed asym ptotic theory shows th at som ething along these lines is necessary
to m ake resam pling work, b u t there is no clear guidance as to precise relative
values for the tuning constants.
fit. N ote how the confidence limits are centred on the convex side o f the fitted
curve in o rd er to account for its bias; this is m ost evident at x = 20. ■
T he following exam ple illustrates the use o f nonparam etric curve fits in m odel-
checking.
Example 7.11 (Leukaemia data) For the d a ta in Exam ple 7.1, we originally
fitted a generalized linear m odel w ith gam m a variance function and linear
p redictor g ro u p + x w ith logarithm ic link, where g ro u p is a factor w ith two
levels. T he fitted m ean function for th a t m odel is show n as two solid curves
in Figure 7.16, the u p p er curve corresponding to G ro u p 1. H ere we consider
368 7 ■Further Topics in Regression
Figure 7.16
Generalized linear
model fits (solid) and
generalized additive
model fits (dashed) for
leukaemia data of
Example 7.1.
w hether or n o t the effect o f x is linear. To do this, we com pare the original fit
to th at o f the generalized additive m odel in which x is replaced by s(x), which
is a sm oothing spline w ith three degrees o f freedom . The link and variance
functions are unchanged. T he fitted m ean function for this m odel is shown as
dashed curves in the figure.
Is the sm ooth curve a significantly b etter fit? To answ er this we use the test
statistic Q defined in (7.8), where here D corresponds to the residual deviance
for the generalized additive m odel, k is the dispersion for th a t m odel, and
Do is the residual deviance for the sm aller generalized linear model. F or these
d a ta D = 40.32 w ith 30 degrees o f freedom , k = 0.725, and Do = 30.75 w ith
27 degrees o f freedom , so th a t q = (40.32 — 30.75)/0.725 = 13.2. The standard
approxim ation for the null distrib u tio n o f Q is chi-squared w ith degrees o f
freedom equal to the difference in m odel dim ensions, here p — po = 3, so
the approxim ate P-value is 0.004. A lternatively, to allow for estim ation o f the
dispersion, (p — po)_12 is com pared to the F distribution w ith denom inator
degrees o f freedom n — p — 1, here 27, an d this gives approxim ate P-value
0.012. It looks as though there is strong evidence against the simpler, loglinear
model. However, the accuracies o f the approxim ations used here are som ew hat
questionable, so it m akes sense to apply the resam pling analysis.
To calculate a b o o tstrap P-value corresponding to q = 13.2, we sim ulate the
distribution o f Q u nder the fitted null m odel, th a t is the original generalized
linear m odel fit, b u t w ith n o n p aram etric resam pling. T he p articular resam pling
scheme we choose here uses the linear predictor residuals rLj defined in (7.10),
one advantage o f which is th a t positive sim ulated responses are guaranteed.
The residuals in this case are
= logCVj) ~ log(Aoj)
Ll 4 /2( l - S ) i / 2 ’
7.6 - Nonparametric Regression 369
Figure 7.17
Chi-squared Q-Q plot of
standardized deviance
differences q* for
comparing generalized
linear and generalized
additive model fits to
the leukaemia data. The
lines show the
theoretical x\
approximation (dashes)
and the F
approximation (dots).
Resampling uses
Pearson residuals on
linear predictor scale,
with R = 999.
Chi-squared quantiles
w here hoj, jhj an d kq are the leverage, fitted value and dispersion estim ate
for the null (generalized linear) m odel. These residuals ap p ear quite hom oge
neous, so no stratification is used. T hus step 2 o f A lgorithm 7.4 consists o f
sam pling e j,...,e * random ly with replacem ent from rL{, . . . , r Ln (w ithout m ean
correction), an d then generating responses y * = /io; exp(KQ/2e*) for j = l , . . . , n .
A pplying this algorithm w ith R = 999 for o u r d a ta gives the P-value 0.035,
larger th a n the theoretical approxim ations, b u t still suggesting th a t the linear
term in x is n o t sufficient. The b o o tstrap null distribution o f q * deviates
m arkedly from the stan d ard %\ approxim ation, as the Q-Q plot in Figure 7.17
shows. The F approxim ation is also inaccurate.
A jack k n ife-after-b o o tstrap plot reveals th at quantiles o f q* are m oderately
sensitive to case 2, b u t w ithout this case the P-value is virtually unchanged.
Very sim ilar results are obtained under param etric resam pling with the
exponential m odel, as m ight be expected from the original d a ta analysis. ■
O u r next exam ple illustrates the use o f sem iparam etric regression in predic
tion.
We take a (j) to be a locally quadratic lowess sm ooth w ith bandw idth 0.5.
370 7 • Further Topics in Regression
Figure 7.18
Generalized additive
model prediction of UK
AIDS diagnoses. The
left panel shows the
fitted curve with
bandwidth 0.5 (smooth
solid line), the predicted
diagnoses from this fit
(jagged dashed line),
and the fitted curves
with bandwidths 0.7
(dots) and 0.3 (dashes),
together with the
observed totals (+). The
right panel shows the
predicted quarterly
diagnoses for 1989-92
(central solid line), and
pointwise 95%
prediction limits from
the Poisson bootstrap
(solid), negative
binomial bootstrap
(dashes), and
nonparametric
T he delay distrib u tio n is so sharply peaked here th a t although we could take bootstrap without (dots)
and with (dot-dash)
a sm ooth function in the delay time, it is equally parsim onious to take 15 stratification.
separate p aram eters f t . We use the sam e variance function as in Exam ple 7.4,
which assum es th a t the observed counts yjk are overdispersed Poisson w ith
m eans /ijk, and we fit the m odel as a generalized additive m odel. T he residual
deviance is 751.7 on 444.2 degrees o f freedom , increased from 716.5 and 413
in the previous fit. The curve show n in the left panel o f Figure 7.18 fits well,
and is m uch m ore plausible as a m odel for underlying trend th an the curve in
Figure 7.5. The panel also shows the predicted values from this curve, which
o f course are heavily affected by the observed diagnoses in Table 7.4.
As m entioned above, in resam pling from fitted curves it is im p o rta n t to take
residuals from an u n dersm oothed curve, in o rd er to avoid bias, and to add
them to an oversm oothed curve. We take Pearson residuals {y — p ) / p } l2 from a
sim ilar curve w ith b andw idth 0.3, and ad d them to a curve w ith bandw idth 0.7.
These fits have deviances 745.3 on 439.2 degrees o f freedom and 754.1 on 446.1
degrees o f freedom . B oth o f these curves are show n in Figure 7.18. Leverage
adjustm ent is aw kw ard for generalized additive m odels, b u t the large num ber
o f degrees o f freedom here m akes such adjustm ents unnecessary. We m odify
resam pling scheme (7.12), an d repeat the calculations as for A lgorithm 7.1
applied to Exam ple 7.4, w ith R = 999.
Table 7.11 shows the resulting prediction intervals for the last quarters o f
1990, 1991, an d 1992. T he intervals for 1992 are substantially shorter th an
those in Table 7.5, because o f the different m odel. T he generalized additive
m odel is based on an underlying sm ooth trend in diagnoses, so predictions
for the last few rows o f the table depend less critically on the values observed
7.6 • Nonparametric Regression 371
in those rows. This contrasts w ith the Poisson tw o-w ay layout m odel, for
which the predictions depend com pletely on single rows o f the table and are
m uch m ore variable. C om pare the slight forecast drop in Figure 7.6 with the
predicted increase in Figure 7.18.
The d otted lines in Figure 7.18 show pointw ise 95% prediction bands for the
A ID S diagnoses. The prediction intervals for the negative binom ial and n o n
p aram etric schemes are similar, although the effect o f stratification is smaller.
S tratification has no effect on the deviances. The negative binom ial deviances
are typically a b o u t 90 larger th a n those generated under the nonparam etric
scheme.
The plausibility o f the sm ooth underlying curve and its usefulness for p re
diction is o f course central to the approach outlined here. ■
Exam ple 7.13 (Downs syndrom e) Table 7.12 contains a set o f d a ta on inci
dence o f D ow ns syndrom e babies for m others in various age ranges. M ean
age is approxim ate m ean age o f the m m others whose babies included y babies
with D ow ns syndrom e. These d a ta are plotted on the logistic scale in Fig
ure 7.19, together w ith a generalized additive spline fit as an exploratory aid
in m odelling the incidence rate.
W h at we notice ab o u t the curve is th at it decreases w ith age for young
m others, co n trary to intu itio n and expert belief. A sim ilar phenom enon occurs
for o th er datasets. We w ant to see if this dip is real, as opposed to a statistical
artefact. So a null m odel is required under which the rate o f occurrence is
increasing w ith age. L inear logistic regression is clearly inappropriate, and
m ost oth er stan d ard m odels give non-increasing rates. The approach taken is
isotonic regression, in which the rates are fitted nonparam etrically subject to
their increasing w ith age. F urther, in order to m ake the null m odel a special
372 7 • Further Topics in Regression
X 37.5 38.5 39.5 40.5 41.5 42.4 43.5 44.5 45.5 47.0
m 5780 4834 3961 2952 2276 1589 1018 596 327 249
y 17 15 30 31 33 20 16 22 11 7
Mean age x
case o f the general model, the la tte r is taken to be an arb itrary convex curve
for the logit o f incidence rate.
If the incidence rate at age x, is n(xi) w ith logit{7r(x/)} = rj(xi) = */*, say, for
i= then the binom ial log likelihood is
1=1
A convex m odel is one in which
Xi+1 - Xi Xi - X i-1 .
t i i < - ------ — rn- 1 + 7 ------ — 1i+1. I = 2 , . .. ,k - 1 .
x i+ 1 %i—1 Xj+1 - Xi- 1
The general m odel fit will m axim ize the binom ial log likelihood subject to these
constraints, giving estim ates fji,...,rjk- T he null m odel satisfies the constraints
rji < rji+i for i = l , . . . , k — 1, which are equivalent to the previous convexity
7.6 ■Nonparametric Regression 373
Mean age x
constraints plus the single co n straint r\\ < r\2 - The null fit essentially pools
adjacent age groups for which the general estim ates fji violate the m onotonicity
o f the null m odel. If the null estim ates are denoted by then we take as our
test statistic the deviance difference
T he difficulty now is th a t the stan d ard chi-squared approxim ation for de
viance differences does n o t apply, essentially because there is n o t a fixed value
for the degrees o f freedom . T here is a com plicated large-sam ple approxim ation
which m ay well n o t be reliable. So a param etric b o o tstrap is used to calculate
the P-value. This requires sim ulation from the binom ial m odel w ith sample
sizes m„ covariate values x, and logits fjo,i-
Figure 7.20 shows the convex and isotone regression fits, which clearly
differ for age below 30. T he deviance difference for these fits is t = 5.873.
S im ulation o f R = 999 binom ial datasets from the isotone m odel gave 33
values o f t* in excess o f 5.873, so the P-value is 0.034 and we conclude
th a t the dip in incidence rate m ay be real. (F urther analysis w ith additional
d a ta does n o t su p p o rt this conclusion.) Figure 7.21 is a histogram o f the t*
values.
It is possible th a t the null distribution o f T is unstable with respect to p ara m
eter values, in which case the nested b o o tstrap procedure o f Section 4.5 should
be used, possibly in conjunction w ith the recycling m ethod o f Section 9.4.4 to
accelerate the com putation. ■
374 7 • Further Topics in Regression
0 2 4 6 8 10
t*
A sum m ary o f m uch o f the theory for resam pling in nonlinear and no n
param etric regression is given in C h ap ter 8 o f Shao and Tu (1995).
7.8 Problems
1 The estimator ft in a generalized linear model may be defined as the solution to
the theoretical counterpart of (7.2), namely
/ c V ( t ) d f / e/ F{x' y} = 0'
evaluated at the fitted model, where W is the diagonal matrix with elements given
by (7 . 3 ) .
Hence show that the approximate variance matrix for ft' for case resampling in a
generalized linear model is
k ( X T W X ) - 1X T W S X { X T W X ) ~ \
where $ = diag(rp,,..., rj,n) with the rpj standardized Pearson residuals (7.9).
Show that for the linear model this yields the modified version of the robust
variance matrix ( 6 . 2 6 ) .
(Section 7 . 2 . 2 ; Moulton and Zeger, 1 9 9 1 )
2 For the gamma model of Examples 7 . 1 and 7 . 2 , verify that v a r(7 ) = k/i2 and that
the log likelihood contribution from a single observation is
= - ^ { l o g i ^ + y/fi}.
Show that the unstandardized Pearson and deviance residuals are respectively
k _ / 2 ( — 1) and sign(z—1 ) [ 2 k _ 1 / 2 { z — 1 — log(z)}]1/2, where z = y/p.. If the regression
i z
is loglinear, meaning that the log link is used, verify that the unstandardized linear
predictor residuals are simply k~i/2 log(z).
What are the possible ranges of the standardized residuals rP, rL and rDl Calculate
these for the model fitted in Example 7 .2.
If the deviance residual is expressed as d(y,p), check that d(y,p) = d(z, 1). Hence
show that the resampling scheme based on standardized deviance residuals can
be expressed as y ’ = faz’, where zj is defined by d(zj, 1) = s' with «' randomly
sampled from rDi, . . . , r Dn. What further simplification can be made?
(Sections 7 . 2 . 2 , 7 . 2 . 3 )
3 The figure below shows the fit to data pairs ( x u y \ ),•■■,(x„,y„) of a binary logistic
model
(a) Under case resampling, show that the maximum likelihood estimate for a
bootstrap sample is infinite with probability close to e~2. W hat effect has this on
the different types o f bootstrap confidence intervals for fa ?
(b) Bias-corrected maximum likelihood estimates are obtained by modifying re
sponse values (0,1) to (/iy/2, l+hj), where hj is the jth leverage for the model fit to
the original data. D o infinite parameter estimates arise when bootstrapping cases
from the modified data?
(Section 7.2.3; Firth, 1993; M oulton and Zeger, 1991)
4 Investigate whether resampling schemes given by (7.12), (7.13), and (7.14) yield
Algorithm 6.1 for bootstrapping the linear model.
6 For generalized linear models the analogue o f the case-deletion result in Problem 6.2
is
Kj = P-(xTwxy'wjk-^xj^^i.
(a) Use this to show that when the y'th case is deleted the predicted value for y, is
378 7 • Further Topics in Regression
(b) Use (a) to give an approximation for the leave-one-out cross-validation estimate
o f prediction error for a binary logistic regression with cost (7.23).
(Sections 6.4.1,7.2.2)
7.9 Practicals
1 Dataframe r e m is s io n contains data from Freeman (1987) concerning a measure
o f cancer activity, the LI values, for 27 cancer patients, o f whom 9 went into
remission. Remission is indicated by the binary variable r = 1. Consider testing
the hypothesis that the LI values do not affect the probability o f remission. First,
fit a binary logistic m odel to the data, plot them, and perform a permutation test:
attach(remission)
plot(LI+O.03*rnorm(27),r,pch=l,xlab="LI, jittered",xlim=c(0,2.5))
rem.glm <- glm(r"LI.binomial,data=remission)
summary(rem.glm)
x <- seqC0.4,2.0,0.02)
eta <- cbind(rep(l,81) ,x)/C*'/.coeff icients(rem.glm)
lines(x,inv.logit(eta),lty=2)
rem.perm <- function(data, i)
{ d <-data
d$LI<- d$LI[i]
d.glm <- glm(r~LI,binomial,data=d)
coefficients(d.glm) >
rem.boot <- boot(remission, rem.perm, R=199, sim="permutation")
qqnorm(rem.boot$t[,2],ylab="Coefficient of LI",ylim=c(-3,3))
abline(h=rem.boot$tO[2],lty=2)
Compare this significance level with that from using a normal approximation for
the coefficient o f LI in the fitted model.
Construct bootstrap tests o f the hypothesis by extending the methods outlined in
Section 6.2.5.
(Freeman, 1987; Hall and Wilson, 1991)
2 Dataframe b reslo w contains data from Breslow (1985) on death rates from heart
disease among British male doctors. A standard m odel is that the numbers o f
deaths y have a Poisson distribution with mean nX, where n is the number o f
person-years and X is the death rate. The focus o f interest is how death rate
depends on two explanatory variables, a factor representing the age group and an
indicator o f sm oking status, x. Two com peting models are
these are respectively multiplicative and additive. To fit these models we proceed
as follows:
Here n s is a variable for the effect o f smoking, constructed to allow for the
difficulty in applying an offset in fitting the additive model. The deviances o f the
fitted models are Dadd = 7.43 and Dmuit = 12.13. Although it appears that the
additive model is the better fit, these models are not nested, so a chi-squared
approximation cannot be applied to the difference o f deviances. For bootstrap
7.9 • Practicals 379
W hat does this tell you about the relative fit o f the models?
A different strategy would be to use parametric simulation, simulating not from
the fitted models, but from the model with separate Poisson distributions for each
o f the original data. D iscuss critically this approach.
(Section 7.2; Example 4.5; Wahrendorf, Becher and Brown, 1987; Hall and Wilson,
1991)
The M L E s for the original data can be obtained by setting hirose.start <-
c(6,2,4,l,l) (obtained by introspection), and then iterating the following lines
a few times.
N ew data are generated by
380 7 ■Further Topics in Regression
Try this with a larger value of R — but don’t hold your breath.
For a full likelihood analysis for the parameter 9, the log likelihood must be
maximized over /?i,...,/?4 for a given value of 9. A little thought shows that the
necessary code is
betaO <- function(theta, mle)
{ x49 <- -log(4.9-(5-exp(mle[4])))
x <- -log(4.9)
log(theta*10"3) - m l e [1]*x49-lgamma(l + exp (-mle [2]-mle [3] *x)) }
hirose.Iik2 <- function(mle, data, theta)
{ xO <- 5-exp(mle[4])
lambda <- exp(betaO(theta,mle)+mle[1]*(-log(data$volt-xO)))
beta <- exp(mle[2]+mle[3]*(-log(data$volt)))
z <- (data$time/lambda)“ beta
sum(z - data$cens*log(beta*z/data$time)) }
hirose.fun2 <- function(data, start, theta)
{ d <- nlminb(start, hirose.Iik2, data=data, theta=theta)
conv <- (d$message=="RELATIVE FUNCTION CONVERGENCE")
c(conv, d$objective, d$parameters) }
hirose.f <- function(data, start, theta)
c( hirose.fun(data,i.start),
hirose.fun2(data,i ,start[-1],theta))
R <- hirose.bootSR
i <- c(l:R) [(hirose.boot$t[,l]==l)&(hirose.boot$t[,8]==l)]
w <- 2*(hirose.boot$t[i,9]-hirose.boot$t[,2])
qqplot(qchisq(c(l:length(w))/(l+length(w)),1),w)
abline(0,1,lty=2)
Again, try this with a larger R.
Can you see how the code would be modified for nonparametric simulation?
(Section 7.3; Hirose, 1993)
attach(nodal)
cost <- function(r, pi=0) mean(abs(r-pi)>0.5)
nodal.glm <- glm(r~stage+xray+acid,binomial,data=nodal)
nodal.diag <- glm.diag(nodal.glm)
app.err <- cost(r, fitted(nodal.glm))
cv.err <- cv.glm(nodal, nodal.glm, cost, K=53)$delta
cv.ll.err <- c v .glm(nodal, nodal.glm, cost, K=ll)$delta
For resampling-based estimates and plot for 0.632 errors:
Here we have used resampled standardized Pearson residuals for the null model,
obtained by c lo t h .d ia g $ r p .
How significant is the observed drop in deviance under this resampling scheme?
(Section 7.6.2; Bissell, 1972; Firth, G losup and Hinkley, 1991)
6 The data n i t r o f e n are taken from a test o f the toxicity o f the herbicide nitrofen
on the zooplankton Ceriodaphnia dubia, an important species that forms the basis
o f freshwater food chains for the higher invertebrates and for fish and birds. The
standard test measures the survival and reproductive output o f 10 juvenile C. dubia
in each o f four concentrations o f the herbicide, together with a control in which
the herbicide is not present. During the 7-day period o f the test each o f the original
individuals produces three broods o f offspring, but for illustration we analyse the
total offspring.
A previous m odel for the data is that at concentration x the total offspring y for
each individual is Poisson distributed with mean exp(/?, + [3[X + (h * 1)- The fit o f
this m odel to the data suggests that low doses o f nitrofen augment reproduction,
but that higher doses inhibit it.
One thing required from analysis is an estimate o f the concentration x 5o o f nitrofen
at which the mean brood size is halved, together with a 95% confidence interval
for x 50. A second issue is posed by the surprising finding from a previous analysis
that brood sizes are slightly larger at low doses o f herbicide than at high or zero
doses: is this true?
A wide variety o f nonparametric curves could be fitted to the data, though care
is needed because there are only five distinct values o f x. The data do not look
Poisson, but we use models with Poisson errors and the log link function to
ensure that fitted values and predictions are positive. To compare the fits o f the
generalized linear m odel described above and a robustified generalized additive
model with Poisson errors:
Do the values of x'^ look normal? What is the bias estimate for x50 using the two
models?
To perform a bootstrap test of whether the peak is a genuine effect, we simulate
from a model satisfying the null hypothesis of no peak to see if the observed
value of a suitable test statistic (, say, is unusual. This involves fitting a model
with no peak, and then simulating from it. We read fitted values m0(x) from
the robust generalized additive model fit, but with 2.2 df (chosen by eye as the
smallest for which the curve is flat through the first two levels of concentration).
We then generate bootstrap responses by setting y ’ = m o ( x ) + s', where the e’ are
chosen randomly from the modified residuals at that x. We take as test statistic
the difference between the highest fitted value and the fitted value at x = 0.
nitro.test <- fitted(gam(total~s(conc,df=2.2).robust(poisson),
data=nitrofen))
f <- predict(nitro.glm,nitro,"response")
nitro.orig <- max(f) - f[l]
res <- (nitrofen$total-nitro.test)/sqrt(l-0.1)
nitrol <- data.frame(nitrofen,res=res,fit=nitro.test)
nitrol.fun <- function(data, i, nitro)
{ assignC'd" ,data[i,] ,frame=l)
d$total <- round(d$fit+d$res[i])
d.fit <- glm(total~conc+conc“2,poisson,data=d)
f <- predict(d.fit,nitro,"response")
max(f)-f[l] }
nitrol.boot <- boot(nitrol, nitrol.fun, R=99,
strata=rep(l:5,r ep(10,5)), nitro=nitro)
(1+sum(nitrol.boot$t>nitro.orig))/(1+nitrol.boot$R)
Do your conclusions change if other smooth curves are fitted?
(Section 7.6.2; Bailer and Oris, 1994)
8
Complex Dependence
8.1 Introduction
In previous chapters o u r m odels have involved variables independent at some
level, an d we have been able to identify independent com ponents th at can be
sim ulated. W here a m odel can be fitted and residuals o f some sort identified,
the sam e ideas can be applied in the m ore com plex problem s discussed in
this chapter. W here th a t m odel is param etric, param etric sim ulation can in
principle be used to obtain resam ples, though M arkov chain M onte C arlo
techniques m ay be needed in practice. But in nonparam etric situations the
dependence m ay be so com plex, or our knowledge o f it so limited, th a t neither
o f these approaches is feasible. O f course some assum ption o f repeatedness
w ithin the d a ta is essential, o r it is im possible to proceed. But the repeatability
m ay not be at the level o f individual observations, b u t o f groups o f them , and
there is typically dependence betw een as well as w ithin groups. This leads to
the idea o f constructing b o o tstrap d a ta by taking blocks o f some sort from the
original observations. T he area is in rapid developm ent, so we avoid a detailed
m athem atical exposition, an d merely sketch key aspects o f the m ain ideas. In
Section 8.2 we describe som e o f the resam pling schemes proposed for time
series. Section 8.3 outlines some ideas useful in resam pling point processes.
385
386 8 ■Complex Dependence
and n o t on their absolute position in the series. A w eaker assum ption used in
d a ta analysis is th a t the jo in t second m om ents o f observations depend only
on their relative positions; such a series is said to be second-order o r weakly
stationary.
Time domain
T here are two basic types o f sum m ary quantities for stationary tim e series. The
first, in the tim e dom ain, rests on the jo in t m om ents o f the observations. Let
{7,} be a second-order stationary tim e series, w ith zero m ean and autocovari
ance function yj. T h at is, E (Yj) = 0 an d co\(Yk, Yk+j) = yj for all k and j ; the
variance o f Yj is yo- T hen the autocorrelation function o f the series is pj = y j / y o,
for j = 0, + 1, . . which m easures the co rrelation betw een observations at lag
j a p a rt; o f course —1 < pj < 1, po = 1, an d ps = p _; . A n uncorrelated series
would have pj = 0, and if the d a ta were norm ally d istributed this would imply
th a t the observations were independent.
For exam ple, the statio n ary m oving average process o f order one, or M A(1)
model, has
Yj = ej + Pej-i, ; = 1 ,0 ,1 ,..., (8.1)
The autoco rrelatio n function for this process is pj = a 1-'1 for j = + 1 , ± 2 and so
forth, so large a gives high correlation betw een successive observations. The
autocorrelatio n function decreases rapidly for b o th m odels (8.1) and (8.2).
A close relative o f the au to co rrelatio n function is the partial autocorrelation
function, defined as pj = yj/yo, where yj is the covariance betw een Y& and Yk+j
after adjusting for the intervening observations. T he partial autocorrelations
for the M A (1) m odel are
by
P 9
Frequency domain
The second ap p ro ach to tim e series is based on the frequency dom ain. The
spectrum o f a statio n ary series w ith autocovariances yj is
00
This sum m arizes the values o f all the autocorrelations o f {Yj}. A w hite noise
process has the flat spectrum g(co) = yo, while a sh arp peak in g(to) corresponds
to a strong periodic com ponent in the series. F or example, the spectrum for a
stationary A R (1) m odel is g(co) = cr2{ 1 — 2acos(co) + a2}-1 .
The em pirical F ourier transform plays a key role in d a ta analysis in the
frequency dom ain. T he treatm en t is simplified if we relabel the series as
yo, a n d suppose th a t n = 2np + 1 is odd. Let f = e2n'^n be the nth
com plex ro o t o f unity, so (" = 1. T hen the empirical Fourier transform o f the
d a ta is the set o f n com plex-valued quantities
n—1
y k = Y 2 £}ky j ’ fc = o ,. . . , n - 1;
7=0
1 "-1
~^2C~}kyk = yj, 7 = 0 , . . —l,
k=0
so this inverse Fourier transform retrieves the data. N ow define the Fourier
frequencies cok — 2nk /n, for k = 1, . . . , n p . T he sam ple analogue o f the spectrum
at a>k is the periodogram,
J2j=i
k = — 1.
Z jU H ujY
W hen the d a ta are w hite noise these ordinates have roughly the same jo in t
distributio n as the o rd er statistics o f np — 1 uniform ran d o m variables.
Exam ple 8.1 (Rio N egro d a ta ) The d a ta for o u r first time series exam ple are
m onthly averages o f the daily stages — heights — o f the R io N egro, 18 km
upstream a t M anaus, from 1903 to 1992, m ade available to us by Professors
H. O ’Reilly S ternberg an d D. R. B rillinger o f the U niversity o f C alifornia
at Berkeley. Because o f the tiny slope o f the w ater surface and the lower
courses o f its flatland affluents, these d a ta m ay be regarded as a reasonable
approxim ation o f the w ater level in the A m azon R iver at the confluence o f the
8.2 • Time Series 389
Figure 8.1
Deseasonalized monthly
average stage (metres)
of the R io N egro at
M anaus, 1903-1992
(Sternberg, 1995).
two rivers. To remove the strong seasonal com ponent, we subtract the average
value for each m onth, giving the series o f length n = 1080 shown in Figure 8.1.
F or an initial exam ple, we take the first ten years o f observations. The top
panels o f Figure 8.2 show the correlogram and partial correlogram for this
sh o rter series, w ith horizontal lines showing approxim ate 95% confidence limits
for correlations from a w hite noise series. The shape o f the correlogram and
the cut-off in the p artial correlogram suggest th a t a low -order autoregressive
m odel will fit the data, which are quite highly correlated. T he lower left panel
o f the figure shows the periodogram o f the series, which displays the usual
high variability associated w ith single periodogram ordinates. The lower right
panel shows the cum ulative periodogram , which lies well outside its overall
95% confidence b and an d clearly does n o t correspond to a white noise series.
A n A R (2) m odel fitted to the shorter series gives oil = 1.14 and a.2 = —0.31,
b o th w ith stan d ard erro r 0.062, and estim ated innovation variance 0.598. The
left panel o f Figure 8.3 shows a norm al probability plot o f the standardized
residuals from this m odel, an d the right panel shows the cum ulative peri
odogram o f the residual series. The residuals seem close to G aussian white
noise. ■
Lag Lag
omega omega/pi
residuals into the fitted m odel. T he residuals are typically recentred to have
the same m ean as the innovations o f the m odel. A b o u t the sim plest situation
is w hen the A R (1) m odel (8.2) is fitted to an observed series y i , . . . , y „ , giving
estim ated autoregressive coefficient a an d estim ated innovations
ej = yj - &y j - u j = 2,...,n;
yo = ej and
y j = a yj_! + e j , j = l,...,n ; (8.6)
o f course we m ust have |a| < 1. In fact the series so generated is n o t stationary,
an d it is b etter to start the series in equilibrium , o r to generate a longer series
o f innovations an d sta rt (8.6) at j = —k, where the ‘b u rn-in’ period —k , . . . , 0
is chosen large enough to ensure th at the observations y [ , . . . , y * are essentially
statio n ary ; the values y'_k, . . . , y ' ) are discarded.
T hus m odel-based resam pling for tim e series is based on applying the
defining equation(s) o f the series to innovations resam pled from residuals.
This procedure is simple to apply, and leads to good theoretical behaviour
for estim ates based on such d a ta w hen the m odel is correct. F or example,
studentized b o o tstrap confidence intervals for the autoregressive coefficients
ak in an A R (p) process enjoy the good asym ptotic properties discussed in
Section 5.4.1, provided th a t the m odel fitted is chosen correctly. Just as there,
confidence intervals based on transform ed statistics m ay be b etter in practice.
Exam ple 8.2 (Wool prices) T he A ustralian W ool C o rp o ratio n m onitors prices
weekly w hen wool m arkets are held, and sets a m inim um price ju st before each
week’s m arkets open. This reflects the overall price o f wool for th a t week, b u t
the prices actually paid can vary considerably relative to the m inim um . The
left panel o f Figure 8.4 shows a plot o f log(price p aid /m in im u m price) for
those weeks w hen m arkets were held from July 1976 to June 1984. The series
does n o t seem stationary, having som e o f the characteristics o f a ran d o m walk,
as well as a possible overall trend.
I f the log ratio in week j follows a random walk, we have Yj = Yj -\ + Sj,
392 8 ■Complex Dependence
where the ej are w hite noise; a non-zero m ean for the innovations Ej will
lead to drift in yj. The right panel o f Figure 8.4 shows the differenced series,
ej = y j —y j - i , which appears stationary a p a rt from a change in the innovation
variance at a b o u t the 100th week. In o u r analysis we drop the first 100
observations, leaving a differenced series o f length 208.
A n alternative to the ran d o m w alk m odel is the A R(1) m odel
this gives the ran d o m w alk when a = 1. If the innovations have m ean zero
and a is close to b u t less th a n one, (8.7) gives stationary data, though subject
to the clim bs and falls seen in the left panel o f Figure 8.4. The im plications for
forecasting depend on the value o f a, since the variance o f a forecast is only
asym ptotically bounded w hen |a| < 1. We test the unit root hypothesis th a t
the d ata are a ran d o m walk, or equivalently th a t a = 1, as follows.
O ur test is based on the o rdinary least squares estim ate o f a in the regression
Yj = }’ + a Yj-1 +Sj for j = 2 , . . . , n using test statistic T = (1 —a) /S, where S is
the stan d ard erro r for a calculated using the usual form ula for a straight-line
regression m odel. L arge values o f T are evidence against the random walk
hypothesis, w ith or w ithout drift. T he observed value o f T is t = 1.19. The
distribution o f T is far from the usual stan d ard norm al, however, because o f
the regression o f each observation on its predecessor.
U nder the hypothesis th a t a = 1 we sim ulate new time series Y J , . . . , Y * by
generating a b o o tstrap sam ple e \ , . . . , e* from the differences e i , . . . , e n and then
setting YJ = Y\, Y j = YJ + e 2" , an d YJ = Y]'_l + £* for subsequent j. This is
(8.6) applied w ith the null hypothesis value a = 1. T he value o f T ' is then
obtained from the regression o f YJ on YJ_X for j = 2 The left panel
8.2 • Time Series 393
o f Figure 8.5 shows the em pirical distribution o f T * in 199 sim ulations. The
distribution is close to norm al w ith m ean 1.17 and variance 0.88. T he observed
significance level for t is (97 + l ) / ( 199 + 1) = 0.49: there is no evidence against
the ran d o m w alk hypothesis.
The right panel o f Figure 8.5 shows the values o f f* plotted against the
inverse sum o f squares for the regressor y j _ v In a conventional regression,
inference is usually conditional on this sum o f squares, which determ ines the
precision o f the estim ate. The dotted line shows the observed sum o f squares.
If the conditional distribution o f tm is th ought to be appropriate here, the
distribution o f values o f t* close to the do tted line shows th a t the conditional
significance level is even higher; there is no evidence against the random walk
conditionally or unconditionally. ■
M odels are com m only fitted in o rder to predict future values o f a tim e series,
b u t as in o th er settings, it can be difficult to allow for the various sources o f
u ncertainty th a t affect the predictions. The next exam ple shows how boo tstrap
m ethods can give some idea o f the relative contributions from innovations,
estim ation error, and m odel error.
Exam ple 8.3 (Sunspot num bers) Figure 8.6 shows the m uch-analysed annual
sunspot num bers y [ , - - - , y 2%g from 1700-1988. T he d a ta show a strong cycle
w ith a period o f ab o u t 11 years, and som e hint o f non-reversibility, which
shows up as a lack o f sym m etry in the peaks. We use values from 1930-1979
to predict the num bers o f sunspots over the next few years, based on fitting
394 8 ■Complex Dependence
Time in years
errors o f this are given in Table 8.1, based o n J ? = 999 b o o tstrap series. The
orders o f the fitted m odels were
O rd er 1 234 5 67 89 10 11 12
# 3 257 126100 273 8522 18 83 23 72
so the A R (9) m odel is chosen in only 8% o f cases, and m ost o f the m odels
selected are less com plicated. The fifth and sixth rows o f Table 8.1 give the
estim ated sta n d a rd errors o f the y ’ — y* using the 83 sim ulated series for
which the selected m odel was A R(9) and using all the series, based on the
999 replications. T here is ab o u t a 10-15% increase in stan d ard erro r due to
p aram eter estim ation, an d the stan dard errors for the A R (9) m odels are m ostly
smaller.
Prediction errors should take account o f the values o f yj im m ediately prior
to the forecast period, since presum ably these are relevant to the predictions
actually m ade. Predictions th a t follow on from the observed d a ta can be
obtained by using innovations sam pled a t random except for the period j =
n — k + 1 ,... ,n, where we use the residuals actually observed. T aking k = n
yields the original series, in which case the only variability in the y'rj is due to
the innovations in the forecast period; the stan d ard errors o f the predictions
will then be close to the nom inal stan d ard error. However, if k is sm all relative
to n, the differences y*j — y'j will largely reflect the variability due to the use o f
estim ated param eters, although the y*rj will follow on from y n. The conditional
stan d ard errors in Table 8.1, based on k = 9, are a b o u t 10% larger th an the
unconditional ones, and substantially larger th an the nom inal stan d ard errors.
The distrib u tio n s o f the y'j — y'j app ear close to norm al with zero means,
and a sum m ary o f variation in term s o f standard errors seems appropriate.
T here will clearly be difficulties w ith norm al-based prediction intervals in 1985
and 1986, w hen the lower lim its o f 95% intervals for y are negative, and it
m ight be b etter to give one-sided intervals for these years. It would be better
to use a studentized version o f y'j — y'j if an ap p ro p riate stan d ard error were
readily available.
W hen b o o tstra p series are generated from the A R (9) m odel fitted to the
d a ta from 1700-1979, the orders o f the fitted m odels are
O rd er 5 9 10 11 12131415 161718 19
# 1 765 88 57 28211111 51 4 25
The m ajo r draw back w ith m odel-based resam pling is th a t in practice not
only the p aram eters o f a m odel, b u t also its structure, m ust be identified
from the data. I f the chosen structure is incorrect, the resam pled series will
be generated from a w rong m odel, an d hence they will n o t have the same
statistical properties as the original data. This suggests th a t som e allowance
be m ade for m odel selection, as in Section 3.11, b u t it is unclear how to do
this w ithout som e assum ptions ab o u t the dependence structure o f the process,
as in the previous example. O f course this difficulty is less critical when the
m odel selected is strongly indicated by subject-m atter considerations o r is
w ell-supported by extensive data.
are long enough, enough o f the original dependence will be preserved in the
resam pled series th a t statistics f* calculated from {yj} will have approxim ately
the sam e distribution as values t calculated from replicates o f the original
series. C learly this approxim ation will be best if the dependence is weak and
the blocks are as long as possible, thereby preserving the dependence m ore
faithfully. O n the o th er hand, the distinct values o f t* m ust be as num erous
as possible to provide a good estim ate o f the distribution o f T, and this
points tow ards short blocks. T heoretical work outlined below suggests th a t a
com prom ise in which the block length I is o f order ny for some y in the interval
(0,1) balances these tw o conflicting needs. In this case b o th the block length /
an d the n u m b er o f blocks b = n/ l tend to infinity as n —* oo, though different
values o f y are ap p ro p riate for different types o f statistic t.
There are several v ariants on this resam pling plan. One is to let the original
blocks overlap, in o u r exam ple giving the n — I + 1 = 9 blocks z\ = (>’i , ...,> '4),
22 = Z3 = t o , . . . , ye), and so forth up to z9 = (y9, . . . , y n) . This
incurs end effects, as the first and last / — 1 o f the original observations ap p ear
in fewer blocks th an the rest. Such effects can be rem oved by w rapping the
d a ta around a circle, in o u r exam ple adding the blocks z\o = (yio,y n , y n , y \ ) ,
Z n . = ( y u , y n , y i , y 2 ), and Z 12 = 0 ' 12,J '1»J'2,J'3)- This ensures th a t each o f the
original observations has an equal chance o f appearing in a sim ulated series.
E nd correction by w rapping also removes the m inor problem with the no n
overlapping scheme th a t the last block is shorter th an the rest if n / l is not an
integer.
Post-blackening
The m ost im p o rtan t difficulty w ith resam pling schemes based on blocks is th at
they generate series th a t are less dependent th an the original data. In some
circum stances this can lead to catastrophically bad resam pling approxim ations,
as we shall see in Exam ple 8.4. It is clearly inappropriate to take blocks o f
length / = 1 w hen resam pling dependent data, for the resam pled series is
then w hite noise, b u t the “w hitening” can rem ain substantial for small and
m oderate values o f I. This suggests a strategy interm ediate betw een m odel-
based and block resam pling. The idea is to “pre-w hiten” the series by fitting
a m odel th a t is intended to remove m uch o f the dependence betw een the
original observations. A series o f innovations is then generated by block
resam pling o f residuals from the fitted m odel, and the innovation series is
then “post-blackened” by applying the estim ated m odel to the resam pled
innovations. T hus if an A R (1) m odel is used to pre-w hiten the original data,
new series are generated by applying (8.6) b u t w ith the innovation series {ej}
sam pled n o t independently b u t in blocks taken from the centred residual series
e2 - e , . . . , e„ - e.
398 8 • Complex Dependence
B lo ck s o f blocks
A different ap p ro ach to rem oving the w hitening effect o f block resam pling is
to resam ple blocks o f blocks. Suppose th a t the focus o f interest is a statistic
T which estim ates 6 an d depends only on blocks o f m successive observations.
A n exam ple is the lag k autocovariance (n — k) 1 Y ^ J l i y j ~ y)(yj+k ~ y), for
which m = k + 1. T hen unless / » m the distribution o f T* — M s typically a
po o r approxim ation to th a t o f T — 6, because a substantial p ro p ortion o f the
pairs (YJ, Yj+k) in a resam pled series will lie across a jo in betw een blocks, and
will therefore be independent. To im plem ent resam pling blocks o f blocks we
define a new m -variate process { Yj } for which Y j = ( Y j , Y j +m- 1), rew rite T
so th a t it involves averages o f the Yj, an d resam ple blocks o f the new “d a ta ”
y \ , .. .,y'„_m+1, each o f the observations o f which is a block o f the original data.
F or the lag 1 autocovariance, for exam ple, we set
and w rite t = (n — I )-1 YXVij ~ y'lMy'ij ~ ? 2-)- The key point is th a t t should
n o t com pare observations adjacent in each row. W ith n = 12 and / = 4 a
b o o tstrap replicate m ight be
ys y6 yi ys yi yi ys y4 yi y9 y io
^6 yi >'8 y9 yi w yi ys ys y? yio yn
Since a b o o tstra p version o f t based on this series will only contain products
o f (centred) adjacent observations o f the original data, the w hitening due to
resam pling blocks will be reduced, though n o t entirely removed.
This ap p ro ach leads to a sh o rter series being resam pled, b u t this is unim
p o rta n t relative to the gain from avoiding whitening.
Stationary bootstrap
A further b u t less im p o rtan t difficulty w ith these block schemes is th at the
artificial series generated by them are n o t stationary, because the jo in t distri
bution o f resam pled observations close to a jo in betw een blocks differs from
th a t in the centre o f a block. This can be overcom e by taking blocks o f random
length. The stationary bootstrap takes blocks whose lengths L are geom etrically
distributed, w ith density
Pr(L = j ) = ( l - p y - ' p , j = 1 ,2 ,—
This yields resam pled series th a t are statio n ary w ith m ean block length Z = p *.
Properties o f this scheme are explored in Problem s 8.1 and 8.2.
Exam ple 8.4 (Rio N egro d a ta ) To illustrate these resam pling schemes we
consider the shorter series o f river stages, o f length 120, w ith its average
subtracted. Figure 8.7 shows the original series, followed by three b o o tstrap
8.2 ■ Time Series 399
0 20 40 60 80 100 120
series generated by m odel-based sam pling from the A R (2) model. The next
three panels show series generated using the block b o o tstrap with length I = 24
and no w rapping. There are some sharp jum ps a t the ends o f contiguous blocks
in the resam pled series. T he b o tto m panels show series generated using the
sam e blocks applied to the residuals, and then post-blackened using the A R(2)
m odel. The ju m p s from using the block b o o tstrap are largely rem oved by
post-blackening.
F o r a m ore system atic com parison o f the m ethods, we generated 200 b o o t
strap replicates under different resam pling plans. F or each plan we calculated
the sta n d a rd erro r SE o f the average y * o f the resam pled series, and the
average o f the first three au to correlation coefficients. The m ore dependent
400 8 ■Complex Dependence
The statio n ary b o o tstrap is used with end correction. The results are similar
to those for the block b o o tstrap , except th a t the varying block length preserves
slightly m ore o f the original correlation structure; this is noticeable at I = 2.
R esults for the post-blackened m ethod with A R (2) and A R (3) m odels are
sim ilar to those for the corresponding m odel-based schemes. The results for
the post-blackened A R (1) scheme are interm ediate betw een A R (1) and A R(2)
m odel-based resam pling, reflecting the fact th a t the A R (1) m odel underfits the
data, and hence structure rem ains in the residuals. L onger blocks have little
effect for the A R (2) an d A R (3) models, b u t they bring results for the A R(1)
m odel m ore into line w ith those for the others. ■
Exam ple 8.5 (Sunspot num bers) To assess the success o f the block and p o st
blackened schemes in preserving nonlinearity, we applied them to the sunspot
data, using / = 10. We saw in Exam ple 8.3 th a t although the best autoregressive
m odel for the transform ed d a ta is A R(9), the series is nonlinear. This nonlin
earity m ust rem ain in the residuals, which are alm ost a linear transform ation
o f the series. Figure 8.8 shows probability plots o f the nonlinearity statistic T
from Exam ple 8.3, w ith m = 20, for the block and post-blackened bootstraps
w ith I = 10. T he results for m odel-based resam pling o f residuals are not shown
b u t lie on the diagonal line, so it is clear th a t b o th schemes preserve some o f
the nonlinearity in the data, which m ust derive from lags up to 10. C uriously
the post-blackened scheme seems to preserve more.
Table 8.1 gives the predictive standard errors for the years 1980-1988 when
the simple block resam pling scheme w ith I = 10 is applied to the d a ta for 1930—
1979. O nce d a ta for 1930-1988 have been generated, the procedure outlined
in Exam ple 8.3 is used to select, fit, and predict from an autoregressive model.
Owing to the jo in s betw een blocks, the stan d ard errors are m uch larger than
for the o th er schemes, including the post-blackened one with I = 10, which
gives results sim ilar to b u t som ew hat m ore variable th an the m odel-based
bootstraps. U nadorned block resam pling seems inappropriate for assessing
prediction error, as one w ould expect. ■
Figure 8.8
Distributions of
nonlinearity statistic for
block resampling
schemes applied to
sunspot data. The left
panel shows R = 999
replicates of a test
statistic for nonlinearity,
based on detecting
nonlinearity at up to 20
lags for the block
bootstrap with / = 10.
The right panel shows
the corresponding plot
for the post-blackened
bootstrap using the
AR(9) model.
Quantile of F distribution Quantile of F distribution
Z = k x (n /m )1/(c+2) (8.10)
as the optim um block length for a series o f length n, and calculate k(n,l).
This procedure elim inates the co n stan t o f proportionality. We can check on
the adequacy o f I by repeating the procedure w ith initial value I = I, iterating
if necessary.
8.2 • Time Series 403
Exam ple 8.6 (Rio N egro d a ta ) There is concern th a t river heights at M anaus
m ay be increasing due to deforestation, so we test for trend in the river series,
a ten-year running average o f which is shown in the left panel o f Figure 8.9.
T here m ay be an u pw ard trend, b u t it is h ard to say w hether the effect is real.
To proceed, we suppose th a t the d a ta consist o f a stationary tim e series to
which has been added a m onotonic trend. O ur test statistic is T = Y?j=1 ai
where the coefficients
Vari
40
<6 whole series (R = 199).
> o /
CM
/
o /
CM
5 10 15 20 0 10 20 30 40 50
Block length Block length
»_14 B) = yo + 2 _ i / n^ i ^ 2 yj = £■
7=1 j= —co
(8.12)
v = v a r’ {/i(Y*)} = fr'(Y)2v ar’ (Y*).
if I—►co and Z/n->0 as n—>oo. To calculate approxim ations for the m ean squared
errors o f P an d v requires m ore careful calculations and involves the variance
of — S ) 2. This is messy in general, b u t the essential points rem ain under
the simplifying assum ptions th a t {Yj) is an m -dependent norm al process. In
this case ym+i = y m+2 = • • • = 0, an d the third and higher cum ulants o f the
8.2 ■ Time Series 407
F or norm al data,
SO
In this case the leading term o f the expansion for fi is the product o f h'( Y)
and the rig h t-h an d side o f (8.15), so the b o o tstrap bias estim ate for Y as an
estim ator o f 9 = n is non-zero, which is clearly m isleading since E (T ) = fi.
W ith overlapping blocks, the properties o f the b o o tstra p bias estim ator depend
on E*(Y *)—Y , and it tu rn s o u t th a t its variance is an order o f m agnitude larger
th an for non-overlapping blocks. This difficulty can be rem oved by w rapping
Yi....... Y„ aro u n d a circle an d using n blocks, in which case E*(Y*) = Y, or
by re-centring the b o o tstrap bias estim ate to ^ = E ’ {/i(Y*)} — ft { E ”(Y ')} . In
either case (8.13) and (8.14) apply. One asym ptotic benefit o f using overlapping
408 8 ■Complex Dependence
blocks when the re-centred estim ator is used is th at var(/?) and var(v) are
reduced by a factor | , though in practice the reduction m ay not be visible for
small n.
The corresponding argum ent for tail probabilities involves E dgew orth ex
pansions and is considerably m ore intricate th an th a t sketched above.
A part from sm oothness conditions on h(-), the key requirem ent for the above
argum ent to w ork is th a t x an d ( be finite, and th a t the autocovariances
decrease sharply enough for the various term s neglected to be negligible. This
is the case if ~ a; for sufficiently large j and some a with |a| < 1, as is
the case for stationary finite A R M A processes. However, if for large j we find
th at yj ~ j ~ s, where 5 < S < 1, £ an d x are n o t finite and the argum ent will
fail. In this case g(<u) ~ oj~ s for sm all co, so long-range dependence o f this
sort is characterized by a pole in the spectrum a t the origin, where £ = g(0)
is the value o f the spectrum . The d a ta co u n terp art o f this is a sharp increase
in periodogram ordinates a t small values o f co. T hus a careful exam ination o f
the periodogram n ear the origin and o f the long-range correlation structure is
essential before applying the block b o o tstrap to data.
Step 3 guarantees th a t Yk has com plex conjugate Y*_k, and therefore th a t the
bo o tstrap series Y0*, . . . , Yn'_{ is real. A n alternative to step 2 is to resam ple the
Uk from the observed phases
The b o o tstrap series always has average y , which implies th at phase scram
bling should be applied only to statistics th a t are invariant to location changes
o f the original series; in fact it is useful only for linear contrasts o f the y j , as
we shall see below. It is straightforw ard to see th at
-1/2 n-1 n-1
Y j = y + -----Y P l ~ ^ Y l cos {2 n k (l ~ + U k}’ j = 0 , . . . , n - 1,
n 1=0 k=0
(8.16)
from which it follows th a t the b o o tstrap d a ta are stationary, w ith covariances
equal to the circular covariances o f the original series, and th a t all their odd
jo in t cum ulants equal zero (Problem 8.4). This representation also m akes it
clear th a t the resam pled series will be essentially linear with norm al margins.
The difference betw een phase scram bling and m odel-based resam pling can
be deduced from A lgorithm 8.1. U nder phase scram bling,
which gives
e *(| y; i2) = Iw I2, v a r * ( |y ;|2) = i |j) ,|4.
C learly these resam pling schemes will give different results unless the quantities
o f interest depend only on the m eans o f the |y fe' | 2, i.e. are essentially quadratic
410 8 ■Complex Dependence
in the data. Since the quan tity o f interest m ust also be location-invariant,
this restricts the dom ain o f phase scram bling to such tasks as estim ating the
variances o f linear contrasts in the data.
Example 8.7 (Rio Negro data) We assess em pirical properties o f phase scram
bling using the first 120 m o n th s o f the R io N egro d ata, which we saw previously
were well-fitted by an A R (2) m odel w ith norm al errors. N ote th a t our statistic
o f interest, T = Y l ajYj> has the necessary structure for phase scram bling n o t
autom atically to fail.
Figure 8.11 shows three phase scram bled datasets, which look sim ilar to the
A R(2) series in the second row o f Figure 8.7.
T he top panels o f Figure 8.12 show the em pirical Fourier transform for the
original d a ta an d for one resam ple. Phase scram bling seems to have shrunk
the m oduli o f the series tow ards zero, giving a resam pled series w ith lower
overall variability. The low er left panel shows sm oothed periodogram s for the
original d a ta and for 9 phase scram bled resam ples, while the right panel shows
corresponding results for sim ulation from the fitted A R (2) model. The results
are quite different, an d show th a t d a ta generated by phase scram bling are less
variable th an those generated from the fitted model.
R esam pling w ith 999 series generated from the fitted A R(2) m odel and by
phase scram bling, the distribution o f 7” is close to no rm al under b o th schemes
b u t it is less variable u nder phase scram bling; the estim ated variances are 27.4
and 20.2. These are sim ilar to the estim ates o f a b o u t 27.5 and 22.5 obtained
using the block and statio n ary bootstraps.
Before applying phase scram bling to the full series, we m ust check th a t
it shows no sign o f nonlinearity or o f long-range dependence, and th at it
is plausibly close to a linear series w ith norm al errors. W ith m = 20 the
nonlinearity statistic described in Exam ple 8.3 takes value 0.015, and no value
for m < 30 is greater th a n 0.84: this gives no evidence th a t the series is
nonlinear. M oreover the p eriodogram shows no signs o f a pole as to—>0+, so
long-range dependence seems to be absent. A n A R (8) m odel fits the series
well, b u t the residuals have heavier tails th an the norm al distribution, w ith
kurtosis 1.2. T he variance o f T * u nder phase scram bling is ab o u t 51, which
8.2 • Time Series 411
CG
o>
0)
e
o
o>
o
1 2 3 4 1 2 3 4
omega omega
again is sim ilar to the estim ates from the block resam pling schemes. A lthough
this estim ate m ay be untrustw orthy, on the face o f things it casts no d o ubt on
the earlier conclusion th a t the evidence for trend is weak. ■
r = -? -£ > /* ,
tr
where I k = Ho}k), ak = a(cok), an d (ok is the /cth F ourier frequency. F o r a linear
process
00
Y j = T , b* H>
i=—oo
E (T ) = J a(co)g(a>)d(o,
v ar(T ) = ri- 1 2nJ a2(co)g2(co)dco+K4 | J a(o))g (« ) dcoj
8.2 ■ Time Series 413
Example 8.8 (Spectral density estim ation) Suppose th a t our goal is inference
for the spectral density g(tj) a t some t] in the interval (0, 7r), and let our estim ate
o f g(tj) be
r k=0
where X ( ) is a sym m etric P D F with m ean zero and unit variance and h is a
positive sm oothing param eter. Then
Since we m ust have h—>0 as n —*00 in order to remove the bias o f T , the second
term in the variance is asym ptotically negligible relative to the first term , as is
necessary for the resam pling scheme outlined above to work w ith a tim e series
for which /c4 0. C om parison o f the variance and bias term s implies th at the
asym ptotic form o f the relative m ean squared erro r for estim ation o f g(//) is
m inim ized by tak in g h oc n~[^5. However, there are two difficulties in using
resam pling to m ake inference ab o ut g(^) from T.
T he first difficulty is analogous to th at seen in Exam ple 5.13, and appears
on com paring T and its b o o tstrap analogue
k=1
We suppose th a t I k is generated using a kernel estim ate g(a>k) with sm oothing
param eter h. T he standardized versions o f T and T * are
Z = (n h c)1/2 T g^ \ Z* = (n h c)1 / l T
414 8 ■Complex Dependence
Figure 8.13
C om parison o f
distributions o f Z and
Z* for time series o f
length 257. The left
panel shows a norm al
plot o f 1000 values o f
Z . The right panel
com pares the
distributions o f
Z and Z*.
Exam ple 8.9 (Caveolae) T he u p p er left panel o f Figure 8.14 shows the p o
sitions o f n = 138 caveolae in a 500 unit square region, originally a 2.65 /*m
square o f muscle fibre. T he u pper right panel shows a realization o f a binom ial
process, for which n points were placed a t ran d o m in the same region; this
is an hom ogeneous Poisson process conditioned to have 138 events. The d ata
seem to have fewer alm ost-coincident points th a n the sim ulation, b u t it is hard
to be sure.
Spatial dependence is often sum m arized by K -functions. Suppose th at the
process is orderly and isotropic, i.e. m ultiple coincident events are precluded
and jo in t probabilities are invariant und er ro tatio n as well as translation. Then
a useful sum m ary o f spatial dependence is Ripley’s K -function,
o o
lO in
---------
o ^ o - __r
\ N I 'v\ / " ^
_ -- --------
V v~ r-— K5
\
o O
T
7 V \ <
■
V
in m
T~
0 20 40 60 80 100 0 20 40 60 80 100
Distance Distance
Z (t) o f Z(t). The dashed lines are pointw ise 95% confidence bands from
R = 999 realizations o f the binom ial process, and the dotted lines are overall
b ands w ith level ab o u t 92% , obtained by using the m ethod outlined after
(4.17) w ith k = 2. Relative to a Poisson process there is a significant deficiency
o f pairs o f points lying close together, which confirm s our previous impression.
The lower right panel o f the figure shows the corresponding results for
sim ulations from the Strauss process, a param etric m odel o f interaction th at
can inhibit p attern s in which pairs lie close together. This m odels the local
behaviour o f the d a ta b etter th an the stationary Poisson process. ■
418 8 ■Complex Dependence
Figure 8.15
o
Neurophysiological
c point process. The rows
o o of the left panel show
o <N
100 replicates of the
interval surrounding the
times at which a human
-200 -100 0 100 200 subject was given a
stimulus; each point
Time (ms) represents the time at
which the firing of a
neuron was observed.
The right panels shows
W a histogram and kernel
c intensity estimate
0)
(xlO -2 ms-1) from
o superposing the events
-200 0 100 200 o on the left, which are
-200 -100 0 100 200 shown by the rug in the
Time (ms) lower right panel.
Time (ms)
The right panels o f Figure 8.15 show a histogram o f the superposed d ata
and a rescaled kernel estim ate o f the intensity X(y) in units o f 10-2 m s-1 ,
k y , h ) = 100 x (N h )~1 £ w ( ^ y 1 ) ,
7=1
where w(-) is a sym m etric density with m ean zero and unit variance; we use
the stan d ard norm al density w ith bandw idth h = 7.5 ms. O ver the observation
period this estim ate integrates to 100n / N . The estim ated intensity is highly
variable an d it is unclear which o f its features are spurious. We can try to
construct a confidence region for A(y) at a set o f y values o f interest, but
the sam e problem s arise as in Exam ples 5.13 and 8.8.
O nce again the key difficulty is bias: l ( y ; h ) estim ates n o t k(y) b u t
/ w(u)A(y — hu) du. F or large n and small h this m eans th at
where c = f w 2 (u)du. As in Exam ple 5.13, the delta m ethod (Section 2.7.1) im
plies th a t l ( y ; h )l/2 has approxim ately constant variance \ c ( N h ) ~ l . We choose
to w ork w ith the standardized quantities
l l' 2 ( y ; h ) - k l/ 2 ( y )
2 (y,h)= y ef.
K M )-V 2 c 1/2
inappro p riate if the estim ator o f interest presupposed th at events could not
coincide, as did the K -function o f Exam ple 8.9.
For all o f these resam pling schemes the b o o tstrap estim ators r ( y ; h ) are
unbiased for l(y',h). T he n atu ral resam pling analogue o f Z is
{ r ( r ; f c ) } '/ 2 - { r ( r ) ) l/2
z(y M ------------------------------------ ■
whose m ean and variance closely m atch those o f Z .
W hatever resam pling scheme is employed, sim ulated values o f Z* will be
used to estim ate the quantiles z i A(h) and z y A{h) in (8.19). If R realizations are
generated, then we take ZL,cc{h) and zu, Jh) to be respectively the (R + l)a th
ordered values o f
m in z*(y:/i), m a xz*(y;h).
The u p p er panel o f Figure 8.16 shows overall 95% confidence bands for
A(y;5), using three o f the sam pling schemes described above. In each case
R = 999, an d zl,0.025(5) an d zl',0.025(5) are estim ated by the em pirical 0.025 and
0.975 quantiles o f the R replicates o f m in{z’(j;;5),>' = —250, —2 4 8 ,...,2 5 0 }
and m a x { z '(y ;5),y = —2 5 0 ,—2 4 8 ,...,2 5 0 } . R esults from resam pling intervals
and events are alm ost indistinguishable, while generating d a ta from a fitted
intensity gives slightly sm oother results. In o rd er to avoid problem s at the
boundaries, the set is taken to be (—230,230). The experim ental setup
implies th a t the intensity should be ab o u t 1 x 10-2 firings per second, the
only significant d ep artu re from which is in the range 0-130 ms, where there is
strong evidence th a t the stim ulus affects the firing rate.
8.3 ■Point Processes 421
The lower panel o f the figure shows z0.025(5)’ z0.975(5), and the boo tstrap
bias estim ate for /*(>>) for resam pling intervals and for generating d a ta from
a fitted intensity function, with h = 7.5 ms. The quantile processes suggest
th a t the variance-stabilizing transform ation has w orked well, b u t the double
sm oothing effect o f the latter scheme shows in the bias. The behaviour o f the
quantile process when y = 50 ms — where there are no firings — suggests th at
a variable b andw idth sm oother m ight be better. ■
Essentially the same ideas can be applied when the d ata are a single real
ization o f an inhom ogeneous Poisson process (Problem 8.8).
where the suspicion is th a t X(y) decreases w ith distance from the origin. Since
the disease is rare, the n u m b er o f cases a t y will be well approxim ated by a
Poisson variable w ith m ean X{y)n(y), where fi(y) is the population density o f
susceptible persons a t y. T he null hypothesis is th a t My) = Xo, i.e. th a t y has
no effect on the intensity o f cases, o th er th an through /i(y). A crucial difficulty
is th at n{y) is unknow n an d will be h ard to estim ate from the d a ta available.
One ap p ro ach to testing for constancy o f X ( y ) is to com pare the p o int pattern
for 2> to th a t o f an o th er disease 2)'. This disease is chosen to have the same
populatio n o f susceptible individuals as 3), b u t its incidence is assum ed to be
unrelated to em issions from the site an d to incidence o f S>, and so it arises
with co n stan t b u t unknow n rate X ’ p er person-year. If Sfi' is also rare, it will
be reasonable to suppose th a t the num b er o f cases o f at y has a Poisson
distributio n w ith m ean X 'f i ( y ) . H ence the conditional probability o f a case o f
at y given th a t there is a case o f o r 3 ' a t y is n { y ) = X { y ) / { X ' + A(y)}.
If the disease locations are indicated by yj, an d dj is zero o r one according as
the case a t yj has 3)' or Q>, the likelihood is
n ^ { i - « ( y ^ .
j
If a suitable form for X(y) is assum ed we can o btain the likelihood ratio or
perhaps an o th er statistic T to test the hypothesis th at 7i(y) is constant. This
is a test o f pro p o rtio n al hazards for Q) and & , b u t unlike in Exam ple 4.4 the
alternative is specified, at least weakly.
W hen A(y) = Xo an ap proxim ation to the null distribution o f T can be
obtained by perm uting the labels on cases at different locations. T h at is, we
perform R ran d o m reallocations o f the labels and 3l' to the yj, recom pute T
for each such reallocation, an d see w hether the observed value o f t is extrem e
relative to the sim ulated values t \ , . . . , t ’R. m
Exam ple 8.12 (Bram bles) The upp er left panel o f Figure 8.17 shows the
locations o f 103 newly em ergent an d 97 one-year-old bram ble canes in a 4.5 m
square plot. It seems plausible th a t these two types o f event are related, but
how should this be tested? Events o f b o th types are clustered, so a Poisson
null hypothesis is not appropriate, n o r is it reasonable to perm ute the labels
attached to events, as in the previous example.
Let us denote the locations o f the two types o f event by y i , . . . , y „ and
y [, . . ., y 'n-. Suppose th a t a statistic T = t ( y i , . . . , y „ , y [ , . . . , y ' n,) is available th at
tests for association betw een the event types. If the extent o f the observation
region were infinite, we m ight construct a null distribution for T by applying
random translations to events o f one type. T hus we would generate values
T ‘ = t(yi + U*, . . ., y„ + U*,y[,...,y'rf), where I/* is a random ly chosen location
in the plane. This sam pling scheme has the desirable property o f fixing the
8.3 • Point Processes 423
t t
where A2 is the overall intensity o f type 2 events. Suppose th a t there are «i,
ri2 events o f types 1 an d 2 in an observation region A o f area \A\, th at u,, is
the distance from the ith type 1 event to the 7th type 2 event, th a t w,(u) is the
proportio n o f the circum ference o f the circle th a t is centred at the ith type 1
event an d has radius u th a t lies w ithin A, and let /(•) denote the indicator o f
the event T hen the sam ple version o f this bivariate K -function is
8.3.4 Tiles
Little is know n ab o u t resam pling spatial processes when there is no param etric
model. One n onparam etric ap proach th a t has been investigated starts from a
p artition o f the observation region St into disjoint tiles o f equal
size and shape. I f we abuse n o tatio n by identifying each tile with the pattern it
contains, we can w rite the original value o f the statistic as T = t(.stf
The idea is to create a resam pled p attern by tak ing a random sam ple of
tiles s 4 \ , . . . , s 4 ' n from with corresponding boo tstrap statistic T* =
t( j/J ,...,,s /* ) . The hope is th a t if dependence is relatively short-range, taking
large tiles will preserve enough dependence to m ake the properties o f T* close
to those o f T. If this is to w ork, the size o f the tile m ust be chosen to trade
off preserving dependence, which requires a few large tiles, and getting a good
estim ate o f the distribution o f T , which requires m any tiles.
This idea is analogous to block resam pling in tim e series, and is capable o f
sim ilar variations. F o r exam ple, ra th e r th an choosing the stf* independently
from the fixed tiles s i we m ay resam ple m oving tiles by setting
8.3 • Point Processes 425
300
* : . . • •......... •
using toroidal wrapping. 1 • ..... . *
•• * ** • :* •
The right panel shows
the resampled point * .. « ! • ! . !
200
pattern. .* * * -I . . .......•...... • » • . * •:
• j * ......... «-•
• V • •/. * * . *• *
100
•• * * ' •
.• • •** • : • */ . •
•: : : •; •* . o
*
•
Exam ple 8.13 (Caveolae) Figure 8.18 illustrates tile resam pling for the d ata
o f Exam ple 8.9. T he left panel shows the original caveolae data, with the dotted
lines showing nine square tiles taken using the m oving scheme w ith toroidal
w rapping. The right panel shows the resam pled p a ttern obtained when the
tiles are laid side-by-side. F or example, the centre top tile and m iddle right
tiles were respectively taken fropi the top left and b ottom right o f the original
data. A long the tile edges, events seem to lie closer together th a n in the left
p anel; this is analogous to the w hitening th a t occurs in blockwise resam pling
o f tim e series. N o analogue o f the post-blackened b o o tstrap springs to mind,
however.
F or a num erical evaluation o f tile resam pling, we experim ented with esti
m ating the variance 9 o f the nu m ber o f events in an observation region 3tt
o f side 200 units, using d a ta generated from three random processes. In each
case we generated 8800 events in a square o f side 4000, then estim ated 9 from
2000 squares o f side 200 taken at random . F or each o f 100 random squares
o f side 200 we calculated the em pirical m ean squared error for estim ation
o f 9 using b o o tstrap s o f size R, for b o th fixed and m oving tiles. D a ta were
generated from a spatial Poisson process (9 = 23.4), from the Strauss process
th a t gave the results in the b o tto m right panel o f Figure 8.14 (9 = 17.5), and
from a sequential spatial inhibition process, which places points sequentially
at ran d o m b u t n o t w ithin 15 units o f an existing event (6 = 15.6).
426 8 • Complex Dependence
M odel-based resam pling for tim e series was discussed by F reedm an (1984),
Freedm an an d Peters (1984a,b), Sw anepoel and van W yk (1986) and Efron and
T ibshirani (1986), am ong others. Li and M ad d ala (1996) survey m uch o f the
related tim e dom ain literature, which has a som ew hat theoretical em phasis;
their account stresses econom etric applications. F or a m ore applied account o f
param etric b o o tstrap p in g in tim e series, see Tsay (1992). B ootstrap prediction
in tim e series is discussed by K ab aila (1993b), while the b o otstrapping o f state-
space m odels is described by Stoffer and W all (1991). The use o f m odel-based
resam pling for o rd er selection in autoregressive processes is discussed by Chen
et al. (1993).
Block resam pling for tim e series was introduced by C arlstein (1986). In an
im p o rta n t paper, K iinsch (1989) discussed overlapping blocks in tim e series,
although in spatial d a ta the proposal o f block resam pling in H all (1985)
predates both. Liu an d Singh (1992a) also discuss the properties o f block
resam pling schemes. Politis an d R om ano (1994a) introduced the stationary
b o o tstrap , an d in a series o f papers (Politis and R om ano, 1993, 1994b) have
discussed theoretical aspects o f m ore general block resam pling schemes. See
also B uhlm ann an d K iinsch (1995) and L ahiri (1995). The m ethod for block
length choice outlined in Section 8.2.3 is due to H all, H orow itz and Jing
(1995); see also H all an d H orow itz (1993). B ootstrap tests for unit roots in
autoregressive m odels are discussed by F erretti and R om o (1996). H all and
Jing (1996) describe a block resam pling approach in which the construction o f
new series is replaced by R ichardson extrapolation.
Bose (1988) showed th a t m odel-based resam pling for autoregressive p ro
cesses has good asym ptotic higher-order properties for a wide class o f statistics.
L ahiri (1991) an d G otze and K iinsch (1996) show th a t the same is true for
block resam pling, b u t D avison and H all (1993) p o int o u t th a t unfortunately
— and unlike w hen the d a ta are independent — this depends crucially on the
variance estim ate used.
Form s o f phase scram bling have been suggested independently by several
au th o rs (N ordgaard, 1990; Theiler et al., 1992), and B raun and K ulperger
(1995, 1997) have studied its properties. H artig an (1990) describes a m ethod
for variance estim ation in G aussian series th a t involves sim ilar ideas b u t needs
no rand o m izatio n ; see Problem 8.5.
Frequency dom ain resam pling has been discussed by F ranke and H ardle
(1992), w ho m ake a strong analogy w ith b o o tstrap m ethods for nonparam etric
regression. It has been fu rth er studied by Janas (1993) and D ahlhaus and
Janas (1996), on which o u r account is based.
O u r discussion o f the R io N egro d a ta is based on Brillinger (1988, 1989),
which should be consulted for statistical details, while Sternberg (1987, 1995)
gives accounts o f the d a ta and background to the problem.
M odels based on p o in t processes have a long history and varied provenance.
428 8 • Complex Dependence
8.5 Problems
1 Suppose that y i,...,y „ is an observed time series, and let zy denote the block
of length / starting at yu where we set y, = yi+(i_i mod „) and y0 = yn-
Also let h , . . . be a stream of random numbers uniform on the integers 1,...,n
and let be a stream of random numbers having the geometric distribution
Pr(L = I) = p(l —p)‘~ \ I = 1,— The algorithm to generate a single stationary
bootstrap replicate is
• Set 7* =
Algorithm 8.3
. Set Yl' = y , r
• For i = 2,...,n, let Y ' = with probability p, and let Y" = yj+l with
probability 1 —p, where y,l, = yj.
v a r (? ) = y0 + 2 ^ fl - yh
;=l '
and that this approaches C = Vo + 2 £ 5 ° yj if ! j\yj\ is finite.
Show that under the stationary bootstrap, conditional on the data,
n—1 /
v a r '( y ‘) = c0 + 2 ^ 3 ( 1 - " ) (! ~ P ) JCj,
;=l ' nJ
where Co,c1;. .. are the empirical circular autocovariances defined in Problem 8.1.
(Section 8.2.3; Politis and Romano, 1994a)
3 (a) Using the setup described on pages 405-408, show that J2($j ~ S )2 has mean
vy — b~l v,j and variance
where vy = cov(S,,S,), = cum(S,, Sj, St) and so forth are the joint cumulants
o f the Sj, and summation is understood over each index.
(b) For an m-dependent normal process, show that provided / > m,
( l~' 4 }, i = j,
v‘.i = \ l - 2c(l>, \ i - j \ = l,
( 0, otherwise,
and show that /“ ‘cq1—>(, c,1*— as /—»o o . Hence establish (8.13) and (8.14).
(Section 8.2.3; Appendix A ; Hall, Horowitz and Jing, 1995)
430 8 ■Complex Dependence
2 ^ r n ) g B r ' - TO'
say. U se (8.18) to show that X a and X i have means zero and that
var(-Xa) = n l aaggl ^ 2 + i(c4, \ a r ( X i ) = n l gel ~ 2 + ^ k4,
COV(XUX a) — 1llagglag Ig “t- 2 ^4 .
8.5 ■Problems 431
where I aagg = / a2(co)g2(co) dco, and so forth. Hence show that to first order
the mean and variance o f T do not involve k4, and deduce that periodogram
resampling may be applied to ratio statistics.
Use simulation to see how well periodogram resampling performs in estimating the
distribution o f a suitable version o f the sample estimate o f the lag j autocorrelation,
= e~toJg M dco
Pl f l n g (« ) dco
(Section 8.2.5; Janas, 1993; Dahlhaus and Janas, 1996)
J= 1
denote a kernel estimate o f My), based on a kernel w( ) that is a PDF. Explain why
the following two algorithms for generating bootstrap data from the estimated
intensity are (almost) equivalent.
(Section 8.3.2)
8 Consider an inhom ogeneous Poisson process o f intensity /.(y) = N n(y), where fi(y)
is fixed and sm ooth, observed for 0 < y < 1.
A kernel intensity estimate based on events at y i , . . . , y n is
i =i
you may need the facts that the number o f events n has a Poisson distribution with
mean A = /J Mu) du, and that conditional on there being n observed events, their
432 8 ■Complex Dependence
times are independent random variables with PDF Hence show that the
asymptotic mean squared error of is minimized when h oc N ~l/S. Use the
delta method to show that the approximate mean and variance of l 1/ 2(y;h) are
*'/ 2 (y) + \ * r m (y) {h 2f ( y ) - ±K h r 1}, \ Kh ~l.
(b) Now suppose that resamples are formed by taking n observations at random
from yi,...,y„. Show that the bootstrapped intensity estimate
w ', y - y j
h J=l
has mean E’{ l ‘(y, h)} = l(y;h), and that the same is true when there are n'
resampled events, provided that E '(n') = n.
For a third resampling scheme, let n have a Poisson distribution with mean n,
and generate n events independently from density ).(y;h)/ f Ql l(u;h)du. Show that
under this scheme
E*{3.*{_y; Ai)} = J w(u)2(y — hu;h)du.
(c) By comparing the asymptotic distributions of
Consider resampling tiles when the observation region ^ is a square, the data are
generated by a stationary planar Poisson process of intensity X, and the quantity
of interest is d = var(Y), where Y is the number of events in 3t.
Suppose that 0t is split into n fixed tiles of equal size and shape, which are then
resampled according to the usual bootstrap. Show that the bootstrap estimate of
6 is t = ^2(yj — y)2, where yj is the number of events in the jth tile. Use the fact
that var(T) = (n — 1)2{k4/h + 2 k \ /( n — 1)}, where Kr is the rth cumulant of Yj, to
show that the mean squared error of T is
^ { n + ( n - l ) ( 2n + n - l ) } ,
where n = l\9l\. Sketch this when p. > 1, fi = 1, and /i < 1, and explain in
qualitative terms its behaviour when fi > 1.
Extend the discussion to moving tiles.
(Section 8.3)
8.6 Practicals
1 Dataframe lynx contains the Canadian lynx data, to the logarithm of which we
fit the autoregressive model that minimizes A IC :
t s .plot(log(lynx))
lynx.ar <- arClogClynx))
lynx.ar$order
• Practicals 433
The best model is A R (ll). How well determined is this, and what is the variance
of the series average? We bootstrap to see, using ly n x .fu n (given below), which
calculates the order of the fitted autoregressive model, the series average, and saves
the series itself.
Here are results for fixed-block bootstraps with block length I = 20:
To obtain similar results for the stationary bootstrap with mean block length
1 = 20:
See if the results look different from those above. Do the simulated series using
blocks look like the original? Compare the estimated variances under the two
resampling schemes. Try different block lengths, and see how the variances of the
series average change.
For model-based resampling we need to store results from the original model:
Compare these results with those above, and try the post-blackened bootstrap with
sim=" geom".
(Sections 8.2.2, 8.2.3)
a linear regression m odel in which the errors r\j form an A R (1) process, and the
are independent identically distributed errors with mean zero and variance a 2.
Having fitted this model, estimated the parameters a,/?o, j8i,<t2 and calculated the
residuals e i , . . . , e n (e\ cannot be calculated), we generate bootstrap series by the
following recipe:
where the error series {>/'} is formed by taking a white noise series {e‘ } at random
from theset {a(e2 — e) , . . . , o(e„ — e)} and then applying the second parto f (8.22).
To fit the original m odel and to generate a new series:
f i t < - f u n c t io n ( d a ta )
{ X < - c b i n d ( r e p ( l , 1 0 0 ) ,d a t a $ a c t iv )
para < - l i s t ( X =X ,data=data)
a ss ig n (" p a r a " ,p a r a ,fr a m e = l)
d < - a r im a .m le (x = p a r a $ d a ta $ t e m p ,m o d e l= lis t(a r = c (0 .8 )),
xreg=para$X )
r e s < - a r i m a .d ia g ( d ,p l o t = F ,s t d .r e s id = T ) $ s t d .r e s i d
r e s <- r e s [ ! is .n a ( r e s ) ]
li s t ( p a r a s = c ( d $ m o d e l$ a r ,d $ r e g .c o e f ,s q r t ( d $ s ig m a 2 ) ) ,
r e s = r e s -m e a n (r e s ) ,f it = X 7,*7, d $ r e g .c o e f ) >
b e a v e r .a r g s < - f i t ( b e a v e r )
w h it e .n o i s e < - f u n c t io n ( n .s im , t s ) s a m p le ( t s ,s iz e = n .s im ,r e p la c e = T )
b e a v e r .g e n < - f u n c t i o n ( t s , n .s im , r a n .a r g s )
{ t s b < - r a n .a r g s $ r e s
f i t < - r a n .a r g s $ f i t
c o e f f < - r a n .a r g s$ p a r a s
ts$ tem p < - f i t + c o e f f [ 4 ] * a r im a .s im ( m o d e l= lis t ( a r = c o e f f [ 1 ] ) ,
n = n .s im ,r a n d .g e n = w h it e .n o is e ,t s = t s b )
ts }
n ew .b ea v er < - b e a v e r .g e n (b e a v e r , 1 0 0 , b e a v e r .a r g s )
N ow we are able to generate data, we can bootstrap and see the results o f
b e a v e r .b o o t as follows:
b e a v e r .fu n < - f u n c t i o n ( t s ) f i t ( t s ) $ p a r a s
b e a v e r .b o o t < - t s b o o t ( b e a v e r , b e a v e r .fu n , R =99,sim ="m odel",
n . s im=1 0 0 ,r a n . g e n = b e a v e r. g e n , r a n . a r g s= b e a v e r . a r g s )
n a m es(b ea v er. b o o t)
b e a v e r . b o o t$ t0
b e a v e r .b o o t $ t [ 1 : 1 0 ,]
showing the original value o f b e a v e r . fu n and its value for the first 10 replicate
8.6 ■Practicals 435
series. Are the estimated mean temperatures for the R = 99 simulations normal?
Use b o o t . c i to obtain normal and basic bootstrap confidence intervals for the
resting and active temperatures.
In this analysis we have assumed that the linear m odel with A R(1) errors is
appropriate. How would you proceed if it were not?
(Section 8.2; Reynolds, 1994)
3 Consider scrambling the phases o f the su n sp o t data. To see the original data,
two replicates generated using ordinary phase scrambling, and two phase scram
bled series whose marginal distribution is the same as that o f the original
data:
su n s p o t .fu n < - f u n c t i o n ( t s ) t s
s u n s p o t .1 < - ts b o o t(s u n s p o t,s u n s p o t.fu n ,R = 2 ,s im = " s c r a m b le " )
.R andom .seed < - s u n s p o t .l$ s e e d
s u n s p o t .2 < - tsb o o t(su n sp o t,su n sp o t.fu n ,R = 2 ,sim = " sc r a m b le " ,n o r m = F )
s p l i t . s c r e e n ( c (3 ,2 ) )
y l < - c (- 5 0 ,2 0 0 )
s c r e e n ( l ) ; t s . p l o t ( s u n s p o t , y l i m = y l ) ; a b lin e ( h = 0 ,lt y = 2 )
s c r e e n ( 3 ) ; t s p l o t ( s u n s p o t . l $ t [ 1 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
s c r e e n ( 4 ) ; t s p l o t ( s u n s p o t . l $ t [ 2 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
s c r e e n ( 5 ) ; t s p l o t ( s u n s p o t . 2 $ t [ 1 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
s c r e e n ( 6 ) ; t s p l o t ( s u n s p o t . 2 $ t [ 2 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
W hat features o f the original data are preserved by the two algorithms? (You may
find it helpful to experiment with different shapes for the figures.)
(Section 8.2.4; Problem 8.4; Theiler et a l, 1992)
Try other choices o f bandwidth h, noting that the estimate for the period (1851 +
4/i, 1962 — 4h) does not have edge effects. D o you think that the drop from about
three accidents per year before 1900 to about one thereafter is spurious? W hat
about the peaks at around 1910 and 1940?
For an equi-tailed 90% bootstrap confidence band for the intensity, we take h = 5
and R = 199 (a larger R will give more reliable results):
Improved Calculation
9.1 Introduction
A few o f the statistical questions in earlier chapters have been am enable to
analytical calculation. However, m ost o f o u r problem s have been too com
plicated for exact solutions, an d sam ples have been too small for theoretical
large-sam ple approxim ations to be trustw orthy. In such cases sim ulation has
provided approxim ate answ ers through M onte C arlo estim ates o f bias, vari
ance, quantiles, probabilities, an d so forth. T h roughout we have supposed th at
the sim ulation size is lim ited only by our im patience for reliable results.
S im ulation o f independent b o o tstrap sam ples and their use as described in
previous chapters is usually easily program m ed and im plem ented. I f it takes
up to a few hours to calculate enough values o f the statistic o f interest, T,
ordinary simulation o f this sort will be an efficient use o f a researcher’s time. But
som etim es T is very costly to com pute, or sam pling is only a single com ponent
in a larger procedure — as in a double b o o tstrap — o r the procedure will be
repeated m any times w ith different sets o f data. T hen it m ay pay to invest
in m ethods o f calculation th a t reduce the num ber o f sim ulations needed to
obtain a given precision, o r equivalently increase the accuracy o f an estim ate
based on a given sim ulation size. This chapter is devoted to such m ethods.
N o lunch is free. The techniques th a t give the biggest potential variance
reductions are usually the h ardest to im plem ent. O thers yield less spectacular
gains, b u t are m ore easily im plem ented. T houghtless use o f any o f them may
m ake m atters worse, so it is essential to ensure th a t use o f a variance reduction
technique will save the investigator’s time, which is m uch m ore valuable than
com puter time.
M ost o f o u r b o o tstrap estim ates depend on averages. For exam ple, in testing
a null hypothesis (C h ap ter 4) we w ant to calculate the significance probability
p = P r’(7” ^ t | Fo), where t is the observed value o f test statistic T and
437
438 9 ■Improved Calculation
the fitted m odel Fo is an estim ate o f F und er the null hypothesis. The simple
M onte C arlo estim ate o f p is R ^ 1 / {T ' > (}, where I is the indicator
function an d the T ’ are based on R independent sam ples generated from Fo-
T he variance o f this estim ate is c R ~{, w here c = p fl —p). N othing can generally
be done ab o u t the factor R ~ l , b u t the co n stan t c can be reduced if we use
a m ore sophisticated M onte C arlo technique. M ost o f this chapter concerns
such techniques. Section 9.2 describes m ethods for balancing the sim ulation
in order to m ake it m ore like a full enum eration o f all possible samples,
and in Section 9.3 we describe m ethods based on the use o f control variates.
Section 9.4 describes m ethods based on im portance sampling. In Section 9.5 we
discuss one im p o rta n t m ethod o f theoretical approxim ation, the saddlepoint
m ethod, which elim inates the need for sim ulation.
B =J t { y \, . . . , y'n)g(y[,. . . , y'„)dy{
This sum over all possible sam ples need involve only (2n„_1) calculations o f (*,
since the sym m etry o f t( ) w ith respect to the sam ple can be used, b u t even
so the complete enumeration o f values t* th a t (9.1) requires will usually be
im practicable unless n is very small. So it is that, especially in nonparam etric
problem s, we usually approxim ate the average in (9.1) by the average o f R
random ly chosen elem ents o f Zf an d so approxim ate B by B r = R _i Y , T* — t.
This calculation w ith a ran d o m subset o f has a m ajor defect: the values
y i , . . . , y n typically d o n o t occur w ith equal frequency in th a t subset. This is
illustrated in Table 9.1, which reproduces Table 2.2 b u t adds (penultim ate
row) the aggregate frequencies for the d a ta values; the final row is explained
later. In the even sim pler case o f the sam ple average t = y we can see clearly
9.2 ■Balanced Bootstraps 439
Table 9.1 R = 9
resamples for city Data
population data, chosen 1 2 3 4 5 6 7 8 9 10
j
by ordinary bootstrap
u 138 93 61 179 48 37 29 23 30 2
sampling from F.
X 143 104 69 260 75 63 50 48 111 50
Sample 1 3 2 1 2 1 1 t\ = 1.466
2 1 1 2 1 2 1 t 2’ = 1.761
3 1 1 1 1 4 2 = 1.951
4 1 2 1 1 2 2 1 t’A = 1.542
5 3 1 3 1 1 1 t; = 1.371
6 1 1 2 1 1 1 3 t'6 = 1.686
7 1 1 2 2 2 1 1 t; = 1.378
8 2 1 3 1 1 1 1 (• = 1.420
9 1 1 1 2 1 2 1 1 t; = 1.660
Aggregate 9 8 11 5 13 8 8 7 11 10
rF* 9 8 11 5 13 8 8 7 n 10
50 55 50 50 50 50 50 50 50 50
th a t the unequal frequencies com pletely account for the fact th a t B r differs
from the correct value B = 0. The corresponding phenom enon for param etric
b o o tstrap p in g is th a t the aggregated E D F o f the R sam ples is n o t as close
to the C D F o f the fitted param etric m odel as it is to the sam e m odel with
different p aram eter values.
T here are tw o ways to deal w ith this difficulty. First, we can try to change
the sim ulation to remove the defect; and secondly we can try to adjust the
results o f the existing sim ulation.
Sample 1 1 1 3 2 1 1 1 t\ = 1.632
2 2 1 1 2 1 1 2 ti = 1.823
3 2 2 2 1 1 1 1 t"3 = 1.334
4 2 2 2 1 1 1 1 t'4 = 1.317
5 1 3 i 2 1 2 t‘5 = 1.531
6 2 1 1 1 1 1 1 1 1 t‘6 = 1.344
7 2 1 1 1 1 1 2 1 ty = 1.730
8 1 2 2 1 1 1 1 1 t\ = 1.424
9 1 2 1 2 1 1 1 1 t; = 1.678
Aggregate 9 9 9 9 9 9 9 9
Exam ple 9.1 (City population d ata) C onsider estim ating the bias o f the
ratio estim ate t = x / u for the d a ta in the second an d third rows o f Table 9.1.
Table 9.2 shows the results for a balanced b o o tstrap w ith R = 9: each d a ta
value occurs exactly 9 tim es overall.
To see how well the balanced b o o tstrap works, we apply it with the m ore
realistic n u m b er R = 49. T he bias estim ate is B R = T* — t = R ~ l J 2r T ' — t,
and its variance over 100 replicates o f the ordinary resam pling scheme is
7.25 x 10-4 . T he corresponding figure for the balanced b o o tstrap is 9.31 x 10-5 ,
so the balanced scheme is ab o u t 72.5/9.31 = 7.8 tim es m ore efficient for bias
estim ation. ■
v' K J b r)
v ar I J B r Y
where for this com parison the subscripts denote the sam pling scheme under
which B r was calculated.
9.2 • Balanced Bootstraps 441
Table 9 3 Approximate
efficiency gains when Cases Stratified R esiduals
balancing schemes with
R = 49 are applied in B alanced A djusted B alanced A djusted B alanced A djusted
estimating biases for
estimates of nonlinear
regression model Po 8.9 6.9 141 108 1.2 0.6
applied to the calcium Pi 13.1 8.9 63 49 1.4 0.6
uptake data, based on a 11.1 9.1 18.7 18.0 15.3 13.5
100 repetitions of the
bootstrap.
So far we have focused on the application to bias estim ation, for which the
balance typically gives a big im provem ent. The same is not generally true for
estim ating higher m om ents or quantiles. For instance, in the previous exam ple
the balanced b o o tstrap has efficiency less th an one for calculation o f the
variance estim ate VR.
The balanced b o o tstra p extends quite easily to m ore com plicated sam pling
situations. I f the d a ta consist o f several independent samples, as in Section 3.2,
balanced sim ulation can be applied separately to each. Some o ther extensions
are straightforw ard.
Exam ple 9.2 (Calcium uptake d ata) To investigate the im provem ent in bias
estim ation for the p aram eters o f the nonlinear regression m odel fitted to the
d a ta o f Exam ple 7.7, we calculated 100 replicates o f the estim ated biases based
on 49 b o o tstra p samples. The resulting efficiencies are given in Table 9.3 for
different resam pling schem es; the results labelled “A djusted” are discussed in
Exam ple 9.3. F or stratified resam pling the d a ta are stratified by the covariate
value, so there are nine stra ta each w ith three observations. T he efficiency gains
u nder stratified resam pling are very large, and those under case resam pling are
worthwhile. T he gains w hen resam pling residuals are n o t w orthw hile, except
for a 2. ■
R
= (9.2)
r= l
where as usual F* denotes the E D F corresponding to the rth row o f the array.
Let F* denote the average o f these E D F s, th a t is
f * = r - ^ f ; + --- + F*r ).
R
Brmj = R - 1 *(£*) - (9-3)
r= 1
This is som etim es called the re-centred bias estim ate. In addition to the usual
A _ _
b o o tstrap values t(Fr ), its calculation requires only F* and f(F*). N ote th at
for adjustm ent to work, t( ) m ust be in a functional form, i.e. be defined
independently o f sam ple size n. F or example, a variance m ust be calculated
with divisor n ra th e r th a n n — 1.
The corresponding calculation for a p aram etric b o o tstra p is similar. In effect
the adjustm ent com pares the sim ulated estim ates T ' to the p aram eter value
Or = t(F*) obtained by fitting the m odel to d a ta w ith E D F F* rath er th an F.
Exam ple 9.3 (Calcium uptake d a ta ) Table 9.3 shows the efficiency gains
from using B r ^ in the nonparam etric resam pling experim ent described in
Exam ple 9.2. T he gains are broadly sim ilar to those for balanced resam pling,
b u t smaller.
F o r param etric sam pling the quantities F ’ in (9.3) represent sets o f d a ta
generated by p aram etric sim ulation from the fitted m odel, and the average
F* is the d ataset o f size R n obtained by concatenating the sim ulated samples.
H ere the sim plest p aram etric sim ulation is to generate d a ta y j = p-j + ej, where
the fa are the fitted values from Exam ple 7.7 an d the e* are independent
iV(0,0.552) variables. In 100 replicates o f this b o o tstrap with R = 49, the
efficiency gains for estim ating the biases o f Po, P\, an d a were 24.7, 42.5, and
20.7; the effect o f the adjustm ent is m uch m ore m arked for the param etric
th a n for the n o n p aram etric b ootstraps. ■
The sam e adjustm ent does n o t apply to the variance approxim ation V r ,
higher m om ents o r quantiles. R a th e r the linear approxim ation is used as a
conventional control variate, as described in Section 9.3.
9.2 ■Balanced Bootstraps 443
which is the basis o f the variance approxim ation vL ; equation (9.5) is simply a
recasting o f (2.44).
In term s o f the frequencies f j with which the yj ap p e ar in the boo tstrap
sam ple a n d the em pirical influence values lj = l(yj;F) and qjk = q(yj,yk;F),
the q u ad ratic ap proxim ation (9.4) is
In the ord in ary resam pling scheme, the rows o f frequencies (/* 1 , . . . , f ' n) are
in dependent sam ples from the m ultinom ial distribution with denom inator n
an d probability vector (n-1 , . . . , n _1). This is the case in Table 9.1. In this
situation the first an d second jo in t m om ents o f the frequencies are
1
Rn1
j= 1 i= 1
+ Qjk • (9.8)
An1 j= i \j= i j
+ 2 ± ^ t
j= i
i
*=i
/= 1 k= l r—l
lM( i ? - l ) _2 , A
j=i
1
-2I T 1 qjj + 2nT 2 R - 2 ( ^ + 2(n - I)/!"1 £ £ q)k
4Rr?
j=1 \;= 1 / j=1 /c =l
(9.10)
The m ean is alm ost the sam e u nder b o th schemes, b u t the leading term o f the
variance in (9.10) is sm aller th an in (9.8) because the term in (9.7) involving
the lj is held equal to zero by the balance constraints Y l r f*j = First-order
balance ensures th a t the linear term in the expansion for B r is held equal to
its value o f zero for the com plete enum eration.
Post-sim ulation balance is closely related to the balanced bootstrap. It is
straightforw ard to see th a t the quad ratic nonparam etric delta m ethod approx
im ation o f Bg^adj in (9.3) equals
(9.11)
y = l k= 1 I r= l r= l r= l
9.2 • Balanced Bootstraps 445
5.0
■O
icy
0) .- j :
ordinary bias estimate oc c
due to balancing and ". / W v ' 0)
JS o
post-simulation CO ifc
adjustment. The right m in
LU
■"■V* ■"
panel shows the gains ©
for the balanced o _____ -»—''r'TV'V.T/*■
estimate, as a function
of the correlation in
between the statistic and d
its linear approximation;
the solid line shows the 0.1 0.5 5.0 0.0 0.2 0.4 0.6 0.8 1.0
theoretical relation. See
text for details. Adjusted Correlation
Like the balanced b o o tstrap estim ate o f bias, there are no linear term s in this
expression. R e-centring has forced those term s to equal their p o p ulation values
o f zero.
W hen the statistic T does n o t possess an expansion like (9.4), balancing
m ay n o t help. In any case the correlation betw een the statistic and its linear
approxim ation is im p o rtan t: if the correlation is low because the quadratic
com ponent o f (9.4) is appreciable, then it m ay n o t be useful to reduce variation
in the linear com ponent. A rough approxim ation is th a t var*(B«) is reduced
by a factor equal to 1 m inus the square o f the correlation betw een T" and T'L
(Problem 9.5).
Exam ple 9.4 (N orm al eigenvalues) F or a num erical com parison o f the effi
ciency gains in bias estim ation from balanced resam pling and post-sim ulation
adjustm ent, we perform ed M onte C arlo experim ents as follows. We generated
n variates from the m ultivariate norm al density w ith dim ension 5 and identity
covariance m atrix, and to o k t to be the five eigenvalues o f the sam ple covari
ance m atrix. F or each sam ple we used a large b o o tstrap to estim ate the linear
approxim ation t"L for each o f the eigenvalues and then calculated the correla
tion c betw een t* and t"L. We then estim ated the gains in efficiency for balanced
an d adjusted estim ates o f bias calculated using the b o o tstrap w ith R = 39,
using variances estim ated from 100 independent bo o tstrap sim ulations.
Figure 9.1 shows the gains in efficiency for each o f the 5 eigenvalues, for
50 sets o f d a ta w ith n = 15 an d 50 sets w ith n = 25; there are 500 points
in each panel. T he left panel com pares the efficiency gains for the balanced
an d adjusted schemes. Balanced sam pling gives b etter gains th an post-sam ple
adjustm ent, b u t the difference is sm aller at larger gains. The right panel shows
446 9 • Improved Calculation
the efficiency gains for the balanced scheme plotted against the correlation
c. The solid line is the theoretical curve (1 — c2)-1 . Know ledge o f c would
enable the efficiency gain to be predicted quite accurately, at least for c > 0.8.
T he potential im provem ent from balancing is n o t g u aranteed to be w orthwhile
w hen c < 0.7. The corresponding plot for the adjusted estim ates suggests th a t
c m ust be at least 0.85 for a useful efficiency gain. ■
This exam ple suggests the following strategy when a good estim ate o f bias
is required: perform a sm all stan d ard unbalanced b ootstrap, and use it to
estim ate the correlation betw een the statistic an d its linear approxim ation.
If th a t correlation exceeds ab o u t 0.7, it m ay be w orthw hile to perform a
balanced sim ulation, b u t otherw ise it will not. I f the correlation exceeds 0.85,
post-sim ulation adjustm ent will usually be w orthw hile, b u t otherw ise it will
not.
be w ritten
the leading term s o f which are known. O nly term s involving D * need to be
approxim ated by sim ulation. G iven sim ulations T w ith corresponding
linear approxim ations and differences D* = T* — T £r, the m ean
and variance o f T* are estim ated by
i? i?
t+ D\ VKcon = v L + ^ ^ ( T £ r - f i ) ( D r* - D' ) + ^ J 2 ( D ; ~ D ' ) 2,
r= l r= l
(9.12)
where T[ = Ylr ^L,r an d D" = Use o f these and related
approxim ations requires the calculation o f the T[ r as well as the T*.
The estim ated bias o f T* based on (9.12) is B r co„ = D ' . This is closely
related to the estim ate obtained un d er balanced sim ulation and to the re
centred bias estim ate B r ^ . Like them , it ensures th at the linear com ponent
o f the bias estim ate equals its population value, zero. D etailed calculation
shows th a t all three approaches achieve the same variance reduction for the
bias estim ate in large samples. However, the variance estim ate in (9.12) based
on linear approxim ation is less variable th an the estim ated variances obtained
u n d er the o th er approaches, because its leading term is n o t random .
Example 9.5 (City population data) To see how effective control m ethods
are in reducing the variability o f a variance estim ate, we consider the ratio
statistic for the city pop u latio n d a ta in Table 2.1, w ith n = 10. F or 100
b o o tstrap sim ulations w ith R = 50, we calculated the usual variance estim ate
vr = ( R — I)-1 — t*)2 and the estim ate VR>con from (9.12). The estim ated
gain in efficiency calculated from the 100 sim ulations is 1.92, which though
w orthw hile is n o t large. T he correlation betw een t* and t‘L is 0.94.
F or the larger set o f d a ta in Table 1.3, with n = 49, we repeated the
experim ent w ith R = 100. H ere the gain in efficiency is 7.5, and the correlation
is 0.99.
Figure 9.2 shows scatter plots o f the estim ated variances in these experim ents.
F or b o th sam ple sizes the values o f v r <co„ are m ore concentrated th an the values
o f vR, though the m ain effect o f control is to increase underestim ates o f the
true variances. ■
Example 9.6 (Frets heads) The d a ta o f Exam ple 3.24 are a sam ple o f n = 25
cases, each consisting o f 4 m easurem ents. We consider the efficiency gains
from using v ^ con to estim ate the b o o tstrap variances o f the eigenvalues o f
their covariance m atrix. T he correlations betw een the eigenvalues and their
linear approxim ations are 0.98, 0.89, 0.85 and 0.74, and the gains in efficiency
estim ated from 100 replicate b o o tstrap s o f size R = 39 are 2.3, 1.6, 0.95 and
448 9 • Improved Calculation
Usual Usual
1.3. The four left panels o f Figure 9.3 show plots o f the values o f v r >co„ against
the values o f v r . N o strong p attern is discernible.
To get a m ore system atic idea o f the effectiveness o f control m ethods in this
setting, we repeated the experim ent outlined in Exam ple 9.4 and com pared the
usual and control estim ates o f the variances o f the five eigenvalues. The results
for the five eigenvalues an d n = 15 and 25 are show n in Figure 9.3. G ains in
efficiency are n o t g u aranteed unless the correlation betw een the statistic and
its linear ap proxim ation is 0.80 o r m ore, and they are n o t large unless the
correlation is close to one. T he line y = (1 — x4)-1 sum m arizes the efficiency
gain well, th o u g h we have n o t attem p ted to justify this. ■
Quantiles
C ontrol m ethods m ay also be applied to quantiles. Suppose th a t we have
the sim ulated values t\, ..., t’R o f a statistic, and th a t the corresponding
control variates and differences are available. We now sort the differences by
the values o f the control variates. F o r exam ple, if o u r control variate is a
linear approxim ation, w ith R = 4 an d t 'L 2 < t"L , < t *L 4 < t] 3, we p u t the
differences in order d"2, d\, d"4, d\. The procedure now is to replace the p
quantile o f the linear approxim ation by a theoretical approxim ation, tp, for
p = 1/(jR + 1 ) ,..., R / ( R + 1), thereby replacing t'r) w ith t ’C r = tp + d '(r), where
7t(r) is the ran k o f t'L r. In o u r exam ple we would obtain t ’c j = t0.2 + d'2,
t'c 2 = £0 . 4 + d.\, t'c 3 = to. 6 + d\, an d t ’CA = fo.g + d\. We now estim ate the pth
quantile o f the distribution o f T by t'c ^ , i.e. the rth quantile o f t“ c v ... ,t*CR.
If the control variate is highly correlated w ith T m, the bulk o f the variability
in the estim ated quantiles will have been rem oved by using the theoretical
approxim ation.
9.3 ■Control Methods 449
O ne desirable property o f the control quantile estim ates is that, unlike m ost
o th er variance reduction m ethods, their accuracy improves with increasing n
as well as R.
T here are various ways to calculate the quantiles o f the control variate. The
preferred ap proach is to calculate the entire distribution o f the control variate
by saddlepoint approxim ation (Section 9.5), and to read off the required qu an
tiles tp. This is better th a n oth er m ethods, such as C o rn ish 'F ish e r expansion,
because it guarantees th a t the quantiles o f the control variate will increase
w ith p.
Example 9.7 (Returns data) To assess the usefulness o f the control m ethod
ju s t described, we consider setting studentized b o o tstrap confidence intervals
for the rate o f retu rn in Exam ple 6.3. We use case resam pling to estim ate
quantiles o f T* = (/?J —/?i ) / S \ where fli is the estim ate o f the regression slope,
an d S 2 is the robust estim ated variance o f fii based on the linear approxim ation
to Pi.
F or a single b o o tstra p sim ulation we calculated three estim ates o f the qu an
tiles o f T * : the usual estim ates, the order statistics < ■■■< t'R); the control
estim ates taking the control variate to be the linear approxim ation to T*
based on exact em pirical influence values; and the control estim ates obtained
using the linear approxim ation w ith em pirical influence values estim ated by
regression on the frequency array for the same bootstrap. In each case the
quantiles o f the control variate were obtained by saddlepoint approxim ation,
as outlined in Exam ple 9.13 below. We used R = 999 and repeated the experi
m ent 50 tim es in o rder to estim ate the variance o f the quantile estim ates. We
450 9 *Improved Calculation
for som e function m( ), where y ' is abbreviated n o ta tio n for a sim ulated d a ta
set. In expression (9.1), for exam ple, m( y' ) = t(y*), and the distribution G for
y* = (y^,..., y„*) puts m ass n~n on each elem ent o f the set f f = { y i,...,y „} ".
9.4 ■Importance Resampling 451
n = J m( y’ )dG(y*) = J d H ( y ’ ), (9.14)
where w(y’) = dG(y’ ) / d H ( y ' ) is know n as the importance sampling weight. The
estim ate fin,raw has m ean fi by virtue o f (9.14), so is unbiased, and has variance
452 9 ■Improved Calculation
C learly the best choice is the one for which m(y*)w(y*) = n, because then Ah,raw
has zero variance, b u t this is n o t usable because /i is unknow n. In general it
is hard to choose H, b u t som etim es the choice is straightforw ard, as we now
outline.
Tilted distributions
A potentially im p o rtan t application is calculation o f tail probabilities such as
n = Pr*(T* < to | F), an d the corresponding quantiles o f T*. F or probabilities
w (y’ ) is taken to be the indicator function I {t(y') < £o}, and if y \, . . . , y n is
a single ran d o m sam ple from the E D F F then dG(y') = n~". A ny adm issible
nonparam etric choice for H is a m ultinom ial distribution w ith probability pj
on yj, for j = 1 ,..., n. Then
dH (f) = J J p f ,
j
pj cc e x p ( M j ) , j= l,...,n , (9.18)
where the lj are the usual em pirical influence values for t. The result o f Prob
lem 9.10 shows th a t u nder this distribution T * is approxim ately
N ( t + XnvL, vi ), so the ap p ro p riate choice for X in (9.18) is approxim ately
X = (to — t)/{nvL), again provided to < t\ in some cases it is possible to choose
X to m ake T* have m ean exactly to- T he choice o f probabilities given by (9.18)
is called an exponential tilting o f the original values n ~l . This idea is also used
in Sections 4.4, 5.3, an d 10.2.2.
Table 9.4 shows approxim ate values o f the efficiency R ~ 1 n ( l —n ) / \ a T , (p.H,raw)
o f near-optim al im portance resam pling for various values o f the tail probability
7i. The values were calculated using no rm al approxim ations for the distributions
9.4 • Importance Resampling 453
Quantiles
To see how quantiles are estim ated, suppose th a t we w ant to estim ate the
a quantile o f the distribution o f 7” , and T* is approxim ately N(t, vL) under
G = F. T hen we take a tilted distribution for H such th a t T* is approxim ately
N ( t + zxV l 2 ,vl). For the situation we have been discussing, the exponential
tilted distribution (9.18) will be near-optim al with k = zi / ( n v i/ 2), and in large
sam ples this will be superior to G = F for any ct =/= i. So suppose th a t we
have used im portance resam pling from this tilted distribution to obtain values
fj < ■■■ < tf; w ith corresponding weights vvj,. . . , w ’R. T hen for a < | the raw
quantile estim ate is t"M, where
- m . M+l
— V wr* < a < - — - V wr\ (9.19)
R + 1^ r R+l ^ r
r= l r= 1
see Problem 9.9. W hen there is no im portance sam pling we have w* = 1, and
the estim ate equals the usual (”(R+1)a).
T he variation in w (y') and its im plications are illustrated in the following
454 9 • Improved Calculation
?2-?l-(/^2-W ).
z = 1/2 ’
(S f /n 2 + S f / n i )
l _ yij - h t _ yij - yi
{ s \ / n 2 + s f / n i ) 1/2 1 ( s l / n 2 + s 2l / n i ) U2
8
1
importance resamples;
the hollow points are
5 o B °
the pairs (z*,w‘) for 99 O
ordinary resamples. The 2
o
right panel compares r i
•
the survivor function •
• i \
Pr*(Z* > 2*) estimated
from 50000 ordinary I:
o 1 V L
i; y
bootstrap resamples o
(heavy solid) with
estimates of it based on -2 0 ■4 -2 0
the 99 ordinary
bootstrap samples z* z*
(dashes) and the 99
importance resamples
(solid). The vertical
dotted lines show z q .
1 hj exp(A/u /n i) E ”l i h j exp(Xl2}/ n 2) _
E "L ie x p (/U ij/n i) + £ " l i exp(Xl 2J/ n 2) Z°’
p lo tted against the b o o tstra p values z* for the im portance resamples. These
values o f z* are shifted to the right relative to the hollow points, which show
the values o f z ’ an d w* (all equal to 1) for 99 ordinary resamples. The values
o f w* for the im portance re-w eighting vary over several orders o f m agnitude,
w ith the largest values w hen z* <C z q . But only those for z* > z0 contribute to
f^H,raw •
H ow well does this single im portance resam pling distribution w ork for
estim ating all values o f the survivor function Pr*(Z * > z)? T he heavy solid
line in the right panel shows the “tru e” survivor function o f Z* estim ated from
50 000 o rdinary b o o tstra p sim ulations. T he lighter solid line is the im portance
456 9 ■Improved Calculation
K- 1 £ wrf{*r* ^ Z)
r= 1
with R = 99, an d the d o tted line is the estim ate based on 99 ordinary boo tstrap
sam ples from the null distribution. T he im portance resam pling estim ate follows
the “tru e” survivor function accurately close to zq b u t does poorly for negative
z*. The usual estim ate does best n ear z* = 0 b u t poorly in the tail region
o f interest; the estim ated significance probability is f a = 0. W hile the usual
estim ate decreases by R ~ { at each z*, the weighted estim ate decreases by
m uch sm aller ju m p s close to z<>; the raw im portance sam pling tail probability
estim ate is p.H,raw = 0.015, which is very close to the true value. T he weighted
survivor function estim ate has large ju m p s in its left tail, where the estim ate is
unreliable.
In 50 repetitions o f this experim ent the o rdinary and raw im portance re
sam pling tail probability estim ates h ad variances 2.09 x 10-4 and 2.63 x 10-5 .
F or a tail probability o f 0.015 this efficiency gain o f ab o u t 8 is sm aller th an
would be predicted from Table 9.4, the reason being th a t the distribution o f
z* is rath er skewed an d the norm al approxim ation to it is poor. ■
In general there are several ways to obtain tilted distributions. We can use
exponential tilting w ith exact em pirical influence values, if these are readily
available. O r we can estim ate the influence values by regression using jRo
initial ordinary b o o tstra p resam ples, as decribed in Section 2.7.4. A n other
way o f using an initial set o f b o o tstrap sam ples is to derive weighted sm ooth
distributions as in (3.39): illustrations o f this are given later in Exam ples 9.9
and 9.11.
tl _ E f-i h y ; m y ;)
Z L m y ;)
To some extent this controls the effect o f very large fluctuations in the weights.
In practice it is b etter to treat the weight as a control variate o r covariate.
Since ou r aim in choosing H is to concentrate sam pling where m( ) is largest,
the values o f m(Yr’ )w(Yr*) and w(Yr*) should be correlated. If so, and if
9.4 ■Importance Resampling 457
the average weight differs from its expected value o f one un d er sim ulation
from H, then the estim ate pH,raw probably differs from its expected value fi.
This m otivates the covariance adjustm ent m ade in the importance resampling
regression estimate
Ph ,reg = Ah,raw ~ b(w - 1), (9.23)
Defensive mixtures
A second im provem ent aim s to prevent the weight w( y' ) from varying wildly.
Suppose th a t H is a m ixture o f distributions, n H\ + (1 —n ) H 2 , where 0 < n < 1.
T he distributions Hi and H 2 are chosen so th at the corresponding probabilities
are n o t b o th sm all sim ultaneously. T hen the weights
d G ( / ) / { j i d H , ( / ) + (1 - 7z)dH 2 (y')}
will vary less, because even if d H i ( y m) is very small, d H 2 (y*) will keep the
den o m in ato r aw ay from zero and vice versa. This choice o f H is know n as
a defensive mixture distribution, and it should do particularly well if m any
estim ates, w ith different m( y’ ), are to be calculated. T he m ixture is applied by
stratified sam pling, th a t is by generating exactly n R observations from Hi and
the rest from H 2, and using pH,reg as usual.
T he com ponents o f the m ixture H should be chosen to ensure th a t the
relevant range o f values o f t* is well covered, b u t beyond this the detailed
choice is n o t critical. F o r exam ple, if we are interested in quantiles o f T* for
probabilities betw een a an d 1 — a, then it would be sensible to target Hi at
the a quantile and H 2 a t the 1 — a quantile, m ost simply by the exponential
tilting m ethod described earlier. As a further precaution we m ight add a
th ird com ponent to the m ixture, such as G, to ensure stable perform ance
in the m iddle o f the distribution. In general the m ixture could have m any
com ponents, b u t careful choice o f two or three will usually be adequate.
A lways the application o f the m ixture should be by stratified sam pling, to
reduce variation.
Exam ple 9.9 (G ravity d a ta ) To illustrate the above ideas, we again consider
the hypothesis testing problem o f Exam ple 9.8. T he left panel o f Figure 9.6
458 9 • Improved Calculation
b o o tstrap resam pling w ith R = 299. The right panel shows 20 estim ates o f the
survivor function using the regression estim ate fiH,reg after sim ulations w ith a
defensive m ixture distribution. This m ixture has three com ponents which are
G (the tw o E D F s), an d tw o pairs o f exponential tilted distributions targeted
at the 0.025 an d 0.975 quantiles o f Z*. From o u r earlier discussion these
distributions are given by (9.21) w ith X = ± 2 / v L \ we shall denote the first pair
o f distributions by probabilities p i j an d p 2j , and the second by probabilities
q i j and q 2j . The first com ponent G was used for R i = 99 samples, the second
com ponent (the ps) for R 2 = 100 an d the th ird com ponent (the qs) for
R j = 100: the m ixture prop o rtio n s were therefore nj = R j / ( R \ + R 2 + R 3 ) for
quantities, th a t the raw estim ate is best for quantiles, th a t results for estim ating
quantiles are insensitive to the precise m ixture used, and th a t theoretical gains
m ay not be realized in practice unless a single tail quantity is to be estim ated.
This is in line w ith o th er studies.
this does n o t take account o f the fact th a t sam pling is w ithout replacem ent.
Figure 9.7 shows the theoretical large-sam ple efficiencies o f balanced re
sampling, im portance resam pling, an d balanced im portance resam pling for
estim ating the quantiles o f a norm al statistic. O rdinary balance gives m ax
im um efficiency o f 2.76 a t the centre o f the distribution, while im portance
9.4 ■Importance Resampling 461
- 2 - 1 0 1 2
Normal quantile
resam pling w orks well in the lower tail b u t badly in the centre and u p per tail
o f the distribution. Balanced im portance resam pling dom inates both.
Exam ple 9.10 (Returns d a ta ) In order to assess how well these ideas m ight
w ork in practice, we again consider setting studentized b o o tstrap confidence
intervals for the slope in the returns example. We perform ed an experim ent
like th a t o f Exam ple 9.7, b u t w ith the R = 999 b o o tstrap sam ples generated
by balanced resam pling, im portance resam pling, and balanced im portance
resampling.
Table 9.6 shows the m ean squared error for the ordinary b o o tstrap divided
by the m ean squared errors o f the quantile estim ates for these m ethods, using
50 replicate sim ulations from each scheme. This slightly different “efficiency”
takes into account any bias from using the im proved m ethods o f sim ulation,
though in fact the co n trib u tio n to m ean squared error from bias is small. The
“tru e ” quantiles are estim ated from an ordinary b o o tstrap o f size 100000.
The first tw o lines o f the table show the efficiency gains due to using the
control m ethod w hen the linear approxim ation is used as a control variate,
w ith em pirical influence values calculated exactly and estim ated by regression
from the sam e b o o tstrap sim ulation. The results differ little. T he next two rows
show the gains due to balanced sampling, both w ithout and w ith the control
462 9 • Improved Calculation
B alance 1.0 1.2 1.5 1.4 3.1 2.9 1.7 1.4 0.6
w ith co n tro l 1.4 1.8 3.0 2.8 4.4 4.7 2.5 2.2 1.5
Im p o rtan ce Hi 7.8 3.7 3.6 1.8 0.4 3.5 2.3 3.1 5.5
Hi 4.6 2.9 3.5 1.1 0.1 2.6 3.1 4.3 5.2
Hi 3.6 3.7 2.0 1.7 0.5 2.4 2.2 2.6 3.6
H* 4.3 2.6 2.5 1.8 0.9 1.6 1.6 2.2 2.3
Hs 2.6 2.1 0.7 0.3 0.4 0.5 0.6 1.6 2.1
B alanced Hi 5.0 5.7 4.1 1.9 0.5 2.6 2.2 6.3 4.5
im p o rtan ce Hi 4.2 3.4 2.4 1.8 0.2 2.0 3.6 4.2 3.9
h3 5.2 4.2 3.8 1.8 0.9 3.0 2.4 4.0 4.0
h4 4.3 3.3 3.4 2.2 2.1 2.7 3.7 3.3 4.3
h 5 3.2 2.8 1.0 0.4 0.9 0.9 1.4 2.1 2.1
m ethod, which gives a w orthw hile im provem ent in perform ance, except in the
tail.
The next five lines show the gains due to different versions o f im portance
resam pling, in each case using a defensive m ixture distribution and the raw
quantile estim ate. In practice it is unusual to perform a b o o tstrap sim ulation
w ith the aim o f setting a single confidence interval, and the choice o f im
portance sam pling distrib u tio n H m ust balance various potentially conflicting
requirem ents. O u r choices were designed to reflect this. We first suppose th at
the em pirical influence values lj for t are know n an d can be used for exponen
tial tilting o f the linear approxim ation t'L to t ‘. T he first defensive m ixture, H\,
uses 499 sim ulations from a distribution tilted to the a quantile o f t*L and 500
sim ulations from a distribution tilted to the 1 — a quantile o f fL, for a = 0.05.
The second m ixture is like this b u t w ith a = 0.025.
The third, fo u rth an d fifth distributions are the sort th a t m ight be used in
practice w ith a com plicated statistic. We first perform ed an ordinary b o otstrap
o f size Ro, which we used to estim ate first the em pirical influence values lj
by regression an d then the tilt values rj for the 0.05 and 0.95 quantiles. We
then perform ed a fu rth er b o o tstrap o f size (R — Ro)/2 using each set o f tilted
probabilities, giving a to tal o f R sim ulations from three different distributions,
one centred an d tw o tilted in opposite directions. We took Ro = 199 and
Ro = 499, giving Hj an d i / 4. F or H$ we took Ro = 499, b u t estim ated the
tilted distributions by frequency sm oothing (Section 3.9.2) w ith bandw idth
9.4 ■Importance Resampling 463
e = 0.5t>1/2 at the 0.05 an d 0.95 quantiles o f t*, where v x/1 is the standard error
o f t estim ated from the ordinary bootstrap.
Balance generally im proves im portance resam pling, which is n o t sensitive to
the m ixture distrib u tio n used. The effect o f estim ating the em pirical influence
values is n o t m arked, while frequency sm oothing does n o t perform so well as
exponential tilting. Im portance resam pling estim ates o f the central quantiles
are poor, even w hen the sim ulation is balanced. Overall, any o f schemes H \-
H 4 leads to appreciably m ore accurate estim ates o f the quantiles usually o f
interest. ■
dGk( Y )
E{m(Y) | Gk} = J m(y)dGk{y) = J = E jm(Y)
dH(Y)
We can therefore estim ate all K values using one set o f sam ples y \ , . . . , y N
sim ulated from H, w ith estim ates
N
P k = N 1^ m ( y , ) (9.24)
Both N and the choice o f H depend u p o n the use being m ade o f the estim ates
and the form o f m(-).
Exam ple 9.11 (City population d a ta ) C onsider again estim ating the bias
and variance functions for ratio 8 = t(F ) o f the city population d a ta with
n = 10. In Exam ple 3.22 we estim ated b(F) = E (T | F) — t(F) and v(F) =
v ar( T | F) for a range o f values o f 0 = t{F) using a first-level b o o tstrap to
calculate values o f t* for 999 b o o tstrap sam ples F*, and then doing a second-
A A
level b o o tstrap to estim ate b(F') an d v( F’) for each o f those samples. H ere
the second level o f resam pling is avoided by using im portance re-weighting.
A t the sam e time, we retain the sm oothing introduced in Exam ple 3.22.
R a th er th a n take each Gk to be one o f the b o o tstrap E D F s F*, we obtain
a sm ooth curve by using sm ooth distributions F'f) w ith probabilities pj( 6 ) as
defined by (3.39). Recall th a t the p aram eter value o f F e’ is t(F'g) = 0*, say,
which will differ slightly from 6 . F o r H we take F , the E D F o f the original
data, on the grounds th a t it has the correct su p p o rt and covers the range o f
values for y ’ w ell: it is n o t necessarily a good choice. T hen we have weights
say, where as usual /*• is the frequency with which y} occurs in the rth
bo o tstrap sample. We should em phasize th a t the sam ples y * draw n from H
here replace second-level b o o tstrap samples.
C onsider the bias estim ate. T he weighted sum R~' ^ ( f ’ — 6")w'(0} is an
unbiased estim ate o f the bias E” (T *‘ | F'e ) — 6 *, an d we can plot this estim ate
to see how the bias varies as a function o f O' or 6 . However, the weighted sum
can behave badly if a few o f the w ' ( 0 ) are very large, and it is b etter to use
the ratio an d regression estim ates (9.22) and (9.23).
The top left panel o f Figure 9.8 shows raw, ratio, an d regression estim ates o f
the bias, based on a single set o f R = 999 sim ulations, w ith the curve obtained
from the double b o o tstrap calculation used in Figure 3.7. F o r example, the
ratio estim ate o f bias for a p articu lar value o f d is ]T r(r' — 0 ’)w‘(0 ) / 2 2 r w '(0),
and this is plotted as a function o f 0*. T he raw an d ratio estim ates are rath er
poor, but the regression estim ate agrees fairly well w ith the double boo tstrap
curve. The panel also shows the estim ated bias from a defensive m ixture w ith
499 ordinary sam ples m ixed w ith 250 sam ples tilted to each o f the 0.025 and
0.975 quantiles; this is the best estim ate o f those we consider. The panels below
show 20 replicates o f these estim ated biases. These confirm the im pression from
the panel a b o v e: w ith o rdinary resam pling the regression estim ator is best, but
it is b etter to use the m ixture distribution.
The to p right panel shows the corresponding estim ates for the standard
9.4 ■Importance Resampling 465
0 0
o
based on a defensive o /
mixture distribution
(light solid). The lower 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.2 1.3 1.4 1.5 1.6 1.7 1.8
panels show 20
replicates of raw, ratio,
and regression estimates
from ordinary sampling,
0.10
0.5
/
estimate from a
;'Y / *
l;k / /.
/ >
0.08
n'if vs
0.4
0.4
) / / M
defensive mixture /A
0.06
0.3
0.3
(clockwise from upper J' B &
left) for the panels Mrr--
0.04
0.2
02
above.
200
5 6
9 °
1.2 1.3 1.4 1.5 1.6 1.7 1.6 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.2 1.3 1.4 1.5 1.6 1.7 1.6 1.2 1.3 1.4 1.5 1.6 1.7 1.8
The results for the raw estim ate suggest th a t recycling can give very variable
results, an d it m ust be used w ith care, as the next exam ple vividly illustrates.
Exam ple 9.12 (Bias adjustm ent) C onsider the problem o f adjusting the
b o o tstrap estim ate o f bias o f T , discussed in Section 3.9. The adjustm ent C
in equation (3.30) is (R M )_1 5Zf=1 ]Cm=i(f*m ~ K) ~ which uses M sam ples
from each o f the R m odels F* fitted to sam ples from F. The recycling m ethod
replaces each average M -1 2 2 m=i([rm — t ”) by a w eighted average o f the form
(9.24), so th a t C is estim ated by
(9.25)
where t’’ is the value o f T for the /th sam ple y ’{ , y ’’ draw n from the
distribution H. If we applied recycling only to the first term o f C, which
estim ates E ” (T**), then a different — an d as it tu rn s out inferior — estim ate
would be obtained for C.
The sup p o rt o f H m ust include all R first-level b o o tstrap samples, so as in
the previous exam ple a n atu ral choice is H = F , the m odel fitted to (or the
E D F of) the original sample. However, this can give highly unstable results, as
one m ight predict from the leftm ost panel in the second row o f Figure 9.8. This
can be illustrated by considering the case o f the p aram etric m odel Y ~ N(0, 1),
with estim ate T = Y . H ere the term s being sum m ed in (9.25) have infinite
variance; see Problem 9.15. T he difficulty arises from the choice H = F , and
can be avoided by taking H to be a m ixture as described in Section 9.4.2, with
at least three com ponents. ■
Instability due to the choice H = F does not occur with all applications o f
recycling. Indeed applications to b o o tstra p likelihood (C hapter 10) w ork well
with this choice.
K ' ( b = u, (9.27)
an d is therefore a function o f u. H ere K ' and K " are respectively the first and
second derivatives o f K with respect to £. A simple approxim ation to the C D F
o f U, P r(l/ < u), is
cD(w) + 0 ( w ) f - - - ) , (9.29)
\W V J
for values o f u such th a t |w| < c for some positive c; the erro r in the
C D F ap proxim ation rises to 0 ( n ~ l ) w hen u is such th a t |w| < cn1^2. A key
feature is th a t the error is relative, so th a t the ratio o f the true density o f
U to its saddlepoint approxim ation is bounded over the likely range o f u. A
consequence is th a t unlike other analytic approxim ations to densities and tail
probabilities, (9.26), (9.28) an d (9.29) are very accurate far into the tails o f the
density o f U. If there is d o u b t ab o u t the accuracy o f (9.28) and (9.29), Gs may
be calculated by num erical in tegration o f gs.
The m ore com plex form ulae th a t are used for conditional and m arginal
density an d distrib u tio n functions are given in Sections 9.5.2 and 9.5.3.
468 9 • Improved Calculation
Application to resampling
In the context o f resam pling, suppose th a t we are interested in the distri
b ution o f the average o f a sam ple from y \ , . . . , y n, where is sam pled with
probability pj, j = 1, . . . , n. O ften, b u t n o t always, Pj = n-1 . We can w rite the
average as U' = n~l J2 f)yj> where as usual ( / j , . . . , ) has a jo in t m ultinom ial
distribution w ith den o m in ato r n. T hen U ' has cum ulant-generating function
K( £ ) = nl og j ] T f ? ; exp(£o,) j , (9.30)
where a; = y j / n . The function (9.30) can be used in (9.26) and (9.28) to give
n on-random approxim ations to the P D F and C D F o f U ‘. Unlike m ost o f
the m ethods described in this book, the erro r in saddlepoint approxim ations
arises not from sim ulation variability, b u t from determ inistic num erical error
in using gs and Gs rath er th an the exact density and distribution function.
In principle, o f course, a n o n p aram etric b o o tstrap statistic is discrete and
so the density does not exist, b u t as we saw in Section 2.3.2, U * typically has
so m any possible values th a t we can thin k o f it as continuous aw ay from
the extrem e tails o f its distribution. C ontinuity corrections can som etim es be
applied, b u t they m ake little difference in b o o tstrap applications.
W hen it is necessary to approxim ate the entire distribution o f U', we
calculate the values o f Gs(u) for m values o f u equally spaced betw een min aj
and m ax aj an d use a spline sm oother to in terpolate betw een the corresponding
values o f C>_ 1{Gs(m)}. Q uantiles an d cum ulative probabilities for U ' can be
read off the fitted curve. Experience suggests th a t m = 50 is usually ample.
Exam ple 9.13 (L inear approxim ation) A simple application o f these ideas
is to the linear approxim ation t'L for a b o o tstrap statistic t \ as was used in
Exam ple 9.7. We write T [ = t + n~l where as usual /* is the frequency
o f the yth case in the b o o tstrap sam ple an d lj is the 7 th em pirical influence
value. T he cum ulant-generating function o f T[ — t is (9.30) with aj = l j / n and
9.5 ■Saddlepoint Approximation 469
in
©
o Jh lltlii**,----------
o
0.0 0.5 1.0 1.5 0.1 0.2 0.3 0.4 0.5 0.6
whose solution is \.
For a num erical exam ple, we take the variance t = n~l J2(yj ~ y )2 f ° r
exponential sam ples o f sizes 10 and 15; the em pirical influence values are
lj = (yj — y )2 — t. Figure 9.9 com pares the saddlepoint approxim ations to the
P D F s o f t'L w ith the histogram from b o o tstrap calculations with R = 49 999.
T he saddlepoint approxim ation accurately reflects the skewed lower tail o f
the b o o tstrap distribution, w hereas a norm al approxim ation would not do so.
However, the saddlepoint approxim ation does n o t pick up the m ultim odality
o f the density for n = 10, which arises for the sam e reason as in the right panels
o f Figure 2.9: the bulk o f the variability o f T[ is due to a few observations
w ith large values o f |/; |, while those for which \lj\ is small merely add noise.
The figure suggests th a t w ith so small a sam ple the C D F approxim ation will
be m ore useful. This is borne out by Table 9.7, which com pares the sim ulation
quantiles an d quantiles obtained by fitting a spline to 50 saddlepoint C D F
values.
In m ore com plex applications the em pirical influence values lj would usually
be estim ated by num erical differentiation or by regression, as outlined in
Sections 2.7.2, 2.7.4 and 3.2.1. ■
Example 9.14 (Tuna density estimate) We return to the double boo tstrap
470 9 • Improved Calculation
used in Exam ple 5.13 to calibrate confidence intervals based on a kernel density
estim ate. This involved estim ating the probabilities
where
is the variance-stabilized estim ate o f the quan tity o f interest. The double
bo o tstrap version o f t can be w ritten as t " = ( 2 2 f j ' aj ) ,/2> where aj =
(nh)~l { 4>{—y j / h ) + <t>(yj/h)} an d /** is the frequency w ith which yj appears in
a second-level b o o tstrap sample. C onditional on a first-level b o o tstra p sample
F ’ with frequencies /*„, the / * ’ are independent m ultinom ial variables
with m ean vector (/* !,...,/* „ ) and d en o m in ato r n.
Now if 2 t’r —t < 0, the probability (9.31) equals zero, because T is positive. If
2t*—t > 0, the event T** < 2 t*—t is equivalent to n~l ^ (2 ^ —0 2- Thus
conditional on F", if 2 1 * — t > 0, we can obtain a saddlepoint approxim ation
to (9.31) by applying (9.28) an d (9.30) w ith u = (21* — t )2 and pj =
Including program m ing, it took ab o u t ten m inutes to calculate 3000 values
o f (9.31) by saddlepoint approxim ation; direct sim ulation with 250 sam ples at
the second level took ab o u t four hours on the sam e w orkstation. ■
Estimating functions
O ne simple extension o f the basic approxim ations is to statistics determ ined by
m onotonic estim ating functions. Suppose th a t the value o f a scalar bo o tstrap
statistic T* based on sam pling from y i , . . . , y „ is the solution to the estim ating
equation
n
U*(t) = ^ 2, a{ f ,y j )f 'j = 0, (9.32)
where for each y the function a( 6 ;y) is decreasing in d. T hen T* < t if and
only if U ’(t) < 0. H ence Pr*(T* < t) m ay be estim ated by Gs(0) applied
w ith cum ulant-generating function (9.30) in which aj = a{t;yj). A saddlepoint
approxim ation to the density o f T is
(9.33)
. A
Example 9.15 (M aize data) Problem 4.7 contains d a ta from a paired com
parison experim ent perform ed by D arw in on the grow th o f m aize plants. The
d a ta are reduced to 15 differences y \ , . . . , y n betw een the heights (in eighths o f
an inch) o f cross-fertilized and self-fertilized plants. W hen two large negative
values are excluded, the differences have average J> = 33 and look close to
norm al, b u t w hen those values are included the average drops to 20.9.
W hen d a ta m ay have been co ntam inated by outliers, robust M -estim ates are
useful. If we assum e th a t Y = 8 + as, where the distribution o f e is sym m etric
a b o u t zero b u t m ay have long tails, an estim ate o f location 0 can be found by
solving the equation
' = 0, (9.34)
j=i
y>(e) = c, (9.35)
W ith c = oo this gives 1p(s) = s and leads to the norm al-theory estim ate 9 = y,
b u t a sm aller choice o f c will give b etter behaviour w hen there are outliers.
W ith c = 1.345 and a fixed a t the m edian absolute deviation s o f the data,
we obtain 8 = 26.45. H ow variable is this? We can get some idea by looking
at replicates o f 9 based on b o o tstrap sam ples y j,...,y * . A b o o tstrap value 9*
solves
P i ^ h
CD Figure 9.10
o Comparison of the
saddlepoint
approximation to the
PDF of a robust
o M-estimate applied to
the maize data (solid),
o Q with results from a
Q_ bootstrap simulation
C\J with R = 50000. The
o
o heavy curve is the
saddlepoint
approximation to the
PDF of the average.
o
d The left panel shows
results from resampling
-20 20 40 60 the data, and the right
shows results from a
theta theta symmetrized bootstrap.
T he right panel o f Figure 9.10 com pares saddlepoint and M onte C arlo a p
proxim ations to the P D F o f O' und er this sym m etrized resam pling scheme;
the P D F o f the average is shown also. All are sym m etric ab o u t 6 .
One difficulty here is th a t we m ight prefer to approxim ate the P D F o f O'
w hen s is replaced by its b o o tstrap version s', an d this cannot be done in the
current fram ew ork. M ore fundam entally, the distrib u tion o f interest will often
be for a q uantity such as a studentized form o f O' derived from 6 ", s', and
perhaps other statistics, necessitating the m ore sophisticated approxim ations
outlined in Section 9.5.3. ■
w here £20 satisfies the (q — qi) x 1 system o f equations d K ( 0 , £ 2 )/dt ; 2 = u2, and
K '22 is the (q — q\) x {q — qi) corner o f K " corresponding to U2.
D ivision o f (9.36) by (9.37) gives a double saddlepoint approxim ation to the
conditional density o f U\ at ui given th a t U2 = u2. W hen U\ is scalar, i.e.
q\ = 1, the approxim ate conditional C D F is again (9.28), b u t with
\ |K"2( 0 , 6 o)I J
In this situation it is o f course m ore direct to use the estim ating function
m ethod w ith a(t;yj) = Xj—tZj and the sim pler approxim ations (9.28) and (9.33).
T hen the Jaco b ian term in (9.33) is | 22 z; e x p { |(x , —t zj ) } / 22 exp{|(x,- —tZj)}\.
A n o th er application is to conditional distributions for T*. Suppose th at the
populatio n pairs are related by x; = Zj 6 + z l/ 2 £j, where the e; are a random
sam ple from a distrib u tio n w ith m ean zero. T hen conditional on the Zj, the
ratio 2 2 xj / 2 2 zj has variance p ro p o rtio n al to (]P Z j)~' ■In some circum stances
we m ight w ant to obtain an ap proxim ation to the conditional distribution o f
T * given th a t 2 2 Z j = 2 2 zj- this case we can use the approach outlined in
the previous p aragraph, b u t w ith tw o conditioning variables: we take the Wj
to be independent Poisson variables w ith equal m eans, and set
"E (xj-tzj)W j\ ( o
[/*= 2 2 zjW j h u = \ 2 2 zj a, = zJ
22 w j J \ n ) V 1
A third application is to approxim ating the distribution o f the ratio when
a sam ple o f size m = 10 is taken w ithout replacem ent from the n = 49 d a ta
pairs. A gain T ' < t is equivalent to the event 2 2 j( x j ~ t z j ) W j < 0, b u t now W j
indicates th a t (z j , X j) is included in the m cities chosen; we w ant to im pose the
condition 2 2 ^ 0 = m - We take Wj to be binary variables with equal success
probabilities 0 < n < 1, giving Kj(£) = lo g (l — n + ne*), with n any value. We
then apply the double saddlepoint approxim ation with
- U ) - " - ( V '
Table 9.8 com pares the quantiles o f these saddlepoint distributions with
9.5 • Saddlepoint Approximation 475
M onte C arlo approxim ations based on 100000 samples. The general agreem ent
is excellent in each case. ■
Exam ple 9.17 (C orrelation coefficient) In Exam ple 4.9 we applied a perm u
tation test to the sam ple co rrelation t betw een variables x and z based on pairs
(x i,z i), ..., (x„,z„). F or this statistic and test, the event T > t is equivalent to
EjXjZ(U) - Y l x i zj> where £(•) is a p erm u tatio n o f the integers 1,.. . , n .
A n alternative form ulation is as follows. Let Wy, i , j = 1 denote
independent binary variables w ith equal success probabilities 0 < n < 1, for
any n. T hen consider the distrib ution o f U\ = J 2 i j x izj ^ U conditional on
U 2 = ( £ , W i j , . . . , Y , j w nj,E , W|1....... 5Di w i,n-i) r = M2, where u 2 is a vector
o f ones o f length 2n — 1. N otice th a t the condition E , = 1 is entailed by
the o th er conditions an d so is redundant. E ach value o f Xj and each value o f zj
app ears precisely once in the sum U\, w ith equal probabilities, and hence the
conditional distrib u tio n o f U\ given U 2 = u 2 is equivalent to the p erm utation
distribution o f T. H ere m = n2, q = 2n, and qi = 1.
O u r lim ited num erical experience suggests th a t in this exam ple the sad
d lepoint ap proxim ation can be inaccurate if the large num ber o f constraints
results in a conditional distribution on only a few values. ■
w here K; is the ith cum u lan t o f T*. T he exact cum ulants are usually unavailable,
so we replace them w ith the cum ulants o f the cubic approxim ation to T* given
by
n n n
t * = t + n - 1 £ / ; / , • + K 2 E / * / ; * ; + i n~3 fifjfkdjk,
i=l i,j=1 i,jjt=1
476 9 • Improved Calculation
where t is the original value o f the statistic, an d lj, qjj and Cy* are the em pirical
linear, quad ratic and cubic influence values; see also (9.6). To the order required
the approxim ate cum ulants are
Z q = (T* ~ kc ,i ) / k £ is
where
D ifferentiation o f (9.39) gives an approxim ate density for Z'c and hence for
T*. However, experience suggests th a t the saddlepoint approxim ations (9.28)
and (9.29) are usually preferable if they can be obtained, prim arily because
(9.39) results in less accurate tail probability estim ates: its error is absolute
ra th e r th an relative. F u rth er draw backs are th at (9.39) need n o t increase with
z, and th a t the density approxim ation m ay becom e negative.
D erivation o f the influence values th a t contribute to kc,i , . . . , Kc,4 can be
tedious.
and so forth.
To obtain the linear, quad ratic and cubic influence values for w(G, F ) at
G = F, we replace G(y) w ith
Here H(x) is the
Heaviside function,
(1 - ei - s 2 - e3)F(y) + £1H ( y - j>i) + e2 H ( y - y 2) + £3H (y - y3),
jumping from 0 to 1 at
x = 0. differentiate w ith respect to £1, s2, and £3, and set £1 = £2 = £3 = 0. The
em pirical influence values for W at F are then obtained by replacing F with
F. In term s o f the influence values for t and v the result o f this calculation is
L w(yi) = v~ 1 / 2 L t(yi),
Qviyuyi) = v~ll 2 Qt{yx, y 2) - ^v~ 3/ 2 L t{yi)Lv{y2 )[2],
Cw{y \, y 2 , y i ) = v ^ l/ 2 Ct(yu y 2 , y 3)
- \ v ~ V 2 { 6 f0 'i,j'2 )l‘.,0'3) + Qv (y uy 2 ) Lt(yi)} [3]
+ 1V~5/ 2 L[ (y 1)LV(y 2 )LV(y3) [3],
478 9 • Improved Calculation
where [fc] after a term indicates th a t it should be sum m ed over the perm utations
o f its y^s th a t give the k distinct quantities in the sum. Thus for exam ple
The influence values for z involve linear, quadratic, and cubic influence values
for t, and linear an d quad ratic influence values for v, the latter given by
The sim plest exam ple is the average t(F) = f x d F ( x ) = y o f a sam ple o f
values y u - - - , y „ from F. T hen L t(j/,-) = y t - y , Qtiyuyj) = Ct(yi,yj,yk) = 0, the
expressions above simplify greatly, an d the required influence quantities are
Integration approach
A n other ap proach involves extending the estim ating function approxim ation
to the m ultivariate case, an d then approxim ating the m arginal distribution o f
the statistic o f interest. To see how, suppose th a t the quantity T o f interest is a
scalar, an d th a t T and S = ( S i , . . . , S q- i) r are determ ined by a q x 1 estim ating
function
n
U(t,s) = ^ a ( t , s i , . . . , s 9- l ;Yj ).
J=i
T hen the b o o tstra p quantities T* an d S ’ are the solutions o f the equations
n
U'(t, s) = J 2 a j ( t , s ) f j = 0 , (9.40)
j=i
9.5 • Saddlepoint Approximation 479
where
,, . p j e x p { Z Taj(t,s)}
1 ’ Y l = i P k e x p { £ , Tak{ t , s) } ’
- 1 /2
d 2 A(t, s)
J(t,S;?)(2 n)-,/2 |X"(|;t,S)|-1/ 2 exp s), (9.43)
dsdsT
8K ; t, s) i 8 K (<^; t, s) , 8 aj s)
— al — = nYj l
=1
pA s ) = °> — j s— = nYlpft
;= 1
>s ) d-s- t = °-
(9.44)
These can be solved using packaged routines, w ith starting values given by
noting th a t w hen t equals its sam ple value to, say, s equals its sam ple value
and £ = 0.
The second derivatives o f A needed to calculate (9.43) m ay be expressed as
8 2K { t - , t , s )
n ^2p ' j( t ,s ) aj ( t , s ) a j ( t ,s ) T, (9.46)
(9.47,
(9.51)
;= i
A pproxim ate quantiles o f T* can be obtained in the way described ju st before
Exam ple 9.13.
The expressions above look forbidding, b u t their im plem entation is relatively
straightforw ard. The key p o in t to note is th a t they depend only on the qu an ti
ties aj(t, s), their first derivatives w ith respect to t, an d their first two derivatives
w ith respect to s. Once these have been program m ed, they can be input to
a generic routine to perform the saddlepoint approxim ations. Difficulties th a t
som etim es arise w ith num erical overflow due to large exponents can usually be
circum vented by rescaling d a ta to zero m ean and unit variance, which has no
9.5 ■Saddlepoint Approximation 481
Exam ple 9.19 (M aize d ata) To illustrate these ideas we consider the boo tstrap
variance an d studentized average for the m aize data. Both these statistics are
location-invariant, so w ithout loss o f generality we replace yj with yj — y and
henceforth assum e th a t y = 0. W ith this sim plification the statistics o f interest
are
Figure 9.11
Saddlepoint
approximations for the
bootstrap variance V *
and studentized average
o Z* for the maize data.
W Top left:
a> approximations to
c quantiles of Z* by
co
3 integration saddlepoint
o (solid) and simulation
using 50000 bootstrap
samples (every 20th
order statistic is shown).
Top right: density
approximations for Z*
-4 *2 0 2 by integration
Quantiles of standard normal z saddlepoint (heavy
solid), approximate
cumulant-generating
function (solid), and
simulation using 50 000
bootstrap samples.
Bottom left:
corresponding
approximations for V*.
Bottom right: contours
of —A(z,t>), with local
maxima along the
dashed line z = —3.5 at
A and at B.
Exam ple 9.20 (Robust M -estim ate) For a second exam ple o f m arginal ap
proxim ation, we suppose th a t 8 and a are M -estim ates found from a random
sam ple y i , . . . , y n by sim ultaneous solution o f the equations
7=1 v 7 ;=1 v 7
which is pro p o rtio n al to the usual Student-t statistic when ip(s) = e. In order
to set studentized b o o tstrap confidence limits for 8 , we need approxim ations to
the b o o tstrap quantiles o f Z . These m ay be obtained by applying the m arginal
saddlepoint approxim ation outlined above with T = Z , S = (Si, S i ) 7 , Pj = n-1 ,
484 9 • Improved Calculation
and
(
xp ( ae j/s i - z d / s 2) \
tp 2 ( ae j/ si - z d / s 2) - y j , (9.52)
tp' ( pe j/ s \ — z d / s 2) - s 2 )
such as H am m ersley and H andscom b (1964), Bratley, Fox and Schrage (1987),
Ripley (1987), an d N iederreiter (1992).
B alanced b o o tstra p sim ulation was introduced by D avison, H inkley and
Schechtm an (1986). O gbonm w an (1985) describes a slightly different m ethod
for achieving first-order balance. G rah am et al. (1990) discuss second-order
balance an d the connections to classical experim ental design. A lgorithm s for
balanced sim ulation are described by G leason (1988). T heoretical aspects o f
balanced resam pling have been investigated by D o and H all (1992b). Balanced
sam pling m ethods are related to num ber-theoretical m ethods for integration
(Fang an d W ang, 1994), and to L atin hypercube sam pling (M cK ay, Conover
and Beckm an, 1979; Stein, 1987; Owen, 1992b). D iaconis and H olm es (1994)
discuss the com plete en um eration o f b o o tstrap sam ples by m ethods based on
G ray codes.
L inear approxim ations were used as control variates in b o o tstra p sam pling
by D avison, H inkley an d Schechtm an (1986). A different approach was taken
by E fron (1990), w ho suggested the re-centred bias estim ate and the use
o f control variates in quantile estim ation. D o and H all (1992a) discuss the
properties o f this m ethod, an d provide com parisons with other approaches.
F u rth er discussion o f control m ethods is contained in theses by T herneau
(1983) and H esterberg (1988).
Im portance resam pling was suggested by Jo hns (1988) and D avison (1988),
and was exploited by H inkley an d Shi (1989) in the context o f iterated
boo tstrap confidence intervals. G igli (1994) outlines its use in param etric
sim ulation for regression an d certain tim e series problem s. H esterberg (1995b)
suggests the application o f ratio and regression estim ators and o f defensive
m ixture distributions in im portance sam pling, an d describes their properties.
T he large-sam ple perform ance o f im portance resam pling has been investigated
by D o an d H all (1991). B ooth, H all and W ood (1993) describe algorithm s for
balanced im portance resampling.
B ootstrap recycling was suggested by D avison, H inkley and W orton (1992)
and independently by N ew ton and G eyer (1994), following earlier ideas by
J. W. Tukey; see M orgenthaler an d Tukey (1991) for application o f sim ilar
ideas to robust statistics. Properties o f recycling in various applications are
discussed by V entura (1997).
S addlepoint m ethods have a history in statistics stretching back to D aniels
(1954), an d they have been studied intensively in recent years. R eid (1988)
reviews their use in statistical inference, while Jensen (1995) and Field and
R onchetti (1990) give longer accounts; see also B arndorff-N ielsen and Cox
(1989). Jensen (1992) gives a direct account o f the distribution function ap
proxim ation we use. Saddlepoint approxim ation for p erm u tatio n tests was
proposed by D aniels (1955) and further discussed by R obinson (1982). D avi
son and H inkley (1988), D aniels and Y oung (1991), and W ang (1993b) in
9.7 - Problems 487
9.7 Problems
1 Under the balanced bootstrap the descending product factorial m om ents o f the /*•
are
Pu = ^ ^ Qu — ^ ^ Sw>
vv:rw = u w :jw=v
with u and v ranging over the distinct values o f row and colum n subscripts on the
left-hand side o f (9.53).
(a) Check the first- and second-order mom ents for the f ’ j at (9.9), and verify that
the values in Problem 2.19 are recovered as R — * c o .
(b) Use the results from (a) to obtain the mean o f the bias estimate under balanced
resampling.
(c) N ow suppose that 7” is a linear statistic, and let V — ( R — I)-1 ^ r(Tr' — T ' ) 2
be the estimated variance o f T based on the bootstrap samples. Show that the
mean o f V ’ under m ultinom ial sampling is asymptotically equivalent to the mean
under hypergeometric sampling, as R increases.
(Section 9.2.1; Appendix A ; Haldane, 1940; D avison, Hinkley and Schechtman,
1986)
488 9 ■Improved Calculation
D [ 21/2r ( f ) 1
l " I/2r ( ¥ ) J
But suppose that we estimate the bias by parametric resampling; that is, we
generate samples y[,...,y„ from the N(y,t2) distribution. Show that the raw and
adjusted bootstrap estimates o f B can be expressed as
Br = Y xr/2 ~ 1
and
1 /2 >
where S(u) = 1 if u = 0 and equals zero otherwise, to show that the ^-balanced
design is balanced in terms o f f r’ j. Is the converse true?
(c) Suppose that we have a regression m odel Yj = fixj + ej, where the independent
errors e; have mean zero and variance a 2. We estimate fi by T = Y j X j / J 2 x j-
Let T ' = 52(t Xj +e'j )xj/ Y l x ) denote a resampled version o f T , where e’ is selected
randomly from centred residuals e; — e, with e} = y, — t x j and e = n ~ l ^ e;. Show
that the average value o f T* equals t if R values o f T ' are generated from a
<!;-balanced design, but not necessarily if the design is balanced in terms o f / * .
(Section 9.2.1; Graham e t al., 1990)
8 Suppose that you wish to estimate the normal tail probability f I { z < a}<p(z)dz,
where <p(.) is the standard normal density function and I [A] is the indicator o f the
event A, by importance sampling from a distribution H( ).
Let H be the normal distribution with mean fi and unit variance. Show that the
maximum efficiency is
<P(fl){l-< D (q )}
exp(/i2)<I>(a + fi) — 9 ( a ) 2 ’
where (i is chosen to minimize exp(/r )<t>(a+^). Use the fact that <t>(z) = —<f>(z)/z for
z < 0 to give an approximate value for fj., and plot the corresponding approximate
efficiency for —3 < a < 0. W hat happens when a > 0 1
(Section 9.4.1)
490 9 • Improved Calculation
'- L * w - / . S 5 " w
Let T(! ) < • • • < T {R) denote the order statistics o f the Tr, set
10 Suppose that T has a linear approximation T[, and let p be the distribution on
y y n with probabilities p; oc exp { l l j / ( n v Ll / 2 ) } , where v L = n ~ 2 Y I l j - Find the
mom ent-generating function o f T[ under sampling from p, and hence show that
in this case T* is approximately N ( t -I- A v j / 2 , v L ). You may assume that T[ is
approximately N ( 0, v L ) when A = 0.
(Section 9.4.1; Johns, 1988; Hinkley and Shi, 1989)
d ,
lj* = ^ t { ( l - f i) p a + e lj]
(c) For the ratio estimates in Example 2.22, compare numerically t’L, t'Lfi9 and the
quadratic approximation
tQ = t + n^ J l f ? J + 2l n ~ 2 H fjfk t*
j=l j = 1 k= 1
with t ’.
(Sections 2.7.4, 3.10.2, 9.4.1; Hesterberg, 1995a)
E ^ r)w W n + R-Vh,
J 2 W(yr) 1 + R~,/2Eo ’
where si = R 1/2 $3{m(yr)w(jv) A1} an(i ®o = R 1/2 E W O v) — !}• Show that this
implies that
var (fiH, at) = K ^ v a r {m(Y )w(Y) - /*w(Y)} .
9.7 ■Problems 491
by simulating from density h(y) = fie~^y, y > 0, fi > 0. Give w(y) and show
that E{m (Y )w (Y )} = n for any fi and 6, but that var(£//rat) is only finite when
0 < fi < 2 6 . Calculate var{m (Y)w (Y )}, cov{m (Y )w (Y ), w (Y)}, and var{w(Y)}.
Plot the asymptotic efficiencies var(/i;; raw) / var(£// ra, ) and var(/*//ratv) / var(^Wrfg) as
functions o f fi for 0 = 2 and fc = 0 ,1 ,2 ,3 . Discuss your findings.
(Section 9.4.2; Hesterberg, 1995b)
13 Suppose that an application o f importance resampling to a statistic T" has resulted
in estimates tj < ■■■< t'R and associated weights w”, and that the importance re
weighting regression estimate o f the C D F o f T" is required. Let A be the R x R
matrix w hose (r,s) element is w“/ ( t “ < t‘ ) and B be the R x 2 matrix whose rth
row is ( I X ) . Show that the regression estimate o f the C D F at t \ , . . . , t ’R equals
(1,1 ) ( BTB ) ~ i B TA.
(Section 9.4.2)
14 (a) Let h = ( h \ , . • ■,hn), k = 1, . . . , n R , denote independent identically dis
tributed multinomial random variables with denominator 1 and probability vector
p = (p\, . . . ,p„). Show that SnK = Yl k =l ^ ^as a multinomial distribution with
denominator n R and probability vector p, and that the conditional distribution
o f I nR given that S„R = q is multinomial with denominator 1 and mean vector
(nR) ~{q , where q = ( R i , . . . , R „ ) is a fixed vector. Show also that
15 For the bootstrap recycling estimate o f bias described in Example 9.12, consider
the case T = Y with the parametric m odel Y ~ N ( 0 , 1). Show that if H is taken to
be the N ( y , a ) distribution, then the simulation variance o f the recycling estimate
o f C is approximately
1 i / a2 y ~ 1)/2 r « ( « - ! ) 1 °2 11
n R + \2 « -l/ I (2a - 3)3/2 R N 8 (a - \ f ' 2 N J J ’
provided a > Compare this to the simulation variance when ordinary double
bootstrap methods are used.
What are the im plications for nonparametric double bootstrap calculations? In
vestigate the use o f defensive mixtures for H in this problem.
(Section 9.4.4; Ventura, 1997)
16 Consider exponential tilting for a statistic whose linear approximation is
where the ( / ' , , . . . , f ‘„s), s = 1 ,..., S, are independent sets o f m ultinom ial frequen
cies.
(a) Show that the cumulant-generating function o f T I is
s f 1 "s
K { 0 = ft + Y n* lo6 \ ~ exP ( ^ y M )
s=l I t= 1
20 (a) Show that the bootstrap correlation coefficient t ’ based on data pairs ( x j , Zj),
j = 1, . . . , n , may be expressed as the solution to the estimating equation (9.40)
with
/ Xj-Si \
Zj - s2
Oj ( t , s ) = ( Xj Si)2 53
(Zj - s2j2 - s4
V ( Xj - Si ) ( Zj - S2) - t{s3s4)1/2 J
where s T = (s1,s 2 ,s 3,s 4), and show that the Jacobian J ( t , s ; £ ) = n5(s 3s4)1/2. Obtain
the quantities needed for the marginal saddlepoint approximation (9.43) to the
density o f T*.
(b) W hat further quantities would be needed for saddlepoint approximation to the
marginal density o f the studentized form o f T ‘ ?
(Section 9.5.3; D avison, Hinkley and Worton, 1995; DiCiccio, Martin and Young,
1994)
21 Let T[‘ be a statistic calculated from a bootstrap sample in which appears with
frequency f j (j = 1, . ..,n ) , and suppose that the linear approximation to T ' is
T [ = t + n~‘ Y s f j h ’ where /i < k < ■ ■ ■ < / „ . The statistic r2
* antithetic to T,' is
calculated from the bootstrap sample in which y, appears with frequency /* +l ..
(a) Show that if T [ and r 2“ are antithetic,
and deduce that when T is the sample average and F is the exponential distribution
the large-sample performance gain o f antithetic resampling is 6 /(1 2 — n 2) = 2.8.
(c) W hat happens if F is symmetric? Explain qualitatively why.
(Hall, 1989a)
494 9 - Improved Calculation
where zo, zi, z2 are unknown but a is known; often a = j . Suppose that we
resample from the E D F F, but with sample sizes nQ, m , where 1 < no < n t < n,
instead o f the usual n, giving simulation estimates z ' ( n0), z ' ( n t ) o f z(n0), z( n x).
(a) Show that z*(n) can be estimated by
9.8 Practicals
1 For ordinary bootstrap sampling, balanced resampling, and balanced resampling
within strata:
y <- rnorm(lO)
junk.fun <- function(y, i) var(y[i])
junk <- boot(y, junk.fun, R=9)
b o o t .array(junk)
apply(junk$t,2,sum)
junk <- boot(y, junk.fun, R=9, sim="balanced")
b o o t .array(junk)
apply(j unk$t,2,sum)
junk <- boot(y, junk.fun, R=9, sim="balanced",
strata=rep(l:2,c(5,5)))
boot.array(j unk)
apply(junk$t,2,sum)
N ow use balanced resampling in earnest to estimate the bias for the gravity data
weighted average:
W hat are the efficiency gains due to using balanced simulation and post-simulation
adjustment for bias estimation here? N ow a calculation to see the correlation
between T ' and its linear approximation:
Finally, calculations for the estimates o f bias, variance and quantiles using the
linear approximation as control variate:
e x p . t i l t ( t a u . L , t h e t a = c ( 1 4 , 1 8 ) ,t 0 = 1 6 .1 6 )
t a u . t i l t < - t i l t . b o o t ( t a u , t a u . w, R=c( 1 9 9 ,1 0 0 ,1 0 0 ) ,s t r a ta = ta u $ d e c a y ,
s t y p e = "w", L = ta u . L, a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) )
s p l i t . s c r e e n (c (1 ,2 ) )
s c r e e n ( l) ; p lo t ( t a u .t ilt $ t ,im p .w e ig h t s ( t a u .t ilt ) ,lo g = " y " )
s c r e e n ( 2 ) ; p l o t ( t a u . t i l t $ t , im p. w e ig h t s ( t a u . t i l t , d e f= F ), lo g = " y ")
i m p .q u a n t i l e ( t a u . t i l t , a l p h a = c ( 0 . 0 5 , 0 . 9 5 ) )
im p. q u a n t i l e ( t a u . t i l t , a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) , def=F )
The same can be done with frequency sm oothing, but then the initial value o f R
must be larger:
t a u .f r e q < - t i l t . b o o t ( t a u , ta u .w , R=c( 4 9 9 ,2 5 0 ,2 5 0 ) ,
s t r a ta = ta u $ d e c a y , stype="w ", t i l t = F , a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) )
im p .q u a n t il e ( t a u .f r e q ,a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) )
For balanced importance resampling we simply add sim ="balanced" to the argu
ments o f t i l t . b o o t . For a small simulation study to see the potential efficiency
gains over ordinary sampling, we compare the performance o f ordinary sampling
and importance resampling with and without balance, in estimating the 0.1 and
0.9 quantiles o f the distribution o f t".
t a u . t e s t < - NULL
f o r ( i r e p in 1 :1 0 )
{ ta u .b o o t < - b o o t ( t a u , ta u .w , R=199, stype="w ",
str a ta = ta u $ d e c a y )
q .o r d < - s o r t ( t a u . b o o t $ t ) [ c ( 2 0 , 1 8 0 )]
t a u . t i l t < - t i l t . b o o t ( t a u , ta u .w , R = c ( 9 9 ,5 0 ,5 0 ) ,
s t r a ta = ta u $ d e c a y , stype="w ", L = tau .L ,
a lp h a = c ( 0 . 1 , 0 . 9 ) )
q . t i l t < - i m p . q u a n t i l e ( t a u . t i l t , a lp h a = c ( 0 . 1 , 0 . 9 ) ) $raw
t a u .b a l < - t i l t . b o o t ( t a u , ta u .w , R = c ( 9 9 ,5 0 ,5 0 ) ,
s tr a ta = ta u $ d e c a y , stype="w ", L = tau .L ,
a lp h a = c ( 0 .1 , 0 . 9 ) , sim ="balanced")
q .b a l < - i m p .q u a n t il e ( t a u .b a l , a lp h a = c ( 0 .1 , 0 .9 ))$ r a w
t a u . t e s t < - r b i n d ( t a u . t e s t , c ( q .o r d , q . t i l t , q . b a l ) ) >
s q r t ( a p p l y ( t a u . t e s t , 2 , v a r ))
W hat are the efficiency gains o f the two importance resampling methods?
Consider the bias and standard deviation functions for the correlation o f the
c la r id g e data (Example 4.9). To estimate them, we perform a double bootstrap
and plot the results, as follows.
c l a r .f u n < - f u n c t io n ( d a t a , f )
{ r <- c o r r (d a ta , f/s u m (f))
n < - n row (d ata)
d <- d a ta [r e p ( 1 : n ,f ) ,]
us <- ( d [ ,l] - m e a n ( d [ ,l ] ) ) /s q r t ( v a r ( d [ ,l] ) )
x s < - ( d [ ,2 ] - m e a n ( d [ ,2 ] ) ) / s q r t ( v a r ( d [ ,2 ] ) )
9.8 ■Practicals 497
To obtain recycled estimates using only the results from a single bootstrap, and to
compare them with those from the double bootstrap:
D o you think these results are close enough to those from the double bootstrap?
Compare the values o f 9 in I S .c la r [, 1] to the values o f O' = t(Fg) in I S .c la r [, 2].
r5 <- apply(matrix(capability$y,15,5,byrow=T), 1,
function(x) diff(range(x)))
m <- 300; top <- 10; bot <- 4
sad <- matrix(, m, 3)
th <- seq(bot,top,length=m)
for (i in l:m)
{ sp <- saddle(A=psi(th[i], r 5 ) , u=0)
sad[i,] <- c(th[i] , sp$spa[l] *det .psi(th[i] , r5, xi=sp$zeta.hat) ,
sp$spa[2]) }
sad <- sad[! is.na(sad[,2] )&!is.na(sad[,3] ) ,]
plot(sad[,l],sad[,2],type="l",xlab="theta hat",ylab="PDF")
To obtain the quantiles o f the distribution o f 6', we use the following code; here
ca p a b .tO contains 9 and its standard error.
theta.fun <- function(d, w, k = 2 .236*(5.79-5.49)) k*sum(w)/sum(d*w)
capab.v <- v a r .linear(empinf(data=r5, statistic=theta.fun))
capab.tO <- c (2.236*(5.79-5.49)/mean(r5),sqrt(capab.v))
Afn <- function(t, data, k=2.236*(5.79-5.49)) k-t*data
ufn <- function(t, data, k=2.236*(5.79-5.49)) 0
capab.sp <- saddle.distn(A=Afn, u=ufn, t0=capab.t0, data=r5)
capab.sp
We can use the same ideas to apply the block bootstrap. N ow we take b = 15 o f
the n — I + 1 blocks o f successive observations o f length / = 5. We concatenate
them to form a new series, and then take the ranges o f each block o f successive
observations. This is equivalent to selecting b ranges from among the n — I + 1
possible ranges, with replacement. The quantiles o f the saddlepoint approximation
to the distribution o f 6’ under this scheme are found as follows.
r5 <- NULL
for (j in 1:71) r5 <- c(r5, diff(range(capability$y[j:(j+4)])))
Afn <- function(t, data, k=2.236*(5.79-5.49)) cbind(k-t*data,1)
ufn <- function(t, data, k=2.236*(5.79-5.49)) c(0,15)
capab.spl <- saddle.distn(A=Afn,u=ufn,wdist="p",
type="cond",t0=capab.t O ,data=r5)
capab.spl$quantiles
Compare them with the quantiles above. How do they differ? W hy?
5 To apply the saddlepoint approximation given in Problem 9.17 to the paired com
parison data o f Problem 4.7, and obtain a one-sided significance level P r'(D ’ > d):
10.1 Likelihood
The likelihood function is central to inference in param etric statistical models.
Suppose th a t d a ta y are believed to have com e from a distribution F w, where
xp is an unknow n p x 1 vector param eter. T hen the likelihood for rp is the
corresponding density evaluated a t y , nam ely
L (w ) = / v (y ).
" (V I ■
W{ y >) = 2 { t ( \ p ) - / ( x p ) } .
499
500 10 ■Semiparametric Likelihood Inference
in param etric models. O ne special feature is th a t the likelihood determ ines the
shape o f confidence regions when xp is a vector.
Unlike m any o f the confidence interval m ethods described in C h apter 5,
likelihood provides a n a tu ra l basis for the com bination o f inform ation from
different experim ents. If we have tw o independent sets o f data, y and z,
th a t bear on the sam e p aram eter, the overall likelihood is simply L(xp) =
f ( y I W)f(z I and tests an d confidence intervals concerning 1p m ay be
based on this. This type o f com bination is particularly useful in applications
where several in dependent experim ents are linked by com m on param eters; see
Practical 10.1.
In applications we can often w rite xp = ( 6 ,X), where the com ponents o f 8 are
o f prim ary interest, while the so-called nuisance param eters X are o f secondary
concern. In such situations inference for 8 is based on the profile likelihood,
L p( 6 ) = m ax L ( 8 , X), (10.1)
n o n p aram etric m axim um likelihood estim ate t = t(F) for 9 (Problem 10.1).
T he E D F is a m ultinom ial distribution w ith d enom inator one and probability
vector («_1, . . . , n _1) attached to the yj. We can think o f this distribution as
em bedded in a m ore general m ultinom ial distribution with a rb itrary probability
vector p = (pi ,. . . ,p„) attached to the d a ta values. If F is restricted to be
such a m ultinom ial distribution, then we can w rite t(p) rath er than t(F)
for the function which defines 8 . The special m ultinom ial probability vector
(n_1, . . . , n _1) corresponding to the E D F is p, and t = t(p) is the nonparam etric
m axim um likelihood estim ate o f 6 . This m ultinom ial representation was used
earlier in Sections 4.4 an d 5.4.2.
R estricting the m odel to be m ultinom ial on the d a ta values with probability
vector p, the p aram eter value is 9 = t{p) and the likelihood for p is L(p) =
n " = i P^j >with / j equal to the frequency o f value yj in the sample. But, assum ing
there are no tied observations, all f j are equal to 1, so th a t U p ) = p x x • • • x pn:
this is the analogue o f L(i/;) in the param etric case. We are interested only in
9 = t(p), for which we can use the profile likelihood
n
L e l {Q)= sup TT Pj, (10.3)
p-Ap)=ejJi
which is called the empirical likelihood for 9. N otice th a t the value o f 9 which
m axim izes L El { 8 ) corresponds to the value o f p m axim izing L(p) with only
the constrain t Y l Pj = 1> th a t is p. In other words, the em pirical likelihood is
m axim ized by the nonparam etric m axim um likelihood estim ate t.
In (10.3) we m axim ize over the p; subject to the constraints im posed by
fixing t(p) = 9 an d Y l Pj = 1> which is effectively a m axim ization over n — 2
quantities w hen 9 is scalar. R em arkably, although the num ber o f param eters
over which we m axim ize is com parable with the sam ple size, the approxim ate
d istributional results from the p aram etric situation carry over. Let do be the
true value o f 8 , w ith T the m axim um em pirical likelihood estim ator. T hen
und er mild conditions on F and in large samples, the em pirical likelihood
ratio statistic
o o
Figure 10.1 Likelihood
and log likelihoods for
the mean of the
00 air-conditioning data:
d empirical (dots),
exponential (dashes),
o o and gamma profile
JO (solid). Values of 6
whose log likelihood lies
above the horizontal
dotted line in the right
CvJ panel are contained in
o an asymptotic 95%
confidence set for the
o
o true mean.
theta theta
(10.4)
T hus the log em pirical likelihood, norm alized to have m axim um zero, is
n
(10.5)
This is m axim ized at the sam ple average 8 = y, where ye = 0 and Pj = n_1. It
is undefined outside (m iny^, m ax y 7-), because no m ultinom ial distribution on
the yj can have m ean outside this interval.
Figure 10.1 shows L e l (&), which is calculated by successive solution o f (10.4)
to yield tjg at values o f 8 small steps apart. T he exponential likelihood and
gam m a profile likelihood for the m ean are also shown. As we should expect,
the gam m a profile likelihood is always higher th a n the exponential likelihood,
which corresponds to the gam m a likelihood b u t w ith shape param eter k = 1.
Both param etric likelihoods are w ider th a n the em pirical likelihood. D irect
com parison betw een p aram etric an d em pirical likelihoods is misleading, how
ever, since they are based on different m odels, an d here and in later figures
10.2 • Multinomial-Based Likelihoods 503
we give the gam m a likelihood purely as a visual reference. The circum stances
in which em pirical an d p aram etric likelihoods are close are discussed in P rob
lem 10.3.
The endpoints o f an approxim ate 95% confidence interval for 8 are obtained
by reading off where £ e l ( 8 ) = 501,0.95, where c^a is the a quantile o f the chi-
squared distribution w ith d degrees o f freedom. The interval is (43.3,92.3),
which com pares well w ith the n onparam etric B C a interval o f (42.4,93.2). The
likelihood ratio intervals for the exponential and gam m a m odels are (44.1,98.4)
and (44.0,98.6).
Figure 10.2 shows the em pirical likelihood and gam m a profile likelihood
ratio statistics for 500 exponential sam ples o f size 24. T hough good for the
param etric statistic, the chi-squared approxim ation is poor for W EL, whose
estim ated 95% quantile is 5.92 com pared to the xj quantile o f 3.84. This
suggests strongly th a t the em pirical likelihood-based confidence interval given
above is too narrow . However, the sim ulations are only relevant w hen the
d a ta are exponential, in which case we would n o t be concerned w ith em pirical
likelihood.
We can use the b o o tstrap to estim ate quantiles for W e l ( 8 o ), by setting 6 q = y
and then calculating W ’ ( 6 q ) for b o o tstrap sam ples from the original data. The
resulting Q -Q p lo t is less extrem e th a n the left panel o f Figure 10.2, w ith a 95%
quantile estim ate o f 4.08 based on 999 b o o tstrap sam ples; the corresponding
em pirical likelihood ratio interval is (42.8,93.3). W ith a sam ple o f size 12, 41
o f the 999 sim ulations gave infinite values o f W e l ( 6 q) because y did not lie
w ithin the lim its (m in y ',m a x y * ) o f the b o o tstrap sample. W ith a sam ple o f
size 24, this problem did n o t arise. ■
504 10 ■Semiparametric Likelihood Inference
V ector p a ra m eter
In principle, em pirical likelihood is straightforw ard to construct when 6 has
dim ension d < n — 1. Suppose th a t 9 = (91, . . . , 8 d)T is determ ined implicitly as
the root o f the sim ultaneous equations
/ u( 9; y) dF ( y) = 0, i= l,...,d,
where u(9;y) is a d x 1 vector w hose ith elem ent is Ui(9;y). T hen the estim ate
9 is the solution to the d estim ating equations
n
= ( 10.6 )
;'=i
A n extension o f the argum ent in Exam ple 10.1, involving the vector o f L a
grange m ultipliers rjg = (t]ou- ■■, f]od)T, shows th a t the log em pirical likelihood
is
n
S e l ( 0) = ~ £ log {1 + n l U j ( 9 ) } , (10.7)
i =i
V - UjiTd )- - = 0 . (10.8)
1 + tiju j(8)
The sim plest approxim ate confidence region for the true 9 is the set o f values
such th at W EL( 0 ) < q j _ „ b u t in sm all sam ples it will again be preferable to
replace the Xd quantile by its b o o tstrap estim ate.
u ,( 0 ) j (10.10)
which is the axis given by the eigenvector corresponding to the largest eigen
value o f E ( y y T). T he d a ta consist o f positions on the lower half-sphere, or
equivalently the sam ple values o f a(9, (j>), which we denote by yj, j = 1 ,..., n.
In o rd er to set an em pirical likelihood confidence region for the m ean p o lar
axis, or equivalently for the spherical polar coordinates (9, cj)), we let
b(9, <f>) = (sin 9 cos 0 , sin 0 sin <f>,—cos 9)T, c(9, </>) = (— sin (p, — cos <f>,0)T
denote the unit vectors o rthogonal to a(0,4>). T hen since the eigenvectors o f
E (y Y T) m ay be taken to be orthogonal, the p o pulation values o f (0, <f>) satisfy
sim ultaneously the equations
The left panel o f Figure 10.3 shows the em pirical likelihood contours based
on (10.7) and (10.8), in the square region show n in Figure 5.10. The correspond
ing contours for Q e e f ( S ) are show n on the right. T he dashed lines show the
boundaries o f the 95% confidence regions for ( 6 , (f>) using b o o tstrap calibra
tio n ; these differ little from those based on the asym ptotic y\ distribution. In
each panel the d o tted ellipse is a 95% confidence region based on a studentized
form o f the sam ple m ean p o la r axis, for which the contours are ellipses. The
elliptical contours are appreciably tighter th a n those for the likelihood-based
statistics.
Table 10.1 com pares theoretical and b o o tstrap quantiles for several likelihood-
based statistics an d the studentized b o o tstrap statistic, Q , for the full d a ta and
for a rando m subset o f size 20. F or the full data, the quantiles for Q e e f and
W e l are close to those for the large-sam ple distribution. F o r the subset,
Q e e f is close to its nom inal distribution, b u t the o th er statistics seem consid
0.80 3.22 3.23 3.40 3.37 3.15 3.67 3.70 3.61 3.15
0.90 4.61 4.77 4.81 5.05 4.69 5.39 5.66 5.36 4.45
0.95 5.99 6.08 6.18 6.94 6.43 7.17 7.99 10.82 7.03
1 M / <■*• _ A
u n = f M ' i n = m Y . - { Jn r ) - <10-1»
m=1 v7
O n repeating this procedure for R different first-level boo tstrap samples, we
obtain R approxim ate likelihood values L ( t ’ ), r = 1, . . . , R , from which a
sm ooth likelihood curve L B(9) can be produced by nonparam etric sm oothing.
Computational improvements
There are various ways to reduce the large am ount o f com putation needed
to obtain a sm ooth curve. One, which was used earlier in Section 3.9.2,
is to generate second-level sam ples from sm oothed versions o f the first-level
samples. As before, probability distributions on the values y i .......y n are denoted
508 10 ■Semiparametric Likelihood Inference
1II
where typically w(-) is the stan d ard norm al density an d e = t>L ; as usual vl is
the nonparam etric delta m ethod variance estim ate for t. The distribution p*(0°)
will have p aram eter value not 0° b u t 9 = t (p '(0 0)). W ith the understanding
th a t 9 is defined in this way, we shall for sim plicity w rite p'(9) ra th er th an
p*(0°). F or a fixed collection o f R first-level sam ples and bandw idth e > 0, the
probability vectors p"(9) change gradually as 9 varies over its range o f interest.
Second-level b o o tstra p sam pling now uses vectors p'(0) as sam pling distri
butions on the d a ta values, in place o f the p* s. The second-level sam ple values
f** are then used in (10.11) to o btain Lg(0). R epeating this calculation for,
say, 100 values o f 6 in the range t + 4 v /1 2, followed by sm ooth interpolation,
should give a good result.
Experience suggests th a t the value e = v 1/ 2 is safe to use in (10.12) if the
t* are roughly equally spaced, which can be arran g ed by weighted first-level
sam pling, as outlined in Problem 10.6.
A way to reduce furth er the am o u n t o f calculation is to use recycling, as
described in Section 9.4.4. R a th e r th an generate second-level sam ples from each
p"(9) o f interest, one set o f M sam ples can be generated using distribution p on
the d ata values, an d the associated values f” , . . . , calculated. Then, following
the general re-w eighting m ethod (9.24), the likelihood values are calculated as
.> 0 ,3 ,
m=\ v /j = 1 v J '
where is the frequency o f the j t h case in the with second-level boo tstrap
sample. O ne simple choice for p is the E D F p. In special cases it will be possible
to replace the second level o f sam pling by use o f the saddlepoint approxim ation
m ethod o f Section 9.5. This w ould give an accurate an d sm ooth approxim ation
to the density o f T ’“ for sam pling from each p ' ( 8 ).
the d a ta from Exam ple 10.1. The solid points in the left panel o f Figure 10.4
are b o o tstrap likelihood values for the m ean 9 for 200 resamples, obtained by
saddlepoint approxim ation. This replaces the kernel density estim ate (10.11)
an d avoids the second level o f resam pling, b u t does n o t remove the variation in
estim ated likelihood values for different b o o tstrap sam ples with sim ilar values
o f t r*. A locally q u ad ratic nonparam etric sm oother (on the log likelihood scale)
could be used to produce a sm ooth likelihood curve from the values o f L(t"),
b u t an o th er approach is better, as we now describe.
The solid line in the left panel o f Figure 10.4 interpolates values obtained
by applying the saddlepoint approxim ation using probabilities (10.12) at a few
values o f 9°. H ere the values o f t! are generated at random , and we have taken
112
e = 0.5vl ; the results depend little on the value o f e.
T he log b o o tstrap likelihood is very close to log em pirical likelihood, with
95% confidence interval (43.8,92.1). ■
where the subscripts indicate sam ple size. N ote th a t F and t are the same for
both sam ple sizes, b u t quantities such as variance estim ates will depend upon
sam ple size. N ote also th a t the im plied p rio r is estim ated by L l 2( 6 ) / L f2n(6).
In practice the distrib u tio n o f Z m ust be estim ated, in general by boo tstrap
10.4 ■Likelihood Based on Confidence Sets 511
L f (0 ) = k 2n j k n ,(10 .1 6 )
(10.17)
In practice these values can be com puted via spline sm oothing from a dense
set o f values o f the kernel density estim ates k„{z).
There are difficulties w ith this m ethod. First, ju st as with b o o tstrap likeli
hood, it is necessary to use a large num ber o f sim ulations R. A second difficulty
is th a t o f ascertaining w hether or n o t the chosen Z is a pivot, o r else w hat
p rio r tran sfo rm atio n o f T could be used to m ake Z pivotal; see Section 5.2.2.
This is especially true if we extend (10.16) to vector 9, which is theoretically
possible. N ote th a t if the studentized b o o tstrap is applied to a transform ation
o f t rath er th a n t itself, then the factor \z(9)\ in (10.14) can be ignored when
applying (10.16).
theta theta
where
um 2r(fl) 22(0)
W l + 2 ar(d) + { l + 4 a r ( d ) } 1/ 2’ 1 + {1 - 4cz(0)}V2’
1/I
with z(9) = (t — d)/vj[ as before. This is called the implied likelihood. Based
on the discussion in Section 5.4, one w ould expect results sim ilar to those from
applying (10.16).
A furth er m odification is to m ultiply La b c( 8 ) by exp{(cv1/ 2 — b) 6 /vi.}, with b
the bias estim ate defined in (5.49). T he effect o f this m odification is to m ake the
likelihood even m ore com patible w ith the Bayesian interpretation, som ew hat
akin to the adjusted profile likelihood (10.2).
Exam ple 10.5 (Air-conditioning d ata) Figure 10.5 shows confidence likeli
hoods for the two sets o f air-conditioning d a ta in Table 5.6, sam ples o f size
12 and 24 respectively. The im plied likelihoods L ABc ( 9 ) are sim ilar to the
em pirical likelihoods for these data. The pivotal likelihood L z ( 6 ), calculated
from R = 9999 sam ples w ith bandw idths equal to 1.0 in (10.17), is clearly quite
unstable for the sm aller sam ple size. This also occurred with b o o tstrap likeli
hood for these d a ta an d seems to be due to the discreteness o f the sim ulations
with so sm all a sample. ■
Pr(y = Uj | p i , . . . , p N ) = pj, = I-
If o u r d a ta consist o f the ran d o m sam ple y \ , . . . , y „ , and f j counts how m any
y, equal Uj, the probability o f the observed d a ta given the values o f the
Pj is pro p o rtio n al to flyLi P^‘ ■ If the prior inform ation regarding the p; is
sum m arized in the p rior density n(Pi, . . . , p N), the jo in t posterior density o f the
Pj given the d a ta is pro p o rtio n al to
N
n ip u -.^p ^n //,
7= 1
and this induces a posterior density for 8 . Its calculation is particularly straight
forw ard w hen 7i is the D irichlet density, in which case the p rio r and posterior
densities are respectively prop o rtional to
ft#
7= 1 7= 1
the posterior density is D irichlet also. Bayesian bootstrap sam ples and the
corresponding values o f 8 are generated from the jo in t posterior density for
the pj, as follows.
theta theta
The left panel o f Figure 10.6 shows kernel density estim ates o f the posterior
density o f 9 based on R = 999 sim ulations w ith all the aj equal to a = —1, 2, 5,
and 10. The increasingly strong p rio r inform ation results in posterior densities
th at are m ore an d m ore sharply peaked.
The right panel shows the im plied priors on 6 , obtained using the d a ta
doubling device from Section 10.4. The priors seem highly inform ative, even
when a = —1. ■
The prim ary use o f the Bayesian b o o tstrap is likely to be for im putation when
d a ta are missing, ra th e r th a n in inference for 9 per se. There are theoretical
advantages to such weighted bootstraps, in which the probabilities P* vary
sm oothly, b u t as yet they have been little used in applications.
10.7 Problems
1 Consider empirical likelihood for a parameter 0 = t(F) defined by an estimating
equation f u(t;y)dF(y) = 0, based on a random sample y\,...,y„.
(a) Use Lagrange multipliers to maximize Y l°g Pj subject to the conditions Y P j =
1 and Y2 Pju(t;yj) = 0, and hence show that the log empirical likelihood is given by
(10.7) with d = 1. Verify that the empirical likelihood is maximized at the sample
EDF, when 6 = t(F).
(b) Suppose that u(f,y) = y — t and n = 2, with y\ < y 2. Show that rj9 can be
written as (9 — y ) / {( 6 — y i)(y2 — 0)}, and sketch it as a function o f 6.
(Section 10.2.1)
gisuiOiyj)
nj(6) = Pr(Y = y j ) = j= l,...,n,
where is determined by
n
5 > ; (0)u (0;y,) = 0. (10.19)
j =i
(a) Let Z i,...,Z „ be independent Poisson variables with means exp(£uj), where
Uj = u(0;yj); we treat 6 as fixed. Write down the likelihood equation for and
show that when the observed values o f the Z j all equal zero, it is equivalent to
(10.19). Hence outline how software that fits generalized linear models may be
used to find
(b) Show that the formulation in terms o f Poisson variables suggests that the empir
ical exponential family likelihood ratio statistic is the Poisson deviance W tEF(0Q),
10.7 ■Problems 517
W W (flo) = 2 ^ { l - e x p ( ^ ; )},
(c) Plot the log likelihood functions corresponding to W E e f and W EEF for the data
in Example 10.1; take Uj = y, — 6. Perform a small simulation study to compare
the behaviour o f W EEF and W'EEF when the underlying data are samples o f size
24 from the exponential distribution.
(Section 10.2.2)
Pj(9°) = f 1 , (10.20)
E ,= i ^
where f = v l/2(8° — t ) and v = n~2 lj.
(a) Show that t(Ft ) = 90, where Fj denotes the C D F corresponding to (10.20).
Hence describe how to space out the values t" in the first-level resampling for a
bootstrap likelihood.
(b) Rather than use the tilted probabilities (10.12) to construct a bootstrap like
lihood by simulation, suppose that we use those in (10.20). For a linear statistic,
show that the cumulant-generating function o f T ” in sampling from (10.20) is
At + n { K ( ^ + n~iA) — K( ^ ) } , where K ( ^ ) = lo g (£ ] e(lJ). Deduce that the saddlepoint
approximation to f r - \ T - ( t I 0°) is proportional to exp {—n X (f)}, where 6° = K '(0 -
Hence show that for the sample average, the log likelihood at 9° = Y I y j e tyi / 53 eiyj
is n { i t - lo g ( 5 3 e ^ )} .
(c) Extend (b) to the situation where t is defined as the solution to a m onotonic
estimating equation.
(Section 10.3; D avison, Hinkley and Worton, 1992)
7 Consider the choice o f h for the raw bootstrap likelihood values (10.11), when w ( )
is the standard normal density. A s is often roughly true, suppose that T* ~ N(t, v),
and that conditional on T ‘ — t ’, T ” ~ N(t ' , v).
(a) Show that the mean and variance o f the product o f vl/1 with (10.11) are /j and
M ~ l {12 — It ), where
h = ( 2 « r * v - v }.
where y = hv~l/2 and 3 = v~l/2(t' — t). Hence verify some o f the values in the
following table:
518 10 ■Semiparametric Likelihood Inference
Oo
<N
O
8 = 1
II
II
y 0 1 2 0 1 2 0 1 2
D ensity x lO -2 39.9 39.9 39.9 24.2 24.2 24.2 5.4 5.4 5.4
Bias x lO -2 -0.8 -2 .9 -5 .7 0 -0.1 -0 .5 0.3 1.2 2.5
M x variance xlO -2 40.4 13.4 5.6 28.3 11.2 5.7 7.5 3.8 2.6
(b) If y is small, show that the variance o f (10.11) is roughly proportional to the
square o f its mean, and deduce that the variance is approximately constant on the
log scale.
(c) Extend the calculations in (a) to (10.13).
(Section 10.3; D avison, Hinkley and Worton, 1992)
e x p [ - § { g ( t ) - g ( 0 ) } 2] , ( 1 - 0 ) - 2, |0| < 1.
Hence show that the posterior mean and variance o f = YI yjPj are y an(l
(2n + a n + 1 )- I m2, where m 2 = n_1 J2(yj ~ y ) 2-
(b) N ow consider the average F t o f bootstrap samples generated as follows. We
generate a distribution F 1 = ( P / , . . . , P j ) o n y t, . . . , y„ under the Bayesian bootstrap,
10.8 ■Practicals 519
E>( = ,a,(f
(2n + an + 1) n
Are the properties o f this as n—►oo and a —»oo what you would expect? How does
this compare with samples generated by the usual nonparametric bootstrap?
(Section 10.5)
10.8 Practicals
1 We compare the empirical likelihoods and 95% confidence intervals for the mean
o f the data in Table 3.1, (a) pooling the eight series:
attach(gravity)
grav.EL <- EL.profile(g,tmin=70,tmax=85,n.t=51)
plot(grav.EL[,1],exp(grav.EL[,2]) ,type="l",xlab="mu",
ylab="empirical likelihood")
l i k .CI(grav.E L ,lim=-0.5*qchisq(0.95,1))
and (b) treating the series as arising from separate distributions with the same
mean and plotting eight individual likelihoods:
Compare the intervals with those in Example 3.2. D oes the result for (b) suggest
a limitation o f multinomial likelihoods in general?
Compare the empirical likelihoods with the profile likelihood (10.1) and the ad
justed profile likelihood (10.2), obtained when the series are treated as independent
normal samples with different variances but the same mean.
(Section 10.2.1)
520 10 ■Semiparametric Likelihood Inference
a tt a c h (is la y )
th <- ifelse(theta>180,theta-360,theta)
a.t <- function(th) c(sin(th*pi/180), cos(th*pi/180))
b.t <- fimction(th) c(cos(th*pi/180), -sin(th*pi/180))
y <- t(apply(matrix(theta, 18,1), 1, a.t))
thetahat <- function(y)
{ m <- apply(y,2,sum)
m <- m/sqrt(m[l] ~2+m[2] “ 2)
180*atan(m[l]/m[2] )/pi }
thetahat(y)
u.t <- function(y, th) crossprod(b.t(th), t(y))
islay.EL <- EL.profile(y, tmin=-100, tmax=120, n.t=40, u=u.t)
plot(islay.EL[,1],islay.EL[,2],type="l",xlab="theta",
ylab="log empirical likelihood",ylim=c(-25,0))
points(th,rep(-25,18)); abline(h=-3.84/2,lty=2)
lik.CI(islay.EL,lim=-0.5*qchisq(0.95,1))
islay.EEF <- EEF.profile(y, tmin=-100, tmax=120, n.t=40, u=u.t)
lines(islay.EEF[,1],islay.EEF[,2],lty=3)
l i k .CI(islay.E E F ,lim=-0.5*qchisq(0.95,1))
3 We compare posterior densities for the mean o f the air-conditioning data using (a)
the Bayesian bootstrap with aj = — 1:
8 ■Practicals 521
Computer Implementation
11.1 Introduction
The key requirem ents for com puter im plem entation o f resam pling m ethods
are a flexible program m ing langauge w ith a suite o f reliable quasi-random
num ber generators, a wide range o f built-in statistical procedures to bootstrap,
and a reasonably fast processor. In this chapter we outline how to use one
im plem entation, using the curren t (M ay 1997) com m ercial version S p lu s 3.3
o f the statistical language S, although the m ethods could be realized in a
n um ber o f o th e r statistical com puting environm ents.
The rem ainder o f this section outlines the in stallation o f the library, and
gives a quick sum m ary o f features o f S p lu s essential to our purpose. Each
subsequent section describes aspects o f the library needed for the m aterial in
the corresponding ch ap ter: Section 11.2 corresponds to C h apter 2, Section 11.3
to C h apter 3, an d so forth. These sections take the form o f a tutorial on the
use o f the library functions. T he outline given here is n o t intended to replace
the help files distributed w ith the library, w hich can be viewed by typing
h e l p ( b o o t , l i b r a r y = " b o o t " ) w ithin S p lu s. A t various points below, you
will need to consult these files for m ore details on functions.
The m ain functions in the library are sum m arized in Table 11.1.
The best way to learn to use softw are is to use it, and from Section 11.1.2
onw ards, we assum e th a t you, d ear reader, know the basics o f S, including how
to w rite sim ple functions, th a t you are seated com fortably at your favourite
com puter w ith S p lu s launched and a graphics w indow open, and th a t you are
working through this chapter. We d o n o t show the S p lu s p ro m p t >, n o r the
continuatio n p ro m p t +.
522
11.1 ■Introduction 523
11.1.1 Installation
UNIX
T he b o o tstra p library can be obtained from the hom e page for this book,
h t t p : //dmawww. e p f 1 . c h / d a v i s o n . mosaic/BM A/
sh bootlib.sh
rm bootlib.sh
Y ou should then follow the instructions in the README file to com plete the
installation o f the library.
It is best to set up an S p lu s library b o o t containing the library files; you
m ay need to ask your system m anager to do this. Once this is done, and
once inside S p lu s in your usual w orking directory, the functions and d a ta are
accessed by typing
l i b r a r y ( b o o t ,f irst=T)
524 11 ■Computer Implementation
T his will avoid cluttering your w orking directory w ith library files, and reduce
the chance th a t you accidentally overw rite them.
W i n d ow s
T he disk at the back o f this book contains the library functions and docum en
tatio n for use w ith S p lu s f o r Windows. F or instructions on the installation,
see the file README. TXT on the disk. T he contents o f the disk can also be
retrieved in the form o f a z i p file from the hom e page for the book given
above.
y < - rn o rm (2 0 )
y
H ere < - is the assignm ent sym bol. To see the contents o f any S object, simply
type its nam e, as above. This is often done below, and we do n o t show the
output.
In general q uasi-random num bers from a distrib u tion are generated by the
functions re x p , rgamma, r c h i s q , r t , . . . , w ith argum ents to give param eters
where needed. F or example,
y <- rgamma(n=10,shape=2)
y <- rgamma(n=10,shape=c(l:10))
generates a vector o f ten gam m a variables w ith shape param eters 1 ,2 ,..., 10.
T he function sam ple is used to sam ple from a set w ith o r w ithout replace
m ent. F o r exam ple, to get a ran d o m p erm u tatio n o f the num bers 1 ,...,1 0 , a
random sam ple w ith replacem ent from them , a ran d o m p erm u tatio n o f 11, 22,
33, 44, 55, a sam ple o f size 10 from them , and a sam ple o f size 10 taken with
unequal probabilities:
sample(10)
sample(10,replace=T)
set <- c(ll,22,33,44,55)
sample(set)
sample(set,size=10,replace=T)
sample(set,size=10,replace=T,prob=c(0.1,0.1,0.1,0.1,0.6))
11.2 ■Basic Bootstraps 525
Subscripts
T he city p o p u latio n d a ta w ith n = 10 are
city
citySu
city$x
where the second two com m ands show the individual variables o f c i t y . T his
S p lu s object is a datafram e — an array o f d a ta in which rows correspond to
cases, and the nam ed colum ns to variables. Elem ents o f an object are accessed
by subscripts, so
city$x[l]
city$x[c(l:4)]
city$x[c(l,5,10)]
city[c(l,5,10),2]
city$x[-l]
city[c(l:3),]
i <- sample(10,replace=T)
city[i,I
The row labels result from the algorithm used to give unique labels to rows,
an d can be ignored for o u r purposes.
B o o t s t r a p obj ect s
The result o f a call to b o o t is a b o o tstrap object. This is im plem ented as a list
o f quantities which is given the class " b o o t" and for which various m ethods
are defined. F or exam ple, typing
city.boot
prints the original statistic, its estim ated bias an d its stan d ard error, while
plot(city.boot)
names(city.boot)
Timing
To repeat the sim ulation, checking how long it takes, type
u n i x . t i m e ( c i t y . b o o t < - b o o t ( c i t y , c i t y .f u n ,R = 5 0 ) )
on a U N IX system or
on a D O S system. The first num b er retu rn ed is the tim e the sim ulation took,
and is useful for estim ating how long a larger sim ulation would take.
A lthough code is generally clearer w hen d atafram es are used, the co m p u ta
tion can be speeded up by avoiding them , as here:
Frequency array
To obtain the R x n arra y o f b o o tstra p frequencies for c i t y . b o o t and to
display its first 20 lines, type
f <- boot.array(city.boot)
f [ 1 :2 0 ,]
11.2 ■Basic Bootstraps 527
Types o f statistic
F or a nonparam etric b o o tstrap , the function s t a t i s t i c can be o f one o f
three types. We have already seen exam ples o f the first, index type, where the
argum ents are the d atafram e d a t a and the vector o f indices, i ; this is specified
by s ty p e = " i" (the default).
F or the second, weighted type, the argum ents are d a t a and a vector o f
weights w. F or exam ple,
writes
£ = E w'jxj / E wj
u’ Y , w) uj / Y , wY
where w* is the weight p u t on the j t h case o f the datafram e in the boo tstrap
sam ple; the first line o f c i t y .w ensures th at vv* = 1. Setting w in the
initial line o f the function gives the default value for w, which is a vector of
n-1 s; this enables the original value o f t to be obtained by c i t y . w ( c i t y ) . A
m ore com plicated exam ple is given by the library correlation function c o rr .
N o t all statistics can be w ritten in this form, b u t w hen they can, num erical
differentiation can be used to obtain em pirical influence values and A B C
confidence intervals.
F or the third, frequency type, the argum ents are d a t a and a vector o f
frequencies f. F o r example,
uses
“* n_ 1 E / y * « /
w here /* is the frequency w ith which the ;'th row o f the datafram e occurs in the
b o o tstra p sample. N o t all statistics can be w ritten in this form. It differs from
the preceding type in th a t w hereas weights can in principle take any positive
528 11 ■Computer Implementation
Fu n c t i on s t a t i s t i c
The contents o f s t a t i s t i c can be m ore-or-less arbitrarily com plicated, p ro
vided th a t its o u tp u t is a scalar or fixed-length vector. F or example,
a i r .f u n <- f u n c tio n (d a ta , i)
{ d <- d a t a [ i ,]
c (m e a n (d ), v a r ( d ) / n r o w ( d a t a ) ) }
a i r . b o o t < - b o o t ( d a t a = a i r c o n d i t , s t a t i s t i c = a i r . f u n , R=200)
c i t y . s u b s e t < - f u n c t i o n ( d a t a , i , n=10)
{ d < - d a t a [ i [ 1 : n ] ,]
m e a n ( d [ ,2 ] ) /m e a n ( d [ ,1 ]) }
c i t y . b o o t < - b o o t ( d a t a = c i t y , s t a t i s t i c = c i t y . s u b s e t , R=200, n=5)
gives resam pled ratios for b o o tstrap sam ples o f size 5. N ote th a t the frequency
array for c i t y . b o o t w ould n o t be useful in this case. The indices can be
obtained by
b o o t . a r r a y ( c i t y . b o o t , i n d i c e s = T ) [ ,1 : 5 ]
new argum ent to b o o t, s im = " p a ra m e tric " , tells b o o t to perform a param etric
sim ulation: by default the sim ulation is nonparam etric and s im = " o rd in a ry " .
O ther possible values for sim are described below.
F or exam ple, for p aram etric sim ulation from the exponential m odel fitted
to the air-conditioning d a ta in Table 1.2, we set
a i r c o n d i t . f u n < - f u n c t i o n ( d a t a ) m e a n (d a ta $ h o u rs )
a i r c o n d i t . sim < - f u n c t i o n ( d a t a , m le)
{ d <- d a ta
d $ h o u rs < - r e x p ( n = n r o w ( d a ta ) , ra te = m le )
d >
a i r c o n d i t . m l e < - l/ m e a n ( a ir c o n d i t$ h o u r s )
a ir c o n d it.p a r a <- b o o t( d a ta = a ir c o n d it, s t a t i s t i c = a i r c o n d i t .f u n ,
R=20, s im = " p a r a m e tr ic " , r a n . g e n = a i r c o n d i t . sim ,
m le = a ir c o n d it.m le )
l . c i t y <- lo g (c ity )
c ity .m le <- c ( a p p ly ( 1 .c i t y , 2 ,m e a n ) ,s q r t ( a p p l y ( l .c i t y ,2 ,v a r ) ) ,
c o r r ( 1 .c i t y ) )
c i t y . s i m < - f u n c t i o n ( d a t a , m le)
{ n < - n ro w (d a ta )
d < - m a t r i x ( r n o r m ( 2 * n ) ,n ,2)
d [ , 2 ] < - m le [2 ] + m le [ 4 ] * (m le [ 5 ] * d [ ,2 ] + s q r t ( 1 - m l e [ 5 ] ~ 2 )* d [ , 1 ] )
d [ , 1] < - m l e [ l ] + m l e [ 3 ] * d [ ,l ]
d a ta $ x < - e x p ( d [ ,2 ] )
d a ta $ u < - e x p ( d [ , l ] )
d a ta }
c i t y . f < - f u n c t i o n ( d a t a ) m e a n ( d a t a [ ,2 ] ) / m e a n ( d a t a [ ,1 ])
c i t y . p a r a < - b o o t ( c i t y , c i t y . f , R=200, s im = " p a r a m e tr ic " ,
r a n . g e n = c i t y . sim , m le = c ity .m le )
This is useful w hen com paring p aram etric and n o n p aram etric b o o tstraps for
the same problem . C om pare them for the c i t y data.
uses regression w ith the 999 sam ples in c i t y . b o o t to estim ate the lj.
Jackknife values can be obtained by
J <- empinf(data=city,statistic=city.fun,stype="i",type="jack")
The argument type controls h o w the influence values are to be calculated, but
this also depends on the quantities input to empinf: for details see the help
file.
Variance approximations
v a r . l i n e a r uses em pirical influence values to calculate the nonparam etric
delta m ethod variance ap proxim ation for a statistic:
v a r .linear(L.diff)
v a r .linear(L.reg)
Linear approximation
l i n e a r .a p p ro x uses o u tp u t from a nonparam etric b o o tstrap sim ulation to
calculate the linear approxim ations to the b o o tstrap p ed quantities. The em pir
ical influence values can be supplied, b u t if not, they are estim ated by a call to
em pinf. F o r the city p o p u latio n ratio,
calculates the linear approxim ation for the two sets o f em pirical influence
values an d plots the actual t' against them.
gravity
grav <- gravity[as.numeric(gravity$series)>=7,]
grav
grav.fun <- function(data, i, trim=0.125)
{ d <- data[i,]
m <- tapply(d$g, d$series, mean, trim=trim)
m[7] -m [8] >
grav.boot <- boot(grav, grav.fun, R=200, strata=grav$series)
11.3.2 Sm oothing
T he neatest w ay to perform sm ooth bo o tstrap p in g is to use s im = " p a ra m e tric " .
F o r exam ple, to estim ate the variance o f the m edian o f the d a ta in y, using
sm oothing p aram eter h = 0.5:
y <- rnorm(99)
h <- 0.5
y.gen <- function(data, mle)
{ n <- length.(data)
i <- sample(n, n, replace=T)
data[i] + mle*rnorm(n) }
532 11 ■Computer Implementation
c i t y . b o o t < - b o o t ( c i t y , c i t y . f u n , R=999)
c i t y . L < - c i t y . b o o t $ t 0 [ 3 : 12]
s p l i t . s c r e e n ( c ( l , 2 ) ) ; s c r e e n ( l ) ; s p l i t .s c r e e n ( c ( 2 ,1 )); sc re e n (4 )
a tta c h (c ity )
p l o t ( u , x , ty p e = " n " , x lim = c ( 0 , 3 0 0 ) , y lim = c ( 0 ,3 0 0 ) )
te x t( u ,x ,r o u n d ( c ity .L ,2 ) )
sc re e n (3 )
p l o t ( u , x , t y p e = " n " , x l i m = c ( 0 ,3 0 0 ) ,y lim = c ( 0 ,3 0 0 ) )
t e x t ( u , x , c ( l : 1 0 ) ) ; a b l i n e ( 0 , c i t y . b o o t $ t 0 [ 1 ] ,lty = 2 )
sc re e n (2 )
j a c k . a f t e r . b o o t ( b o o t . o u t = c i t y . b o o t , u se J= F , s t i n f = F , L = c ity .L )
c l o s e . s c r e e n ( a ll = T )
The two left panels show the d a ta with case num bers and em pirical influence
values as p lo ttin g symbols. T he jackknife-after-bootstrap plot on the right
shows the effect o f deleting cases in tu rn : values o f t* are m ore variable when
case 4 is deleted and less variable w hen cases 9 and 10 are deleted. We see
from the em pirical influence values th at the distribution o f t' shifts dow nw ards
when cases w ith positive em pirical influence values are deleted, and conversely.
This plot is also produced by setting true the ja c k argum ent to p l o t when
applied to a b o o tstrap object, as in p l o t ( c i t y . b o o t , j a c k = T ) .
O ther argum ents for j a c k , a f t e r .b o o t control w hether the influence values
are standardized (by default they are, s tin f = T ) , w hether the em pirical influence
values are used (by default jackknife values are used, based on the sim ulation,
so the default values are u seJ= T and L=NULL).
M ost post-processing functions allow the user to specify either an index for
the com ponent o f interest, o r a vector o f length b o o t.o u t$ R to be treated
as the m ain statistic. T hus a jackknife-after-bootstrap plot using the second
com ponent o f c i t y . b o o t $ t — the estim ated variances for t* — would be
obtained by either o f
j a c k . a f t e r . b o o t ( c i t y . b o o t , u s e J = F , s t i n f = F , in d ex = 2 )
ja c k .a f te r .b o o t( c ity .b o o t,u s e J = F ,s tin f = F ,t= c ity .b o o t$ t[ ,2 ] )
Frequency smoothing
sm o o th . f sm ooths the frequencies o f a nonparam etric b o o tstrap object to give
a “ typical” distrib u tio n w ith expected value roughly at 9. In order to find
the sm oothed frequencies for 9 = 1.4 for the city ratio, and to obtain the
corresponding value o f t, we set
c i t y . f r e q < - s m o o t h . f ( t h e t a = l . 4 , b o o t . o u t = c i t y .b o o t )
c ity .w ( c ity , c ity .f r e q )
534 11 ■Computer Implementation
11.4 Tests
11.4.1 Parametric tests
Simple param etric tests can be conducted using p aram etric sim ulation. For
example, to perform the conditional sim ulation for the d a ta in f i r (E xam
ple 4.2):
The last tw o lines here display the results (alm ost) as in the right panel o f
Figure 4.1.
This uses the same seed as for the p erm u tatio n test, for a m ore precise
com parison. Is the significance level sim ilar to th a t for the p erm utation test?
W hy can n o t b o o t be directly applied to d u ck s to perform a b o o tstrap test?
Exponential tilting
The test o f equality o f m eans for two sets o f d a ta in Exam ple 4.16 involves
exponential tilting. The null distribution puts probabilities given by (4.25) on
the two sets o f data, an d the tilt param eter k solves the equation
Y j zij exp(^zi7') __ „
Eyexp(A zy)
where z\j = yij, z2j = —yij, an d 6 = 0. The fitted null distribution is obtained
using e x p . t i l t , as follows:
z <- grav$g
z[grav$series==8] <- -z[grav$series==8]
z.tilt <- exp.tilt(L=z, theta=0, strata=grav$series)
z.tilt
where z . t i l t contains the fitted probabilities (which sum to one for each
stratum ) and the values o f k an d 6. O ther argum ents can be in put to e x p . t i l t :
see its help file.
The significance probability is then obtained by using the w e ig h ts argum ent
to b o o t. This argum ent is a vector containing the probabilities w ith which to
select the rows o f d a ta , when b o o tstrap sam pling is to be perform ed with
unequal probabilities. In this case the unequal probabilities are given by the
tilted distribution, und er which the expected value o f the test statistic is zero.
The code needed to perform the sim ulation and get the estim ated significance
level is:
536 11 ■Computer Implementation
b o o t . c i ( b o o t . o u t = c i t y .b o o t )
b o o t.c i( b o o t.o u t= c ity .b o o t,ty p e = c ( " n o r m " , " p e r c " , " b a s ic " , " b c a " ) ,
c o n f = c ( 0 .8 ,0 .9 ) )
b o o t . c i ( c i t y . b o o t , h = lo g , h in v = e x p , h d o t= f u n c tio n ( u ) 1 /u )
where h in v and h d o t are the inverse and first derivative o f h(-). N ote how
transform atio n im proves the basic b o o tstrap interval.
N onparam etric A BC intervals are calculated using a b c . c i. F or exam ple
a b c . c i ( d a t a = c i t y , s t a t i s t i c = c i t y . w)
Em pirical influence values and the nonparam etric delta m ethod standard
error for the slope o f the linear m odel could be obtained by putting the slope
estim ate in weighted form :
F or m ore com plicated regressions, for exam ple w ith unequal response vari
ances, m ore inform ation m ust be added to the new datafram e.
Wild bootstrap
The wild b o o tstra p can be im plem ented using s im = " p a ra m e tric " , as follows:
11.6.2 Prediction
Now consider prediction o f the log b rain weight o f new m am m als w ith body
weights equal to those for the chim panzee and baboon. F or this we introduce
yet an o th er argum ent to b o o t — m, which gives the num ber o f e*m to be
sim ulated w ith each b o o tstra p sam ple (see A lgorithm 6.4). In this case we
w ant to predict a t m = 2 “new ” m am m als, w ith covariates contained in
d .p r e d . The s t a t i s t i c function supplied to b o o t m ust now take at least one
m ore argum ent, nam ely the additional indices for constructing the boo tstrap
versions o f the two “new ” m am m als. We im plem ent this as follows:
giving the 0.025, 0.5, an d 0.975 prediction limits for the b rain sizes o f the
“new ” m am m als. The actual brain sizes lie close to o r above the up p er limits
o f these intervals: prim ates tend to have larger b rains th a n other m am m als.
The o ther procedures for m odel-based resam pling o f generalized linear m odels
are applied similarly. Try to m odify this code to resam ple the linear predictor
residuals according to (7.13) (they are already calculated above).
11.7 ■Further Topics in Regression 541
The b o o tstra p function m e l. fun given below need only take one argum ent,
a d atafram e containing the d a ta themselves. N ote how the function uses a
sm oothing spline to interpolate fitted values for the full range o f thickness;
this avoids difficulties due to the variability o f the covariate when resam pling
cases. T he o u tp u t o f m e l. fun is the vector o f fitted linear predictors predicted
by the spline.
T he next three com m ands give the syntax for case resam pling, for m odel-based
resam pling an d for conditional resam pling. For either o f these last two schemes,
the baseline survivor functions for the survival times and censoring times, and
the fitted p ro p o rtio n al hazards (Cox) m odel for the survival distribution m ust
be supplied via the F . s u rv , G . s u rv , and cox argum ents.
attach(melanoma)
mel.boot <- censboot(melanoma, mel.fun, R=99, strata=ulcer)
mel.boot.mod <- censboot(melanoma, mel.fun, R=99,
F.surv=mel.surv, G.surv=mel.cens, strata=ulcer,
cox=mel.cox, sim="model")
mel.boot.con <- censboot(melanoma, mel.fun, R=99,
F .surv=mel.surv, G.surv=mel.cens, strata=ulcer,
cox=mel.cox, sim="cond")
542 11 ■Computer Implementation
The b o o tstrap results are best displayed graphically. H ere is the code for the
analogue o f the left panels o f Figure 7.9:
t h < - s e q ( f r o m = 0 .2 5 ,to = 1 0 ,b y = 0 .2 5 )
s p l i t . s c r e e n ( c ( 2 , 1 ))
s c re e n (l)
p lo t( th ,m e l.b o o t$ tO ,ty p e = " n " ,x la b = " T u m o u r t h i c k n e s s (mm)",
x l i m = c ( 0 ,1 0 ) ,y l i m = c ( - 2 ,2 ) ,y l a b = " L in e a r p r e d i c t o r " )
l i n e s ( t h , m e l . b o o t$ tO , lwd=3)
r u g (jitte r(th ic k n e s s ))
f o r ( i i n 1 :1 9 ) l i n e s ( t h , m e l . b o o t $ t [ i , ] , l w d = 0 . 5 )
sc re e n (2 )
p lo t( th ,m e l.b o o t$ tO ,ty p e = " n " ,x la b = " T u m o u r t h i c k n e s s (mm)",
x lim = c ( 0 ,1 0 ) ,y l i m = c ( - 2 ,2 ) ,y l a b = " L i n e a r p r e d i c t o r " )
l i n e s ( t h , m e l . b o o t $ t 0 , lwd=3)
m e l.e n v < - e n v e l o p e ( m e l .b o o t $ t ,l e v e l = 0 .95)
l i n e s ( t h . m e l . e n v $ p o in t [ 1 , ] , l t y = l )
l i n e s ( t h , m e l. e n v $ p o in t [ 2 , ] , l t y = l )
m e l.e n v < - e n v e lo p e ( m e l .b o o t .m o d $ t ,l e v e l = 0 .95)
l i n e s ( t h , m e l . e n v $ p o i n t [ 1 , ] , lty = 2 )
lin e s ( th ,m e l.e n v $ p o in t[ 2 , ] ,lty = 2 )
m e l.e n v < - e n v e l o p e ( m e l .b o o t .c o n $ t ,l e v e l = 0 .95)
l i n e s ( t h . m e l . e n v $ p o i n t [ 1 , ] ,lt y = 3 )
l i n e s ( t h , m e l . e n v $ p o i n t [ 2 , ] , lty = 3 )
d e t a c h ( "m elanom a")
N ote how tight the confidence envelope is relative to th a t for the m ore highly
param etrized m odel used in the example. Try again w ith larger values o f R, if
you have the patience.
The best m odel is AR(9). How well determ ined is this, and w hat is the variance
o f the series average? We b o o tstrap to see, using
544 11 ■Computer Implementation
which calculates the o rd er o f the fitted autoregressive m odel, the series average,
and saves the series itself.
O u r function for b o o tstrap p in g tim e series is ts b o o t. H ere are results for
fixed-block b o o tstrap s w ith block length / = 20:
Check the orders o f the fitted m odels for this scheme. A re they sim ilar to those
obtained using the block schemes above?
For “ post-blackening” we need to define yet an o th er function:
ts .m o d < - ra n .a r g s $ m o d e l
m e a n ( t s . o r i g ) + r t s ( a r im a .s im ( m o d e l= ts .m o d ,n = n .s im ,in n o v = r e s ) ) }
s u n . l b < - t s b o o t ( s u n . r e s , s u n .f u n , R=99, 1=20, s im = " f ix e d " ,
r a n .g e n = s u n .b l a c k , r a n . a x g s = l i s t ( t s = s u n , m o d e l= su n .m o d e l),
n .s im = le n g t h ( s u n ) )
C om pare these results with those above, and try it with sim="geom".
c o n t r o l ( c i t y . b o o t , b i a s .a d j= T )
gives a list consisting o f the regression estim ates o f the em pirical influence
values, linear approxim ations to the b o o tstrap statistics, the control estim ates
o f bias, variance, and the th ird cum ulant o f the f \ control estim ates o f selected
quantiles o f the distrib u tio n o f t*, and a spline object th at sum m arizes the ap
proxim ate quantiles used to o btain the control quantile estim ates. Saddlepoint
approxim ation is used to obtain these approxim ate quantiles. Typing
c ity . con$L
c ity . c o n $ b ia s
c ity . con$var
c ity . c o n $ q u a n tile s
c i t y . t o p < - e x p . t i l t ( L = c i t y . L , t h e t a = 2 , t O = c i t y .w ( c i t y ) )
c i t y . b o t < - e x p . t i l t ( L = c i t y . L , t h e t a = 1 . 2 , t O = c i t y .w ( c i t y ) )
c i t y . t i l t < - b o o t ( c i t y , c i t y . f u n , R = c (1 0 0 ,9 9 ),
w e i g h t s = r b i n d ( c i t y . to p $ p , c i t y . b o t$ p ) )
im p. w e i g h t s ( c i t y . t i l t )
i m p .m o m e n t s ( c i t y . t i l t )
im p .q u a n tile ( c ity .tilt)
Each o f these returns raw, ratio and regression estim ates o f the corresponding
quantities. Some oth er uses o f im p o rtan t resam pling are exemplified by
i m p . p r o b ( c i t y . t i l t , t 0 = 1 .2 , d ef= F )
z <- ( c i t y . t i l t $ t [ ,1 ] - c i t y . t i l t $ t 0 [1 ]) / s q r t ( c i t y . t i l t $ t [ ,2 ] )
im p. q u a n t i l e ( b o o t . o u t = c i t y . t i l t , t= z )
The call to im p .p ro b calculates the im portance sam pling estim ate o f the
probability th a t t* < 1.2, w ithout using defensive m ixture distributions (by
default d e f =T, i.e. defensive m ixture distributions are used to obtain the weights
and estim ates). The last two lines show how im portance sam pling is used to
estim ate quantiles o f the studentized b o o tstrap statistic.
F or m ore details an d fu rth er argum ents to the functions, see their help files.
11.9 ■Improved Simulation 547
Function tilt.boot
The description above relies on exponential tilting to obtain the resam
pling probabilities, an d requires know ing where to tilt to. If this is difficult,
t i l t . b o o t can be used to avoid this, by perform ing an initial b o o tstrap with
equal resam pling probabilities, then using frequency sm oothing to estim ate
ap p ro p riate tilted probabilities. F o r example,
c i t y . t i l t < - t i l t . b o o t ( c i t y , c i t y . f u n , R=c(5 0 0 ,2 5 0 ,2 4 9 ))
perform s 500 ordinary bootstraps, uses the results to estim ate probability
d istributions tilted to the 0.025 and 0.975 points o f the sim ulations, and then
perform s 250 b o o tstrap s tilted to the 0.025 quantile, and 249 tilted to the 0.975
quantile, before assigning the result to a b o o tstrap object. M ore com plex uses
o f t i l t , b o o t are possible; see its help file.
Importance re-weighting
These functions allow for im portance re-w eighting as well as im portance
sam pling. F or exam ple, suppose th a t we require to re-weight the sim ulated
values so th a t they ap p e a r to have been sim ulated from a distribution with
expected ratio close to 1.4. We then use the q= option to the im portance
sam pling functions as follows:
q <- s m o o th .f ( th e ta = l.4 , b o o t .o u t = c i t y .t i l t )
c i t y . w ( c i t y , q)
i m p . m o m e n t s ( c i t y . t i l t , q=q)
i m p . q u a n t i l e ( c i t y . t i l t , q=q)
where the first line calculates the sm oothed distribution, the second obtains the
corresponding ratio, an d the third and fourth obtain the m om ent and quantile
estim ates corresponding to sim ulation from the distribution q.
s a d d l e ( A = c i t y .L / n r o w ( c i t y ) , u = 2 - c i t y . w ( c i t y ) )
T he L ug an n an i-R ice form ula can be applied by setting LR=T in the calls to
s a d d le an d s a d d l e . d i s t ; by default LR=F.
F o r m ore sophisticated applications, the argum ents A and u to s a d d l e . d i s t n
can be replaced by functions. F or exam ple, the b oo tstrap p ed ratio can be
defined th ro u g h the estim ating equation
£ / ; ( * ; - tu ,) = ° , (11.1)
j
where the /* have a jo in t m ultinom ial distrib u tio n w ith equal probabilities and
denom in ato r n = 10, the n u m b er o f rows o f c i t y , as outlined in Exam ple 9.16.
A ccordingly we set
The penultim ate line here gives the exact version o f the call to s a d d le th at
started this section, while the last line calculates the saddlepoint approxim ation
to the exact distrib u tion o f T*. F or s a d d l e . d i s t n the quantiles o f the distri
bution o f T * are estim ated by obtaining the C D F approxim ation a t a num ber
o f values o f t, an d then interp o latin g the C D F using a spline sm oother. The
range o f values o f t used is determ ined by the contents o f tO, w hose first value
contains the original value o f the statistic, and whose second value contains a
m easure o f the spread o f the distrib u tio n o f T*, such as its stan d ard error.
A n o th er use o f s a d d le an d s a d d l e . d i s t n is to give them directly the
adjusted cum ulant generating function K ( £ ) — t£, and the second derivative
K " { <*). F o r exam ple, the c i t y d a ta above can be tackled as follows:
This is m ost useful w hen K (-) is not o f the standard form th a t follows from a
m ultinom ial distribution.
Conditional approximations
C onditional saddlepoint approxim ation is applied by giving Af n and u f n m ore
colum ns, an d setting the w d is t and ty p e argum ents to s a d d le appropriately.
F or example, suppose th a t we w ant to find the distribution o f T " , defined as
the root o f (11.1), b u t resam pling 25 rath e r th an 49 cases o f b i g c i t y . T hen
we set
H ere the w d is t argum ent gives the distribution o f the random variables W j ,
which is Poisson in this case, and the ty p e argum ent specifies th a t a conditional
approxim ation is required. F or resam pling w ithout replacem ent, see the help
file. A fu rth er argum ent mu allows these variables to have differing means,
in which case the conditional saddlepoint will correspond to sam pling from
m ultinom ial or hypergeom etric distributions with unequal probabilities.
N ote how close the two sem iparam etric log likelihoods are, com pared to the
param etric one. T he practicals at the end o f C h ap ter 10 give m ore exam ples
o f their use (and abuse).
M ore general (and m ore robust!) code to calculate em pirical likelihoods is
provided by Professor A. B. Owen a t Stanford U niversity; see W orld W ide
W eb reference http://playfair.stanford.edU/reports/owen/el.S.
APPENDIX A
Cumulant Calculations
K ( t ) = l o g E ( e tY) = J 2 - / ks,
s=l S-
where ks is the sth cum ulant, while the m om ent-generating function o f Y is
00 1
M(t) = E ( e' r ) = £ - f y s,
s=0 5 ’
where fi's = E (Y S) is the sth m om ent. A simple exam ple is a N(/i, a 2) random
variable, for which K( t ) = t/j.+ ^ t 2 a 2; note the appealing fact th a t its cum ulants
o f order higher th an tw o are zero. By equating powers o f t in the expansions
o f K ( t ) an d log M( t) we find th a t k\ = and th at
K2 =
*3 = n'3 -3(i2»'i+2(n'l)3,
K4 = H 4 ~ 4 /4 /4 - 3(/4)2 + 12/4(/4 )2 - 6(/4 )4.
A*2 = K2 + (K i )2,
/i'3 = K3 + 3K2K\ + (Ki)3, (A .l)
/*4 = k 4+ 4 ( c 3 )c i + 3 (k 2)2 + 6 k 2(k i)2 + (k i)4.
T he cum ulants k j, k2, kt, and K4 are the m ean, variance, skewness and kurtosis
o f Y.
F o r vector Y it is b etter to drop the pow er n o ta tio n used above and to
551
552 A ■Cumulant Calculations
where sum m ation is im plied over repeated indices, so that, for example,
t,/c‘ = t\Kl + ----- h t„Kn, titjK''J = fitiK 1,1 + t \ t 2 K1'2 + ----- h tntnKn'n.
T hus the n-dim ensional norm al distrib u tio n w ith m eans k ‘ and covariance
m atrix fc‘J has cum ulant-generating function Uk1+ jtjtjKl’i . We som etim es write
K>J = cum (Y ‘, Y j ), K'j* = cum (Y ', Y>, Y k) and so forth for the coefficients o f
titj, titjtk in K(t). The cum ulant arrays k ‘j , etc. are invariant to index
perm utation, so for exam ple (c1,2,3 = k2,3,1.
A key feature th a t simplifies calculations w ith cum ulants as opposed to m o
m ents is th a t cum ulants involving two or m ore independent random variables
are z e ro : for independent variables, k,j = k ' ^ = • • = 0 unless all the indices
are equal.
The above n o tatio n extends to generalized cum ulants such as
cu m (Y ‘Y ; Y fc) = E ( Y iY i Y k) = Kijk,
cu m (Y ‘, Y * Y k) = KlJk, c u m ( Y iY J, Y k, Y l) = KijJ('1,
S i = l 1, i = j’
,} \ 0, otherwise,
the K ronecker delta symbol. T he covariance is
the second equality following on use o f Table A .l because k ' = 0 and the
third equality following because the observations are independent and identi
cally distributed. In pow er n o ta tio n k 1,1,1 is k 3, the third cum ulant o f Yi; so
cov{Y ,(n - l ) - 1 £ ( Y , - Y )2} = K-i/n. Similarly
Table A.1
Complementary set
partitions
1 2 3 4
1 12 123 1234
1 [1] 12 [1] 123 [1] 1234 [1]
1|2 [1] 12|3 [3] 123|4 [4]
1|2|3 [1] 12|34 [3]
1|2 12|3|4 [6]
12 [1] 12|3 1|2|3|4 [1]
123 [1]
13|2 [2] 123|4
1234 [1]
1|2|3| 124|3 [3]
123 [1] 12|34 [3]
14|2|3 [3]
12|34
1234 [1]
123|4 [2] [2]
134|2
13|24 [2]
13|2|4 [4]
12|3|4
1234 [1]
134|2 [2]
13|24 [2]
1|2|3|4
1234 [1]
Bibliography
555
556 Bibliography
W. Sendler, volum e 376 o f L ecture N otes in Economics B ooth, J. G . (1996) B o o tstrap m eth o d s for generalized
and M athem atical Systems, pp. 23-30. N ew Y ork: linear m ixed m odels w ith ap p licatio n s to sm all area
Springer. estim ation. In Statistical Modelling, eds G. U . H. Seeber,
B. J. F rancis, R . H atzin g er an d G . Steckel-Berger,
B eran, R. J. (1997) D iag n o sin g b o o tstra p success. Annals
o f the Institute o f Statistical M athem atics 49, to appear. volum e 104 o f Lecture N otes in Statistics, pp. 43-51.
N ew Y ork: Springer.
Berger, J. O. an d B ern ard o , J. M . (1992) O n the
dev elo p m en t o f reference p rio rs (w ith D iscussion). In B ooth, J. G . an d B utler, R. W. (1990) R a n d o m iza tio n
d istrib u tio n s an d sa d d lep o in t a p p ro x im atio n s in
Bayesian S tatistics 4, eds J. M . B ernardo, J. O. Berger,
A. P. D aw id an d A. F. M . Sm ith, pp. 35-60. O xford: g eneralized linear m odels. Biometrika 77, 787-796.
C laren d o n Press. B ooth, J. G., B utler, R. W. an d H all, P. (1994) B o o tstrap
Besag, J. E. a n d Clifford, P. (1989) G eneralized M o n te m eth o d s for finite p o p ulations. Journal o f the American
C a rlo significance tests. Biometrika 76, 633-642. Statistical Association 89, 1282-1289.
Bithell, J. F. an d Stone, R. A. (1989) O n statistical m ethods B retagnolle, J. (1983) L ois lim ites d u b o o tstra p de
fo r analysing th e geo g rap h ical d istrib u tio n o f cancer certaines fonctionelles. Annales de I'Institut Henri
cases n ear n u clear installatio ns. Journal o f Epidemiology Poincare, Section B 19, 281-296.
and Community Health 43, 79-85. Brillinger, D. R . (1981) Time Series: Data Analysis and
Bloom field, P. (1976) Fourier Analysis o f Time Series: An Theory. E xpan d ed edition. San F ran cisco : H olden-D ay.
Introduction. N ew Y o rk : Wiley. Brillinger, D. R . (1988) A n elem entary tren d analysis o f
Boos, D. D . an d M o n a h an , J. F. (1986) B o o tstrap m ethods R io N eg ro levels a t M a n au s, 1903-1985. Brazilian
u sin g p rio r in fo rm atio n . Biom etrika 73, 77-83. Journal o f Probability and Statistics 2, 63-79.
Bibliography 557
B rillinger, D. R. (1989) C o n sistent d etection o f a b o o tstra p , pivots an d confidence lim its. T echnical
m o n o to n ic tren d su p erp o sed on a sta tio n ary tim e series. R e p o rt 34, C enter for S tatistical Sciences, U niversity o f
Biometrika 76, 23-30. T exas a t A ustin.
Brockw ell, P. J. an d D avis, R . A. (1991) Time Series: Chen, C., D avis, R . A., Brockwell, P. J. a n d Bai, Z. D.
Theory and Methods. Second edition. N ew Y ork: (1993) O rd e r d eterm in atio n for autoregressive processes
Springer. using resam pling m ethods. Statistica Sinica 3, 481-500.
Brockw ell, P. J. an d D avis, R. A. (1996) Introduction to C hen, C.-H. a n d G eorge, S. L. (1985) T he b o o tstra p an d
Time Series and Forecasting. N ew Y o rk : Springer. identification o f pro g n o stic factors via C ox’s
p ro p o rtio n a l h azard s regression m odel. Statistics in
Brow n, B. W. (1980) P red iction analysis for b inary d a ta . In
Medicine 4, 39-46.
Biostatistics Casebook, eds R. G. M iller, B. E fron, B. W.
B row n a n d L. E. M oses, pp. 3-18. N ew Y ork: Wiley. C hen, S. X. (1996) E m pirical likelihood confidence
intervals for n o n p aram etric density estim ation.
B uckland, S. T. an d G arth w aite, P. H . (1990) A lgorithm
Biometrika 83, 329-341.
A S 259: estim atin g confidence intervals by the
R o b b in s -M o n ro search process. Applied Statistics 39, C hen, S. X. an d H all, P. (1993) S m oothed em pirical
413-424. likelihood confidence intervals for quantiles. Annals of
Statistics 21, 1166-1181.
B iihlm ann, P. an d K iinsch, H. R. (1995) T he blockw ise
b o o tstra p for general p aram eters o f a sta tio n ary tim e C hen, Z. an d D o , K.-A. (1994) T he b o o tstra p m ethods
series. Scandinavian Journal of Statistics 22, 35-54. w ith sad d lep o in t ap p ro x im atio n s an d im p o rtan ce
resam pling. Statistica Sinica 4, 407-421.
B unke, O. an d D ro g e, B. (1984) B o o tstrap and
cro ss-v alid atio n estim ates o f the pred ictio n e rro r for C obb, G . W. (1978) T he problem o f th e N ile: conditional
lin ear regression m odels. Annals of Statistics 12, solution to a c h an g ep o in t problem . Biometrika 65,
1400-1424. 243-252.
B u rm an , P. (1989) A co m p arativ e study o f o rd in ary C o chran, W. G . (1977) Sampling Techniques. T h ird edition.
cross-v alid atio n , D-fold cross-validation a n d the rep eated N ew Y o rk : Wiley.
learn in g -testin g m eth o d s. Biometrika 76, 503-514. C ollings, B. J. an d H am ilto n , M . A. (1988) E stim atin g the
B urr, D . (1994) A co m p ariso n o f certain b o o tstra p pow er o f the tw o-sam ple W ilcoxon test for location
confidence intervals in th e Cox m odel. Journal of the shift. Biometrics 44, 847-860.
American Statistical Association 89, 1290-1302. C ook, R . D ., H aw kins, D. M . an d W eisberg, S. (1992)
B urr, D . a n d D oss, H. (1993) C onfidence b an d s for the C o m p ariso n o f m odel m isspecification diagnostics using
m ed ian survival tim e as a function o f covariates in the residuals from least m ean o f squares an d least m edian o f
C ox m odel. Journal of the American Statistical squares fits. Journal of the American Statistical
Association 88, 1330-1340. Association 87, 4 1 9 ^ 2 4 .
C anty, A. J., D avison, A. C. an d H inkley, D. V. (1996) C ook, R. D ., Tsai, C.-L. an d Wei, B. C. (1986) Bias in
R eliable confidence intervals. D iscussion o f “ B o o tstrap n o n lin ear regression. Biometrika 73, 615-623.
confidence in terv als”, by T. J. D iC iccio an d B. Efron. C ook, R . D . a n d W eisberg, S. (1982) Residuals and
Statistical Science 11, 214-219. Influence in Regression. L o n d o n : C h a p m a n & Hall.
C arlstein, E. (1986) T h e use o f subseries values for C ook, R . D . a n d W eisberg, S. (1994) T ran sfo rm in g a
estim atin g th e v arian ce o f a general statistic from a response variable for linearity. Biometrika 81, 731-737.
sta tio n a ry sequence. Annals of Statistics 14, 1171-1179.
C o rco ran , S. A., D avison, A. C. an d Spady, R. H . (1996)
C a rp en ter, J. R. (1996) Simulated confidence regions for R eliable inference from em pirical likelihoods. P reprint,
parameters in epidemiological models. Ph.D . thesis, D e p a rtm e n t o f Statistics, U niversity o f O xford.
D e p a rtm e n t o f S tatistics, U niversity o f O xford.
Cow ling, A., H all, P. an d Phillips, M . J. (1996) B o o tstrap
C h am b ers, J. M . a n d H astie, T. J. (eds) (1992) Statistical confidence regions for th e intensity o f a Poisson process.
Models in S. Pacific G rove, C alifo rn ia: W adsw orth & Journal of the American Statistical Association 91,
B ro o k s/C o le. 1516-1524.
C h ao , M.-T. a n d Lo, S.-H. (1994) M ax im u m likelihood Cox, D . R . a n d H inkley, D . V. (1974) Theoretical Statistics.
su m m ary an d th e b o o tstra p m eth o d in stru ctu red finite L o n d o n : C h a p m a n & Hall.
p o p u latio n s. Statistica Sinica 4, 389-406.
Cox, D . R . a n d Isham , V. (1980) Point Processes. L o n d o n :
C h a p m a n , P. a n d H inkley, D . V. (1986) T h e double C h a p m a n & H all.
558 Bibliography
Cox, D. R. a n d Lewis, P. A. W. (1966) The Statistical likelihoods. Statistics and Computing 5, 257-264.
Analysis of Series of Events. L o n d o n : C h a p m a n & H all. D avison, A. C. a n d Snell, E. J. (1991) R esiduals an d
Cox, D. R. a n d O akes, D . (1984) Analysis of Survival Data. diagnostics. In Statistical Theory and Modelling: In
L o n d o n : C h a p m a n & H all. Honour of Sir David Cox, FRS, eds D. V. H inkley,
N . R eid a n d E. J. Snell, pp. 83-106. L o n d o n : C h a p m a n
Cox, D. R. a n d Snell, E. J. (1981) Applied Statistics:
Principles and Examples. L o n d o n : C h a p m a n & H all. & H all.
D e A ngelis, D . an d G ilks, W. R. (1994) E stim ating
Cressie, N. A. C. (1982) P laying safe w ith m isw eighted
m eans. Journal of the American Statistical Association acquired im m une deficiency syndrom e incidence
acco u n tin g for re p o rtin g delay. Journal of the Royal
77, 754-759.
Statistical Society series A 157, 31-40.
Cressie, N . A. C. (1991) Statistics for Spatial Data. N ew
Y o rk : Wiley. D e A ngelis, D ., H all, P. a n d Y oung, G . A. (1993)
A nalytical an d b o o tstra p ap p ro x im atio n s to estim ato r
D ah lh au s, R. a n d Ja n as, D. (1996) A frequency d o m ain d istrib u tio n s in L\ regression. Journal of the American
b o o tstra p fo r ra tio statistics in tim e series analysis. Statistical Association 88, 1310-1316.
Annals of Statistics 24, to ap p ear.
D e A ngelis, D . an d Y oung, G . A. (1992) S m oothing the
D aley, D. J. an d V ere-Jones, D. (1988) A n Introduction to b o o tstrap . International Statistical Review 60, 45-56.
the Theory of Point Processes. N ew Y ork: S pringer.
D em pster, A. P., L aird, N . M . a n d R ubin, D . B. (1977)
D aniels, H . E. (1954) S ad d lep o in t a p p ro x im atio n s in M ax im u m likelihood from incom plete d a ta via the E M
statistics. Annals of Mathematical Statistics 25, 631-650. alg o rith m (w ith D iscussion). Journal of the Royal
D aniels, H. E. (1955) D iscussion o f “ P e rm u ta tio n theory in Statistical Society series B 39, 1-38.
the d eriv atio n o f ro b u st criteria an d th e study o f D iaconis, P. an d H olm es, S. (1994) G ray codes for
d ep artu re s fro m assu m p tio n ”, by G . E. P. Box a n d S. L. ran d o m izatio n procedures. Statistics and Computing 4,
A ndersen. Journal of the Royal Statistical Society series 287-302.
B 17, 27-28.
D iC iccio, T. J. a n d E fron, B. (1992) M ore accurate
D aniels, H . E. (1958) D iscussion o f “T h e regression confidence intervals in exponential fam ilies. Biometrika
analysis o f b in ary sequences”, by D . R. Cox. Journal of 79, 231-245.
the Royal Statistical Society series B 20, 236-238.
D iC iccio, T. J. an d E fron, B. (1996) B o o tstrap confidence
D aniels, H. E. an d Y oung, G . A. (1991) S ad d lep o in t intervals (w ith D iscussion). Statistical Science 11,
ap p ro x im atio n fo r th e stu d entized m ean, w ith an 189-228.
a p p licatio n to th e b o o tstra p . Biometrika 78, 169-179.
D iC iccio, T. J., H all, P. a n d R o m ano, J. P. (1989)
D av iso n , A. C. (1988) D iscussion o f th e R oyal S tatistical C o m p ariso n o f p aram etric an d em pirical likelihood
Society m eeting o n th e b o o tstrap . Journal of the Royal functions. Biometrika 76, 465-476.
Statistical Society series B 50, 356-357.
D iC iccio, T. J., H all, P. an d R o m ano, J. P. (1991) E m pirical
D av iso n , A. C. an d H all, P. (1992) O n the bias a n d likelihood is B a rtlett-correctable. Annals of Statistics 19,
variab ility o f b o o tstra p an d cross-validation estim ates o f 1053-1061.
e rro r rate in d iscrim in atio n problem s. Biometrika 79,
D iC iccio, T. J., M a rtin , M . A. an d Y oung, G . A. (1992a)
279-284.
A nalytic ap p ro x im atio n s for iterated b o o tstra p
D av iso n , A. C. a n d H all, P. (1993) O n S tudentizing an d confidence intervals. Statistics and Computing 2,
blo ck in g m eth o d s fo r im plem enting the b o o tstra p w ith 161-171.
d ep en d en t d a ta . Australian Journal of Statistics 35,
D iC iccio, T. J., M a rtin , M . A. a n d Y oung, G . A. (1992b)
215-224.
F a st an d accu rate ap p ro x im ate dou b le b o o tstra p
D av iso n , A. C. a n d H inkley, D. V. (1988) S ad d lep o in t confidence intervals. Biometrika 79, 285-295.
ap p ro x im atio n s in resam pling m ethods. Biometrika 75,
D iC iccio, T. J., M a rtin , M . A. an d Y oung, G . A. (1994)
417-431.
A nalytical a p p ro x im atio n s to b o o tstra p d istrib u tio n
D av iso n , A. C., H inkley, D. V. a n d S chechtm an, E. (1986) functions using sad d lep o in t m ethods. Statistica Sinica 4,
Efficient b o o tstra p sim ulation. Biometrika 73, 555-566. 281-295.
D av iso n , A. C., H inkley, D. V. a n d W orton, B. J. (1992) D iC iccio, T. J. an d R o m an o , J. P. (1988) A review o f
B o o tstrap likelihoods. Biometrika 79, 113-130. b o o tstra p confidence intervals (w ith D iscussion). Journal
D av iso n , A. C., H inkley, D . V. a n d W orton, B. J. (1995) of the Royal Statistical Society series B 50, 338-370.
A ccu rate an d efficient co n stru ctio n o f b o o tstra p C o rrectio n , volum e 51, p. 470.
Bibliography 559
D o, K .-A . an d H all, P. (1992a) D istrib u tio n estim ation E fron, B. (1992) Ja ck k n ife-after-b o o tstrap sta n d a rd erro rs
using co n co m itan ts o f o rd er statistics, w ith application an d influence functions (w ith D iscussion). Journal of the
to M o n te C a rlo sim u latio n for the b o o tstrap . Journal of Royal Statistical Society series B 54, 83-127.
the Royal Statistical Society series B 54, 595-607. E fron, B. (1993) Bayes a n d likelihood calcu latio n s from
D o , K .-A . a n d H all, P. (1992b) Q u asi-ran d o m resam pling confidence intervals. Biometrika 80, 3-26.
fo r th e b o o tstra p . Statistics and Computing 1, 13-22. E fron, B. (1994) M issing d a ta , im p u tatio n , an d the
D o b so n , A. J. (1990) An Introduction to Generalized Linear b o o tstra p (w ith D iscussion). Journal of the American
Models. L o n d o n : C h a p m a n & H all. Statistical Association 89, 463-479.
D o n eg an i, M . (1991) A n ad aptive an d pow erful E fron, B., H allo ran , M . E. an d H olm es, S. (1996) B o o tstrap
ra n d o m izatio n test. Biometrika 78, 930-933. confidence levels fo r phylogenetic trees. Proceedings of
the National Academy of Sciences, U S A 93, 13429-13434.
D oss, H. a n d G ill, R. D. (1992) A n elem entary ap p ro ach
to w eak convergence fo r q u an tile processes, with E fron, B. a n d Stein, C. M . (1981) T he jack k n ife estim ate o f
ap p licatio n s to cen so red survival d ata. Journal of the variance. Annals of Statistics 9, 586-596.
American Statistical Association 87, 869-877. E fron, B. an d T ibshirani, R . J. (1986) B o o tstra p m ethods
D rap er, N . R. an d Sm ith, H . (1981) Applied Regression for sta n d a rd errors, confidence intervals, a n d o th er
Analysis. S econd edition. N ew Y o rk : Wiley. m easures o f statistical accuracy (w ith D iscussion).
Statistical Science 1, 54-96.
D u ch arm e, G . R., Jh u n , M., R o m ano, J. P. a n d T ruong,
K . N . (1985) B o o tstrap confidence cones for directional E fron, B. a n d T ibshirani, R . J. (1993) An Introduction to
d a ta . Biometrika 72, 637-645. the Bootstrap. N ew Y ork: C h a p m a n & H all.
E asto n , G . S. an d R o n ch etti, E. M. (1986) G en eral E fron, B. a n d T ibshirani, R . J. (1997) Im provem ents on
sa d d lep o in t a p p ro x im atio n s w ith ap p licatio n s to L cross-validation: the .632+ b o o tstra p m ethod. Journal of
statistics. Journal of the American Statistical Association the American Statistical Association 92, 548-560.
81, 420-430. Fang, K . T. a n d W ang, Y. (1994) Number-Theoretic
E fron, B. (1979) B o o tstrap m eth o d s: a n o th e r look a t the Methods in Statistics. L o n d o n : C h a p m a n & H all.
560 Bibliography
Faraway, J. J. (1992) O n the cost of data analysis. Journal least-squares estimates in stationary linear models.
of Computational and Graphical Statistics 1, 213-229. Annals of Statistics 12, 827-842.
Feigl, P. and Zelen, M. (1965) Estimation of exponential Freedman, D. A. and Peters, S. C. (1984a) Bootstrapping a
survival probabilities with concomitant information. regression equation: some empirical results. Journal of
Biometrics 21, 826-838. the American Statistical Association 79, 97-106.
Feller, W. (1968) A n Introduction to Probability Theory and Freedman, D. A. and Peters, S. C. (1984b) Bootstrapping
its Applications. Third edition, volume I. N e w York: an econometric model:some empirical results. Journal
Wiley. of Business & Economic Statistics 2, 150-158.
Fernholtz, L. T. (1983) von Mises Calculus for Statistical Freeman, D. H. (1987) Applied Categorical Data Analysis.
Functionals. Volume 19 of Lecture Notes in Statistics. N e w York: Marcel Dekker.
N e w York: Springer. Frets, G. P. (1921) Heredity of head form in man. Genetica
Ferretti, N. and Romo, J. (1996) Unit root bootstrap tests 3, 193-384.
for /1R( 1) models. Biometrika 83, 849-860. Garcia-Soidan, P. H. and Hall, P. (1997) O n sample reuse
Field, C. and Ronchetti, E. M. (1990) Small Sample methods for spatial data. Biometrics 53, 273-281.
Asymptotics. Volume 13 of Lecture Notes — Monograph
Garthwaite, P. H. and Buckland, S. T. (1992) Generating
Series. Hayward, California: Institute of Mathematical
Monte Carlo confidence intervals by the
Statistics. Robbins-Monro process. Applied Statistics 41, 159-171.
Firth, D. (1991) Generalized linear models. In Statistical
Gatto, R. (1994) Saddlepoint methods and nonparametric
Theory and Modelling: In Honour of Sir David Cox,
approximations for econometric models. Ph.D. thesis,
FRS, eds D. V. Hinkley, N. Reid and E. J. Snell, pp.
Faculty of Economic and Social Sciences, University of
55-82. London: Chapman & Hall.
Geneva.
Firth, D. (1993) Bias reduction of maximum likelihood
Gatto, R. and Ronchetti, E. M. (1996) General saddlepoint
estimates. Biometrika 80, 27-38.
approximations of marginal densities and tail
Firth, D., Glosup, J. and Hinkley, D. V. (1991) Model probabilities. Journal of the American Statistical
checking with nonparametric curves. Biometrika 78, Association 91, 666-673.
245-252.
Geisser, S. (1975) The predictive sample reuse method with
Fisher, N. I., Hall, P., Jing, B.-Y. and Wood, A. T. A. (1996) applications. Journal of the American Statistical
Improved pivotal methods for constructing confidence Association 70, 320-328.
regions with directional data. Journal of the American
Geisser, S. (1993) Predictive Inference: An Introduction.
Statistical Association 91, 1062-1070.
London: Chapman & Hall.
Fisher, N. I., Lewis, T. and Embleton, B. J. J. (1987)
Geyer, C. J. (1991) Constrained maximum likelihood
Statistical Analysis of Spherical Data. Cambridge:
exemplified by isotonic convex logistic regression.
Cambridge University Press.
Journal of the American Statistical Association 86,
Fisher, R. A. (1935) The Design of Experiments. 717-724.
Edinburgh: Oliver and Boyd.
Geyer, C. J. (1995) Likelihood ratio tests and inequality
Fisher, R. A. (1947) The analysis of covariance method for constraints. Technical Report 610, School of Statistics,
the relation between a part and the whole. Biometrics 3, University of Minnesota.
65-68.
Gigli, A. (1994) Contributions to importance sampling and
Fleming, T. R. and Harrington, D. P. (1991) Counting resampling. Ph.D. thesis, Department of Mathematics,
Processes and Survival Analysis. N e w York: Wiley. Imperial College, London.
Forster, J. J., McDonald, J. W. and Smith, P. W. F. (1996) Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (eds)
Monte Carlo exact conditional tests for log-linear and (1996) Markov Chain Monte Carlo in Practice. London:
logistic models. Journal of the Royal Statistical Society Chapman & Hall.
series B 58, 445^53.
Gleason, J. R. (1988) Algorithms for balanced bootstrap
Franke, J. and Hardle, W. (1992) On bootstrapping kernel simulations. American Statistician 42, 263-266.
spectral estimates. Annals of Statistics 20, 121-145.
Gong, G. (1983) Cross-validation, the jackknife, and the
Freedman, D. A. (1981) Bootstrapping regression models. bootstrap: excess error estimation in forward logistic
Annals of Statistics 9, 1218-1228. regression. Journal of the American Statistical
Freedman, D. A. (1984) O n bootstrapping two-stage Association 78, 108-113.
Bibliography 561
Gotze, F. and Kiinsch, H. R. (1996) Second order Hall, P. and Horowitz, J. L. (1993) Corrections and
correctness of the blockwise bootstrap for stationary blocking rules for the block bootstrap with dependent
observations. Annals of Statistics 24, 1914-1933. data. Technical Report SRI 1-93, Centre for
Graham, R. L., Hinkley, D. V., John, P. W. M. and Shi, S. Mathematics and its Applications, Australian National
(1990) Balanced design of bootstrap simulations. Journal University.
of the Royal Statistical Society series B 52, 185-202. Hall, P., Horowitz, J. L. and Jing, B.-Y. (1995) O n blocking
Gray, H. L. and Schucany, W. R. (1972) The Generalized rules for the bootstrap with dependent data. Biometrika
Jackknife Statistic. N e w York: Marcel Dekker. 82, 561-574.
Green, P. J. and Silverman, B. W. (1994) Nonparametric Hall, P. and Jing, B.-Y. (1996) On sample reuse methods
Regression and Generalized Linear Models: A Roughness for dependent data. Journal of the Royal Statistical
Penalty Approach. London: Chapman & Hall. Society series B 58, 727-737.
Gross, S. (1980) Median estimation in sample surveys. In Hall, P. and Keenan, D. M. (1989) Bootstrap methods for
Proceedings of the Section on Survey Research Methods,
constructing confidence regions for hands.
Communications in Statistics — Stochastic Models 5,
pp. 181-184. Alexandria, Virginia: American Statistical
555-562.
Association.
Hall, P. and La Scala, B. (1990) Methodology and
Haldane, J. B. S. (1940) The mean and variance of x2,
algorithms of empirical likelihood. International
when used as a test of homogeneity, when expectations
Statistical Review 58, 109-28.
are small. Biometrika 31, 346-355.
Hall, P. and Martin, M. A. (1988) O n bootstrap resampling
Hall, P. (1985) Resampling a coverage pattern. Stochastic
and iteration. Biometrika 75, 661-671.
Processes and their Applications 20, 231-246.
Hall, P. and Owen, A. B. (1993) Empirical likelihood
Hall, P. (1986) O n the bootstrap and confidence intervals.
confidence bands in density estimation. Journal of
Annals of Statistics 14, 1431-1452.
Computational and Graphical Statistics 2, 273-289.
Hall, P. (1987) O n the bootstrap and likelihood-based
Hall, P. and Titterington, D. M. (1989) The effect of
confidence regions. Biometrika 74, 481^193.
simulation order on level accuracy and power of Monte
Hall, P. (1988a) Theoretical comparison of bootstrap Carlo tests. Journal of the Royal Statistical Society series
confidence intervals (with Discussion). Annals of B 51, 459-467.
Statistics 16, 927-985.
Hall, P. and Wilson, S. R. (1991) Two guidelines for
Hall, P. (1988b) On confidence intervals for spatial bootstrap hypothesis testing. Biometrics 47, 757-762.
parameters estimated from nonreplicated data.
Hamilton, M. A. and Collings, B. J. (1991) Determining
Biometrics 44, 271-277.
the appropriate sample size for nonparametric tests for
Hall, P. (1989a) Antithetic resampling for the bootstrap. location shift. Technometrics 33, 327-337.
Biometrika 76, 713-724.
Hammersley, J. M. and Handscomb, D. C. (1964) Monte
Hall, P. (1989b) Unusual properties of bootstrap Carlo Methods. London: Methuen.
confidence intervals in regression problems. Probability
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and
Theory and Related Fields 81, 247-273.
Stahel, W. A. (1986) Robust Statistics: The Approach
Hall, P. (1990) Pseudo-likelihood theory for empirical Based on Influence Functions. N e w York: Wiley.
likelihood. Annals of Statistics 18, 121-140. Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and
Hall, P. (1992a) The Bootstrap and Edgeworth Expansion. Ostrowski, E. (eds) (1994) A Handbook of Small Data
N e w York: Springer. Sets. London: Chapman & Hall.
Hall, P. (1992b) O n bootstrap confidence intervals in Hardle, W. (1989) Resampling for inference from curves. In
nonparametric regression. Annals of Statistics 20, Bulletin of the 47th Session of the International Statistical
695-711. Institute, Paris, August 1989, volume 3, pp. 53-63.
Hall, P. (1995) O n the biases of error estimators in Hardle, W. (1990) Applied Nonparametric Regression.
prediction problems. Statistics and Probability Letters Cambridge: Cambridge University Press.
24, 257-262. Hardle, W. and Bowman, A. W. (1988) Bootstrapping in
Hall, P., DiCiccio, T. J. and Romano, J. P. (1989) On nonparametric regression: local adaptive smoothing and
smoothing and the bootstrap. Annals of Statistics 17, confidence bands. Journal of the American Statistical
692-704. Association 83, 102-110.
562 Bibliography
Hardle, W. and Marron, J. S. (1991) Bootstrap jackknife confidence limit methods. Biometrika 71,
simultaneous error bars for nonparametric regression. 331-339.
Annals of Statistics 19, 778-796. Hirose, H. (1993) Estimation of threshold stress in
Hartigan, J. A. (1969) Using subsample values as typical accelerated life-testing. IEEE Transactions on Reliability
values. Journal of the American Statistical Association 42, 650-657.
64, 1303-1317. Hjort, N. L. (1985) Bootstrapping Cox’
s regression model.
Hartigan, J. A. (1971) Error analysis by replaced samples. Technical Report NSF-241, Department of Statistics,
Journal of the Royal Statistical Society series B 33, Stanford University.
98-110. Hjort, N. L. (1992) O n inference in parametric survival
Hartigan, J. A. (1975) Necessary and sufficient conditions data models. International Statistical Review 60,
for asymptotic joint normality of a statistic and its 355-387.
subsample values. Annals of Statistics 3, 573-580.
Horvath, L. and Yandell, B. S. (1987) Convergence rates
Hartigan, J. A. (1990) Perturbed periodogram estimates of for the bootstrapped product-limit process. Annals of
variance. International Statistical Review 58, 1-7. Statistics 15, 1155-1173.
Hastie, T. J. and Loader, C. (1993) Local regression: Hosmer, D. W. and Lemeshow, S. (1989) Applied Logistic
automatic kernel carpentry (with Discussion). Statistical Regression. N e w York: Wiley.
Science 8, 120-143.
Hu, F. and Zidek, J. V. (1995) A bootstrap based on the
Hastie, T. J. and Tibshirani, R. J. (1990) Generalized estimating equations of the linear model. Biometrika 82,
Additive Models. London: Chapman & Hall. 263-275.
Hayes, K. G., Perl, M. L. and Efron, B. (1989) Application Huet, S., Jolivet, E. and Messean, A. (1990) Some
of the bootstrap statistical method to the simulations results about confidence intervals and
tau-decay-mode problem. Physical Review Series D 39, bootstrap methods in nonlinear regression. Statistics 3,
274-279. 369-432.
Heller, G. and Venkatraman, E. S. (1996) Resampling Hyde, J. (1980) Survival analysis with incomplete
procedures to compare two survival distributions in the observations. In Biostatistics Casebook, eds R. G. Miller,
presence of right-censored data. Biometrics 52, B. Efron, B. W. Brown and L. E. Moses, pp. 31-46. N e w
1204-1213. York: Wiley.
Hesterberg, T. C. (1988) Advances in importance sampling. Janas, D. (1993) Bootstrap Procedures for Time Series.
Ph.D. thesis, Department of Statistics, Stanford Aachen: Verlag Shaker.
University, California.
Jennison, C. (1992) Bootstrap tests and confidence intervals
Hesterberg, T. C. (1995a) Tail-specific linear
for a hazard ratio when the number of observed failures
approximations for efficient bootstrap simulations.
is small, with applications to group sequential survival
Journal of Computational and Graphical Statistics 4,
studies. In Computer Science and Statistics: Proceedings
113-133.
of the 22nd Symposium on the Interface, eds C. Page and
Hesterberg, T. C. (1995b) Weighted average importance R. LePage, pp. 89-97. N e w York: Springer.
sampling and defensive mixture distributions.
Jensen, J. L. (1992) The modified signed likelihood statistic
Technometrics 37, 185-194.
and saddlepoint approximations. Biometrika 79,
Hinkley, D. V. (1977) Jackknifing in unbalanced situations. 693-703.
Technometrics 19, 285-292.
Jensen, J. L. (1995) Saddlepoint Approximations. Oxford:
Hinkley, D. V. and Schechtman, E. (1987) Conditional Clarendon Press.
bootstrap methods in the mean-shift model. Biometrika
74, 85-93. Jeong, J. and Maddala, G. S. (1993) A perspective on
application of bootstrap methods in econometrics. In
Hinkley, D. V. and Shi, S. (1989) Importance sampling and Handbook of Statistics, vol. II: Econometrics, eds G. S.
the nested bootstrap. Biometrika 76, 435-446. Maddala, C. R. Rao and H. D. Vinod, pp. 573-610.
Hinkley, D. V. and Wang, S. (1991) Efficiency of robust Amsterdam: North-Holland.
standard errors for regression coefficients. Jing, B.-Y. and Robinson, J. (1994) Saddlepoint
Communications in Statistics — Theory and Methods 20, approximations for marginal and conditional
1- 11. probabilities of transformed variables. Annals of
Hinkley, D. V. and Wei, B. C. (1984) Improvements of Statistics 22, 1115-1132.
Bibliography 563
Jing, B.-Y. and Wood, A. T. A. (1996) Exponential Lawson, A. B. (1993) O n the analysis of mortality events
empirical likelihood is not Bartlett correctable. Annals of associated with a prespecified fixed point. Journal of the
Statistics 24, 365-369. Royal Statistical Society series A 156, 363-377.
Jockel, K.-H. (1986) Finite sample properties and Lee, S. M. S. and Young, G. A. (1995) Asymptotic iterated
asymptotic efficiency of Monte Carlo tests. Annals of bootstrap confidence intervals. Annals of Statistics 23,
Statistics 14, 336-347. 1301-1330.
Johns, M. V. (1988) Importance sampling for bootstrap Leger, C., Politis, D. N. and Romano, J. P. (1992)
confidence intervals. Journal of the American Statistical Bootstrap technology and applications. Technometrics
Association 83, 709-714. 34, 378-398.
Journel, A. G. (1994) Resampling from stochastic Leger, C. and Romano, J. P. (1990a) Bootstrap choice of
simulations (with Discussion). Environmental and tuning parameters. Annals of the Institute of Statistical
Ecological Statistics 1, 63-91. Mathematics 42, 709-735.
Kabaila, R (1993a) Some properties of profile bootstrap Leger, C. and Romano, J. P. (1990b) Bootstrap adaptive
confidence intervals. Australian Journal of Statistics 35, estimation: the trimmed mean example. Canadian
205-214. Journal of Statistics 18, 297-314.
Kabaila, P. (1993b) O n bootstrap predictive inference for Lehmann, E. L. (1986) Testing Statistical Hypotheses.
autoregressive processes. Journal of Time Series Analysis Second edition. N e w York: Wiley.
14, 473—484.
Li, G. (1995) Nonparametric likelihood ratio estimation of
Kalbfleisch, J. D. and Prentice, R. L. (1980) The Statistical probabilities for truncated data. Journal of the American
Analysis of Failure Time Data. N e w York: Wiley.
Statistical Association 90, 997-1003.
Kaplan, E. L. and Meier, P. (1958) Nonparametric
Li, H. and Maddala, G. S. (1996) Bootstrapping time series
estimation from incomplete observations. Journal of the
models (with Discussion). Econometric Reviews 15,
American Statistical Association 53, 457-481.
115-195.
Karr, A. F. (1991) Point Processes and their Statistical
Li, K.-C. (1987) Asymptotic optimality for C p, C l ,
Inference. Second edition. N e w York: Marcel Dekker.
cross-validation and generalized cross-validation:
Katz, R. (1995) Spatial analysis of pore images. Ph.D. discrete index set. Annals of Statistics 15, 958-975.
thesis, Department of Statistics, University of Oxford.
Liu, R. Y. and Singh, K. (1992a) Moving blocks jackknife
Kendall, D. G. and Kendall, W. S. (1980) Alignments in and bootstrap capture weak dependence. In Exploring
two-dimensional random sets of points. Advances in the Limits of Bootstrap, eds R. LePage and L. Billard,
Applied Probability 12, 380-424. pp. 225-248. N e w York: Wiley.
Kim, J.-H. (1990) Conditional bootstrap methods for Liu, R. Y. and Singh, K. (1992b) Efficiency and robustness
censored data. Ph.D. thesis, Department of Statistics, in resampling. Annals of Statistics 20, 370-384.
Florida State University.
Lloyd, C. J. (1994) Approximate pivots from M-estimators.
Kiinsch, H. R. (1989) The jackknife and bootstrap for Statistica Sinica 4, 701-714.
general stationary observations. Annals of Statistics 17,
1217-1241. Lo, S.-H. and Singh, K. (1986) The product-limit estimator
and the bootstrap: some asymptotic representations.
Lahiri, S. N. (1991) Second-order optimality of stationary
Probability Theory and Related Fields 71, 455-465.
bootstrap. Statistics and Probability letters 11, 335-341.
Loh, W.-Y. (1987) Calibrating confidence coefficients.
Lahiri, S. N. (1995) On the asymptotic behaviour of the
Journal of the American Statistical Association 82,
moving block bootstrap for normalized sums of
155-162.
heavy-tail random variables. Annals of Statistics 23,
1331-1349. Mallows, C. L. (1973) Some comments on C p.
Technometrics 15, 661-675.
Laird, N. M. (1978) Nonparametric maximum likelihood
estimation of a mixing distribution. Journal of the Mammen, E. (1989) Asymptotics with increasing
American Statistical Association 73, 805-811. dimension for robust regression with applications to the
Laird, N. M. and Louis, T. A. (1987) Empirical Bayes bootstrap. Annals of Statistics 17, 382-400.
confidence intervals based on bootstrap samples (with Mammen, E. (1992) When Does Bootstrap Work?
Discussion). Journal of the American Statistical Asymptotic Results and Simulations. Volume 77 of
Association 82, 739-757. Lecture Notes in Statistics. N e w York: Springer.
564 Bibliography
Mammen, E. (1993) Bootstrap and wild bootstrap for high Journal of the Royal Statistical Society series A 152,
dimensional linear models. Annals of Statistics 21, 305-384.
255-285. Murphy, S. A. (1995) Likelihood-based confidence
Manly, B. F. J. (1991) Randomization and Monte Carlo intervals in survival analysis. Journal of the American
Methods in Biology. London: Chapman & Hall. Statistical Association 90, 1399-1405.
Marriott, F. H. C. (1979) Barnard’
s Monte Carlo tests: Mykland, P. A. (1995) Dual likelihood. Annals of Statistics
how many simulations? Applied Statistics 28, 75-77. 23, 396-421.
McCarthy, P. J. (1969) Pseudo-replication: half samples. Nelder, J. A. and Pregibon, D. (1987) A n extended
Review of the International Statistical Institute 37, quasi-likelihood function. Biometrika 74, 221-232.
239-264. Newton, M. A. and Geyer, C. J. (1994) Bootstrap
McCarthy, P. J. and Snowden, C. B. (1985) The Bootstrap recycling: a Monte Carlo alternative to the nested
and Finite Population Sampling. Vital and Public Health bootstrap. Journal of the American Statistical Association
Statistics (Ser. 2, No. 95), Public Health Service 89, 905-912.
Publication. Washington, DC: United States Newton, M. A. and Raftery, A. E. (1994) Approximate
Government Printing Office, 85-1369. Bayesian inference with the weighted likelihood
McCullagh, P. (1987) Tensor Methods in Statistics. bootstrap (with Discussion). Journal of the Royal
London: Chapman & Hall. Statistical Society series B 56, 3—48.
McCullagh, P. and Nelder, J. A. (1989) Generalized Linear Niederreiter, H. (1992) Random Number Generation and
Models. Second edition. London: Chapman & Hall. Quasi-Monte Carlo Methods. Number 63 in C B M S - N S F
McKay, M. D., Beckman, R. J. and Conover, W. J. (1979) Regional Conference Series in Applied Mathematics.
A comparison of three methods for selecting values of Philadelphia: SIAM.
input variables in the analysis of output from a Nordgaard, A. (1990) On the resampling of stochastic
computer code. Technometrics 21, 239-245. processes using a bootstrap approach. Ph.D. thesis,
McKean, J. W., Sheather, S. J. and Hettsmansperger, T. P. Department of Mathematics, Linkoping University,
(1993) The use and interpretation of residuals based on Sweden.
robust estimation. Journal of the American Statistical Noreen, E. W. (1989) Computer Intensive Methods for
Association 88, 1254-1263. Testing Hypotheses: An Introduction. N e w York: Wiley.
McLachlan, G. J. (1992) Discriminant Analysis and Ogbonmwan, S.-M. (1985) Accelerated resampling codes
Statistical Pattern Recognition. N e w York: Wiley. with application to likelihood. Ph.D. thesis, Department
Milan, L. and Whittaker, J. (1995) Application of the of Mathematics, Imperial College, London.
parametric bootstrap to models that incorporate a Ogbonmwan, S.-M. and Wynn, H. P. (1986) Accelerated
singular value decomposition. Applied Statistics 44, resampling codes with low discrepancy. Preprint,
31-49. Department of Statistics and Actuarial Science, The City
Miller, R. G. (1974) The jackknife — a review. Biometrika University.
61, 1-15. Olshen, R. A., Biden, E. N., Wyatt, M. P. and Sutherland,
Miller, R. G. (1981) Survival Analysis. N e w York: Wiley. D. H. (1989) Gait analysis and the bootstrap. Annals of
Statistics 17, 1419-1440.
Monti, A. C. (1997) Empirical likelihood confidence
regions in time series models. Biometrika 84, 395-405. Owen, A. B. (1988) Empirical likelihood ratio confidence
Morgenthaler, S. and Tukey, J. W. (eds) (1991) Configural intervals for a single functional. Biometrika 75, 237-249.
Polysampling: A Route to Practical Robustness. N e w Owen, A. B. (1990) Empirical likelihood ratio confidence
York: Wiley. regions. Annals of Statistics 18, 90-120.
Moulton, L. H. and Zeger, S. L. (1989) Analyzing repeated Owen, A. B. (1991) Empirical likelihood for linear models.
measures on generalized linear models via the bootstrap. Annals of Statistics 19, 1725-1747.
Biometrics 45, 381-394. Owen, A. B. (1992a) Empirical likelihood and small
Moulton, L. H. and Zeger, S. L. (1991) Bootstrapping samples. In Computer Science and Statistics: Proceedings
generalized linear models. Computational Statistics and of the 22nd Symposium on the Interface, eds C. Page and
Data Analysis 11, 53-63. R. LePage, pp. 79-88. N e w York: Springer.
Muirhead, C. R. and Darby, S. C. (1989) Royal Statistical Owen, A. B. (1992b) A central limit theorem for Latin
Society meeting on cancer near nuclear installations. hypercube sampling. Journal of the Royal Statistical
Bibliography 565
Rubin, D. B. and Schenker, N. (1986) Multiple imputation bootstrap. Annals o f Statistics 9, 1187-1195.
for interval estimation from simple random samples Sitter, R. R. (1992) A resampling procedure for complex
with ignorable nonresponse. Journal of the American survey data. Journal of the American Statistical
Statistical Association 81, 366-374. Association 87, 755-765.
Ruppert, D. and Carroll, R. J. (1980) Trimmed least Smith, P. W. F., Forster, J. J. and McDonald, J. W. (1996)
squares estimation in the linear model. Journal of the Monte Carlo exact tests for square contingency tables.
American Statistical Association 75, 828-838. Journal of the Royal Statistical Society series A 159,
Samawi, H. M. (1994) Power estimation for two-sample 309-321.
tests using importance and antithetic resampling. Ph.D.
Spady, R. H. (1991) Saddlepoint approximations for
thesis, Department of Statistics and Actuarial Science, regression models. Biometrika 78, 879-889.
University of Iowa, Ames.
St. Laurent, R. T. and Cook, R. D. (1993) Leverage, local
Sauerbrei, W. and Schumacher, M. (1992) A bootstrap
influence, and curvature in nonlinear regression.
resampling procedure for model building: application to
Biometrika 80, 99-106.
the Cox regression model. Statistics in Medicine 11,
2093-2109. Stangenhaus, G. (1987) Bootstrap and inference
procedures for L\ regression. In Statistical Data Analysis
Schenker, N. (1985) Qualms about bootstrap confidence
Based on the L\-Norm and Related Methods, ed.
intervals. Journal of the American Statistical Association
Y. Dodge, pp. 323-332. Amsterdam: North-Holland.
80, 360-361.
Stein, C. M. (1985) On the coverage probability of
Seber, G. A. F. (1977) Linear Regression Analysis. N ew
confidence sets based on a prior distribution. Volume 16
York: Wiley.
of Banach Centre Publications. Warsaw: P W N — Polish
Shao, J. (1988) O n resampling methods for variance and Scientific Publishers.
bias estimation in linear models. Annals of Statistics 16,
986-1008. Stein, M. (1987) Large sample properties of simulations
using Latin hypercube sampling. Technometrics 29,
Shao, J. (1993) Linear model selection by cross-validation. 143-151.
Journal of the American Statistical Association 88,
486-494. Sternberg, H. O ’
R. (1987) Aggravation of floods in the
Amazon River as a consequence of deforestation?
Shao, J. (1996) Bootstrap model selection. Journal of the Geografiska Annaler 69A, 201-219.
American Statistical Association 91, 655-665.
Sternberg, H. O ’
R. (1995) Water and wetlands of Brazilian
Shao, J. and Tu, D. (1995) The Jackknife and Bootstrap. Amazonia: an uncertain future. In The Fragile Tropics
N e w York: Springer. of Latin America : Sustainable Management of Changing
Shao, J. and Wu, C. F. J. (1989) A general theory for Environments, eds T. Nishizawa and J. I. Uitto, pp.
jackknife variance estimation. Annals of Statistics 17, 113-179. Tokyo: United Nations University Press.
1176-1197.
Stine, R. A. (1985) Bootstrap prediction intervals for
Shorack, G. (1982) Bootstrapping robust regression. regression. Journal of the American Statistical
Communications in Statistics — Theory and Methods 11, Association 80, 1026-1031.
961-972.
Stoffer, D. S. and Wall, K. D. (1991) Bootstrapping
Silverman, B. W. (1981) Using kernel density estimates to state-space models: Gaussian maximum likelihood
investigate multimodality. Journal of the Royal estimation and the Kalman filter. Journal of the
Statistical Society series B 43, 97-99. American Statistical Association 86, 1024-1033.
Silverman, B. W. (1985) Some aspects of the spline Stone, M. (1974) Cross-validatory choice and assessment
smoothing approach to non-parametric regression curve of statistical predictions (with Discussion). Journal of
fitting (with Discussion). Journal of the Royal Statistical the Royal Statistical Society series B 36, 111-147.
Society series B 47, 1-52.
Stone, M. (1977) A n asymptotic equivalence of choice of
Silverman, B. W. and Young, G. A. (1987) The bootstrap: model by cross-validation and Akaike’ s criterion.
to smooth or not to smooth? Biometrika 74, 469-479. Journal of the Royal Statistical Society series B 39,
Simonoff, J. S. and Tsai, C.-L. (1994) Use of modified 44-47.
profile likelihood for improved tests of constancy of Swanepoel, J. W. H. and van Wyk, J. W. J. (1986) The
variance in regression. Applied Statistics 43, 357-370. bootstrap applied to power spectral density function
Singh, K. (1981) O n the asymptotic accuracy of Efron’
s estimation. Biometrika 73, 135-141.
Bibliography 567
Tanner, M. A. (1996) Tools for Statistical Inference: Wang, S. (1992) General saddlepoint approximations in the
Methods for the Exploration of Posterior Distributions bootstrap. Statistics and Probability Letters 13, 61-66.
and Likelihood Functions. Third edition. N e w York: Wang, S. (1993a) Saddlepoint expansions in finite
Springer. population problems. Biometrika 80, 583-590.
Tanner, M. A. and Wong, W. H. (1987) The calculation of Wang, S. (1993b) Saddlepoint methods for bootstrap
posterior densities by data augmentation (with confidence bands in nonparametric regression.
Discussion). Journal of the American Statistical Australian Journal of Statistics 35, 93-101.
Association 82, 528-550.
Wang, S. (1995) Optimizing the smoothed bootstrap.
Theiler, J., Galdrikian, B., Longtin, A., Eubank, S. and Annals of the Institute of Statistical Mathematics 47,
Farmer, J. D. (1992) Using surrogate data to detect 65-80.
nonlinearity in time series. In Nonlinear Modeling and Weisberg, S. (1985) Applied Linear Regression. Second
Forecasting, eds M. Casdagli and S. Eubank, number edition. N e w York: Wiley.
XII in Santa Fe Institute Studies in the Sciences of
Complexity, pp. 163-188. N e w York: Addison-Wesley. Welch, B. L. and Peers, H. W. (1963) O n formulae for
confidence points based on integrals of weighted
Therneau, T. (1983) Variance reduction techniques for the likelihoods. Journal of the Royal Statistical Society series
bootstrap. Ph.D. thesis, Department of Statistics, B 25, 318-329.
Stanford University, California.
Welch, W. J. (1990) Construction of permutation tests.
Tibshirani, R. J. (1988) Variance stabilization and the Journal of the American Statistical Association 85,
bootstrap. Biometrika 75, 433-444. 693-698.
Tong, H. (1990) Non-linear Time Series: A Dynamical Welch, W. J. and Fahey, T. J. (1994) Correcting for
System Approach. Oxford: Clarendon Press. covariates in permutation tests. Technical Report
STAT-94-12, Department of Statistics and Actuarial
Tsay, R. S. (1992) Model checking via parametric
Science, University of Waterloo, Waterloo, Ontario.
bootstraps in time series. Applied Statistics 41, 1-15.
Westfall, P. H. and Young, S. S. (1993) Resampling-Based
Tukey, J. W. (1958) Bias and confidence in not quite large
Multiple Testing: Examples and Methods for p-value
samples (Abstract). Annals of Mathematical Statistics 29,
Adjustment. N e w York: Wiley.
614.
Woods, H., Steinour, H. H. and Starke, H. R. (1932) Effect
Venables, W. N. and Ripley, B. D. (1994) Modern Applied of composition of Portland cement on heat evolved
Statistics with S-Plus. N e w York: Springer. during hardening. Industrial Engineering and Chemistry
Ventura, V. (1997) Likelihood inference by Monte Carlo 24, 1207-1214.
methods and efficient nested bootstrapping. D.Phil. thesis, Wu, C. J. F. (1986) Jackknife, bootstrap and other
Department of Statistics, University of Oxford. resampling methods in regression analysis (with
Ventura, V., Davison, A. C. and Boniface, S. J. (1997) Discussion). Annals of Statistics 14, 1261-1350.
Statistical inference for the effect of magnetic brain Wu, C. J. F. (1990) O n the asymptotic properties of the
stimulation on a motoneurone. Applied Statistics 46, to jackknife histogram. Annals of Statistics 18, 1438-1452.
appear. Wu, C. J. F. (1991) Balanced repeated replications based
Wahrendorf, J., Becher, H. and Brown, C. C. (1987) on mixed orthogonal arrays. Biometrika 78, 181-188.
Bootstrap comparison of non-nested generalized linear Young, G. A. (1986) Conditioned data-based simulations:
models: applications in survival analysis and Some examples from geometrical statistics. International
epidemiology. Applied Statistics 36, 72-81. Statistical Review 54, 1-13.
Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. Young, G. A. (1990) Alternative smoothed bootstraps.
London: Chapman & Hall. Journal of the Royal Statistical Society series B 52,
Wang, S. (1990) Saddlepoint approximations in resampling 477-484.
analysis. Annals of the Institute of Statistical Young, G. A. and Daniels, H. E. (1990) Bootstrap bias.
Mathematics 42, 115-131. Biometrika 77, 179-185.
Name Index
568
Name Index 569
Witkowski, J. A. 417 Wu, C. J. F. 60, 125, 130, 315, 316 Young, G. A. x, 59, 60, 124, 128, 246,
Wong, W. H. 124 Wyatt, M. P. 316 316, 428, 486, 487, 493
Wood, A. T. A. 247, 251, 486, 488, Wynn, H. P. 515 Young, S. S. 184
491, 515, 517
Yahav, J. A. 487, 494 Zeger, S. L. 374, 376, 377
Woods, H. 277
Yandell, B. S. 374 Zelen, M. 328
Worton, B. J. 486, 487, 493, 515, 517,
518 Ying, Z. 250 Zidek, J. V. 318
Example index
accelerated life test, 346, 379 Channing House data, 131 gamma model, 5, 25, 36, 148, 207,
adaptive test, 187, 188 circular data, 126, 517, 520 233, 247, 376
AIDS data, 1, 342, 369 city population data, 6, 13, 22, 30, 49, generalized additive model, 367, 369,
air-conditioning data, 4, 15, 17, 19, 25, 52, 53, 54, 66, 95, 108, 110, 113, 371, 382, 383
27, 30, 33, 36, 197, 199, 203, 205, 118, 201, 238, 439, 440, 447, 464, generalized linear model, 328, 334,
207, 209, 216, 217, 233, 501, 508, 473, 490, 492, 513 338, 342, 367, 376, 378, 381, 383
513, 520 Claridge data, 157, 158, 496 gravity data, 72, 121, 131, 454, 457,
Amis data, 253 cloth data, 382 494, 519
A M L data, 83, 86, 146, 160, 187 coal-mining data, 435
antithetic bootstrap, 493 comparison of means, 159, 162, 163, handedness data, 157, 496
association, 421, 422 166, 171, 172, 176, 181, 186, 454, hazard ratio, 221
autoregression, 388, 391, 393, 398, 457, 519 head size data, 115
432, 434 comparison of variable selection heart disease data, 378
average, 13, 15, 17, 19, 22, 25, 27, 30, methods, 306
hypergeometric distribution, 487
33, 36, 47, 51, 88, 92, 94, 98, 128, convex regression, 371
501, 508, 513, 516 correlation coefficient, 48, 61, 63, 68,
importance sampling, 454, 457, 461,
axial data, 234, 505 80, 90, 108, 115, 157, 158, 187,
464, 466, 489, 490, 491, 495
247, 251, 254, 475, 493, 496, 518
balanced bootstrap, 440, 441, 442, imputation, 88, 90
correlogram, 388
445, 487, 488, 489, 494 independence, 177
Bayesian bootstrap, 513, 518, 520 Darwin data, 186, 188, 471, 481, 498 influence function, 48, 53
beaver data, 434 difference of means, 71, 75 intensity estimate, 418
bias estimation, 106, 440, 464, 466, dogs data, 187 Islay data, 520
488, 492, 495 Downs’syndrome data, 371 isotonic regression, 371
binomial data, 338, 359, 361 double bootstrap, 176, 177, 224, 226,
bivariate missing data, 90, 128 254, 464, 466, 469 jackknife, 51, 64, 65, 317
bootstrap likelihood, 508, 517, 518 ducks data, 134 jackknife-after-bootstrap, 115, 130,
bootstrap recycling, 464, 466, 492, 496 134, 313, 325
eigenvalue, 64, 134, 252, 277, 445, 447
block bootstraps, 398, 401, 403, 432
empirical likelihood, 501, 516, 519,
brambles data, 422 K-function, 416, 422
520
Breslow data 378 kernel density estimate, 226, 413, 469
empirical exponential family
likelihood, 505, 516, 520 kernel intensity estimate, 418, 431
calcium uptake data, 355, 441, 442
equal marginal distributions, 78
capability index, 248, 253, 497
exponential mean, 15, 17, 19, 30, 61, laterite data, 234, 505
carbon monoxide data, 67
176, 224, 250, 510 leukaemia data, 328, 334, 367
cats data, 321
exponential model, 188, 328, 334, 367 likelihood ratio statistic, 62, 148, 247,
caveolae data, 416, 425 346, 501
C D 4 data, 68, 134, 190, 251, 252, 254 factorial experiment, 320, 322 linear approximation, 118, 468, 490
cement data, 277 fir seedlings data, 142 logistic regression, 141, 146, 338, 359,
changepoint estimation, 241 Frets’ heads data, 115, 447 361, 371, 376
572
Example index 573
log-linear model, 342, 369 phase scrambling, 410, 430, 435 sample variance, 61, 62, 64, 104, 481
lognormal model, 66, 148 point process data, 416, 418, 421 separate families test, 148
low birth weights data, 361 poisons data, 322 several samples, 72, 126, 131, 133, 519
lynx data, 432 Poisson process, 416, 418, 422, 425, simulated data, 306
431, 435 smoothed bootstrap, 80, 127, 168, 169,
maize data, 181, Poisson regression, 342, 369, 378, 382, 418, 431
mammals data, 257, 262, 265, 324 383 spatial association, 421, 422
M C M C , 146, 184, 185 prediction, 244, 286, 287, 323, 324, 342 spatial clustering, 416
matched pairs, 186, 187, 188, 492 prediction error, 298, 300, 320, 321, spatial epidemiology, 421
mean, see average 359, 361, 369, 381, 393, 401
spectral density estimation, 413
mean polar axis, 234, 505 product-limit estimator, 86, 128
spherical data, 126, 234, 505
median, see sample median proportional hazards, 146, 160, 221,
spline model, 365
median survival time, 86 352
stationary bootstrap, 398, 403, 428,
melanoma data, 352 429
quantile, 48, 253, 352
misclassification error, 359, 361, 381 straight-line regression, 257, 262, 265,
quartzite data, 520
missing data, 88, 90, 128 269, 272, 308, 317, 321, 322, 449,
mixed continuous-discrete 461
ratio, 6, 13, 22, 30, 49, 52, 53, 54, 66,
distributions, 78 stratified ratio, 98
95, 98, 108, 110, 113, 118, 126,
model selection, 304, 306, 393, 432 127, 165, 201, 217, 238, 249, 439, Strauss process, 416, 425
motorcycle impact data, 363, 365 447, 464, 473, 490, 513 studentized statistic, 477, 481, 483
multinomial distribution, 66, 487 regression, see convex regression, sugar cane data, 338
multiple regression, 276, 277, 281, 286, generalized additive model, sunspot data, 393, 401, 435
287, 298, 300, 304, 306, 309, 313 generalized linear model, logistic survival probability, 86, 131
regression, log-linear model,
survival proportion data, 308, 322
neurophysiological point process data, multiple regression, nonlinear
regression, nonparametric survival time data, 328, 334, 346, 352,
418
regression, robust regression, 367
neurotransmission data, 189
straight-line regression survivor functions, 83
Nile data, 241
regression prediction, 286, 287, 298, symmetric distribution, 78, 251, 471,
nitrofen data, 383 300, 320, 321, 323, 324, 342, 359, 483
nodal involvement data, 381 361, 369, 381
nonlinear regression, 355, 441, 442 reliability data, 346, 379 tau particle data, 133, 495
nonlinear time series, 393, 401 remission data, 378 test of correlation, 157
nonparametric regression, 365 returns data, 269, 272, 449, 461 test for overdispersion, 142, 184
normal plot, 150, 152, 154 Richardson extrapolation, 494 test for regression coefficient, 269,
normal prediction limit, 244 Rio Negro data, 388, 398, 403, 410 281, 313
normal variance, 208 robust M-estimate, 318, 471, 483 test of interaction, 322
nuclear power data, 286, 298, 304, 323 robust regression, 308, 309, 313, 318, tile resampling, 425, 432
324 times on delivery suite data, 300
one-way model, 276, 319, 320 robust variance, 265, 318, 376 traffic data, 253
overdispersion, 142, 338, 342 rock data, 281, 287 transformation, 33, 108, 118, 169, 226,
322, 355, 418
paired comparison, 471, 481, 498 trend test in time series, 403, 410
saddlepoint approximation, 468, 469,
partial correlation, 115 471, 473, 475, 477, 481, 483, 492, trimmed average, 64, 121, 130, 133
Paulsen data, 189 493, 497 tuna data, 169, 228, 469
periodogram, 388 salinity data, 309, 313, 324 two-sample problem, see comparison
periodogram resampling, 413, 430 sample maximum, 39, 56, 247 of means
P ET film data, 346, 379 sample median, 41, 61, 65, 80 two-way model, 177, 184, 338
574 Example index
unimodality, 168, 169, 189 variance estimation, 208, 446,464, weird bootstrap, 128
unit root test, 391 488, 495 Wilcoxon test, 181
unne data, 359 Weibull model, 346, 379 wild bootstrap, 272, 319
variable selection, 304, 306 weighted average, 72, 126, 131 wool prices data, 391
Subject index
575
576 Subject index
higher-order, 441, 486, 489 stratified, 89, 90, 306, 340, 344, 365, city population data example, 6, 13,
theory, 443^45, 487 371, 457, 494, 531 22, 30, 49, 52, 53, 54, 66, 95, 108,
superpopulation, 94, 125, 129 110, 113, 118,201, 238, 249,440,
balanced importance resampling,
symmetrized, 78, 122, 169, 471, 485 447, 464, 473, 490, 492, 513
460-463, 486, 496
Claridge data example, 157, 158, 496
Bayesian, 512-514, 515, 518, 520 tile, 424-426, 428, 432
cloth data example, 382
block, 396-408, 427, 428, 433 tilted, 166-167, 452^56, 459, 462,
546-547 coal-mining data example, 435
calibration, 246
weighted, 60, 514, 516 collinearity, 276-278
case resampling, 84
weird, 86-87, 124, 128, 132 complementary set partitions, 552, 554
consistency, 37-39
wild, 272-273, 316, 319, 538 complete enumeration, 27, 60, 438,
conditional, 84, 124, 132, 351, 374, 440, 486
bootstrap diagnostics, 113-120, 125
474 conditional inference, 43, 138, 145,
bias function, 108, 110, 464-465
discreteness 27, 61 238-243, 247, 251
jackknife-after-bootstrap, 113-118,
double 103-113, 122, 125, 130, confidence band, 375, 417, 420, 435
532
175-180, 223-230, 254, 373, confidence interval
463-466, 469, 486, 497, linearity, 118-120
variance function, 107-111, 464-465 ABC, 214-220, 231, 246, 511, 536
507-509
B C a, 203-213, 246, 249, 336-337,
theory for, 105-107, 125 bootstrap frequencies, 22-23, 66, 76,
383, 536
110-111, 438-445, 464, 526, 527
generalized, 56 basic bootstrap, 28-29, 194-195,
bootstrap likelihood, 507-509, 515,
hierarchical, 100-102, 125, 130, 288 199, 213-214, 337, 365, 374,
517, 518
imputation, 89-92, 124-125 383, 435
bootstrap recycling, 463-466, 487,
jittered, 124 492, 497, 508 coefficient, 191
mirror-match, 93, 125, 129 comparison of methods, 211-214,
bootstrap test, see significance test
230-233, 246, 336-338
model-based, 349, 433, 434 Box-Cox transformation, 118
conditional, 238-243, 247, 251
nested, see double brambles data example, 422
double bootstrap, 223-230, 250,
nonparametric, 22 Breslow estimator, 350 254, 374, 469
parametric, 15-21, 261, 333, 334, normal approximation, 14, 194, 198,
339, 344, 347, 373, 378, 379, calcium uptake data example, 355, 337, 374, 383, 435
416, 528, 534 441, 442
percentile method, 202-203,
population, 94, 125, 129 capability index example, 248, 253, 213-214, 336-337, 352, 383
post-blackened, 397, 399, 433 497
profile likelihood, 196, 346
post-simulation balance, 441-445, carbon monoxide data, 67
studentized bootstrap, 29-31, 95,
486, 488, 495 cats data example, 321 125, 194-196, 199, 212,
quantile, 18-21, 36, 69, 441, 442, caveolae data example, 416, 425 227-228, 231, 233, 246, 248,
448-450, 453-456, 457 463, C D 4 data example, 68, 134, 190, 251, 250, 336-337, 391, 449, 454,
468, 490 252, 254 483-485
recycling, 463-466, 486, 492, 496 cement data example, 277 test inversion, 220-223, 246
robustness, 264 censboot, 532, 541 confidence limits, 193
shrunk smoothed, 79, 81, 127 censored data, 82-87, 124, 128, 131, confidence region, 192, 231-237,
160, 346-353, 514, 532, 541 504-506
simulation size, 17-21, 34-37, 69,
155-156, 178-180, 183, 185, changepoint model, 241 consistency, 13
202, 226, 246, 248 Channing House data example, 131 contingency table, 177, 183, 184, 342
smoothed, 79-81, 124, 127, 168, choice of estimator, 120-123, 125, 134 control, 545
169, 310, 418, 431, 531 choice of predictor, 301-305 control methods, 446-450, 486
spectral, 412—415, 427, 430 choice of test statistic, 173, 180, 184, bias estimation, 446-448, 496
stationary, 398-408, 427, 428^29, 187 efficiency, 447, 448, 450, 462
433 circular data, 126, 517, 520 importance resampling weight, 456
Subject index 577
linear approximation, 446, 486, 495 discreteness effects, 26-27, 61 posterior, see posterior distribution
quantile estimation, 446-450, dispersion parameter, 327, 328, 331, prior, see prior distribution
461-463, 486, 495-496 339, see also overdispersion slash, 485
saddlepoint approximation, 449 distribution tilted, see exponential tilting
variance estimation, 446-448, 495 F, 331, 368 Weibull, 346, 379
Cornish-Fisher expansion, 40, 211, t, 81, 331, 484 dogs data example, 187
449
Bernoulli, 376, 378, 381, 474, 475 Downs’syndrome data example, 371
correlation estimate, 48, 61, 63, 68, 69,
beta, 187, 248, 377 ducks data example, 134
80, 90-92, 108, 115-116, 134, 138,
157, 158, 247, 251, 254, 266, 475, beta-binomial, 338, 377
493 binomial, 86, 128, 327, 333, 338, 377
Edgeworth expansion, 39—41, 60, 408,
correlogram, 386, 389 bivariate normal, 63, 80, 91, 108, 476-477
partial, 386, 389 128
EEF.profile, 550
coverage process, 428 Cauchy, 42, 81
eigenvalue example, 64, 134, 252, 278,
cross-validation, 153, 292-295, chi-squared, 139, 142, 163, 233, 234, 445, 447
296-301, 303, 305-307, 316, 320, 237, 303, 330, 335, 368, 373,
378, 382, 484, 500, 501, 503, eigenvector, 505
321, 324, 360-361, 365, 377, 381
504, 505, 506 EL. prof ile, 550
K-fold, 294-295, 316, 320, 324,
360-361, 381 defensive mixture, see defensive empinf, 530
cumulant-generating function, 66, mixture distribution empirical Bayes, 125
466, 467, 472, 479, 551-553 Dirichlet, 513, 518 empirical distribution function, 11-12,
approximate, 476-478, 482, 492 double exponential, 516 60-61, 128, 501, 508
paired comparison test, 492 empirical, see empirical distribution as model, 108
cumulants, 551-553 function, empirical exponential marginal, 267
family missing data, 89-91
approximate, 476
exponential, 4, 81, 82, 130, 132, 176, residuals, 77, 181, 261
generalized, 552
188, 197, 203, 205, 224, 249,
cumulative hazard function, 82, 83, several, 71, 75
328, 334, 336, 430, 491, 503,
86, 350 521 smoothed, 79-81, 127, 169, 227, 228
exponential family, 504-507, 516 symmetrized, 78, 122, 165, 169, 228,
Darwin data example, 186, 188, 471,
251
481, 498 gamma, 5, 131, 149, 207, 216, 230,
233, 247, 328, 332, 334, 376, tilted, 166-167, 183, 209-210,
defensive mixture distribution,
503, 512, 513, 521 452-456, 459, 504
457-459, 462, 464, 486, 496
geometric, 398, 428 empirical exponential family
delivery suite data example, 300
hypergeometric, 444, 487 likelihood, 504-506, 515, 516, 520
delta method, 45^6, 195, 227, 233,
419, 432, see also nonparametric least-favourable, 206, 209 empirical influence values, 46-47, 49,
delta method 51-53, 54, 63, 64, 65, 75, 209, 210,
lognormal, 66, 149, 336 452, 461, 462, 476, 517
density estimate, see kernel density multinomial, 66, 111, 129, 443, 446,
estimate generalized linear models, 376
452, 468, 473, 491, 492, 493,
deviance, 330-331, 332, 335, 367-369, linear regression, 260, 275, 317
501, 502, 517, 519
370, 373, 378, 382 multivariate normal, 445, 552 numerical approximation of, 47,
deviance residuals, see regression 51-53, 76
negative binomial, 337, 344, 345,
residuals 371 several samples, 75, 127, 210
diagnostics, see bootstrap diagnostics normal, 10, 150, 152, 154, 208, 244, see also influence values
difference of means, see average, 327, 485, 488, 489, 518, 551 empirical likelihood, 500-504, 509,
comparison of two 512, 514-515, 516, 517, 519, 520
Poisson, 327, 332, 333, 337, 342,
directional data, 126, 234, 505, 515, 344, 345, 370, 378, 382, 383, empirical likelihood ratio statistic,
517, 520 416, 419, 431, 473, 474, 493, 501, 503, 506, 515
dirty data, 44 516 envelope test, see graphical test
578 Subject index
equal marginal distributions example, generalized likelihood ratio, 139 regression estimator, 457, 459, 464,
78 generalized linear model, 327-346, 486, 491
error rate, 137, 153, 174, 175 368, 369, 374, 376-377, 378, tail probability estimate, 452, 455
estimating function, 50, 63, 105, 250, 381-384, 516 time series, 486
318, 329, 470-471, 478, 483, 504, comparison of resampling schemes weights, 451, 455, 456-457, 458, 464
505, 514, 516 for, 336-338
importance sampling, 450-452, 489
excess error, 292, 296 graphical test, 150-154, 183, 188, 416,
422 efficiency, 452, 456, 459, 460, 462,
exchangeability, 143, 145
489
expansion gravity data example, 72, 121, 131,
150, 152, 154, 162, 163, 166, 171, identity, 116, 451, 463
Cornish-Fisher, 40, 211, 449
172, 454, 457, 494, 519 misapplication, 453
cubic, 475-478
Greenwood’ s formula, 83, 128 quantile estimate, 489, 490
Edgeworth, 39-41, 60, 411,
476-478, 487 ratio estimator, 490
half-sample methods, 57-59, 125
linear, 47, 51, 69, 75, 76, 118, 443, raw estimator, 451
handedness data example, 157, 158,
446, 468 496 regression estimator, 491
notation, 39 hat matrix, 258, 275, 278, 318, 330 tail probability estimate, 453
quadratic, 50, 66, 76, 443 hazard function, 82, 146-147, weight, 451
Taylor series, 45, 46 221-222, 350 imp.prob, 546
experimental design heads data example, see Frets’heads imp.quantile, 546
relation to resampling, 58, 439, 486 data example imputation, 88, 90
exponential mean example, 15, 17, 19, heart disease data example, 378
imp.weights, 546
30, 61, 176, 250, 510 heteroscedasticity, 259-260, 264, 269,
incomplete data, 43-44, 88-92
exponential quantile plot, 5, 188 270-271, 307, 318, 319, 323, 341,
363, 365 index notation, 551-553
exponential tilting, 166-167, 183,
209-210, 452-454, 456-458, hierarchical data, 100-102, 125, 130, infinitesimal jackknife, see
461-463, 492, 495, 504, 517, 535, 251-253, 287-289, 374 nonparametric delta method
546, 547 Huber M-estimate, see robust influence functions, 46-50, 60, 63-64
exp.tilt, 535 M-estimate example chain rule, 48
hypergeometric distribution, 487 correlation, 48
factorial experiment, 320, 322 hypothesis test, see significance test covariance, 316, 319
finite population sampling, 92-100,
eigenvalue, 64
125, 128, 129, 130, 474 implied likelihood, 511-512, 515, 518
fir seedlings data, 142 imp.moments, 546 estimating equation, 50, 63
Fisher information, 193, 206, 349, 516 importance resampling, 450-466, 486, least squares estimates, 260, 317
Fourier frequencies, 387 491, 497 M-estimation, 318
Fourier transform, 387 balanced, 460-463 mean, 47, 316
empirical, 388, 408, 430 algorithm, 491 moments, 48, 63
fast, 388 efficiency, 461, 462 multiple samples, 74-76, 126
inverse, 387 efficiency, 452, 458, 461, 462, 486 quantile, 48
frequency array, 23, 52, 443 improved estimators, 456-460 ratio of means, 49, 65, 126
frequency smoothing, 110, 456, 462, iterated bootstrap confidence regression, 260, 317, 319
463, 464-465, 496, 508 intervals, 486
studentized statistic, 63
Frets’heads data example, 115, 447 quantile estimation, 453-456, 457,
495 trimmed mean, 64
ratio estimator, 456, 459, 464, 486, two-sample t statistic, 454
gamma model, 5, 25, 62, 131, 149,
207, 216, 233, 247, 376 490 variance, 64
generalized additive model, 366-371, raw estimator, 459, 464, 486 weighted mean, 126
375, 382, 383 regression, 486 information distance, 165-166
Subject index 579
neurophysiological data example, 418, partial correlation example, 115 prediction rule, 290, 358, 359
428 periodogram, 387-389, 408, 430 prior distribution, 499, 510, 513
Nile data example, 241 resampling, 412-415, 427, 430 product factorial moments, 487
nitrofen data example, 383 permutation test, 141, 146, 156-160, product-limit estimator, 82-83, 87,
nodal involvement data example, 381 173, 183, 185-186, 266, 279, 422, 124, 128, 350, 351, 352, 515
nonlinear regression, 353-358 486, 492 profile likelihood, 62, 206, 248, 347,
nonlinear time series, 393-396, 401, for regression slope, 266, 378 501, 515, 519
410, 426 saddlepoint approximation, 475, proportional hazards model, 146, 160,
nonparametric delta method, 46-50, 487 221, 350-353, 374
75 PET film data example, 346, 379 P-value, 137, 138, 141, 148, 158, 161,
balanced bootstrap, 443-444 phase scrambling, 408-412, 427, 430, 437
cubic approximation, 475-478 435 adjusted, 175-180, 183, 187
linear approximation, 47, 51, 52, 60, pivot, 29, 31, 33, 510-511, see also importance sampling, 452, 454, 459
69, 76, 118, 126, 127, 205, 261, studentized statistic
443, 454, 468, 487, 488, 490, point process, 415-426, 427-428
492 quadratic approximation, see
poisons data example, 322 nonparametric delta method
control variate, 446
Poisson process, 416-422, 425, 428, quantile estimator, 18-21, 48, 80, 86,
importance resampling, 452 431-432, 435 124, 253, 352
tilted, 490 quartzite data example, 520
Poisson regression example, 342, 369,
quadratic approximation, 50, 79, 378, 382, 383 quasi-likelihood, 332, 344
212, 215, 443, 487, 490
posterior distribution, 499, 510, 513,
variance approximation, 47, 50, 63, 515, 520
64, 75, 76, 108, 120, 199, 260, random effects model, see hierarchical
power notation, 551-553 data
261, 265, 275, 312, 318, 319,
376, 477, 478, 483 prediction error, 244, 375, 378 random walk model, 391
nonparametric maximum likelihood, K-fold cross-validation estimate, randomization test, 183, 492, 498
165-166, 186, 209, 501 293-295, 298-301, 316, 320,
randomized block design, 489
324, 358-362, 381
nonparametric regression, 362-373, ratio
375, 382, 383 0.632 estimator, 298, 316, 324,
358-362, 381 in finite population sampling, 98
normal prediction limit, 244
adjusted cross-validation estimate, stratified sampling for, 98
normal quantile plot test, 150
295, 298-301, 316, 324, ratio estimate
notation, 9-10
358-362 in finite population sampling, 95
nuclear power data example, 286, 298,
aggregate, 290-301, 320, 321, 324, ratio example, 6, 13, 22, 30, 49, 52, 53,
304, 323
358-362 54, 62, 66, 98, 108, 110, 113, 118,
null distribution, 137 126, 127, 165, 178, 186, 201, 217,
apparent, 292, 298-301, 320, 324,
null hypothesis, 136 381 238, 249, 439, 447, 464, 473, 490,
bootstrap estimate, 295-301, 316, 513
one-way model example, 208, 276,
324, 358-362, 381 recycling, see bootstrap recycling
319, 320
comparison of estimators, 300-301 regression
outliers, 27, 307-308, 363
cross-validation estimate, 292-293, L u 124, 311, 312, 316, 325
overdispersion, 327, 332, 338-339,
343-344, 370, 382 298-301, 320, 324, 358-362, case deletion, 317, 377
381 case resampling, 264-266, 269, 275,
test for, 142
generalized linear model, 340-346 277, 279, 312, 333, 355, 364
paired comparison, see matched-pair leave-one-out bootstrap estimate, convex, 372
data 297, 321 design, 260, 261, 263, 264, 276, 277,
parameter transformation, see time series, 393-396, 401, 427 305
transformation of statistic prediction limits, 243-245, 251, generalized additive, 366-371, 375,
partial autocorrelation, 386 284-289, 340-346, 369-371 382, 383
Subject index 581
generalized linear, 327-346, 374, linear predictor, 331, 333, 376 sample maximum example, 39, 56, 247
376, 377, 378, 381, 382, 383 modified, 77, 259, 270, 272, 275, sample median, 41, 61, 65, 80, 121,
isotonic, 371 279, 312, 318, 331, 355, 365 181, 518
least trimmed squares, 308, 311, nonlinear regression, 355, 375 sample variance, 61, 62, 64, 104, 208,
313, 325 nonstandard, 349 432, 488
linear, 256-325, 434 raw, 258, 275, 278, 317, 319 sampling
local, 363, 367, 375 Pearson, 331, 333, 334, 342, 370, stratified, see stratified sampling
logistic, 141, 146, 338, 371, 376, 378, 376, 382 without replacement, 92
381, 474 standardized, 259, 331, 332, 333, sampling fraction, 92-93
loglinear, 342, 369, 383 376
sandwich variance estimate, 63, 275,
M-estimation, 311-313, 316, 318 time series, 390, 392 318, 376
many covariates with, 275-277 returns data example, 269, 272, 449, second-order accuracy, 39-41,
model-based resampling, 261-264, 461 211-214, 246
267, 270-272, 275, 276, 279, Richardson extrapolation, 487, 494 semiparametric model, 77-78, 123
280, 312, 333-335, 346-351, Rio Negro data example, 388, 398, sensitivity analysis, 113
364-365 403, 410, 427
separate families example, 148
multiple, 273-307 robustness, 3, 14, 264, 318
sequential spatial inhibition process,
no intercept, 263, 317 robust M-estimate example, 471, 483 425
nonconstant variance, 270-273 robust regression example, 308, 309, several samples, 71-76, 123, 126, 127,
nonlinear, 353-358, 375, 441, 442 313, 318, 325 130, 131, 133, 163, 208, 210-211,
nonparametric, 362-373, 375, 427 rock data example, 281, 287 217-220, 253
Poisson, 337, 342, 378, 382, 383, rough statistics, 41-43 shrinkage estimate, 102, 130
473, 504, 516 significance probability, see P-value
prediction, 284-301, 315, 316, 323, saddle, 547 significance test
324, 340-346, 369 saddle.distn, 547 adaptive, 173-174, 184, 187, 188
repeated design points in, 263 saddlepoint approximation, 466-485, conditional, 138, 173-174
resampling moments, 262 486, 487, 492, 493, 498, 508, 509,
517, 547 confidence interval, 220-223
residuals, see residuals
accuracy, 467, 477, 487 critical region, 137
resistant, 308
conditional, 472-475, 487, 493 double bootstrap, 175-180, 183,
robust, 307-314, 315, 316, 318, 325
186, 187
significance tests, 266-270, 279-284, density function, 467, 470
error rate, 137, 175-176
322, 325, 367, 371, 382, 383 distribution function, 467, 468, 470,
486-487 generalized linear regression,
straight-line, 257-273, 308, 317, 322,
330-331, 367-369, 378, 382
391, 449, 461, 489 double, 473-475
graphical, 150-154, 188, 416, 422,
survival data, 346-353 equation, 467, 473, 479
428
weighted least squares, 271-272, estimating function, 470-472
278-279, 329 linear regression, 266-270, 279-284,
integration approach, 478-485 317, 322, 392
regression estimate linear statistic for, 468-469, 517 Monte Carlo, 140-147, 151-154
in finite population sampling, 95 Lugannani-Rice formula, 467 multiple, 174-175, 184
remission data example, 378 marginal, 473, 475-485, 487, 493 nonparametric bootstrap, 161-175,
repeated measures, see hierarchical permutation distribution, 475, 486, 267-270
data 487
nonparametric regression, 367, 371,
resampling, see bootstrap quantile estimate, 449, 468, 480, 483 382, 383
residuals randomization distribution, 492, parametric bootstrap, 148-149
deviance, 332, 333, 334, 345, 376 498
permutation, 141, 146, 156-160,
in multiple imputations, 89-91 salinity data example, 309, 311, 324 173,183, 185, 188, 266, 317,
inhomogeneous, 338-340, 344 sample average, see average 378, 475, 486
582 Subject index
pivot, 138-139, 268-269, 280, 284, straight-line regression, see regression, training set, 292
392, 454, 486 straight-line transformation of statistic
power, 155-156, 180-184 stratified resampling, 71, 89, 90, 306, empirical, 112, 113, 118, 125, 201
P-value, 137, 138, 141, 148, 158, 340, 344, 365, 371, 457, 494 for confidence interval, 195, 200,
161, 175-176 stratified sampling, 97-100 233
randomization, 183, 185, 186, 492, Strauss process, 417, 425 linearizing, 118-120
498 studentized statistic, 29, 53, 119, 139, variance stabilizing, 32, 63, 108,
separate families, 148, 378 171-173, 223, 249, 268, 280-281, 109, 111-113, 125, 195, 201,
sequential, 182 284, 286, 313, 315, 324, 325, 326, 227, 246, 252, 394, 419, 432
330, 477, 481, 483, 513 trend test in time series example, 403,
spatial data, 416, 421, 422, 428
subsampling, 55-59 410
studentized, see pivot
balanced, 125 trimmed average example, 64, 121,
time series, 392, 396, 403, 410
in model selection, 303-304 130, 133, 189
simulated data example, 306 spatial, 424—426 tsboot, 544
simulation error, 34-37, 62 sugar cane date example, 338 tuna data example, 169, 228, 469
simulation outlier, 73 summation convention, 552 two-way model example, 338, 342, 369
simulation size, 17-21, 34-37, 69, sunspot data example, 393, 401, 435 two-way table, 177, 184
155-156, 178-180, 183, 185, 202,
survival data
226, 246, 248
nonparametric, 82-87, 124, 128, unimodality test, 168, 169, 189
size of test, 137 unit root test, 391, 427
131, 132, 350-353, 374-375
smooth.f, 533 parametric, 346-350,'379 urine data example, 359
smooth estimates of F, 79-81 survival probability, 86, 132, 160, 352,
spatial association example, 421, 428 515 variable selection, 301-307, 316, 375
spatial clustering example, 416 survival proportion data example, variance approximations, see
spatial data, 124, 416, 421^126, 428 308, 322 nonparametric delta method
spatial epidemiology, 421, 428 survivor function, 82, 160, 350, 351, variance estimate, see sample variance
352, 455 variance function, 327-330, 332, 336,
species abundance example, 169, 228
symmetric distribution example, 78, 337, 338, 339, 344, 367
spectral density estimation example,
169, 228, 251, 470, 471, 485 estimation of, 107-113, 465
413
variance stabilization, 32, 63, 108, 109,
spectral resampling, see periodogram tau particle data example, 133, 495
111-113, 125, 195, 201, 227, 246,
resampling test, see significance test 419, 432
spectrum, 387, 408 tile resampling, 424— 426, 427, 428, 432 variation of properties of T, 107-113
spherical data example, 126, 234, 505 tilt.boot, 547 var.linear, 530
spline smoother, 352, 364, 365, 367, tilted distribution, see exponential
368,371,468, . tilting weighted average example, 72, 126,
standardized residuals, see residuals, time series, 385-415, 426-427, 131
standardized 428-431, 514 weighted least squares, 270-272,
stationarity, 385-387, 391, 398, 416 econometric, 427 278-279, 329-330, 377
statistical error, 31-34 nonlinear, 396, 410, 426 white noise, 386
statistical function, 12-14, 46, 60, 75 toroidal shifts, 423 Wilcoxon test, 181
Stirling’
s approximation, 61, 155 traffic data, 253 wool prices data example, 391