Thinking The Unthinkable
Thinking The Unthinkable
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
Society for Industrial and Applied Mathematics is collaborating with JSTOR to digitize, preserve and extend
access to SIAM Review.
http://www.jstor.org
460
2
Shortcutsand approximationsare possible,thesimplestofwhichresultsin usingexactlythestandardt
methoddescribedfirst!R. A. Fisher,the principalfigurein the developmentof normaltheorymethods,
advocated what we have called the nonparametricapproach as early as 1935, but most of the theoretical
developmenttook place after1950.
3 A refereepointsout thatMonte Carlo allows one to go
muchfartherin studyingstandardstatistical
methods,such as the t test,undernonstandard(i.e. nonnormal)conditions.This is anotherwayin whichthe
computerimpactson statisticaltheory.
School# 1 2 3 4 5 6 7 8
School# 9 10 11 12 13 14 15
GPA
3.50
.5
.8
3.40 '1 *9
3.30. *2
3.20-
2.70,*13
2.70 I *12 i B i i i ' LSAT
540 550 560 570 580 590 600 610 620 630 640 650 660 670
FIG. 1. A plotof thelaw schooldata givenin Table 1.
the root mean square difference of A, based on n = 15 pairs,fromp. Calling (2.2) the
standarddeviationassumes that is unbiased forp, thatis Ep = p. This isn't exactly
A
(2.3) ()jn1 E p
n =
(It is usual to replace bytn1 A(i)/n in (2.3), again forreasonsof bias correction,but
A
the difference in the estimateof 5(J) is less than .01% in our example.)
Table 2 displays the values of p(i) - p for the law school data. The jackknife
estimateof accuracyis
CJ =A.0203 = .143.
Notice thatwe have had to about n timesas muchcomputationto get CJ as to getthe
estimate itself.
A
TABLE 2
The values of forthelaw schooldata.
A
(,)-
i 1 2 3 4 5 6 7 8
i 9 10 11 12 13 14 15
fP( A*)
4.0
3.5-
3.0-
densityfunction 2.5
2.0.
1.0
percentiles
Definethebootstrapestimateofthestandarddeviationo-,sayC(B), to be halfthelength
of thisinterval,
A(B)=b* a
normaltheory
density\ /histogram
histrogram
percentiles
p -p
-.4 -.3 -.2 -.1 0 .1 .2
>- givescr()=27 b
normaltheorydensity h
histogm stogram
/ ~~~~~~histogram
\
percentiles
16
X50% 84%
P* -P
-.4 -.3 -.2 -.1 0 .1 .2
L
-givescr(B) =.125 i
FIG. 4. Histogram,1000 bootstrap replicationsof p*- p, usingthesmoothedsamplingdistribution FC
C= 1/1/5,describedin thetext.The histogramfollowsthenormal-theory densitymorecloselythaninFig. 3, but
(B)= .125, almostthesame value as fortheunsmoothedbootstrap.
where g(i) is the partial derivativeof g(Q) with respect to p8, evaluated at p*=
(l/n, , 1/n). Together, (3.2) and (3.3) suggest an easy approximationto the
standarddeviationof _- underbootstrapsampling,namely
* A
O=
A(B)- 12E [g(i)]2.
This is almost exactly the jackknife estimate CJ, the main differencebeing the
in place of the derivativesg(i) appearingin
substitutionin (2.3) of finitedifferences9
(3.4). Jaeckel[15] originallysuggestedtherightside of (3.4) as an accuracyapproxima-
tion,callingit the "infinitesimaljackknife;"see also Efron[8].
4. Cross-validation.In itsoriginalform,cross-validation referredto thefollowing
simple,butuseful,idea: givena large class ofpossible modelsto fitto a set of data, for
examplelinearregressionmodels in whichthe choice of predictorvariablesis open to
question,firstrandomlydividethedata intotwohalves.Then fita modelto thefirst half
ofthedata,usinganyfitting methodat all, and see howwellthefittedmodelpredictsthe
second half of the data. This last step, which is the cross-validation,protectsthe
statisticianagainstan overlyoptimisticassessmentof goodness-of-fit.
Recentlymanyauthors,in particularStone [21] and Geisser [10], have proposed
directuse of cross-validationforthe selectionof appropriatemodels.This approachis
computerintensive,butpotentiallymuchbroaderin applicationthanthefamiliarlinear
model approach. We illustratethe methodwithan example taken fromWahba and
Wold [23].
Figure 5 shows 100 artificiallygenerateddata points,created accordingto the
followingmodel: thepoint(xi,yi)withabscissaxi has ordinateyirandomlydetermined
by
(4.1) Yi A(Xi) + Si i= 1, 29..** 1009
where t
and the ei are independentnormal random variables with mean 0 and standard
deviationo-= 0.2. The xi values are equallyspaced from0 to 3.10. The function/I(x),
whichin a real applicationwouldbe unknownto thestatistician, is shownas thedashed
curvein Fig. 5.
.801
.60.*
.40-
.20 . ;
.00 . 1 1 2 2 3.00
-.20 t; .
-.60 .;
_1800 '
-1.20^ .
.00 .50 1.00 1.50 2.00 2.50 3.00
FIG. 5. 100 randompointsgeneratedaccordingto model (4.1), (4.2). The truemean functionA (x) is
indicatedbythedashed curve.Thesolidcurvewas obtainedfromthedata pointsbythecross-validation
method.
Wahba and Wold considerfittinga class of curves 77(x,a) to these data. For a
particularchoiceofthenonnegativeparametera, 77(x, a) is bydefinition
thecurve77(x)
It is nothardto showthat(n - l)(p - P()) g('); see ? 5 ofEfron[9].
approximates
minimizing
l 100
(4-3) 1 E [yi-r,vi)]2
100 i~
1 100 2 100
(4.6) E E [y*--a(xi,a)]2 o' +- E [(x(xi)-r7(xi,a)]2.
100 i1100i=
In otherwords, 1(i)(x,
a) is thesolutionto theconstrainedminimization
problem(4.3),
(4.4) withpoint(xi,yi) removedfromthe data set. We thendefine
1 100
(4.8) Qt(a)= - E [yi-q(i)(xi, a)]2,
100i=
andselectas "best"theatminimizing
Qt(aE), sayaEt. Thecurver7(x,a t) iS theproposed
estimatefor,u(x).
The solid curvein Fig. 5 shows r1(x, a t) in Wahba and Wold's example.The fitis
obviouslyquitegood, and Qt(a t), ifitwerepresented,wouldgivea good estimateofthe
expectedpredictionerror(4.6) for77(x,at). Of course we have had to do about 100
timesas muchworkto computethe curve71(x,a) forany givena. (Wahba and Wold
actually omit points 10 at a time, instead of one at a time, and so reduce the
computationaleffortby a factorof 10.)
Cross-validationresemblesthejackknifein thatdata pointsare removedone at a
timein bothprocedures,buttheunderlying connectionbetweenthetwomethodsis still
not clear to statisticalresearchers.The next example shows a situationwhere either
cross-validationor the bootstrapcan be applied, but the latteris quite a bit more
effective.This isn't intendedto disparage cross-validation,but ratherto suggestthat
furtherresearchmaylead to powerfulcombinationsof cross-validation and jackknife-
bootstrapmethods.
Figure 6 shows 20 artificially generatedrandom points, 10 fromeach of two
populations.The underlying x populationis bivariatenormalwithmean vector(-2, 0)'
Region A
.y x
OX y~~~~~~
.x /.
linear /.
discriminant /Y Region B
boundary ~>/
In the situationof Fig. 6, biast = 0.10, whichmeans that4 out of 10 x values were
misclassifiedduringthe cross-validationprocess.
The bootstrapestimateof biasx takes considerablymore computation:
1) Select a bootstrapsample of 10 new x points,xi, x2, , x*o, by random
sampling,independently and withreplacement,fromthe givenpointsx1,x2,... , x10.
Likewise,constructa bootstrapsample of 10 new y pointsy*,*
Y2, * , y*O byrandom
*yrno
samplingfromYl, Y2,* , Yio.
2) Constructthebootstraplineardiscriminant boundarybysubstituting x*, y*,S*
forx, y,S in (4.9). Denote the bootstrapdiscriminantregionsas A*, B*.
3) Let
#{x,~?B*}
(4.12) b*_ #{xjeB*}
10 10
TABLE 3
Microbecountsin 69 swabs ofa Marinerspace probe.(Partofa muchlargerdata
set.) The countwas zero in 53 swabs,one in 6 swabs,etc.Removingthelargestcount,
1010, reducestheaveragecountfrom16.14 to 1.53.
Count 0 1 3 4 5 6 9 62 1010
Numberofswabs 53 6 4 1 1 1 1 1 1
rc, x <-c,
(5.2) qf(x)= x, -c 'x c,
C, C<X.
For a linear functional,such as the mean T(F)= x dF(x), (5.5) is exact. (For
the mean, T(x;F)=x-T(F), so (l/n) i= T(xi;F)=x-T(F)=T(F)-T(F).)
Nonlinear functionals,such as the M estimatorbased on (5.2), are, under some
regularityconditions,asymptoticallylinear, as n -* 00, in the sense of (5.5). The
usefulnessof (5.5) is thatit approximatesT(F) - T(F) bythe average of independent,
identicallydistributed, randomquantitiesT(xi; F). The standarddeviationof such an
average is
TABLE 4
Survival timesfor 18 early hearttransplantpatients.Tabled is survivaltime,in days, followingthe
transplant;"+" indicatesthatthepatientwas stillalive on April 13, 1972, theday thedata werecollected.
Abstractedfrom a largerdata setinBrownand Turnbull[21]. "Numberat risk" is used in thecalculationofF.
Survival
time 3 4+ 10 25+ 39 40+ 43 54 65
Numberat
risk 18 17 16 15 14 13 12 11 10
Survival
time 120+ 136 147 157+ 183+ 312 546+ 824 1025
Numberat
risk 9 8 7 6 5 4 3 2 1
?(t)
1.0.
.9
.4.
.3-
.2-
where
(6.7) h(s)-Prob {T = s I T > s -1},
theconditionalprobabilityofdyingon day s givensurvivalpastday s -1. The function
h(s) is called the hazard rate.The estimate(6.4) comes fromestimatingthe factor
1-h(s) by
For example,a 57 year old whitemale mightbe coded (57, 0, 1) where"0" indicates
whiteand "1 " indicatesmale. Cox's model postulatesthatthehazardrateforpatienti,
say hi(s), is of the form
(6.10) E eIzs e z
LER(s) e
(Expression (6.10) is actuallyan approximationwhichbecomes exact as the unitsin
whichwe are measuringtimebecome infinitesimal.)
The advantageof (6.10) is thatitdependsonlyon f3and theobservablevectorszi,
and noton the commonhazardfunctiong(s) in (6.9). This makesiteasy to analyzethe
data forthe effectsof /3,withoutanymodelingof g(s) beingnecessary.Cox multiplies
the factors(6.10) together,one fromeach observeddeath,
(6.11) r e
observedEi(EQ(s) e
deaths
REFERENCES