0% found this document useful (0 votes)
105 views22 pages

Thinking The Unthinkable

This document discusses how the advent of high-speed computers has affected the development of statistical theory. It provides examples of areas that have advanced due to increased computational power, including nonparametric methods, the jackknife, bootstrap, and methods for larger datasets. The examples are intended to illustrate to non-statisticians how computation allows new statistical approaches.

Uploaded by

Edwin Steven
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views22 pages

Thinking The Unthinkable

This document discusses how the advent of high-speed computers has affected the development of statistical theory. It provides examples of areas that have advanced due to increased computational power, including nonparametric methods, the jackknife, bootstrap, and methods for larger datasets. The examples are intended to illustrate to non-statisticians how computation allows new statistical approaches.

Uploaded by

Edwin Steven
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Computers and the Theory of Statistics: Thinking the Unthinkable

Author(s): Bradley Efron


Source: SIAM Review, Vol. 21, No. 4 (Oct., 1979), pp. 460-480
Published by: Society for Industrial and Applied Mathematics
Stable URL: http://www.jstor.org/stable/2030104 .
Accessed: 18/06/2014 09:32

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].

Society for Industrial and Applied Mathematics is collaborating with JSTOR to digitize, preserve and extend
access to SIAM Review.

http://www.jstor.org

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
SIAM REVIEW ? 1979 SocietyforIndustrialand Applied Mathematics
Vol. 21, No. 4, October 1979 0036-1445/79/2104-0002$01.00/0

COMPUTERS AND THE THEORY OF STATISTICS:


THINKING THE UNTHINKABLE*
BRADLEY EFRONt

Abstract.This is a surveyarticleconcerningrecentadvancesin certainareas ofstatisticaltheory,written


fora mathematicalaudience withno backgroundin statistics.The topicsare chosen to illustratea special
point: how the adventof the high-speedcomputerhas at'ected the developmentof statisticaltheory.The
topicsdiscussedincludenonparametricmethods,the jackknife,the bootstrap,cross-validation,error-rate
estimationin discriminantanalysis, robust estimation,the influencefunction,censored data, the EM
algorithm,and Cox's likelihoodfunction.The expositionis mainlybyexample,withonlya littleofferedin the
way of theoreticaldevelopment.
1. Introduction.The editorshave been kind enough to invitea surveyarticle
concerningwhat'snew in the theoryof statistics.Any answerto thisquestionmustbe
eitherincompleteor bewilderingto the reader. Here I have triedto be incomplete,
selectingmy topics to illustratea special point: how the advent of the high-speed
computerhas affectedthe theoreticalstructureof statistics.
Statisticsconcernsthe comparisonof sets of numbers-with each other,with
theoreticalmodels,and withpast experience.The prototypicalscientific question,"Is
methodA betterthanmethodB?," mayboil downto thestatisticalquestion,"Is set of
numbersA biggerthanset of numbersB?" If,forexample,1A = {94,197,16,38,99,
141, 23} and B = {52, 104, 146, 10, 50, 31, 40, 27, 46}, how can we preciselyphrase
such a question, in particularthe crucial concept of "bigger," and answer it in a
scientificallymeaningfulway? The statistician'sstandardanswer,before1950, would
have been
1) Computethet-statistic,whichis thedifference betweentheaverageofthesetA
numbersand theaverageof theset B numbers,dividedbya certainquadraticfunction
of all 16 numbers.(The divisorscales thedifference betweenthetwoaveragesso thata
singletable can be used at step 2 below.)
2) Compare the observed value of t withits theoreticaldistributioncalculated
underthe assumptionthatall 16 numberswere independentlydrawnfromthe same
normal("Gaussian") distribution. This theoreticaldistributionis publishedin a stan-
dard t-table.
3) Decide thatset A is reallybiggerthanset B, and notjust accidentallybigger,if
the observedvalue of t is in the upper 5% of the theoreticaldistribution.
The mostobviousdefectof thisprocedureis theuse of normaldistribution theory
to determinethe criticalvalue at whichthe observed t becomes "significant."Non-
parametric mainlydevelopedsince 1950, givesan answerthatdoes notdepend
statistics,
upon normaltheory:
1) Combine all 16 numbersinto one set C = {94, 197, ,46}, and considerall
11,440 ways (=16!/7!9!) of partitioningC into two sets "a" and "b," a having7
membersand b having9 members.
2) For each suchpartitioncomputethedifference betweentheaverageoftheset a
numbersand the average of the set b numbers,say xa-4 There are 11,440 such
differences, one of whichis the differenceXA - XB corresponding to the data actually
observed.
* Received bytheeditorsJune28, 1978, and in revisedversionDecember 14, 1978. The preparationof
thisinvitedmanuscriptwas supportedby the U.S. ArmyResearch OfficeunderContractDAAG29-79-C-
0014.
t Departmentof Statistics,StanfordUniversity, Stanford,California94305.
1These numbersare cell counts,inthousands,froman experimentinvolving16 mice.The 7 miceinsetA
receivedan inoculationexpectedto increasethecell count.The 9 micein setB did notreceivean inoculation.

460

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 461

3) Decide thatsetA reallyis biggerthanset B ifXA - XB is in theupper5% ofthe


11,440 xa -Xb values.
The nonparametricmethodpays a stiffcomputationalpriceforits freedomfrom
normal distributiontheory.There is no "significancetable," correspondingto the
t-table,withwhichone can comparethe observedvalue of XA - XB. Essentially,such a
table mustbe constructedanew foreach set ofdata.2On theotherhand,morethanjust
freedomfromnormality assumptionsis gained.Ifa differenttablehas to be constructed
foreach data set,thestatisticianmayverywellchoose to tablesomethingotherthanthe
differenceof the averages (whichwas chosen in the firstplace because of theoretical
propertiespeculiar to the normaldistribution).The recipe for a nonparametrictest
givenabove worksjust as well forthedifference of themediansas forthedifference of
the averages. Or the statisticianmayfirstmake a nonlineartransformation on each of
the 16 numbers,say y = g(x), and compareYA - YB withthetabledvalues ba- Yb Or he
maytryseveraldifferent transformations,and severaldifferentmeasuresof difference
betweenthetwosetsofnumbers,goingthroughthenonparametric recipeeach time,in
an attemptto understandhow robustthe perceiveddifference betweenA and B is to
changesin the statisticalprocedure.
The "unthinkable:"mentionedin the titleis simplythe thoughtthatone mightbe
willingto perform500,000 numericaloperationsin the analysisof 16 data points.Or
one mightbe willingto performa billion operationsto analyze 500 numbers.Such
statementswould have seemed insane thirtyyears ago, when a slow and noisy fifty
pound desk calculatorwhichadded, subtracted,multiplied,and dividedwas the most
sophisticatedcomputationalaid available to most scientists.Most of the statistical
theoryin common use was developed under the constraintof slow and expensive
computation.Now computationis fastand cheap. It is notsurprising thatnewtheoryis
being developed, which takes advantage of the high-speedcomputer.This paper
consistsofseveralexamplesofsuchtheory,presented,hopefully, ina manneraccessible
to nonstatisticians.
The set of examples presentedhere in no way exhauststhe range of interesting
currentwork in statistics,not even withinthe limitedcontextof this article.Some
notableomissionsincludethedesignofexperiments, computergraphicsand descriptive
statistics("data analysis"),timeseries and stochasticprocesses,Bayes and empirical
Bayes methods,Steinestimationand ridgeregression,analysisof categoricaldata, and
Monte Carlo methods.3Also unmentionedis the vigorousdevelopmentof numerical
analysismethodsappropriateto large statisticalanalyses,see forexample Golub and
Styan[11], whichcould easilyoccupyan articleof equal length.
This paper is intendedfornonstatisticians,and in orderto make it easilyreadable
mostof theexamplesinvolveartificially smalldata sets.This belies an importanteffect
ofthecomputerupon statisticalthinking. Statisticalproblemshave gottenmuchbigger,
in raw size, duringthe past 30 years as scientists,emboldened by the data handling
capabilitiesofthecomputer,have collectedlargerand largerdata sets.It is notunusual
these days to workwithsets of a millionor more numbers,sometimesfitting models
whichinvolvethousandsof parameters.Even themosttimewornstatisticaltechnique,

2
Shortcutsand approximationsare possible,thesimplestofwhichresultsin usingexactlythestandardt
methoddescribedfirst!R. A. Fisher,the principalfigurein the developmentof normaltheorymethods,
advocated what we have called the nonparametricapproach as early as 1935, but most of the theoretical
developmenttook place after1950.
3 A refereepointsout thatMonte Carlo allows one to go
muchfartherin studyingstandardstatistical
methods,such as the t test,undernonstandard(i.e. nonnormal)conditions.This is anotherwayin whichthe
computerimpactson statisticaltheory.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
462 BRADLEY EFRON

such as the standardlinear model, takes on qualitativelynew aspects when applied


underthesecircumstances. A briefdiscussionofthispointcan be foundin ? 8 of Efron
[9].
The expositionproceedsbya seriesofexamples,withonlyan occasionalhintofthe
deeper theoreticalquestionslurkingbehind the methods.The referenceshave been
chosen forreadabilityas well as importance,and are recommendedto readerswith
some statisticalbackground,who wishto pursuethesesubjectsfurther.
2. The jackknife.The jackknife,4introducedbyQuenouille and Tukeyin thelate
1950's, is an intriguingattemptto solve an importantstatisticalproblem: having
computedan estimateof some quantityof interest,say a mean or a probabilityor a
correlation,whataccuracycan be attachedto theestimate?Accuracyhererefersto the
"? something"whichoftenaccompaniesstatisticalestimates.The usual ? quantities
are based on normaldistribution theory,or occasionallysome otherparametrictheory,
while the jackknifeis a nonparametrictechniquewhichmakes no such assumptions.
Miller[18] givesan excellentreviewofthesubject.Here theexplanationwillbe givenin
termsof a simpleexample.
Table 1 refersto the 1973 enteringclasses of 15 Americanlaw schools. For each
school two numbersare given,
xi = average LSAT score of enteringstudentsin law school i
yi= average GPA of enteringstudentsin law school i,
TABLE 1
The average LSAT scoreand undergraduateGPA at 15 Americanlaw schools,entering
classes of 1973.

School# 1 2 3 4 5 6 7 8

LSAT 576 635 558 578 666 580 555 661


GPA 3.39 3.30 2.81 3.03 3.44 3.07 3.00 3.43

School# 9 10 11 12 13 14 15

LSAT 651 605 653 575 545 572 594


GPA 3.36 3.13 3.12 2.74 2.76 2.88 2.96

i = 1, 2, , 15. (The LSAT is a nationaltest,similarto the Graduate Record Exam,


while GPA refersto undergraduategrade point average.) These data are abstracted
fromRubin [20]. The data are plottedin Fig. 1.
The correlationcoefficientis a measure of association between two sets of
numbers,or, in itsabstractform,betweentwo infinitely largesets of numbers,usually
thoughtof as two related probabilitydistributions.By definition,the correlation
betweenthe n pairsof numbers(xi,yi),i = 1, 2,
coefficient , is,1
n
~~~~~~~~n n
(2.1) p n 1(Xi n xi2' (
=
1 in, YL yiln

Because of the Cauchy-Schwarzinequalityit is alwaystruethat-1 _p ? 1. The case


p = 1 occurswhenthe(xi,yi)pairslie on a singlestraightline withpositiveslope, while
4 The name "jackknife,"coinedbyTukey,is meantto conveythenotion a
of roughand readytool,useful
in a wide varietyof situations.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 463

GPA
3.50
.5
.8
3.40 '1 *9

3.30. *2

3.20-

3.10 *10 .11


.6
3.00. *7 4
*15
2.90^ 14
2.80- 3

2.70,*13
2.70 I *12 i B i i i ' LSAT
540 550 560 570 580 590 600 610 620 630 640 650 660 670
FIG. 1. A plotof thelaw schooldata givenin Table 1.

= -1 indicatesa perfectstraight line relationshipwithnegativeslope. Figure1 shows


thatwhilethelaw schooldata do notgo to eitheroftheseextremes,theyare "positively
correlated,"i.e. closerto p = 1 thanp = -1. The actualvalue is p = .776, whichin most
sociologicalstudieswould be takento indicatea strongly positiverelationshipbetween
the twovariables.In plain language,higherLSAT usuallygoes withhigherGPA, and
vice versa.
We wishto knowhow accurateis theestimatep = .776. In askingthisquestionwe
assumethatthereis a truecorrelationp which is attempting to measure,and which
A A

would approachifthe numberof data pairswas increasedfromn = 15 towardn = oo.


The mostcommonlyused measureof accuracyis the standarddeviation,
(2.2) (r = VE[(A _ p)2],

the root mean square difference of A, based on n = 15 pairs,fromp. Calling (2.2) the
standarddeviationassumes that is unbiased forp, thatis Ep = p. This isn't exactly
A

true,butthebias is smallenoughto be ignoredin thelaw schoolexample,forthesake of


simplifiedpresentation.The jackknifetheoryactuallyincludesa bias correctionmethod
whichwon't be discussedhere.
The jackknifeestimateof o-,say C(J), is obtainedby the followingprocedure:
1) Delete pair (xi,yi)fromthe data set and recomputethe correlationcoefficient
forthe remaining14 pairs. Call thisrecomputedvalue A(i), i = 1, 2, , n = 15.
2) Estimateo- by5

(2.3) ()jn1 E p
n =

(It is usual to replace bytn1 A(i)/n in (2.3), again forreasonsof bias correction,but
A

the difference in the estimateof 5(J) is less than .01% in our example.)

5Suppose thatinsteadof the correlationcoefficient, we wishto estimatethe standarddeviationof the


mean x of n numbersx1,x2, ***, xn. The jackknifeprocedure,applied to thissituation,gives the usual
estimate[X(xi-_)2/(n (n - 1))]1/2. The factor(n - 1)/n in (2.3) is includedinorderto makethejackknifegive
this,the "right"answer,forthe standarddeviationof x.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
464 BRADLEY EFRON

Table 2 displays the values of p(i) - p for the law school data. The jackknife
estimateof accuracyis
CJ =A.0203 = .143.
Notice thatwe have had to about n timesas muchcomputationto get CJ as to getthe
estimate itself.
A

TABLE 2
The values of forthelaw schooldata.
A
(,)-

i 1 2 3 4 5 6 7 8

P(i)_P .116 -.013 -.021 -.000 -.045 .004 .008 .040

i 9 10 11 12 13 14 15

A(i) _A -.025 -.000 .042 .009 -.036 -.009 .003

The statistician mightnow reportp = .776 ? .143. This meansthathisbestguessof


A
theunknowntruevalue p is = .776, withan expectedrootmean square errorof .143
forp - p. Ifp - p has roughlya normaldistribution,
whichforlargeamountsofdata will
alwaysbe the case, thenthe accuracystatementcan also be interpretedas
(2.4) Prob{p E [.776 -.143, .776 +.143]}>.68.
(Statement(2.4) is based upon the fact that a normal distributionputs 68% of its
probabilitywithinone standarddeviationofthemean.) Intervalstatementsofaccuracy
like (2.4) have more intuitiveappeal thanroot mean square error.
How good is the estimate(r(J)?We could, if we wanted to, jackknifethe entire
procedurewhichcomputeda(J), that is do a second order jackknife,to estimatea
standarddeviationof a (This would requireabout n2 timesas manycalculationsas
forp.) Instead,we willcomparea(J) withthetraditionalnormal-theory estimateofp's
standarddeviation. In the next section we will calculate the standarddeviation in
anotherway whichclarifiesthe connectionbetweenthe two answers.
Suppose the n = 15 pairs (xi,yi) are actually drawn froma bivariate normal
distribution withcorrelationcoefficient p. Then the exact densityfunctionof can be
A

calculatedtheoretically. Thisdensityfunctiondependsonlyuponp,noton themeansor


standard deviations ofx andy,andso canbe denotedfp(A); bydefinition lb fpA() dp =

Prob {a - b}. Figure2 showsfp(*) for =


p .776, theobservedvalue in the law school
samples.It is denotedfp (p*) to preservethedefinition p as theobservedvalue; p* is
of
justa convenientnameforthedummyvariableinf (*). The abscissais plottedin * -
to emphasizethe deviationsof p* fromp.
We see thatthe densityfunctionis not exactlynormal,havinga longertail to the
leftthanto theright,and also is not centeredexactlyat 0, i.e. at p p, havinginstead
median value .011. (The normalitycan be dramaticallyimprovedby makingFisher's
tanh-1transformation; see Cramer[5, p. 399].) The traditionalnormaltheoryestimate
of o-can be described,at the expense of a slightoversimplification, in termssimilarto
(2.4): look at the central 68% of the distributiondescribed byfp ( * ), thatis theinterval
fromthe 16thpercentile to the 84th percentile. Half of the lengthof thisintervalis a
reasonable definition of the normal-theory estimate of o-,say (N). For p =.776 this
gives5(N) - ofa(N) agreeswith(2.2),butin
.113. Forlargevaluesofn thisdefinition
beingless affectedbyoccasionalwildvalues ofthe
smallsamplesitis moremeaningful,
randomquantitywhose accuracywe are tryingto describe.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 465

fP( A*)
4.0
3.5-

3.0-

densityfunction 2.5

2.0.

1.0
percentiles

16% 50% 84%


p -p
-.4 -.3 -.2 -.1 0 .1 .2
(
14 gives 0r(N). 13

FIG. 2. Thenormaltheory density function oftheobservedcorrelation p* for15 data pairs (xi,yi)


coefficient
drawnfroma bivariatenormaldistribution withtruecorrelationp =.776. The distributionputs 68% of its
in theintervalp* E [p - .126, p +.099].
probability

The calculations of the next section suggest an answer somewhat closer to


(N)=.113 thanto5'
a = .143. One bad featureof a(J) can be spottedin Table 2. The
firstvalue, -P = .116, accountsfortwo-thirds of the sum of squares in (2.3). Any
estimatethatdependsso heavilyon a singledatumis proneto instability, as we discuss
in ? 5.
Figure1 showswhyp(?)- p is so large.Data point1 is farawayfromtheother14, so
that its removal causes a large change in the estimatedcorrelationcoefficient. This
notion is formalizedin ? 5 under the name "influencefunction,"and furnishesa
theoreticrationaleforthe jackknifeestimateof accuracy.In additionto Miller [18],
anothergood referenceon the justificationand use of the jackknifeis Mostellerand
Tukey [19].
3. Bootstrapmethods. We consideranothermethod,called the "bootstrap" in
Efron [8], of assigningan accuracyto the estimatedcorrelationp =.776 forthe law
school data:6
1) Let F be the empiricaldistributionof the 15 observed data points,i.e. the
probabilitydistribution whichputsmass 1/15 at each observedpoint(xi,yi).
2) Use a randomnumbergeneratorto draw 15 newpoints(x*, y ) independently
and with replacementfromF, so that each new point is an independentrandom
selectionof one of the 15 originaldata points.These new points,whichwe willcall the
"bootstrapsample," are a subsetof the originalpointsplottedin Fig. 1. Some of the
originalpointswillhave been selectedzero times,some once, some twice,etc.
3) Compute p , the correlationcoefficient forthe bootstrapsample.
4) Repeat steps(2) and (3) a largenumberof times,say N times,each timeusing
an independentsetofnewrandomnumbersto generatethenewbootstrapsample.Call
theresulting
sequenceofbootstrap
correlation
coefficients
p*1, p*2,. *, p$I, ,p

6 Thename"bootstrap" ismeanttobe euphonic


with"jackknife,"
thetwomethodsbeingclosely
related
as we shallsee,andalsoto conveytheself-help
natureofthebootstrap
algorithm.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
466 BRADLEY EFRON

5) Let [a*, b*] be the central68% intervalforthe p* values, i.e.


{p '<a*l 1 {P < b }=84
N 16, N 84.

Definethebootstrapestimateofthestandarddeviationo-,sayC(B), to be halfthelength
of thisinterval,

A(B)=b* a

Figure3 showstheresultsofN = 1000 bootstrapreplications.The histogramofthe


1000valuesp,' _p, p *2 p, , I*N _p, is plotted,anditisseenthatC(B) = .127. The
similarityof thehistogramto thenormal-theory densityfunctionofp, - p, reproduced
fromFig. 2, is apparent,the main difference being an excess of bootstrapvalues for
p- p > .15 (comingfroma deficit intherange0 to .10). This excesspullsthe84% point
of p -p up to .132, comparedwiththenormal-theory value of .099, and is thereason
r(BY .127 is larger than the normal-theory estimate (N)=.113, thoughit is still
considerablysmallerthan -(J)= .143, the jackknifeestimate.

normaltheory
density\ /histogram

histrogram
percentiles

_ _ 16% t50% 84%

p -p
-.4 -.3 -.2 -.1 0 .1 .2
>- givescr()=27 b

FIG. 3. Histogram,1000 bootstrap replications


ofp* - p, givesa bootstrap
estimateofaccuracy5(B) = .127
forthecorrelationcoefficientA= .776 ofthelaw schooldata. Thenormaltheory densityof A* - A,fromFig. 2, has
a similarshape, butfalls offmorequicklyat highervalues ofp* -

What we have called p before,the truecorrelation,mightbetterbe called p(F),


where F is the trueprobabilitydistributiongivingrise to the data pairs (xi,yi). The
notationp = p (F) emphasizesthatthe correlationcoefficient is a functional,mapping
any bivariate probabilitydistributioninto a real number in the interval[-1, 1].
Definition(2.1) can be writtenp = p (F), where F is the empiricalprobabilitydis-
tributionintroducedat step 1 of the bootstrapprocedure.
The rationaleunderlying thebootstrapprocedureis simple:i) We wantan estimate
of the accuracyof A; ii) We would like to use o-(F), where cr-() is some agreed upon
functionalwhichmeasuresaccuracy,suchas (2.2), o-(F) = [E(p (F) - p (F))2]1/2. (Notice
thato-(F) depends only upon F since the expectationoperatorE averages over the
possible F's arisingfroma randomsample of 15 independentpairs fromF.) iii) We
don't knowF, so insteadwe estimateCJ= o-(F). In otherwords,we use the same basic

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 467

methodto estimateo-as to estimatep itself-a simplesubstitution


ofF fortheunknown
truedistribution
F.
Insteadof rootmean square error,we have been employinga different functional
to measureaccuracy,

(3.1) o-(F) = halfthe lengthof the central68% of the


underF, of p (F) - p (F).
probabilitydistribution,
Whyone mightprefer(3.1) to (2.2) is discussedin ? 5, thoughthereal reason here has
been the ease of graphicalpresentation.
The empiricaldistribution F is a crudeestimateofF. Whynotuse a betterestimate
of F, say F+, and estimatethe accuracyby `+ = o-(F+)? That is exactlywhatwe have
done in obtainingthenormaltheoryestimateC(N). The betterestimateofF is F+ equal
to a bivariatenormaldistribution whose correlationcoefficient is the observedvalue
p= .776. (The means and variancesof F+ are also set equal to the observed sample
values.) In thissense C(N) is itselfa bootstrapestimate,theonlydifference beingtheuse
of a betterF at step 1.7 "Better,"of course,mayreallybe worseifthe assumptionthat
the trueF is bivariatenormalis wrong.It is reassuringto see the agreementbetween
C(N) and (B), since the lattermakes no special assumptionsabout the formof F
It is interesting
to trya compromisebetweenF, theempiricaldistribution, and F+,
the best fitting normaldistribution. Let FC be the probabilitydistribution of a random
point v = (x, y) obtained as follows: take independentpoints v' = (x', y') and v" =
(x", y") fromF and F+ respectively, and let v = s/i - c2 v' + cv". Then F0 = F, F1 =F
but forintermediatevalues of c we get a blend of the discretedistribution F and the
continuousnormaldistribution F+, whichmay more nearlyapproximateour actual
beliefsabout the formof the trueF
Figure4 showswhathappensifwe beginthebootstrapprocedurewithFC instead
of F. The value c = 1/AV5 was used, whichroughlyspeakinggivesfourtimesas much
weightto F as to F+. The bootstrapdistribution ofp - p now looks even morelikethe
normaltheorycase, but the estimateof accuracyis virtuallyunchanged,5(B) = .125.
(An equal mixtureof F and F gave (B = .116.) The summarystatementp=
.776 ? .125 seemsquitereasonableat thispoint;we have groundsforbelievingthatthe
accuracyis somewhat,butnota greatdeal, worsethantheaccuracyunderpure normal
theory.
The choice of N = 1000 as the numberof bootstrapreplicationscan be shown,in
thepresentcase, to determineC(B) to an accuracyof about 2.5%. This meansthatifN
were increasedfrom1000 towardinfinity, thelimiting value of CJ(B) wouldbe expected
to differ from.127 byless than2.5%. Vastlymorebootstrapreplicationsmightresultin
'A
,A(B)=O
oCB= .130 or .125, but almostcertainlynot B = .120 or .135. We could have gotten
by withN = 250 replications,givingan expectedaccuracyof 5%, but N = 1000 is not
foolishlyexcessive.This impressiveexpenditureof computingpower, 1000 timesthat
forthe originalcalculationof ',Adoesn't includethe 1000 smoothedbootstraprepli-
cationsof Fig. 4. Of course,all the calculationstogetheronlytook a fewseconds and
cost perhaps $10, but, to reiteratethe obvious, they would have been practically
impossible 30 years ago. Bootstrap-likeprocedures have undergone very little
theoretical development since they have been computationallypractical for a

7Steps 2 through5 of the bootstrapprocedure are done theoretically,ratherthan by computer


simulation,in the normal-theory calculation.The bivariatenormalmodel is virtuallyunique in yieldingan
analyticallysimpledistributionforp. This getsback to our mainpoint,theeffectof thecomputeron whatis
considereda feasiblestatisticalprocedure.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
468 BRADLEY EFRON

normaltheorydensity h
histogm stogram

/ ~~~~~~histogram
\
percentiles

16
X50% 84%
P* -P
-.4 -.3 -.2 -.1 0 .1 .2
L
-givescr(B) =.125 i
FIG. 4. Histogram,1000 bootstrap replicationsof p*- p, usingthesmoothedsamplingdistribution FC
C= 1/1/5,describedin thetext.The histogramfollowsthenormal-theory densitymorecloselythaninFig. 3, but
(B)= .125, almostthesame value as fortheunsmoothedbootstrap.

comparativelyshorttime,but theoreticianscan be expectedto take greaterinterestin


themnow thattheyare feasible.
There is an interestingtheoreticalconnectionbetween the jackknifeand the
bootstrap.8Consideringnow just one bootstrapreplication,letpe be theproportionof
the bootstrapsample equal to the originaldata pair (xi,yi). For example,if (x5,y5) is
includedthreetimesin thebootstrapsample,ofsize n = 15, thenp4*= 3/15 = .20. The
vectorp* = (p*,p* , *,p) determinesp -p, so we can write,say p*-p=gg(p*),
where g(') is a known function.(To be specific,g(p*) [Ep8 (xi-x*)(y -
[Zpe (xi -x-*)2 E pe (Yi - -*)2]1/2 _ p, where x = E Pexi, y* = ZP8Yj. Notice that the
data (xi,yi),i = 1, 2, ***, 15, are consideredfixedin thisdefinition.)
The statisticsof the vector p* are completelyknown fromthe propertiesof
the multinomial distribution. For example, p* has expected value
(1/n, 1/n,1/n, , 1/n), n = 15, and covariancematrixwithijth element
1 1
Covariance (pI,pI)=2-p
P --3 i=j
n n
(3.2)
=-
n3 ~i?j.

Expandingg(*) in a Taylorseries around (1/n, 1/n, , 1/n) gives


n 1\
(3.3) p
A* A
_ p= E
i=1
i --) +higher orderterms,
g(P) ( _i
/i A

where g(i) is the partial derivativeof g(Q) with respect to p8, evaluated at p*=
(l/n, , 1/n). Together, (3.2) and (3.3) suggest an easy approximationto the
standarddeviationof _- underbootstrapsampling,namely
* A

O=
A(B)- 12E [g(i)]2.

8The remainderofthissectionassumessomeknowledge ofstatistical


theory,
though
thegeneraldrift
of
theargumentstillshouldbe discernible
to nonstatisticians.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 469

This is almost exactly the jackknife estimate CJ, the main differencebeing the
in place of the derivativesg(i) appearingin
substitutionin (2.3) of finitedifferences9
(3.4). Jaeckel[15] originallysuggestedtherightside of (3.4) as an accuracyapproxima-
tion,callingit the "infinitesimaljackknife;"see also Efron[8].
4. Cross-validation.In itsoriginalform,cross-validation referredto thefollowing
simple,butuseful,idea: givena large class ofpossible modelsto fitto a set of data, for
examplelinearregressionmodels in whichthe choice of predictorvariablesis open to
question,firstrandomlydividethedata intotwohalves.Then fita modelto thefirst half
ofthedata,usinganyfitting methodat all, and see howwellthefittedmodelpredictsthe
second half of the data. This last step, which is the cross-validation,protectsthe
statisticianagainstan overlyoptimisticassessmentof goodness-of-fit.
Recentlymanyauthors,in particularStone [21] and Geisser [10], have proposed
directuse of cross-validationforthe selectionof appropriatemodels.This approachis
computerintensive,butpotentiallymuchbroaderin applicationthanthefamiliarlinear
model approach. We illustratethe methodwithan example taken fromWahba and
Wold [23].
Figure 5 shows 100 artificiallygenerateddata points,created accordingto the
followingmodel: thepoint(xi,yi)withabscissaxi has ordinateyirandomlydetermined
by
(4.1) Yi A(Xi) + Si i= 1, 29..** 1009
where t

(4.2) . (x) = 4.26(e-x - 4 e-2x + e-3x)

and the ei are independentnormal random variables with mean 0 and standard
deviationo-= 0.2. The xi values are equallyspaced from0 to 3.10. The function/I(x),
whichin a real applicationwouldbe unknownto thestatistician, is shownas thedashed
curvein Fig. 5.

.801
.60.*
.40-
.20 . ;
.00 . 1 1 2 2 3.00
-.20 t; .

-.60 .;

_1800 '

-1.20^ .
.00 .50 1.00 1.50 2.00 2.50 3.00
FIG. 5. 100 randompointsgeneratedaccordingto model (4.1), (4.2). The truemean functionA (x) is
indicatedbythedashed curve.Thesolidcurvewas obtainedfromthedata pointsbythecross-validation
method.

Wahba and Wold considerfittinga class of curves 77(x,a) to these data. For a
particularchoiceofthenonnegativeparametera, 77(x, a) is bydefinition
thecurve77(x)
It is nothardto showthat(n - l)(p - P()) g('); see ? 5 ofEfron[9].
approximates

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
470 BRADLEY EFRON

minimizing
l 100
(4-3) 1 E [yi-r,vi)]2
100 i~

subject to the constraint


3.125
(4.4) J| [7"(x)]2 dx = a.

Constraint(4.4) is a smoothnesscondition:if we take a = 0, 7(x, 0) is verysmooth


indeed,beingtheordinaryleast squares straightline forthedata in Fig. 5. It is easy to
see thatthisgivesa verypoor fitin thepresentcase. At theoppositeextreme,ifwe leta
get largeenough, 7(x, c) willgo througheverydata point.This fitsthedata perfectly,
butis fartoo irregulara curveto be of anyuse forpredictionor analysis.Intermediate
valuesofa givecubicsplinefunctions, witha trade-offbetweensmoothness(4.4) and fit
(4.3).
Cross-validationproposes to estimate the best value of a, withoutany prior
knowledgeofthegeneratingmechanismforthedata. "Best" heremeansthevalue ofa
minimizing
l 100
(4.5) 10 E [A(xi)-q (xi,a)]2,
100 i=1

inotherwords,thecurver1(x, a) closestto thetruemeanfunction, (x). Anotherwayto


statecriterion(4.5) is to imaginethata new set of data, say (xi,y*, i = 1, 2, , 100,
has been independently generatedaccordingto model (4.1), (4.2). How wellwilla curve
71(x, a) fittedto the originaldata predictthisnew data set, in the sense of minimizing
(1/100) Eilo [y-*- rq(xi,a)]2? The expectederrorof prediction,with77(x,a) fixed,is

1 100 2 100
(4.6) E E [y*--a(xi,a)]2 o' +- E [(x(xi)-r7(xi,a)]2.
100 i1100i=

Since u2 = (0.2)2 is a fixednumber,minimizing(4.5) is equivalentto minimizingthe


expectedsquared errorof prediction(4.6).
Ifthenewdata set (xi,y*) wereactuallyavailablewe could easilyselecta: foreach
a, the curve71(x,a) is determinedfromthe originaldata set,by (4.3), (4.4), and then
testedon the new data set by computingQ*(a) = (1/100) Y,1i [y* - (xi, a)]2. The a
whichminimizedQ*(a) would be the estimatedbest a.
Cross-validationdoes almostthesame thing,withoutrequiringanynew data. For
each choice of i, i = 1, 2, 100, let T(i)(x, a) be thatcurve77(x)satisfying
constraint
(4.4), and minimizing
1 100
99 ji=1
j#i

In otherwords, 1(i)(x,
a) is thesolutionto theconstrainedminimization
problem(4.3),
(4.4) withpoint(xi,yi) removedfromthe data set. We thendefine
1 100
(4.8) Qt(a)= - E [yi-q(i)(xi, a)]2,
100i=

andselectas "best"theatminimizing
Qt(aE), sayaEt. Thecurver7(x,a t) iS theproposed
estimatefor,u(x).

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 471

The solid curvein Fig. 5 shows r1(x, a t) in Wahba and Wold's example.The fitis
obviouslyquitegood, and Qt(a t), ifitwerepresented,wouldgivea good estimateofthe
expectedpredictionerror(4.6) for77(x,at). Of course we have had to do about 100
timesas muchworkto computethe curve71(x,a) forany givena. (Wahba and Wold
actually omit points 10 at a time, instead of one at a time, and so reduce the
computationaleffortby a factorof 10.)
Cross-validationresemblesthejackknifein thatdata pointsare removedone at a
timein bothprocedures,buttheunderlying connectionbetweenthetwomethodsis still
not clear to statisticalresearchers.The next example shows a situationwhere either
cross-validationor the bootstrapcan be applied, but the latteris quite a bit more
effective.This isn't intendedto disparage cross-validation,but ratherto suggestthat
furtherresearchmaylead to powerfulcombinationsof cross-validation and jackknife-
bootstrapmethods.
Figure 6 shows 20 artificially generatedrandom points, 10 fromeach of two
populations.The underlying x populationis bivariatenormalwithmean vector(-2, 0)'

Region A

.y x
OX y~~~~~~

.x /.

linear /.
discriminant /Y Region B
boundary ~>/

FIG. 6. Ten x pointsindependently


generated
froma bivariate normalpopulationwithmean vector(-2, 0)',
and ten y points independently generatedfroma bivariatenormalpopulation withmean vector(2, 0)'.
(Covariance matrixis theidentity
in bothgroups.)The straightline is thelineardiscriminant
boundary.

and covariancematrixthe identity.The y population differsin havingmean vector


(2, 0)'. By definition,
thelineardiscriminant
boundaryis thestraightline
(4.9) {z: (y-X?)'S1((- D
2 =)-}
wherex and yiare the two mean vectors, m=exi/10, y= /10,and S is the 20x2
matrix E (xi - i)(x -xty)'
d + bo(yh
- T-3l)'. The linear discriminant boundary divides
op)(y
theplane intotworegions,A and B, theintentionbeingto classifyan unlabeledfuture
point z as being eitheran x or y dependingon whetherit falls into A or B. (The
optimum divisionlineforfuture classificationis actually{z= (z1, z2):zl=}, but of

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
472 BRADLEY EFRON

course the statisticianwouldn'tknow that in a real situation.Notice thatthe linear


discriminantboundary is calculated from the observed data, and doesn't require
knowledgeof the underlying probabilitymechanisms.Definition(4.9) is motivatedby
an attemptto estimatetheoptimumdivisionline,whichis infactthelineobtainedfrom
(4.9) whenx, y and S are replacedbythetruemeanvectorsand covariancematrixofthe
two normalpopulations.Using a linearboundarytacitlyassumes thatthe covariance
matrixis the same forbothpopulations.)
The probabilitythata futurex randompointwillbe misclassified is

error. Prob {x E B},

whichhappensto equal 0.41 forthesituationinFig. 6. In thisdefinition,


B is considered
fixedas shown,and therandomquantityis thehypothetical futurex point.The obvious
estimateof error.is
#{xi E B}
errorx= 1

whichequals 0.30 in Fig. 6. It is well knownthaterrorxtendsto underestimateerrorx,


thatis to have an optimisticbias, and an importantproblemis to estimatetheexpected
bias,

(4.10) biasx E{errorx- errorx}.


The corresponding quantityforthey populationis equallyimportantofcourse,butitis
sufficientto discussestimatingbiasx.
Cross-validationestimatesbiasx by i) successivelyeliminatingeach point xi,
i = 1, 2, , 10; ii) recomputing the lineardiscriminant boundaryon the basis of the
nine remainingx's and 10 y's; and iii) seeingwhetheror not xi is misclassifiedby the
recomputed discriminationrule. Let errors be the proportion of the x points
misclassified at step iii). Then the cross-validatedestimateof bias is

(4.11) biasx= error--errorx.

In the situationof Fig. 6, biast = 0.10, whichmeans that4 out of 10 x values were
misclassifiedduringthe cross-validationprocess.
The bootstrapestimateof biasx takes considerablymore computation:
1) Select a bootstrapsample of 10 new x points,xi, x2, , x*o, by random
sampling,independently and withreplacement,fromthe givenpointsx1,x2,... , x10.
Likewise,constructa bootstrapsample of 10 new y pointsy*,*
Y2, * , y*O byrandom
*yrno
samplingfromYl, Y2,* , Yio.
2) Constructthebootstraplineardiscriminant boundarybysubstituting x*, y*,S*
forx, y,S in (4.9). Denote the bootstrapdiscriminantregionsas A*, B*.
3) Let
#{x,~?B*}
(4.12) b*_ #{xjeB*}
10 10

4) Repeat steps 1)-3) a large numberN of times,obtainingindependentvalues


b 2, , br', and estimatebiasx by
bX*1,
i N
(4.13) bxs= N b1 '

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 473

In the presentcase, N = 100 bootstrapreplicationsgave the estimatebias* = 0.078.


Notice that(4.12) is of the form"trueminusapparenterrorrate," wherenow "true"
refersto thexi and "apparent"refersto thex*. The justification ofthebootstrapis the
same here as in ? 3.
When the x and y values are generatedby the underlyingnormaldistributions
describedearlier,theactualvalue ofbiasxis .062. That is,errorxtendsto underestimate
errorxby .062, on theaverage.In a largenumberofMonte Carlo trials,reportedin ? 4
of Efron [8], both biast and bias* were themselvesnearly unbiased; that is they
averagedabout .062. However,thebiastvalueswerethreetimesmorevariablethanthe
bias* values, which made them much less dependable for assessing biasx in any
particularcase.
5. Robust estimation.A fundamentalstatisticaltactic is the combinationof
separate small pieces of information, each by itselfnearlyworthless,to produce an
overallconclusionofsubstantialreliability. Independenttossesofa possiblybiased coin
offerthe classic example. No one toss tells us verymuch about the coin, but having
observed,say, 30 heads in 100 tosses, the trueprobabilityof heads can reliablybe
predictedto lie in the interval.300 ?.092. Averaging,whichis whatis done to get the
estimate.300, is a powerfulway of bringingdiverseinformation to bear on a single
importantquestion. Some of the most useful statisticalmethods, such as linear
regressionand analysisofvariance,are reallyno morethanfancyaveragingtechniques,
designedforsituationswherethe individualobservationsare collectedundervarying
circumstances.
Suppose we threwaway any one of the 100 coin flips,leavingourselveswiththe
data fromthe remaining99. The estimatedtrueprobabilityof heads, call it p, would
thenequal eitherp = 30/99 = .303 or p = 29/99 = .293, dependingon whetherwe had
thrownawaya head or a tail.Both .303 and .293 are quite close to .300, thepointhere
beingthatno one oftheindividualpiecesofinformation is byitselfveryimportant to the
estimatep = .300. We say thatp is robustin thissituation,to use Tukey's memorable
terminology (somewhatdifferently thanoriginallyintended).
Unfortunately, it is not alwaystruethatthe average x~= En xi/nis robustin the_

sense above. Table 3 shows microbecountsin 69 swabs fromdifferent portionsof a


Marinerspace probe. The average countis Jx= 16.14, but deletingthe largestcount,
count #69, gives average x(69)= 1.53 for the remaining68 numbers.Deleting the
largesttwo counts,count #69 and count #68, gives x~(68,69) =.63. In thiscase xi is
nonrobust.
distinctly
Recently statisticianshave become interestedin robust estimators,averaging
techniqueswhichlimitthe influenceof any one observationon the estimate,even in
situationsas extremeas thatof Table 3. Huber's monograph[14] gives an excellent
overviewof the subject.Anothergood referenceis Hampel [12].

TABLE 3
Microbecountsin 69 swabs ofa Marinerspace probe.(Partofa muchlargerdata
set.) The countwas zero in 53 swabs,one in 6 swabs,etc.Removingthelargestcount,
1010, reducestheaveragecountfrom16.14 to 1.53.

Count 0 1 3 4 5 6 9 62 1010

Numberofswabs 53 6 4 1 1 1 1 1 1

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
474 BRADLEY EFRON

The averagex ofa setofnumbersx1,x2, , x, can also be derivedas thatnumber


T whichminimizesthesumofsquared deviations, =(xi = shows
- T)2. Differentiation
thatx mayalso be characterizedas thesolutionto theequation(in T), Ej=l (xi- T) = 0.
an M estimatoris the solutionin T to the equation
By definition,
n
(5.1) Y. /(xi- T) = .
i=l

Here 0( ) is a preselectedfunction,whichcan be chosen to give good robustness


properties.If *f(x) x then T is the ordinaryaverage x. If
f(x)= sign(x)
then T is the sample median,the middlevalue of theobservationslistedin increasing
argumentat thebeginningofthisparagraphshows
order.(Reversingthedifferentiation
that the median minimizesthe sum of absolute deviations ,7=,lxi- TI.) For the
microbedata the median equals 0 no matterhow many of the nonzero counts are
removed.This is more robustnessthanwe wantin manysituations!
As a compromisebetween*l(x) = x and ql(x) = sign(x) we can take

rc, x <-c,
(5.2) qf(x)= x, -c 'x c,
C, C<X.

Choosingc = o makes T equal to theaverage,whilec = 0 (actually,thelimitas c -* 0, in


whichcase f(x)/c-* sign(x)), givesthemedian.The choicec = 10 resultsintheestimate
T = .93 for the microbe data. Removingthe largestcount changes the estimateto
T(69)=.78; also removingthe second largestgives T(68,69)= .63. These values can be
obtainedeasilyon a hand calculator,usingNewton-Raphsoniterationor just trialand
error.Doing the computationgivesa good feelingforthe way in whichthe estimator
based on (5.2) acts like x near the middle of the data, but automaticallylimitsthe
influenceof outlyingobservations.
How can we choose amongstpossible estimatorsT in any givensituation?If we
knew that the observationswere independentlygeneratedaccordingto some prob-
abilitydensityfunctionf(x - 0), with 9 an unknownparameterto be estimated(a
"translationfamily"situation),we could use themaximumlikelihoodestimator,i.e. the
number T which maximizes Hl=n f(xi- T). Taking logarithmsand differentiating
showsthatthemaximumlikelihoodestimatoris an Mestimator,10 withql functionequal
to

(5.3) Off(X) f(x)


I
For the normaltranslationfamily,withf(x - 9) = (2irf112exp {-2(x f)2}, f(X) = X
and so the average x is the maximumlikelihoodestimator.The Laplace translation
familyf(x - 9) = (2) exp {-Ix - OI}givesthemedianas themaximumlikelihoodestima-
tor. Maximum likelihoodproduces nearlyoptimal estimatesin translationfamilies,
assumingof coursethatthef(*) used in (5.3) is actuallythecorrectformof thedensity
function.
The pointof muchof theworkin robustnesstheoryis thatthestatistician maynot
completelytrusta givenparametricmodel,suchas thenormaltranslation family,and so

The name "M estimator"comes fromMaximumlikelihood.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 475

maypreferto changethe "optimum"Off(.) to a morerobustchoice ql(*).This reduces


the theoreticalefficiencyof the estimatorsomewhat,compared with that of the
maximumlikelihood estimator,if the model f(x-0) is correct,but protectsthe
statistician againstdisastrously
foolishestimatesifthemodel is somewhatoff.It can be
shown that the square of the correlationbetween qlf(x)and *l(x) calculated under
densityf(x), determinesthelarge-sampleefficiency of theM estimatorbased on qf(r).
A correlationof .90, forexample,means thatthe M estimatorbased on 0(/*)wastes
about 19% (.19 = 1 _ .92) oftheinformation availableforestimating0 underthemodel
f(x - 9). It turnsout thatforthe normaltranslationmodel,reasonable choices of c in
(5.2) give efficienciesbetterthan 95% while stillprovidinggood protectionagainst
occasional wildobservations.
We have discussed the influenceof a singleobservationon an estimateT. This
notion has been formalizedunder the name "influencefunction,"and provides
theoreticaljustificationfor the jackknife,as well as for robust estimators.The M
estimatorsare functionalsT(F), as was p (F) in ? 3, and can be thoughtofas estimating
the truevalue T(F), whereF is thetrueprobabilitydistribution givingriseto thedata
X1, X2, X3,*. If the samplesize n were increasedtowardinfinity,
T(F) would
approach T(F).
Let S.,representthedegenerateprobability puttingall ofitsmassat the
distribution
pointx. The influencefunctionT(x; F), fora givenestimatorT, evaluated at the true
distributionF, is the functionof x definedby
d
(5.4) T(x; F)--T((l1- E)F + E3,x)|
dE E=0

The influencefunctionrepresentstheeffectupon T(F) of a smalllocal changein F. By


superimposingmany such small changes we obtain, via a firstorder Taylor series
expansion,an approximationto T(F) - T(F), thedifferencebetweentheestimatedand
truevalue of T,

(5.5) T(F) -T(F) + - T(xi;F).


n i=i

For a linear functional,such as the mean T(F)= x dF(x), (5.5) is exact. (For
the mean, T(x;F)=x-T(F), so (l/n) i= T(xi;F)=x-T(F)=T(F)-T(F).)
Nonlinear functionals,such as the M estimatorbased on (5.2), are, under some
regularityconditions,asymptoticallylinear, as n -* 00, in the sense of (5.5). The
usefulnessof (5.5) is thatit approximatesT(F) - T(F) bythe average of independent,
identicallydistributed, randomquantitiesT(xi; F). The standarddeviationof such an
average is

(5.6) [I[T(x; F)]2 dF(X)]112


V-n
l/In timesthe root mean square of the influencefunction.The jackknifestandard
deviation r(J), (2.3), is the nonparametricestimateof (5.6). (The values p(i)-p are
rathercrude estimatesof the influencefunction.Expression(5.6) is closelyrelatedto
(3.4).)
The principleofrobustestimationcan nowbe statedmorequantitatively: onlyuse
estimatorsT(F) forwhichtheinfluencefunctionis sensiblybounded.It is easy to verify
thattheinfluencefunctionof an M estimatoris proportionalto I+r(x - T(F)). The form
of i/ in (5.2) is nothingmorethana modification of if forthe average,if(x) = x, witha

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
476 BRADLEY EFRON

bound puton themagnitudeof theinfluencefunction.Definition(3.1) is motivatedby


similarconsiderations.
Robustnessideas are now beingapplied to regressionsituationssuchas (4.1), nice
referencesbeingAndrews[1] and Mostellerand Tukey [19]. If the errorsEioccasion-
ally take on wild values, then fitting
models by the methodof least squares can go
disastrouslywrong.The leastsquaresmethodfitsregressionparameters/3(forexample,
thecoefficientsof theexponentialtermsin (4.2), iftheywere unknown)byminimizing
Ei=1 (y1- , (xi))2.Instead,we can minimizeEU1P(yi - jt (xi)), where

(5.7) P(y) ={J (y') dy',

with / as in (5.2). The limitingcase, as c 0, fitsa model by minimizingE=l- -


O,(xi)|, the sum of absolute deviations."Least absolute deviations" was the fitting
method favoredby Laplace, but it lost out to Gauss' least squares, mainlyon the
groundsof computationalsimplicity.Now, 150 yearslater,Laplace may reclaimthe
field,withthe assistanceof the moderncomputer.
6. Censoreddata. We have made frequentuse oftheempiricaldistribution F, the
probabilitydistributionwhich puts mass 1/n at each of n observed data points
x1,x2,... , xn. (In ?? 3 and 4 thexi werepointsin a twodimensionalspace, whilein ? 5
the space was one dimensional.) It may seem that there is no way to make the
calculationofF difficult. Ifso, a look at some censoreddata shouldconvincethereader
otherwise.
Table 4 showssome earlyresultsfromthe hearttransplantprogramat Stanford.
The survivaltimesin days,followingthetransplant operation,are listedfor18 patients.
The firstlistedpatientsurvived3 days,thesecond4+ days,wherethe"+" indicatesthat
the patientwas stillalive on April 13, 1972, the pointin timeat whichthe data was
collected.Here itwouldbe wrongto letF be thedistribution puttingmass 1/18 at each
of the numbers 3, 4, 10, 25, . . , 1025 since, for eample, the actual survival time
correspondingto 4+ is knownonlyto lie in the interval(4, xo). This is an example of
censoring,in which the exact value of a measurementcan't be seen, but some
informationon itswhereaboutsis available.
Let T representthesurvivaltimeofa hearttransplant patient,a quantitywhichwe
willmeasurein days.The survivalcurveS(t) is theprobabilityofsurviving past a given
timet,
(6.1) S(t) Prob {T > t}.
KnowingthefunctionS(t) is thesame as knowingF, thetrueprobabilitydistribution of
T. Iftherewereno censoringwe could constructan estimateof S(t) in theobviousway,

(6.2) S(t) #{Ti> t}


n
wheren = 18 in thecase above. In otherwords,we could use theordinaryestimateF, of
whichS(t) is anotherrepresentation.
Figure7 showshow S(t) is constructedwhensome of the data are censored.The
constructiondepends upon the numberof patientsat riskat timet,

(6.3) n(t) numberof patientsneithercensored


nor observedto die beforetimet,
whichis givenin Table 4. In our example,n(0) = 18, n(100) = 9, n(200) = 4, etc. The

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 477

TABLE 4
Survival timesfor 18 early hearttransplantpatients.Tabled is survivaltime,in days, followingthe
transplant;"+" indicatesthatthepatientwas stillalive on April 13, 1972, theday thedata werecollected.
Abstractedfrom a largerdata setinBrownand Turnbull[21]. "Numberat risk" is used in thecalculationofF.

Survival
time 3 4+ 10 25+ 39 40+ 43 54 65

Numberat
risk 18 17 16 15 14 13 12 11 10

Survival
time 120+ 136 147 157+ 183+ 312 546+ 824 1025

Numberat
risk 9 8 7 6 5 4 3 2 1

?(t)
1.0.
.9

.4.

.3-
.2-

0 10 20 30 40 50 60 70 80 90 100110 120 130 140 150 160170 180190 200 210


FIG. 7. EstimatedsurvivalcurveS(t) fromthedata in Table 4. Open circlesrepresent censoreddata points,
whilejumpsoccurat uncensored At each uncensored
observations. data point,i.e. at each observeddeath,S(t) is
bya factorequal to theproportion
multiplied of theobservablepopulationnotdying.

of S(t) is recursive,startingwithS(O) = S(O) = 1:


definition
-S(t- 1) ifno observeddeathson day t

(6.4) (t) jn(tS(t-1 ) 1


n(1 n(t)
ifone observeddeath on day t.

In the heart transplantexample S(2) = S(1) = S(O) = 1, S(3) = S(2)(17/18) =.944,


S(1O) = S(9)(15/16) = .994 .938 = .885, etc. Notice thatthe data point4+ figuresin
thedenominatorofS(3), buthas no further effecton S(t). Kaplan and Meier [16] givea
veryreadable accountof the theorybehind(6.4).
The constructionof S(t) may seem ad hoc, but Kaplan and Meier show that it
producesthe maximumlikelihoodestimateof the unknownS(t): among all possible
survivalcurvesS(t), i.e. among all possible truedistributionsF, the choice S(t) = S(t)
maximizestheprobabilityofobtainingthedata actuallyobserved.Bootstrapestimates
of accuracyforfunctionalsof censoreddata beginwithF corresponding to thesurvival
curveS(t), at step 1 of the bootstrapalgorithm.
Efron [7] suggestedanothermotivationforS(t). Suppose we startout withany
estimateS(?1(t). Define a new estimateS(1)(t),along the lines of (6.2),

(6.5) ~ ~1(t) E1() #{ T > t})


n

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
478 BRADLEY EFRON

whereE"() indicatesan expectationtaken withrespectto the probabilitydistribution


definedby the survivalcurve S(0)(t). Taking t = 20 in Table 4, for example, the 15
patientswithsurvivaltimes>20, censoredor not,contribute15 to the #{Ti > 20}. The
patientswithsurvivaltimes3 and 10 contributezero to #{Ti > 20}. The patientwith
survivaltime4+ may or may not have Ti > 20. This patientcontributesan expected
amountto the rightside of (6.5), the expectationbeing taken underthe distribution
"S(?)(so thatS(1)(20) is between 15/18 and 16/18.
We can iterate (6.5), giving the sequence of survival curves S(0)(t), S l1(t),
S (t), Efronshows thatthissequence convergesto S(t). The usefulnessof this
iterativeconstructionof S(t) is thatit can be applied undermore difficultcensoring
conditions.The data in Table 4 consistsof observeddeaths and right-censored obser-
vations,such as 4+. In othersituationstheremay also be left-censoredand doubly
censoredobservations("the eventoccurredbeforet = 17," "the eventdid not occur
duringthe interval(12, 20)"). Turnbull[22] showed that under general censoring
conditions,theiterativeconstruction (6.5) alwaysconvergesto themaximumlikelihood
estimateof S(t).
Recently,Dempster,Laird,and Rubin [6] have puttherelationshipbetween(6.5)
and maximumlikelihoodestimationintoa widercontext,unifying workbymanyearlier
writers.They considera varietyof situationsin whichitwould be easy to calculatethe
maximumlikelihoodestimatorifone had thefullset of data, butwhereforone reason
or anothersome of the informationis missing.An example, forthose familiarwith
analysisof variance,is a two way table witha few missingobservations.In such a
situationtheyshowthatan iterativeprocedurelike (6.5) alwaysleads to themaximum
likelihoodestimator,and moreoverdoes so in a monotonicmanner.They call this
methodthe "EM Algorithm,"in whichone firstEstimatesthe missingdata and then
Maximizesas ifthe fulldata set were present.
The survivalfunctionS(t) can be expressedas

(6.6) S(t)= H [1-h(s)],


s=1

where
(6.7) h(s)-Prob {T = s I T > s -1},
theconditionalprobabilityofdyingon day s givensurvivalpastday s -1. The function
h(s) is called the hazard rate.The estimate(6.4) comes fromestimatingthe factor
1-h(s) by

(6.8) 1 _ numberofobserveddeathson day s


n(s)
Hazard ratesare moreconvenientto workwiththandensityfunctionsin censoreddata
situations,an idea we now explorefurther.
We have treatedthepatientsinTable 4 as iftheywereidentical,at leastas faras the
probability distribution
oftheirsurvivaltimesis concerned.In fact,thereare observable
differences betweenthepatients-age, sex,race,etc.-which we mightwishto examine
fortheireffecton survivaltime.If therewere no censoringwe could runan ordinary
regressionanalysiswiththeobservedsurvivaltimesas thedependentvariable.Cox [4]
has suggesteda regressionanalysiswhichworksdirectlywiththehazardrates,and is not
affectedby data censoring.
Let z7 representthe vectorof relevantobservableinformation, such as age, race,
and sex, about patienti, coded in some fashionso thatall theentriesof z1are numbers.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 479

For example,a 57 year old whitemale mightbe coded (57, 0, 1) where"0" indicates
whiteand "1 " indicatesmale. Cox's model postulatesthatthehazardrateforpatienti,
say hi(s), is of the form

(6.9) hi(s) = g(s) e'Z.


Here g(s) is an overall hazard rate applyingto all the patientsand IS is a vectorof
unknowncoefficients, correspondingto the regressioncoefficientsin an ordinary
regressionmodel. If /3 = 0 thenall thepatientshave thesame hazard rate,i.e. identical
probabilitydistributions fortheirsurvivaltimes,but if 3 $ 0, model (6.9) saysthatthe
survivaltimedistributions are functionsof zi. (The vectorzi can itselfbe a timevarying
function,say zi(s), as long as it is alwaysobservable.)
In orderto analyzethismodel,Cox uses an approachsimilarto (6.8). Let R (s) be
the riskset on day s, theset of patientsavailable forobservationon thatday,i.e. those
who have not been previouslycensorednor observedto die. Given thatthereis one
death on day s, theprobabilityundermodel (6.9) thatit was some particularpatientin
R(s) who dies, say patientis, equals

(6.10) E eIzs e z
LER(s) e
(Expression (6.10) is actuallyan approximationwhichbecomes exact as the unitsin
whichwe are measuringtimebecome infinitesimal.)
The advantageof (6.10) is thatitdependsonlyon f3and theobservablevectorszi,
and noton the commonhazardfunctiong(s) in (6.9). This makesiteasy to analyzethe
data forthe effectsof /3,withoutanymodelingof g(s) beingnecessary.Cox multiplies
the factors(6.10) together,one fromeach observeddeath,

(6.11) r e
observedEi(EQ(s) e
deaths

and treatsthisproductas ifitwere an ordinarylikelihoodfunctionfor,8.For example,


the /8which maximizes (6.11) is treated as a maximumlikelihood estimate.This
approach ignorespartof the data, those days on whichno deathsoccur,but has been
shownto give reasonablyefficient estimatesof /3nevertheless.
Ifthereare manypatients,a hundredor more,and thevectorszi are time-varying,
expression(6.11) can be computationally to deal with,taxingeven a large
quite difficult
computer.Withouta computer,the methodis hopeless,exceptin the simplestsitua-
tions.(Manteland Haenszel [17] discussone suchsituation,thetwosamplecomparison
problemof ? 1, withcensoreddata.) Cox's regressionmethodis a good example of a
statisticaltheorywhichhas developed in responseto the capacityof moderncompu-
tationalequipment.
7. Conclusion. The purpose of mathematicaltheory,and in fact all scientific
theory,is to reducecomplicatedsituationsto simpleones. Justwhata scientistmeansby
"simple" is determinedby experience,training,convention,and the limitationsof
human reasoningfaculties.A Taylor series expansion is a classic example of this
process: a givenfunctionis expressedas a sum of multiplesof powers. Since we are
taughta lot about sums,multiples,and powers,the explanationmay be a good deal
easier to understandthanthe functionas originallystated.
The adventofthehighspeed computerhas redefined"simple"inthemathematical
sciences.For example,an optimizationproblemwhichcan be reducedto a problemin

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions
480 BRADLEY EFRON

linear programming is, in most instances,now consideredsolved, since the simplex


methodis so efficient in numericallysolvinglinearprograms.
The purpose of this article has been to show this same process at work in
mathematicalstatistics.A theorywhichenables a scientistto understandhis data with
thehelpofa highspeed computermaynowbe as usefulas a theorywhichonlyrequiresa
table of the exponentialfunction,particularlyif the latter theorydoes not exist.
Computerassistedtheoryis no less "mathematical"thanthetheoryofthepast,itis just
less constrainedby the limitationsof the humanbrain.
The need fora moreflexible,realistic,and dependablestatisticaltheoryis pressing,
giventhemountainsofdata nowbeingamassed.The prospectforsuccessis bright,butI
believe thesolutionis likelyto lie along thelinessuggestedin theprevioussections-a
blendof traditionalmathematicalthinkingcombinedwiththenumericaland organiza-
tional aptitudeof the computer.

REFERENCES

[1] D. F. ANDREWS,A robustmethodformultiple Technometrics16 (1974), pp.523-531.


linearregression,
[2] B. W. BROWN AND B. W. Turnbull,Survivorship analysis of hearttransplantdata, Departmentof
Statistics,StanfordUniversity, Technical Report No. 34 (1972).
[3] B. W. BROWN,B. W. TURNBULL AND M. Hu, Survivorship analysisofhearttransplant data,Journal
of the AmericanStatisticalAssociation,69 (1974), pp. 74-80.
[4] D. R. Cox, Regressionmodels and life-tables, Journalof the Royal StatisticalSociety Series B, 34
(1972), pp. 187-220.
[5] H. CRAMER, MathematicalMethodsofStatistics, PrincetonUniversityPress,Princeton,NJ,1946.
[6] A. P. DEMPSTER, N. M. LAIRD AND D. B. RUBIN, Maximum likelihood
estimationfromincomplete
data via theEM algorithm, Journalof the Royal StatisticalSocietySeries B 39 (1977), pp. 1-38.
[7] B. EFRON, Thetwosampleproblemwithcensoreddata, ProceedingsoftheFifthBerkeleySymposiumon
MathematicalStatisticsand ProbabilityIV, (1967), pp. 831-853.
[8] , Bootstrapmethods:Anotherlookat thejackknife,AnnalsofStatistics,7 (1979), no. 1, to appear.
[9] , Controversiesin thefoundationsofstatistics,
AmericanMathematicalMonthly85 (1978), no. 4,
pp. 231-246.
[10] S. GEISSER, The predictive sample reusemethodwithapplications,Journalof the AmericanStatistical
Association70 (1975), pp. 320-328.
[11] G. H. GOLUB AND G. P. H. STYAN, Numericalcomputations forunivariatelinearmodels,Journalof
StatisticalComputationand Simulation2 (1973), pp. 253-274.
[12] F. R. HAMPEL, The influence curveand itsrolein robustestimation, Journalof theAmericanStatistical
Association69 (1974), pp. 383-393.
[13] P. J.HUBER, Robuststatistics:a review,Annals of MathematicalStatistics43 (1972), pp. 1041-1067.
[14] , Robust StatisticalProcedures,SocietyforIndustrialand Applied Mathematics,Philadelphia,
PA, 1977.
[15] L. JAECKEL, The infinitesimal jackknife,Bell Labs. Memorandum#MM 72-1215-11, 1972.
[16] E. L. KAPLAN AND P. MEIER, Nonparametric estimation
fromincomplete observations,Journalof the
AmericanStatisticalAssociation53 (1958), pp. 457-481.
[17] N. MANTEL AND W. HAENSZEL, Statisticalaspectsoftheanalysisofdata fromretrospective studiesof
disease,Journalof the National Cancer Institute22 (1959), pp. 719-748.
[18] R. G. MILLER, The jackknife-a review,Biometrika61 (1974), pp. 1-17.
[19] F. MOSTELLER AND J.W. TUKEY, Data Analysisand Regression,Addison-Wesley,Reading, MA,
1977.
[20] D. B. RUBIN, UsingempiricalBayes techniquesin thelaw schoolvaliditystudies,Law School Admission
Council Report78-1, 1977.
[21] M. STONE, Cross-validatory choice and assessmentof statisticalpredictions,Journalof the Royal
StatisticalSocietySeries B 36 (1974), pp. 111-147.
[22] B. W. TURNBULL, The empiricaldistribution functionwitharbitrarily grouped,censored,and truncated
data, Journalof the Royal StatisticalSocietySeries B 38 (1976), pp. 290-295.
[23] G. WAHBA AND S. WOLD, A completelyautomaticFrenchcurve: fitting splinefunctionsby cross-
validation,Communicationsin Statistics4 (1975), pp. 1-17.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM


All use subject to JSTOR Terms and Conditions

You might also like