0% found this document useful (0 votes)

105 views22 pages

Thinking The Unthinkable

This document discusses how the advent of high-speed computers has affected the development of statistical theory. It provides examples of areas that have advanced due to increased computational power, including nonparametric methods, the jackknife, bootstrap, and methods for larger datasets. The examples are intended to illustrate to non-statisticians how computation allows new statistical approaches.

Uploaded by

Edwin Steven

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views22 pages

Thinking The Unthinkable

Uploaded by

Edwin Steven

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Computers and the Theory of Statistics: Thinking the Unthinkable

Author(s): Bradley Efron

Source: SIAM Review, Vol. 21, No. 4 (Oct., 1979), pp. 460-480
Published by: Society for Industrial and Applied Mathematics
Stable URL: http://www.jstor.org/stable/2030104 .
Accessed: 18/06/2014 09:32

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].

Society for Industrial and Applied Mathematics is collaborating with JSTOR to digitize, preserve and extend
access to SIAM Review.

http://www.jstor.org

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
SIAM REVIEW ? 1979 SocietyforIndustrialand Applied Mathematics
Vol. 21, No. 4, October 1979 0036-1445/79/2104-0002$01.00/0

COMPUTERS AND THE THEORY OF STATISTICS:

THINKING THE UNTHINKABLE*
BRADLEY EFRONt

Abstract.This is a surveyarticleconcerningrecentadvancesin certainareas ofstatisticaltheory,written

fora mathematicalaudience withno backgroundin statistics.The topicsare chosen to illustratea special
point: how the adventof the high-speedcomputerhas at'ected the developmentof statisticaltheory.The
topicsdiscussedincludenonparametricmethods,the jackknife,the bootstrap,cross-validation,error-rate
estimationin discriminantanalysis, robust estimation,the influencefunction,censored data, the EM
algorithm,and Cox's likelihoodfunction.The expositionis mainlybyexample,withonlya littleofferedin the
way of theoreticaldevelopment.
1. Introduction.The editorshave been kind enough to invitea surveyarticle
concerningwhat'snew in the theoryof statistics.Any answerto thisquestionmustbe
eitherincompleteor bewilderingto the reader. Here I have triedto be incomplete,
selectingmy topics to illustratea special point: how the advent of the high-speed
computerhas affectedthe theoreticalstructureof statistics.
Statisticsconcernsthe comparisonof sets of numbers-with each other,with
theoreticalmodels,and withpast experience.The prototypicalscientific question,"Is
methodA betterthanmethodB?," mayboil downto thestatisticalquestion,"Is set of
numbersA biggerthanset of numbersB?" If,forexample,1A = {94,197,16,38,99,
141, 23} and B = {52, 104, 146, 10, 50, 31, 40, 27, 46}, how can we preciselyphrase
such a question, in particularthe crucial concept of "bigger," and answer it in a
scientificallymeaningfulway? The statistician'sstandardanswer,before1950, would
have been
1) Computethet-statistic,whichis thedifference betweentheaverageofthesetA
numbersand theaverageof theset B numbers,dividedbya certainquadraticfunction
of all 16 numbers.(The divisorscales thedifference betweenthetwoaveragesso thata
singletable can be used at step 2 below.)
2) Compare the observed value of t withits theoreticaldistributioncalculated
underthe assumptionthatall 16 numberswere independentlydrawnfromthe same
normal("Gaussian") distribution. This theoreticaldistributionis publishedin a stan-
dard t-table.
3) Decide thatset A is reallybiggerthanset B, and notjust accidentallybigger,if
the observedvalue of t is in the upper 5% of the theoreticaldistribution.
The mostobviousdefectof thisprocedureis theuse of normaldistribution theory
to determinethe criticalvalue at whichthe observed t becomes "significant."Non-
parametric mainlydevelopedsince 1950, givesan answerthatdoes notdepend
statistics,
upon normaltheory:
1) Combine all 16 numbersinto one set C = {94, 197, ,46}, and considerall
11,440 ways (=16!/7!9!) of partitioningC into two sets "a" and "b," a having7
membersand b having9 members.
2) For each suchpartitioncomputethedifference betweentheaverageoftheset a
numbersand the average of the set b numbers,say xa-4 There are 11,440 such
differences, one of whichis the differenceXA - XB corresponding to the data actually
observed.
* Received bytheeditorsJune28, 1978, and in revisedversionDecember 14, 1978. The preparationof
thisinvitedmanuscriptwas supportedby the U.S. ArmyResearch OfficeunderContractDAAG29-79-C-
0014.
t Departmentof Statistics,StanfordUniversity, Stanford,California94305.
1These numbersare cell counts,inthousands,froman experimentinvolving16 mice.The 7 miceinsetA
receivedan inoculationexpectedto increasethecell count.The 9 micein setB did notreceivean inoculation.

460

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 461

3) Decide thatsetA reallyis biggerthanset B ifXA - XB is in theupper5% ofthe

11,440 xa -Xb values.
The nonparametricmethodpays a stiffcomputationalpriceforits freedomfrom
normal distributiontheory.There is no "significancetable," correspondingto the
t-table,withwhichone can comparethe observedvalue of XA - XB. Essentially,such a
table mustbe constructedanew foreach set ofdata.2On theotherhand,morethanjust
freedomfromnormality assumptionsis gained.Ifa differenttablehas to be constructed
foreach data set,thestatisticianmayverywellchoose to tablesomethingotherthanthe
differenceof the averages (whichwas chosen in the firstplace because of theoretical
propertiespeculiar to the normaldistribution).The recipe for a nonparametrictest
givenabove worksjust as well forthedifference of themediansas forthedifference of
the averages. Or the statisticianmayfirstmake a nonlineartransformation on each of
the 16 numbers,say y = g(x), and compareYA - YB withthetabledvalues ba- Yb Or he
maytryseveraldifferent transformations,and severaldifferentmeasuresof difference
betweenthetwosetsofnumbers,goingthroughthenonparametric recipeeach time,in
an attemptto understandhow robustthe perceiveddifference betweenA and B is to
changesin the statisticalprocedure.
The "unthinkable:"mentionedin the titleis simplythe thoughtthatone mightbe
willingto perform500,000 numericaloperationsin the analysisof 16 data points.Or
one mightbe willingto performa billion operationsto analyze 500 numbers.Such
statementswould have seemed insane thirtyyears ago, when a slow and noisy fifty
pound desk calculatorwhichadded, subtracted,multiplied,and dividedwas the most
sophisticatedcomputationalaid available to most scientists.Most of the statistical
theoryin common use was developed under the constraintof slow and expensive
computation.Now computationis fastand cheap. It is notsurprising thatnewtheoryis
being developed, which takes advantage of the high-speedcomputer.This paper
consistsofseveralexamplesofsuchtheory,presented,hopefully, ina manneraccessible
to nonstatisticians.
The set of examples presentedhere in no way exhauststhe range of interesting
currentwork in statistics,not even withinthe limitedcontextof this article.Some
notableomissionsincludethedesignofexperiments, computergraphicsand descriptive
statistics("data analysis"),timeseries and stochasticprocesses,Bayes and empirical
Bayes methods,Steinestimationand ridgeregression,analysisof categoricaldata, and
Monte Carlo methods.3Also unmentionedis the vigorousdevelopmentof numerical
analysismethodsappropriateto large statisticalanalyses,see forexample Golub and
Styan[11], whichcould easilyoccupyan articleof equal length.
This paper is intendedfornonstatisticians,and in orderto make it easilyreadable
mostof theexamplesinvolveartificially smalldata sets.This belies an importanteffect
ofthecomputerupon statisticalthinking. Statisticalproblemshave gottenmuchbigger,
in raw size, duringthe past 30 years as scientists,emboldened by the data handling
capabilitiesofthecomputer,have collectedlargerand largerdata sets.It is notunusual
these days to workwithsets of a millionor more numbers,sometimesfitting models
whichinvolvethousandsof parameters.Even themosttimewornstatisticaltechnique,

2
Shortcutsand approximationsare possible,thesimplestofwhichresultsin usingexactlythestandardt
methoddescribedfirst!R. A. Fisher,the principalfigurein the developmentof normaltheorymethods,
advocated what we have called the nonparametricapproach as early as 1935, but most of the theoretical
developmenttook place after1950.
3 A refereepointsout thatMonte Carlo allows one to go
muchfartherin studyingstandardstatistical
methods,such as the t test,undernonstandard(i.e. nonnormal)conditions.This is anotherwayin whichthe
computerimpactson statisticaltheory.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
462 BRADLEY EFRON

such as the standardlinear model, takes on qualitativelynew aspects when applied

underthesecircumstances. A briefdiscussionofthispointcan be foundin ? 8 of Efron
[9].
The expositionproceedsbya seriesofexamples,withonlyan occasionalhintofthe
deeper theoreticalquestionslurkingbehind the methods.The referenceshave been
chosen forreadabilityas well as importance,and are recommendedto readerswith
some statisticalbackground,who wishto pursuethesesubjectsfurther.
2. The jackknife.The jackknife,4introducedbyQuenouille and Tukeyin thelate
1950's, is an intriguingattemptto solve an importantstatisticalproblem: having
computedan estimateof some quantityof interest,say a mean or a probabilityor a
correlation,whataccuracycan be attachedto theestimate?Accuracyhererefersto the
"? something"whichoftenaccompaniesstatisticalestimates.The usual ? quantities
are based on normaldistribution theory,or occasionallysome otherparametrictheory,
while the jackknifeis a nonparametrictechniquewhichmakes no such assumptions.
Miller[18] givesan excellentreviewofthesubject.Here theexplanationwillbe givenin
termsof a simpleexample.
Table 1 refersto the 1973 enteringclasses of 15 Americanlaw schools. For each
school two numbersare given,
xi = average LSAT score of enteringstudentsin law school i
yi= average GPA of enteringstudentsin law school i,
TABLE 1
The average LSAT scoreand undergraduateGPA at 15 Americanlaw schools,entering
classes of 1973.

School# 1 2 3 4 5 6 7 8

LSAT 576 635 558 578 666 580 555 661

GPA 3.39 3.30 2.81 3.03 3.44 3.07 3.00 3.43

School# 9 10 11 12 13 14 15

LSAT 651 605 653 575 545 572 594

GPA 3.36 3.13 3.12 2.74 2.76 2.88 2.96

i = 1, 2, , 15. (The LSAT is a nationaltest,similarto the Graduate Record Exam,

while GPA refersto undergraduategrade point average.) These data are abstracted
fromRubin [20]. The data are plottedin Fig. 1.
The correlationcoefficientis a measure of association between two sets of
numbers,or, in itsabstractform,betweentwo infinitely largesets of numbers,usually
thoughtof as two related probabilitydistributions.By definition,the correlation
betweenthe n pairsof numbers(xi,yi),i = 1, 2,
coefficient , is,1
n
~~~~~~~~n n
(2.1) p n 1(Xi n xi2' (
=
1 in, YL yiln

Because of the Cauchy-Schwarzinequalityit is alwaystruethat-1 _p ? 1. The case

p = 1 occurswhenthe(xi,yi)pairslie on a singlestraightline withpositiveslope, while
4 The name "jackknife,"coinedbyTukey,is meantto conveythenotion a
of roughand readytool,useful
in a wide varietyof situations.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 463

GPA
3.50
.5
.8
3.40 '1 *9

3.30. *2

3.20-

3.10 *10 .11

.6
3.00. *7 4
*15
2.90^ 14
2.80- 3

2.70,*13
2.70 I *12 i B i i i ' LSAT
540 550 560 570 580 590 600 610 620 630 640 650 660 670
FIG. 1. A plotof thelaw schooldata givenin Table 1.

= -1 indicatesa perfectstraight line relationshipwithnegativeslope. Figure1 shows

thatwhilethelaw schooldata do notgo to eitheroftheseextremes,theyare "positively
correlated,"i.e. closerto p = 1 thanp = -1. The actualvalue is p = .776, whichin most
sociologicalstudieswould be takento indicatea strongly positiverelationshipbetween
the twovariables.In plain language,higherLSAT usuallygoes withhigherGPA, and
vice versa.
We wishto knowhow accurateis theestimatep = .776. In askingthisquestionwe
assumethatthereis a truecorrelationp which is attempting to measure,and which
A A

would approachifthe numberof data pairswas increasedfromn = 15 towardn = oo.

The mostcommonlyused measureof accuracyis the standarddeviation,
(2.2) (r = VE[(A _ p)2],

the root mean square difference of A, based on n = 15 pairs,fromp. Calling (2.2) the
standarddeviationassumes that is unbiased forp, thatis Ep = p. This isn't exactly
A

true,butthebias is smallenoughto be ignoredin thelaw schoolexample,forthesake of

simplifiedpresentation.The jackknifetheoryactuallyincludesa bias correctionmethod
whichwon't be discussedhere.
The jackknifeestimateof o-,say C(J), is obtainedby the followingprocedure:
1) Delete pair (xi,yi)fromthe data set and recomputethe correlationcoefficient
forthe remaining14 pairs. Call thisrecomputedvalue A(i), i = 1, 2, , n = 15.
2) Estimateo- by5

(2.3) ()jn1 E p
n =

(It is usual to replace bytn1 A(i)/n in (2.3), again forreasonsof bias correction,but
A

the difference in the estimateof 5(J) is less than .01% in our example.)

5Suppose thatinsteadof the correlationcoefficient, we wishto estimatethe standarddeviationof the

mean x of n numbersx1,x2, ***, xn. The jackknifeprocedure,applied to thissituation,gives the usual
estimate[X(xi-_)2/(n (n - 1))]1/2. The factor(n - 1)/n in (2.3) is includedinorderto makethejackknifegive
this,the "right"answer,forthe standarddeviationof x.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
464 BRADLEY EFRON

Table 2 displays the values of p(i) - p for the law school data. The jackknife
estimateof accuracyis
CJ =A.0203 = .143.
Notice thatwe have had to about n timesas muchcomputationto get CJ as to getthe
estimate itself.
A

TABLE 2
The values of forthelaw schooldata.
A
(,)-

i 1 2 3 4 5 6 7 8

P(i)_P .116 -.013 -.021 -.000 -.045 .004 .008 .040

i 9 10 11 12 13 14 15

A(i) _A -.025 -.000 .042 .009 -.036 -.009 .003

The statistician mightnow reportp = .776 ? .143. This meansthathisbestguessof

A
theunknowntruevalue p is = .776, withan expectedrootmean square errorof .143
forp - p. Ifp - p has roughlya normaldistribution,
whichforlargeamountsofdata will
alwaysbe the case, thenthe accuracystatementcan also be interpretedas
(2.4) Prob{p E [.776 -.143, .776 +.143]}>.68.
(Statement(2.4) is based upon the fact that a normal distributionputs 68% of its
probabilitywithinone standarddeviationofthemean.) Intervalstatementsofaccuracy
like (2.4) have more intuitiveappeal thanroot mean square error.
How good is the estimate(r(J)?We could, if we wanted to, jackknifethe entire
procedurewhichcomputeda(J), that is do a second order jackknife,to estimatea
standarddeviationof a (This would requireabout n2 timesas manycalculationsas
forp.) Instead,we willcomparea(J) withthetraditionalnormal-theory estimateofp's
standarddeviation. In the next section we will calculate the standarddeviation in
anotherway whichclarifiesthe connectionbetweenthe two answers.
Suppose the n = 15 pairs (xi,yi) are actually drawn froma bivariate normal
distribution withcorrelationcoefficient p. Then the exact densityfunctionof can be
A

calculatedtheoretically. Thisdensityfunctiondependsonlyuponp,noton themeansor

standard deviations ofx andy,andso canbe denotedfp(A); bydefinition lb fpA() dp =

Prob {a - b}. Figure2 showsfp(*) for =

p .776, theobservedvalue in the law school
samples.It is denotedfp (p*) to preservethedefinition p as theobservedvalue; p* is
of
justa convenientnameforthedummyvariableinf (*). The abscissais plottedin * -
to emphasizethe deviationsof p* fromp.
We see thatthe densityfunctionis not exactlynormal,havinga longertail to the
leftthanto theright,and also is not centeredexactlyat 0, i.e. at p p, havinginstead
median value .011. (The normalitycan be dramaticallyimprovedby makingFisher's
tanh-1transformation; see Cramer[5, p. 399].) The traditionalnormaltheoryestimate
of o-can be described,at the expense of a slightoversimplification, in termssimilarto
(2.4): look at the central 68% of the distributiondescribed byfp ( * ), thatis theinterval
fromthe 16thpercentile to the 84th percentile. Half of the lengthof thisintervalis a
reasonable definition of the normal-theory estimate of o-,say (N). For p =.776 this
gives5(N) - ofa(N) agreeswith(2.2),butin
.113. Forlargevaluesofn thisdefinition
beingless affectedbyoccasionalwildvalues ofthe
smallsamplesitis moremeaningful,
randomquantitywhose accuracywe are tryingto describe.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 465

fP( A*)
4.0
3.5-

3.0-

densityfunction 2.5

2.0.

1.0
percentiles

16% 50% 84%

p -p
-.4 -.3 -.2 -.1 0 .1 .2
(
14 gives 0r(N). 13

FIG. 2. Thenormaltheory density function oftheobservedcorrelation p* for15 data pairs (xi,yi)

coefficient
drawnfroma bivariatenormaldistribution withtruecorrelationp =.776. The distributionputs 68% of its
in theintervalp* E [p - .126, p +.099].
probability

The calculations of the next section suggest an answer somewhat closer to

(N)=.113 thanto5'
a = .143. One bad featureof a(J) can be spottedin Table 2. The
firstvalue, -P = .116, accountsfortwo-thirds of the sum of squares in (2.3). Any
estimatethatdependsso heavilyon a singledatumis proneto instability, as we discuss
in ? 5.
Figure1 showswhyp(?)- p is so large.Data point1 is farawayfromtheother14, so
that its removal causes a large change in the estimatedcorrelationcoefficient. This
notion is formalizedin ? 5 under the name "influencefunction,"and furnishesa
theoreticrationaleforthe jackknifeestimateof accuracy.In additionto Miller [18],
anothergood referenceon the justificationand use of the jackknifeis Mostellerand
Tukey [19].
3. Bootstrapmethods. We consideranothermethod,called the "bootstrap" in
Efron [8], of assigningan accuracyto the estimatedcorrelationp =.776 forthe law
school data:6
1) Let F be the empiricaldistributionof the 15 observed data points,i.e. the
probabilitydistribution whichputsmass 1/15 at each observedpoint(xi,yi).
2) Use a randomnumbergeneratorto draw 15 newpoints(x*, y ) independently
and with replacementfromF, so that each new point is an independentrandom
selectionof one of the 15 originaldata points.These new points,whichwe willcall the
"bootstrapsample," are a subsetof the originalpointsplottedin Fig. 1. Some of the
originalpointswillhave been selectedzero times,some once, some twice,etc.
3) Compute p , the correlationcoefficient forthe bootstrapsample.
4) Repeat steps(2) and (3) a largenumberof times,say N times,each timeusing
an independentsetofnewrandomnumbersto generatethenewbootstrapsample.Call
theresulting
sequenceofbootstrap
correlation
coefficients
p*1, p*2,. *, p$I, ,p

6 Thename"bootstrap" ismeanttobe euphonic

with"jackknife,"
thetwomethodsbeingclosely
related
as we shallsee,andalsoto conveytheself-help
natureofthebootstrap
algorithm.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
466 BRADLEY EFRON

5) Let [a, b] be the central68% intervalforthe p* values, i.e.

{p '<a*l 1 {P < b }=84
N 16, N 84.

Definethebootstrapestimateofthestandarddeviationo-,sayC(B), to be halfthelength
of thisinterval,

A(B)=b* a

Figure3 showstheresultsofN = 1000 bootstrapreplications.The histogramofthe

1000valuesp,' _p, p *2 p, , I*N _p, is plotted,anditisseenthatC(B) = .127. The
similarityof thehistogramto thenormal-theory densityfunctionofp, - p, reproduced
fromFig. 2, is apparent,the main difference being an excess of bootstrapvalues for
p- p > .15 (comingfroma deficit intherange0 to .10). This excesspullsthe84% point
of p -p up to .132, comparedwiththenormal-theory value of .099, and is thereason
r(BY .127 is larger than the normal-theory estimate (N)=.113, thoughit is still
considerablysmallerthan -(J)= .143, the jackknifeestimate.

normaltheory
density\ /histogram

histrogram
percentiles

_ _ 16% t50% 84%

p -p
-.4 -.3 -.2 -.1 0 .1 .2
>- givescr()=27 b

FIG. 3. Histogram,1000 bootstrap replications

ofp* - p, givesa bootstrap
estimateofaccuracy5(B) = .127
forthecorrelationcoefficientA= .776 ofthelaw schooldata. Thenormaltheory densityof A* - A,fromFig. 2, has
a similarshape, butfalls offmorequicklyat highervalues ofp* -

What we have called p before,the truecorrelation,mightbetterbe called p(F),

where F is the trueprobabilitydistributiongivingrise to the data pairs (xi,yi). The
notationp = p (F) emphasizesthatthe correlationcoefficient is a functional,mapping
any bivariate probabilitydistributioninto a real number in the interval[-1, 1].
Definition(2.1) can be writtenp = p (F), where F is the empiricalprobabilitydis-
tributionintroducedat step 1 of the bootstrapprocedure.
The rationaleunderlying thebootstrapprocedureis simple:i) We wantan estimate
of the accuracyof A; ii) We would like to use o-(F), where cr-() is some agreed upon
functionalwhichmeasuresaccuracy,suchas (2.2), o-(F) = [E(p (F) - p (F))2]1/2. (Notice
thato-(F) depends only upon F since the expectationoperatorE averages over the
possible F's arisingfroma randomsample of 15 independentpairs fromF.) iii) We
don't knowF, so insteadwe estimateCJ= o-(F). In otherwords,we use the same basic

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 467

methodto estimateo-as to estimatep itself-a simplesubstitution

ofF fortheunknown
truedistribution
F.
Insteadof rootmean square error,we have been employinga different functional
to measureaccuracy,

(3.1) o-(F) = halfthe lengthof the central68% of the

underF, of p (F) - p (F).
probabilitydistribution,
Whyone mightprefer(3.1) to (2.2) is discussedin ? 5, thoughthereal reason here has
been the ease of graphicalpresentation.
The empiricaldistribution F is a crudeestimateofF. Whynotuse a betterestimate
of F, say F+, and estimatethe accuracyby `+ = o-(F+)? That is exactlywhatwe have
done in obtainingthenormaltheoryestimateC(N). The betterestimateofF is F+ equal
to a bivariatenormaldistribution whose correlationcoefficient is the observedvalue
p= .776. (The means and variancesof F+ are also set equal to the observed sample
values.) In thissense C(N) is itselfa bootstrapestimate,theonlydifference beingtheuse
of a betterF at step 1.7 "Better,"of course,mayreallybe worseifthe assumptionthat
the trueF is bivariatenormalis wrong.It is reassuringto see the agreementbetween
C(N) and (B), since the lattermakes no special assumptionsabout the formof F
It is interesting
to trya compromisebetweenF, theempiricaldistribution, and F+,
the best fitting normaldistribution. Let FC be the probabilitydistribution of a random
point v = (x, y) obtained as follows: take independentpoints v' = (x', y') and v" =
(x", y") fromF and F+ respectively, and let v = s/i - c2 v' + cv". Then F0 = F, F1 =F
but forintermediatevalues of c we get a blend of the discretedistribution F and the
continuousnormaldistribution F+, whichmay more nearlyapproximateour actual
beliefsabout the formof the trueF
Figure4 showswhathappensifwe beginthebootstrapprocedurewithFC instead
of F. The value c = 1/AV5 was used, whichroughlyspeakinggivesfourtimesas much
weightto F as to F+. The bootstrapdistribution ofp - p now looks even morelikethe
normaltheorycase, but the estimateof accuracyis virtuallyunchanged,5(B) = .125.
(An equal mixtureof F and F gave (B = .116.) The summarystatementp=
.776 ? .125 seemsquitereasonableat thispoint;we have groundsforbelievingthatthe
accuracyis somewhat,butnota greatdeal, worsethantheaccuracyunderpure normal
theory.
The choice of N = 1000 as the numberof bootstrapreplicationscan be shown,in
thepresentcase, to determineC(B) to an accuracyof about 2.5%. This meansthatifN
were increasedfrom1000 towardinfinity, thelimiting value of CJ(B) wouldbe expected
to differ from.127 byless than2.5%. Vastlymorebootstrapreplicationsmightresultin
'A
,A(B)=O
oCB= .130 or .125, but almostcertainlynot B = .120 or .135. We could have gotten
by withN = 250 replications,givingan expectedaccuracyof 5%, but N = 1000 is not
foolishlyexcessive.This impressiveexpenditureof computingpower, 1000 timesthat
forthe originalcalculationof ',Adoesn't includethe 1000 smoothedbootstraprepli-
cationsof Fig. 4. Of course,all the calculationstogetheronlytook a fewseconds and
cost perhaps $10, but, to reiteratethe obvious, they would have been practically
impossible 30 years ago. Bootstrap-likeprocedures have undergone very little
theoretical development since they have been computationallypractical for a

7Steps 2 through5 of the bootstrapprocedure are done theoretically,ratherthan by computer

simulation,in the normal-theory calculation.The bivariatenormalmodel is virtuallyunique in yieldingan
analyticallysimpledistributionforp. This getsback to our mainpoint,theeffectof thecomputeron whatis
considereda feasiblestatisticalprocedure.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
468 BRADLEY EFRON

normaltheorydensity h
histogm stogram

/ ~~~~~~histogram
\
percentiles

16
X50% 84%
P* -P
-.4 -.3 -.2 -.1 0 .1 .2
L
-givescr(B) =.125 i
FIG. 4. Histogram,1000 bootstrap replicationsof p*- p, usingthesmoothedsamplingdistribution FC
C= 1/1/5,describedin thetext.The histogramfollowsthenormal-theory densitymorecloselythaninFig. 3, but
(B)= .125, almostthesame value as fortheunsmoothedbootstrap.

comparativelyshorttime,but theoreticianscan be expectedto take greaterinterestin

themnow thattheyare feasible.
There is an interestingtheoreticalconnectionbetween the jackknifeand the
bootstrap.8Consideringnow just one bootstrapreplication,letpe be theproportionof
the bootstrapsample equal to the originaldata pair (xi,yi). For example,if (x5,y5) is
includedthreetimesin thebootstrapsample,ofsize n = 15, thenp4*= 3/15 = .20. The
vectorp* = (p*,p* , *,p) determinesp -p, so we can write,say p*-p=gg(p*),
where g(') is a known function.(To be specific,g(p*) [Ep8 (xi-x*)(y -
[Zpe (xi -x-*)2 E pe (Yi - -*)2]1/2 _ p, where x = E Pexi, y* = ZP8Yj. Notice that the
data (xi,yi),i = 1, 2, ***, 15, are consideredfixedin thisdefinition.)
The statisticsof the vector p* are completelyknown fromthe propertiesof
the multinomial distribution. For example, p* has expected value
(1/n, 1/n,1/n, , 1/n), n = 15, and covariancematrixwithijth element
1 1
Covariance (pI,pI)=2-p
P --3 i=j
n n
(3.2)
=-
n3 ~i?j.

Expandingg(*) in a Taylorseries around (1/n, 1/n, , 1/n) gives

n 1\
(3.3) p
A* A
_ p= E
i=1
i --) +higher orderterms,
g(P) ( _i
/i A

where g(i) is the partial derivativeof g(Q) with respect to p8, evaluated at p*=
(l/n, , 1/n). Together, (3.2) and (3.3) suggest an easy approximationto the
standarddeviationof _- underbootstrapsampling,namely
* A

O=
A(B)- 12E [g(i)]2.

8The remainderofthissectionassumessomeknowledge ofstatistical

theory,
though
thegeneraldrift
of
theargumentstillshouldbe discernible
to nonstatisticians.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 469

This is almost exactly the jackknife estimate CJ, the main differencebeing the
in place of the derivativesg(i) appearingin
substitutionin (2.3) of finitedifferences9
(3.4). Jaeckel[15] originallysuggestedtherightside of (3.4) as an accuracyapproxima-
tion,callingit the "infinitesimaljackknife;"see also Efron[8].
4. Cross-validation.In itsoriginalform,cross-validation referredto thefollowing
simple,butuseful,idea: givena large class ofpossible modelsto fitto a set of data, for
examplelinearregressionmodels in whichthe choice of predictorvariablesis open to
question,firstrandomlydividethedata intotwohalves.Then fita modelto thefirst half
ofthedata,usinganyfitting methodat all, and see howwellthefittedmodelpredictsthe
second half of the data. This last step, which is the cross-validation,protectsthe
statisticianagainstan overlyoptimisticassessmentof goodness-of-fit.
Recentlymanyauthors,in particularStone [21] and Geisser [10], have proposed
directuse of cross-validationforthe selectionof appropriatemodels.This approachis
computerintensive,butpotentiallymuchbroaderin applicationthanthefamiliarlinear
model approach. We illustratethe methodwithan example taken fromWahba and
Wold [23].
Figure 5 shows 100 artificiallygenerateddata points,created accordingto the
followingmodel: thepoint(xi,yi)withabscissaxi has ordinateyirandomlydetermined
by
(4.1) Yi A(Xi) + Si i= 1, 29..** 1009
where t

(4.2) . (x) = 4.26(e-x - 4 e-2x + e-3x)

and the ei are independentnormal random variables with mean 0 and standard
deviationo-= 0.2. The xi values are equallyspaced from0 to 3.10. The function/I(x),
whichin a real applicationwouldbe unknownto thestatistician, is shownas thedashed
curvein Fig. 5.

.801
.60.*
.40-
.20 . ;
.00 . 1 1 2 2 3.00
-.20 t; .

-.60 .;

_1800 '

-1.20^ .
.00 .50 1.00 1.50 2.00 2.50 3.00
FIG. 5. 100 randompointsgeneratedaccordingto model (4.1), (4.2). The truemean functionA (x) is
indicatedbythedashed curve.Thesolidcurvewas obtainedfromthedata pointsbythecross-validation
method.

Wahba and Wold considerfittinga class of curves 77(x,a) to these data. For a
particularchoiceofthenonnegativeparametera, 77(x, a) is bydefinition
thecurve77(x)
It is nothardto showthat(n - l)(p - P()) g('); see ? 5 ofEfron[9].
approximates

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
470 BRADLEY EFRON

minimizing
l 100
(4-3) 1 E [yi-r,vi)]2
100 i~

subject to the constraint

3.125
(4.4) J| [7"(x)]2 dx = a.

Constraint(4.4) is a smoothnesscondition:if we take a = 0, 7(x, 0) is verysmooth

indeed,beingtheordinaryleast squares straightline forthedata in Fig. 5. It is easy to
see thatthisgivesa verypoor fitin thepresentcase. At theoppositeextreme,ifwe leta
get largeenough, 7(x, c) willgo througheverydata point.This fitsthedata perfectly,
butis fartoo irregulara curveto be of anyuse forpredictionor analysis.Intermediate
valuesofa givecubicsplinefunctions, witha trade-offbetweensmoothness(4.4) and fit
(4.3).
Cross-validationproposes to estimate the best value of a, withoutany prior
knowledgeofthegeneratingmechanismforthedata. "Best" heremeansthevalue ofa
minimizing
l 100
(4.5) 10 E [A(xi)-q (xi,a)]2,
100 i=1

inotherwords,thecurver1(x, a) closestto thetruemeanfunction, (x). Anotherwayto

statecriterion(4.5) is to imaginethata new set of data, say (xi,y*, i = 1, 2, , 100,
has been independently generatedaccordingto model (4.1), (4.2). How wellwilla curve
71(x, a) fittedto the originaldata predictthisnew data set, in the sense of minimizing
(1/100) Eilo [y-*- rq(xi,a)]2? The expectederrorof prediction,with77(x,a) fixed,is

1 100 2 100
(4.6) E E [y*--a(xi,a)]2 o' +- E [(x(xi)-r7(xi,a)]2.
100 i1100i=

Since u2 = (0.2)2 is a fixednumber,minimizing(4.5) is equivalentto minimizingthe

expectedsquared errorof prediction(4.6).
Ifthenewdata set (xi,y*) wereactuallyavailablewe could easilyselecta: foreach
a, the curve71(x,a) is determinedfromthe originaldata set,by (4.3), (4.4), and then
testedon the new data set by computingQ*(a) = (1/100) Y,1i [y* - (xi, a)]2. The a
whichminimizedQ*(a) would be the estimatedbest a.
Cross-validationdoes almostthesame thing,withoutrequiringanynew data. For
each choice of i, i = 1, 2, 100, let T(i)(x, a) be thatcurve77(x)satisfying
constraint
(4.4), and minimizing
1 100
99 ji=1
j#i

In otherwords, 1(i)(x,
a) is thesolutionto theconstrainedminimization
problem(4.3),
(4.4) withpoint(xi,yi) removedfromthe data set. We thendefine
1 100
(4.8) Qt(a)= - E [yi-q(i)(xi, a)]2,
100i=

andselectas "best"theatminimizing
Qt(aE), sayaEt. Thecurver7(x,a t) iS theproposed
estimatefor,u(x).

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 471

The solid curvein Fig. 5 shows r1(x, a t) in Wahba and Wold's example.The fitis
obviouslyquitegood, and Qt(a t), ifitwerepresented,wouldgivea good estimateofthe
expectedpredictionerror(4.6) for77(x,at). Of course we have had to do about 100
timesas muchworkto computethe curve71(x,a) forany givena. (Wahba and Wold
actually omit points 10 at a time, instead of one at a time, and so reduce the
computationaleffortby a factorof 10.)
Cross-validationresemblesthejackknifein thatdata pointsare removedone at a
timein bothprocedures,buttheunderlying connectionbetweenthetwomethodsis still
not clear to statisticalresearchers.The next example shows a situationwhere either
cross-validationor the bootstrapcan be applied, but the latteris quite a bit more
effective.This isn't intendedto disparage cross-validation,but ratherto suggestthat
furtherresearchmaylead to powerfulcombinationsof cross-validation and jackknife-
bootstrapmethods.
Figure 6 shows 20 artificially generatedrandom points, 10 fromeach of two
populations.The underlying x populationis bivariatenormalwithmean vector(-2, 0)'

Region A

.y x
OX y~~~~~~

.x /.

linear /.
discriminant /Y Region B
boundary ~>/

FIG. 6. Ten x pointsindependently

generated
froma bivariate normalpopulationwithmean vector(-2, 0)',
and ten y points independently generatedfroma bivariatenormalpopulation withmean vector(2, 0)'.
(Covariance matrixis theidentity
in bothgroups.)The straightline is thelineardiscriminant
boundary.

and covariancematrixthe identity.The y population differsin havingmean vector

(2, 0)'. By definition,
thelineardiscriminant
boundaryis thestraightline
(4.9) {z: (y-X?)'S1((- D
2 =)-}
wherex and yiare the two mean vectors, m=exi/10, y= /10,and S is the 20x2
matrix E (xi - i)(x -xty)'
d + bo(yh
- T-3l)'. The linear discriminant boundary divides
op)(y
theplane intotworegions,A and B, theintentionbeingto classifyan unlabeledfuture
point z as being eitheran x or y dependingon whetherit falls into A or B. (The
optimum divisionlineforfuture classificationis actually{z= (z1, z2):zl=}, but of

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
472 BRADLEY EFRON

course the statisticianwouldn'tknow that in a real situation.Notice thatthe linear

discriminantboundary is calculated from the observed data, and doesn't require
knowledgeof the underlying probabilitymechanisms.Definition(4.9) is motivatedby
an attemptto estimatetheoptimumdivisionline,whichis infactthelineobtainedfrom
(4.9) whenx, y and S are replacedbythetruemeanvectorsand covariancematrixofthe
two normalpopulations.Using a linearboundarytacitlyassumes thatthe covariance
matrixis the same forbothpopulations.)
The probabilitythata futurex randompointwillbe misclassified is

error. Prob {x E B},

whichhappensto equal 0.41 forthesituationinFig. 6. In thisdefinition,

B is considered
fixedas shown,and therandomquantityis thehypothetical futurex point.The obvious
estimateof error.is
#{xi E B}
errorx= 1

whichequals 0.30 in Fig. 6. It is well knownthaterrorxtendsto underestimateerrorx,

thatis to have an optimisticbias, and an importantproblemis to estimatetheexpected
bias,

(4.10) biasx E{errorx- errorx}.

The corresponding quantityforthey populationis equallyimportantofcourse,butitis
sufficientto discussestimatingbiasx.
Cross-validationestimatesbiasx by i) successivelyeliminatingeach point xi,
i = 1, 2, , 10; ii) recomputing the lineardiscriminant boundaryon the basis of the
nine remainingx's and 10 y's; and iii) seeingwhetheror not xi is misclassifiedby the
recomputed discriminationrule. Let errors be the proportion of the x points
misclassified at step iii). Then the cross-validatedestimateof bias is

(4.11) biasx= error--errorx.

In the situationof Fig. 6, biast = 0.10, whichmeans that4 out of 10 x values were
misclassifiedduringthe cross-validationprocess.
The bootstrapestimateof biasx takes considerablymore computation:
1) Select a bootstrapsample of 10 new x points,xi, x2, , x*o, by random
sampling,independently and withreplacement,fromthe givenpointsx1,x2,... , x10.
Likewise,constructa bootstrapsample of 10 new y pointsy*,*
Y2, * , y*O byrandom
*yrno
samplingfromYl, Y2,* , Yio.
2) Constructthebootstraplineardiscriminant boundarybysubstituting x*, y*,S*
forx, y,S in (4.9). Denote the bootstrapdiscriminantregionsas A*, B*.
3) Let
#{x,~?B*}
(4.12) b*_ #{xjeB*}
10 10

4) Repeat steps 1)-3) a large numberN of times,obtainingindependentvalues

b 2, , br', and estimatebiasx by
bX*1,
i N
(4.13) bxs= N b1 '

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 473

In the presentcase, N = 100 bootstrapreplicationsgave the estimatebias* = 0.078.

Notice that(4.12) is of the form"trueminusapparenterrorrate," wherenow "true"
refersto thexi and "apparent"refersto thex*. The justification ofthebootstrapis the
same here as in ? 3.
When the x and y values are generatedby the underlyingnormaldistributions
describedearlier,theactualvalue ofbiasxis .062. That is,errorxtendsto underestimate
errorxby .062, on theaverage.In a largenumberofMonte Carlo trials,reportedin ? 4
of Efron [8], both biast and bias* were themselvesnearly unbiased; that is they
averagedabout .062. However,thebiastvalueswerethreetimesmorevariablethanthe
bias* values, which made them much less dependable for assessing biasx in any
particularcase.
5. Robust estimation.A fundamentalstatisticaltactic is the combinationof
separate small pieces of information, each by itselfnearlyworthless,to produce an
overallconclusionofsubstantialreliability. Independenttossesofa possiblybiased coin
offerthe classic example. No one toss tells us verymuch about the coin, but having
observed,say, 30 heads in 100 tosses, the trueprobabilityof heads can reliablybe
predictedto lie in the interval.300 ?.092. Averaging,whichis whatis done to get the
estimate.300, is a powerfulway of bringingdiverseinformation to bear on a single
importantquestion. Some of the most useful statisticalmethods, such as linear
regressionand analysisofvariance,are reallyno morethanfancyaveragingtechniques,
designedforsituationswherethe individualobservationsare collectedundervarying
circumstances.
Suppose we threwaway any one of the 100 coin flips,leavingourselveswiththe
data fromthe remaining99. The estimatedtrueprobabilityof heads, call it p, would
thenequal eitherp = 30/99 = .303 or p = 29/99 = .293, dependingon whetherwe had
thrownawaya head or a tail.Both .303 and .293 are quite close to .300, thepointhere
beingthatno one oftheindividualpiecesofinformation is byitselfveryimportant to the
estimatep = .300. We say thatp is robustin thissituation,to use Tukey's memorable
terminology (somewhatdifferently thanoriginallyintended).
Unfortunately, it is not alwaystruethatthe average x~= En xi/nis robustin the_

sense above. Table 3 shows microbecountsin 69 swabs fromdifferent portionsof a

Marinerspace probe. The average countis Jx= 16.14, but deletingthe largestcount,
count #69, gives average x(69)= 1.53 for the remaining68 numbers.Deleting the
largesttwo counts,count #69 and count #68, gives x~(68,69) =.63. In thiscase xi is
nonrobust.
distinctly
Recently statisticianshave become interestedin robust estimators,averaging
techniqueswhichlimitthe influenceof any one observationon the estimate,even in
situationsas extremeas thatof Table 3. Huber's monograph[14] gives an excellent
overviewof the subject.Anothergood referenceis Hampel [12].

TABLE 3
Microbecountsin 69 swabs ofa Marinerspace probe.(Partofa muchlargerdata
set.) The countwas zero in 53 swabs,one in 6 swabs,etc.Removingthelargestcount,
1010, reducestheaveragecountfrom16.14 to 1.53.

Count 0 1 3 4 5 6 9 62 1010

Numberofswabs 53 6 4 1 1 1 1 1 1

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
474 BRADLEY EFRON

The averagex ofa setofnumbersx1,x2, , x, can also be derivedas thatnumber

T whichminimizesthesumofsquared deviations, =(xi = shows
- T)2. Differentiation
thatx mayalso be characterizedas thesolutionto theequation(in T), Ej=l (xi- T) = 0.
an M estimatoris the solutionin T to the equation
By definition,
n
(5.1) Y. /(xi- T) = .
i=l

Here 0( ) is a preselectedfunction,whichcan be chosen to give good robustness

properties.If *f(x) x then T is the ordinaryaverage x. If
f(x)= sign(x)
then T is the sample median,the middlevalue of theobservationslistedin increasing
argumentat thebeginningofthisparagraphshows
order.(Reversingthedifferentiation
that the median minimizesthe sum of absolute deviations ,7=,lxi- TI.) For the
microbedata the median equals 0 no matterhow many of the nonzero counts are
removed.This is more robustnessthanwe wantin manysituations!
As a compromisebetween*l(x) = x and ql(x) = sign(x) we can take

rc, x <-c,
(5.2) qf(x)= x, -c 'x c,
C, C<X.

Choosingc = o makes T equal to theaverage,whilec = 0 (actually,thelimitas c -* 0, in

whichcase f(x)/c-* sign(x)), givesthemedian.The choicec = 10 resultsintheestimate
T = .93 for the microbe data. Removingthe largestcount changes the estimateto
T(69)=.78; also removingthe second largestgives T(68,69)= .63. These values can be
obtainedeasilyon a hand calculator,usingNewton-Raphsoniterationor just trialand
error.Doing the computationgivesa good feelingforthe way in whichthe estimator
based on (5.2) acts like x near the middle of the data, but automaticallylimitsthe
influenceof outlyingobservations.
How can we choose amongstpossible estimatorsT in any givensituation?If we
knew that the observationswere independentlygeneratedaccordingto some prob-
abilitydensityfunctionf(x - 0), with 9 an unknownparameterto be estimated(a
"translationfamily"situation),we could use themaximumlikelihoodestimator,i.e. the
number T which maximizes Hl=n f(xi- T). Taking logarithmsand differentiating
showsthatthemaximumlikelihoodestimatoris an Mestimator,10 withql functionequal
to

(5.3) Off(X) f(x)

I
For the normaltranslationfamily,withf(x - 9) = (2irf112exp {-2(x f)2}, f(X) = X
and so the average x is the maximumlikelihoodestimator.The Laplace translation
familyf(x - 9) = (2) exp {-Ix - OI}givesthemedianas themaximumlikelihoodestima-
tor. Maximum likelihoodproduces nearlyoptimal estimatesin translationfamilies,
assumingof coursethatthef(*) used in (5.3) is actuallythecorrectformof thedensity
function.
The pointof muchof theworkin robustnesstheoryis thatthestatistician maynot
completelytrusta givenparametricmodel,suchas thenormaltranslation family,and so

The name "M estimator"comes fromMaximumlikelihood.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 475

maypreferto changethe "optimum"Off(.) to a morerobustchoice ql(*).This reduces

the theoreticalefficiencyof the estimatorsomewhat,compared with that of the
maximumlikelihood estimator,if the model f(x-0) is correct,but protectsthe
statistician againstdisastrously
foolishestimatesifthemodel is somewhatoff.It can be
shown that the square of the correlationbetween qlf(x)and *l(x) calculated under
densityf(x), determinesthelarge-sampleefficiency of theM estimatorbased on qf(r).
A correlationof .90, forexample,means thatthe M estimatorbased on 0(/*)wastes
about 19% (.19 = 1 _ .92) oftheinformation availableforestimating0 underthemodel
f(x - 9). It turnsout thatforthe normaltranslationmodel,reasonable choices of c in
(5.2) give efficienciesbetterthan 95% while stillprovidinggood protectionagainst
occasional wildobservations.
We have discussed the influenceof a singleobservationon an estimateT. This
notion has been formalizedunder the name "influencefunction,"and provides
theoreticaljustificationfor the jackknife,as well as for robust estimators.The M
estimatorsare functionalsT(F), as was p (F) in ? 3, and can be thoughtofas estimating
the truevalue T(F), whereF is thetrueprobabilitydistribution givingriseto thedata
X1, X2, X3,*. If the samplesize n were increasedtowardinfinity,
T(F) would
approach T(F).
Let S.,representthedegenerateprobability puttingall ofitsmassat the
distribution
pointx. The influencefunctionT(x; F), fora givenestimatorT, evaluated at the true
distributionF, is the functionof x definedby
d
(5.4) T(x; F)--T((l1- E)F + E3,x)|
dE E=0

The influencefunctionrepresentstheeffectupon T(F) of a smalllocal changein F. By

superimposingmany such small changes we obtain, via a firstorder Taylor series
expansion,an approximationto T(F) - T(F), thedifferencebetweentheestimatedand
truevalue of T,

(5.5) T(F) -T(F) + - T(xi;F).

n i=i

For a linear functional,such as the mean T(F)= x dF(x), (5.5) is exact. (For
the mean, T(x;F)=x-T(F), so (l/n) i= T(xi;F)=x-T(F)=T(F)-T(F).)
Nonlinear functionals,such as the M estimatorbased on (5.2), are, under some
regularityconditions,asymptoticallylinear, as n -* 00, in the sense of (5.5). The
usefulnessof (5.5) is thatit approximatesT(F) - T(F) bythe average of independent,
identicallydistributed, randomquantitiesT(xi; F). The standarddeviationof such an
average is

(5.6) [I[T(x; F)]2 dF(X)]112

V-n
l/In timesthe root mean square of the influencefunction.The jackknifestandard
deviation r(J), (2.3), is the nonparametricestimateof (5.6). (The values p(i)-p are
rathercrude estimatesof the influencefunction.Expression(5.6) is closelyrelatedto
(3.4).)
The principleofrobustestimationcan nowbe statedmorequantitatively: onlyuse
estimatorsT(F) forwhichtheinfluencefunctionis sensiblybounded.It is easy to verify
thattheinfluencefunctionof an M estimatoris proportionalto I+r(x - T(F)). The form
of i/ in (5.2) is nothingmorethana modification of if forthe average,if(x) = x, witha

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
476 BRADLEY EFRON

bound puton themagnitudeof theinfluencefunction.Definition(3.1) is motivatedby

similarconsiderations.
Robustnessideas are now beingapplied to regressionsituationssuchas (4.1), nice
referencesbeingAndrews[1] and Mostellerand Tukey [19]. If the errorsEioccasion-
ally take on wild values, then fitting
models by the methodof least squares can go
disastrouslywrong.The leastsquaresmethodfitsregressionparameters/3(forexample,
thecoefficientsof theexponentialtermsin (4.2), iftheywere unknown)byminimizing
Ei=1 (y1- , (xi))2.Instead,we can minimizeEU1P(yi - jt (xi)), where

(5.7) P(y) ={J (y') dy',

with / as in (5.2). The limitingcase, as c 0, fitsa model by minimizingE=l- -

O,(xi)|, the sum of absolute deviations."Least absolute deviations" was the fitting
method favoredby Laplace, but it lost out to Gauss' least squares, mainlyon the
groundsof computationalsimplicity.Now, 150 yearslater,Laplace may reclaimthe
field,withthe assistanceof the moderncomputer.
6. Censoreddata. We have made frequentuse oftheempiricaldistribution F, the
probabilitydistributionwhich puts mass 1/n at each of n observed data points
x1,x2,... , xn. (In ?? 3 and 4 thexi werepointsin a twodimensionalspace, whilein ? 5
the space was one dimensional.) It may seem that there is no way to make the
calculationofF difficult. Ifso, a look at some censoreddata shouldconvincethereader
otherwise.
Table 4 showssome earlyresultsfromthe hearttransplantprogramat Stanford.
The survivaltimesin days,followingthetransplant operation,are listedfor18 patients.
The firstlistedpatientsurvived3 days,thesecond4+ days,wherethe"+" indicatesthat
the patientwas stillalive on April 13, 1972, the pointin timeat whichthe data was
collected.Here itwouldbe wrongto letF be thedistribution puttingmass 1/18 at each
of the numbers 3, 4, 10, 25, . . , 1025 since, for eample, the actual survival time
correspondingto 4+ is knownonlyto lie in the interval(4, xo). This is an example of
censoring,in which the exact value of a measurementcan't be seen, but some
informationon itswhereaboutsis available.
Let T representthesurvivaltimeofa hearttransplant patient,a quantitywhichwe
willmeasurein days.The survivalcurveS(t) is theprobabilityofsurviving past a given
timet,
(6.1) S(t) Prob {T > t}.
KnowingthefunctionS(t) is thesame as knowingF, thetrueprobabilitydistribution of
T. Iftherewereno censoringwe could constructan estimateof S(t) in theobviousway,

(6.2) S(t) #{Ti> t}

n
wheren = 18 in thecase above. In otherwords,we could use theordinaryestimateF, of
whichS(t) is anotherrepresentation.
Figure7 showshow S(t) is constructedwhensome of the data are censored.The
constructiondepends upon the numberof patientsat riskat timet,

(6.3) n(t) numberof patientsneithercensored

nor observedto die beforetimet,
whichis givenin Table 4. In our example,n(0) = 18, n(100) = 9, n(200) = 4, etc. The

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 477

TABLE 4
Survival timesfor 18 early hearttransplantpatients.Tabled is survivaltime,in days, followingthe
transplant;"+" indicatesthatthepatientwas stillalive on April 13, 1972, theday thedata werecollected.
Abstractedfrom a largerdata setinBrownand Turnbull[21]. "Numberat risk" is used in thecalculationofF.

Survival
time 3 4+ 10 25+ 39 40+ 43 54 65

Numberat
risk 18 17 16 15 14 13 12 11 10

Survival
time 120+ 136 147 157+ 183+ 312 546+ 824 1025

Numberat
risk 9 8 7 6 5 4 3 2 1

?(t)
1.0.
.9

.4.

.3-
.2-

0 10 20 30 40 50 60 70 80 90 100110 120 130 140 150 160170 180190 200 210

FIG. 7. EstimatedsurvivalcurveS(t) fromthedata in Table 4. Open circlesrepresent censoreddata points,
whilejumpsoccurat uncensored At each uncensored
observations. data point,i.e. at each observeddeath,S(t) is
bya factorequal to theproportion
multiplied of theobservablepopulationnotdying.

of S(t) is recursive,startingwithS(O) = S(O) = 1:

definition
-S(t- 1) ifno observeddeathson day t

(6.4) (t) jn(tS(t-1 ) 1

n(1 n(t)
ifone observeddeath on day t.

In the heart transplantexample S(2) = S(1) = S(O) = 1, S(3) = S(2)(17/18) =.944,

S(1O) = S(9)(15/16) = .994 .938 = .885, etc. Notice thatthe data point4+ figuresin
thedenominatorofS(3), buthas no further effecton S(t). Kaplan and Meier [16] givea
veryreadable accountof the theorybehind(6.4).
The constructionof S(t) may seem ad hoc, but Kaplan and Meier show that it
producesthe maximumlikelihoodestimateof the unknownS(t): among all possible
survivalcurvesS(t), i.e. among all possible truedistributionsF, the choice S(t) = S(t)
maximizestheprobabilityofobtainingthedata actuallyobserved.Bootstrapestimates
of accuracyforfunctionalsof censoreddata beginwithF corresponding to thesurvival
curveS(t), at step 1 of the bootstrapalgorithm.
Efron [7] suggestedanothermotivationforS(t). Suppose we startout withany
estimateS(?1(t). Define a new estimateS(1)(t),along the lines of (6.2),

(6.5) ~ ~1(t) E1() #{ T > t})

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
478 BRADLEY EFRON

whereE"() indicatesan expectationtaken withrespectto the probabilitydistribution

definedby the survivalcurve S(0)(t). Taking t = 20 in Table 4, for example, the 15
patientswithsurvivaltimes>20, censoredor not,contribute15 to the #{Ti > 20}. The
patientswithsurvivaltimes3 and 10 contributezero to #{Ti > 20}. The patientwith
survivaltime4+ may or may not have Ti > 20. This patientcontributesan expected
amountto the rightside of (6.5), the expectationbeing taken underthe distribution
"S(?)(so thatS(1)(20) is between 15/18 and 16/18.
We can iterate (6.5), giving the sequence of survival curves S(0)(t), S l1(t),
S (t), Efronshows thatthissequence convergesto S(t). The usefulnessof this
iterativeconstructionof S(t) is thatit can be applied undermore difficultcensoring
conditions.The data in Table 4 consistsof observeddeaths and right-censored obser-
vations,such as 4+. In othersituationstheremay also be left-censoredand doubly
censoredobservations("the eventoccurredbeforet = 17," "the eventdid not occur
duringthe interval(12, 20)"). Turnbull[22] showed that under general censoring
conditions,theiterativeconstruction (6.5) alwaysconvergesto themaximumlikelihood
estimateof S(t).
Recently,Dempster,Laird,and Rubin [6] have puttherelationshipbetween(6.5)
and maximumlikelihoodestimationintoa widercontext,unifying workbymanyearlier
writers.They considera varietyof situationsin whichitwould be easy to calculatethe
maximumlikelihoodestimatorifone had thefullset of data, butwhereforone reason
or anothersome of the informationis missing.An example, forthose familiarwith
analysisof variance,is a two way table witha few missingobservations.In such a
situationtheyshowthatan iterativeprocedurelike (6.5) alwaysleads to themaximum
likelihoodestimator,and moreoverdoes so in a monotonicmanner.They call this
methodthe "EM Algorithm,"in whichone firstEstimatesthe missingdata and then
Maximizesas ifthe fulldata set were present.
The survivalfunctionS(t) can be expressedas

(6.6) S(t)= H [1-h(s)],

s=1

where
(6.7) h(s)-Prob {T = s I T > s -1},
theconditionalprobabilityofdyingon day s givensurvivalpastday s -1. The function
h(s) is called the hazard rate.The estimate(6.4) comes fromestimatingthe factor
1-h(s) by

(6.8) 1 _ numberofobserveddeathson day s

n(s)
Hazard ratesare moreconvenientto workwiththandensityfunctionsin censoreddata
situations,an idea we now explorefurther.
We have treatedthepatientsinTable 4 as iftheywereidentical,at leastas faras the
probability distribution
oftheirsurvivaltimesis concerned.In fact,thereare observable
differences betweenthepatients-age, sex,race,etc.-which we mightwishto examine
fortheireffecton survivaltime.If therewere no censoringwe could runan ordinary
regressionanalysiswiththeobservedsurvivaltimesas thedependentvariable.Cox [4]
has suggesteda regressionanalysiswhichworksdirectlywiththehazardrates,and is not
affectedby data censoring.
Let z7 representthe vectorof relevantobservableinformation, such as age, race,
and sex, about patienti, coded in some fashionso thatall theentriesof z1are numbers.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
COMPUTERS AND STATISTICS 479

For example,a 57 year old whitemale mightbe coded (57, 0, 1) where"0" indicates
whiteand "1 " indicatesmale. Cox's model postulatesthatthehazardrateforpatienti,
say hi(s), is of the form

(6.9) hi(s) = g(s) e'Z.

Here g(s) is an overall hazard rate applyingto all the patientsand IS is a vectorof
unknowncoefficients, correspondingto the regressioncoefficientsin an ordinary
regressionmodel. If /3 = 0 thenall thepatientshave thesame hazard rate,i.e. identical
probabilitydistributions fortheirsurvivaltimes,but if 3 $ 0, model (6.9) saysthatthe
survivaltimedistributions are functionsof zi. (The vectorzi can itselfbe a timevarying
function,say zi(s), as long as it is alwaysobservable.)
In orderto analyzethismodel,Cox uses an approachsimilarto (6.8). Let R (s) be
the riskset on day s, theset of patientsavailable forobservationon thatday,i.e. those
who have not been previouslycensorednor observedto die. Given thatthereis one
death on day s, theprobabilityundermodel (6.9) thatit was some particularpatientin
R(s) who dies, say patientis, equals

(6.10) E eIzs e z
LER(s) e
(Expression (6.10) is actuallyan approximationwhichbecomes exact as the unitsin
whichwe are measuringtimebecome infinitesimal.)
The advantageof (6.10) is thatitdependsonlyon f3and theobservablevectorszi,
and noton the commonhazardfunctiong(s) in (6.9). This makesiteasy to analyzethe
data forthe effectsof /3,withoutanymodelingof g(s) beingnecessary.Cox multiplies
the factors(6.10) together,one fromeach observeddeath,

(6.11) r e
observedEi(EQ(s) e
deaths

and treatsthisproductas ifitwere an ordinarylikelihoodfunctionfor,8.For example,

the /8which maximizes (6.11) is treated as a maximumlikelihood estimate.This
approach ignorespartof the data, those days on whichno deathsoccur,but has been
shownto give reasonablyefficient estimatesof /3nevertheless.
Ifthereare manypatients,a hundredor more,and thevectorszi are time-varying,
expression(6.11) can be computationally to deal with,taxingeven a large
quite difficult
computer.Withouta computer,the methodis hopeless,exceptin the simplestsitua-
tions.(Manteland Haenszel [17] discussone suchsituation,thetwosamplecomparison
problemof ? 1, withcensoreddata.) Cox's regressionmethodis a good example of a
statisticaltheorywhichhas developed in responseto the capacityof moderncompu-
tationalequipment.
7. Conclusion. The purpose of mathematicaltheory,and in fact all scientific
theory,is to reducecomplicatedsituationsto simpleones. Justwhata scientistmeansby
"simple" is determinedby experience,training,convention,and the limitationsof
human reasoningfaculties.A Taylor series expansion is a classic example of this
process: a givenfunctionis expressedas a sum of multiplesof powers. Since we are
taughta lot about sums,multiples,and powers,the explanationmay be a good deal
easier to understandthanthe functionas originallystated.
The adventofthehighspeed computerhas redefined"simple"inthemathematical
sciences.For example,an optimizationproblemwhichcan be reducedto a problemin

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions
480 BRADLEY EFRON

linear programming is, in most instances,now consideredsolved, since the simplex

methodis so efficient in numericallysolvinglinearprograms.
The purpose of this article has been to show this same process at work in
mathematicalstatistics.A theorywhichenables a scientistto understandhis data with
thehelpofa highspeed computermaynowbe as usefulas a theorywhichonlyrequiresa
table of the exponentialfunction,particularlyif the latter theorydoes not exist.
Computerassistedtheoryis no less "mathematical"thanthetheoryofthepast,itis just
less constrainedby the limitationsof the humanbrain.
The need fora moreflexible,realistic,and dependablestatisticaltheoryis pressing,
giventhemountainsofdata nowbeingamassed.The prospectforsuccessis bright,butI
believe thesolutionis likelyto lie along thelinessuggestedin theprevioussections-a
blendof traditionalmathematicalthinkingcombinedwiththenumericaland organiza-
tional aptitudeof the computer.

REFERENCES

[1] D. F. ANDREWS,A robustmethodformultiple Technometrics16 (1974), pp.523-531.

linearregression,
[2] B. W. BROWN AND B. W. Turnbull,Survivorship analysis of hearttransplantdata, Departmentof
Statistics,StanfordUniversity, Technical Report No. 34 (1972).
[3] B. W. BROWN,B. W. TURNBULL AND M. Hu, Survivorship analysisofhearttransplant data,Journal
of the AmericanStatisticalAssociation,69 (1974), pp. 74-80.
[4] D. R. Cox, Regressionmodels and life-tables, Journalof the Royal StatisticalSociety Series B, 34
(1972), pp. 187-220.
[5] H. CRAMER, MathematicalMethodsofStatistics, PrincetonUniversityPress,Princeton,NJ,1946.
[6] A. P. DEMPSTER, N. M. LAIRD AND D. B. RUBIN, Maximum likelihood
estimationfromincomplete
data via theEM algorithm, Journalof the Royal StatisticalSocietySeries B 39 (1977), pp. 1-38.
[7] B. EFRON, Thetwosampleproblemwithcensoreddata, ProceedingsoftheFifthBerkeleySymposiumon
MathematicalStatisticsand ProbabilityIV, (1967), pp. 831-853.
[8] , Bootstrapmethods:Anotherlookat thejackknife,AnnalsofStatistics,7 (1979), no. 1, to appear.
[9] , Controversiesin thefoundationsofstatistics,
AmericanMathematicalMonthly85 (1978), no. 4,
pp. 231-246.
[10] S. GEISSER, The predictive sample reusemethodwithapplications,Journalof the AmericanStatistical
Association70 (1975), pp. 320-328.
[11] G. H. GOLUB AND G. P. H. STYAN, Numericalcomputations forunivariatelinearmodels,Journalof
StatisticalComputationand Simulation2 (1973), pp. 253-274.
[12] F. R. HAMPEL, The influence curveand itsrolein robustestimation, Journalof theAmericanStatistical
Association69 (1974), pp. 383-393.
[13] P. J.HUBER, Robuststatistics:a review,Annals of MathematicalStatistics43 (1972), pp. 1041-1067.
[14] , Robust StatisticalProcedures,SocietyforIndustrialand Applied Mathematics,Philadelphia,
PA, 1977.
[15] L. JAECKEL, The infinitesimal jackknife,Bell Labs. Memorandum#MM 72-1215-11, 1972.
[16] E. L. KAPLAN AND P. MEIER, Nonparametric estimation
fromincomplete observations,Journalof the
AmericanStatisticalAssociation53 (1958), pp. 457-481.
[17] N. MANTEL AND W. HAENSZEL, Statisticalaspectsoftheanalysisofdata fromretrospective studiesof
disease,Journalof the National Cancer Institute22 (1959), pp. 719-748.
[18] R. G. MILLER, The jackknife-a review,Biometrika61 (1974), pp. 1-17.
[19] F. MOSTELLER AND J.W. TUKEY, Data Analysisand Regression,Addison-Wesley,Reading, MA,
1977.
[20] D. B. RUBIN, UsingempiricalBayes techniquesin thelaw schoolvaliditystudies,Law School Admission
Council Report78-1, 1977.
[21] M. STONE, Cross-validatory choice and assessmentof statisticalpredictions,Journalof the Royal
StatisticalSocietySeries B 36 (1974), pp. 111-147.
[22] B. W. TURNBULL, The empiricaldistribution functionwitharbitrarily grouped,censored,and truncated
data, Journalof the Royal StatisticalSocietySeries B 38 (1976), pp. 290-295.
[23] G. WAHBA AND S. WOLD, A completelyautomaticFrenchcurve: fitting splinefunctionsby cross-
validation,Communicationsin Statistics4 (1975), pp. 1-17.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

All use subject to JSTOR Terms and Conditions

Introduction To Statistics and Probablity-M.nurul Islam
17% (23)
Introduction To Statistics and Probablity-M.nurul Islam
9 pages
COX, D. R. HINKLEY, D. V. Theoretical Statistics. 1974 PDF
100% (4)
COX, D. R. HINKLEY, D. V. Theoretical Statistics. 1974 PDF
522 pages
A Handbook of Small Data Sets D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway A PDF
No ratings yet
A Handbook of Small Data Sets D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway A PDF
470 pages
Statistical Modeling and Computation Scribd PDF Download
100% (19)
Statistical Modeling and Computation Scribd PDF Download
14 pages
Biometry and Experimental Design
100% (1)
Biometry and Experimental Design
106 pages
Model Validity
No ratings yet
Model Validity
511 pages
Students T Test
No ratings yet
Students T Test
20 pages
Dr. SK Ahammad Basha Non Parametric Tests 1
No ratings yet
Dr. SK Ahammad Basha Non Parametric Tests 1
37 pages
SIMPLEtestofhypothesis
No ratings yet
SIMPLEtestofhypothesis
35 pages
Maths 05
No ratings yet
Maths 05
5 pages
Modern Mathematical Statistics With Applications (2nd Edition)
13% (32)
Modern Mathematical Statistics With Applications (2nd Edition)
13 pages
Statistical Modeling and Computation Full Access Download
No ratings yet
Statistical Modeling and Computation Full Access Download
16 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
64 pages
1 Intro Stat - Merged
No ratings yet
1 Intro Stat - Merged
279 pages
Application of Statistical Tools For Data Analysis and Interpretation in Crops
No ratings yet
Application of Statistical Tools For Data Analysis and Interpretation in Crops
10 pages
Causation, Prediction, and Search: Second Edition
100% (1)
Causation, Prediction, and Search: Second Edition
567 pages
Introduction To Statistics Walpole, Ronald E 1974 New York, Macmillan
No ratings yet
Introduction To Statistics Walpole, Ronald E 1974 New York, Macmillan
368 pages
Entropy: Measures of Difference and Significance in The Era of Computer Simulations, Meta-Analysis, and Big Data
No ratings yet
Entropy: Measures of Difference and Significance in The Era of Computer Simulations, Meta-Analysis, and Big Data
11 pages
What Do You Understand by Word "Statistics". Give Out Its Definitions (Minimum by 4 Authors) As Explained by Various Distinguished Authors
No ratings yet
What Do You Understand by Word "Statistics". Give Out Its Definitions (Minimum by 4 Authors) As Explained by Various Distinguished Authors
23 pages
Z Test Formula
No ratings yet
Z Test Formula
6 pages
BCSC 108 MAY 24 Introduction To Statistics
No ratings yet
BCSC 108 MAY 24 Introduction To Statistics
63 pages
LargeScaleInference PDF
No ratings yet
LargeScaleInference PDF
273 pages
Simulation
No ratings yet
Simulation
180 pages
Updated - BCSC 108 MAY 24 Introduction To Statistics
No ratings yet
Updated - BCSC 108 MAY 24 Introduction To Statistics
69 pages
FahrmeirAndTutz-Generalized Additive Models
No ratings yet
FahrmeirAndTutz-Generalized Additive Models
536 pages
DMW Ebook TechKnowledge
No ratings yet
DMW Ebook TechKnowledge
216 pages
Std12 Stat EM
No ratings yet
Std12 Stat EM
256 pages
Nonparametric Methods: C Vi S A: I N M
No ratings yet
Nonparametric Methods: C Vi S A: I N M
73 pages
Statistical Inference. Casella, G. y Berger, R. L. 2002
No ratings yet
Statistical Inference. Casella, G. y Berger, R. L. 2002
584 pages
MABC 503-Module01
No ratings yet
MABC 503-Module01
17 pages
UNIT 5 23CPS Updated
No ratings yet
UNIT 5 23CPS Updated
7 pages
Hypothesis Testing 2
No ratings yet
Hypothesis Testing 2
7 pages
Statistical Tests
No ratings yet
Statistical Tests
9 pages
Bernard F Dela Vega PH 1-1
No ratings yet
Bernard F Dela Vega PH 1-1
5 pages
Chapter 1
No ratings yet
Chapter 1
15 pages
Unidad 2 - 2024
No ratings yet
Unidad 2 - 2024
10 pages
Statistics in Research, Step
No ratings yet
Statistics in Research, Step
35 pages
Stats Notes
No ratings yet
Stats Notes
81 pages
STA1B Practice Questions + Solutions 2022
No ratings yet
STA1B Practice Questions + Solutions 2022
11 pages
Nonparametric Statistics
No ratings yet
Nonparametric Statistics
32 pages
SDM - Unit 4 - Statistical Inference
No ratings yet
SDM - Unit 4 - Statistical Inference
9 pages
Part A
No ratings yet
Part A
4 pages
Origin of Statistics
No ratings yet
Origin of Statistics
4 pages
Module2 - Basics of Business Statiscs
100% (1)
Module2 - Basics of Business Statiscs
37 pages
5 Measures of Central Tendency
No ratings yet
5 Measures of Central Tendency
61 pages
As Las 1
No ratings yet
As Las 1
4 pages
Statistics Notes Part - 1
No ratings yet
Statistics Notes Part - 1
25 pages
Theoretical Statistics
No ratings yet
Theoretical Statistics
18 pages
Chi Square Test (Lecture 5)
No ratings yet
Chi Square Test (Lecture 5)
34 pages
Statistical Instruments and References Writing in Research
No ratings yet
Statistical Instruments and References Writing in Research
36 pages
Module 20 Inferential Statistics (Parametric Test)
No ratings yet
Module 20 Inferential Statistics (Parametric Test)
71 pages
MMW Chap 3 Data Management Statistics Part 1
No ratings yet
MMW Chap 3 Data Management Statistics Part 1
18 pages
Statistics Lecture
No ratings yet
Statistics Lecture
15 pages
Add Maths Folio
No ratings yet
Add Maths Folio
30 pages
Expect The Unexpected A First Course in Biostatistics - 2nd Edition Research PDF Download
No ratings yet
Expect The Unexpected A First Course in Biostatistics - 2nd Edition Research PDF Download
17 pages
Student T-Test
No ratings yet
Student T-Test
6 pages
Chapter 1-Introduction To Non-Parametric Statistics
No ratings yet
Chapter 1-Introduction To Non-Parametric Statistics
10 pages
Statistical Methods (SM)
No ratings yet
Statistical Methods (SM)
4 pages
Topic01. Review of Math Notations
No ratings yet
Topic01. Review of Math Notations
17 pages
Distribution Free Methods, Which Do Not Rely On Assumptions That The
No ratings yet
Distribution Free Methods, Which Do Not Rely On Assumptions That The
4 pages
Tourism Adoption Project Report
No ratings yet
Tourism Adoption Project Report
14 pages
Shweta Mba Project Report Final
No ratings yet
Shweta Mba Project Report Final
74 pages
Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
No ratings yet
Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
60 pages
Application of Data Science and Machine Learning Algorithms For ROP Prediction Turning Data Into Knowledge
No ratings yet
Application of Data Science and Machine Learning Algorithms For ROP Prediction Turning Data Into Knowledge
10 pages
23novel Approach To Classify Brain Tumor Based On Transfer Learning
No ratings yet
23novel Approach To Classify Brain Tumor Based On Transfer Learning
8 pages
Text Mining
No ratings yet
Text Mining
12 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
(FREE PDF Sample) Machine Learning With R Cookbook Second Edition Analyze Data and Build Predictive Models Bhatia Ebooks
100% (6)
(FREE PDF Sample) Machine Learning With R Cookbook Second Edition Analyze Data and Build Predictive Models Bhatia Ebooks
55 pages
Portfolio Optimization Via Machine Learning and Economic Regimes 2
No ratings yet
Portfolio Optimization Via Machine Learning and Economic Regimes 2
39 pages
Ltimindtree Interview Preparations
No ratings yet
Ltimindtree Interview Preparations
7 pages
Latex
No ratings yet
Latex
4 pages
Improve Quality and Efficiency of Textile Process Using Data-Driven
No ratings yet
Improve Quality and Efficiency of Textile Process Using Data-Driven
13 pages
GRP 5 Tan Yi Xuen
No ratings yet
GRP 5 Tan Yi Xuen
122 pages
Post Hoc Explanations Feature Attributions 1 of 4
No ratings yet
Post Hoc Explanations Feature Attributions 1 of 4
26 pages
Predicting Student Academic Success DDA
No ratings yet
Predicting Student Academic Success DDA
26 pages
Computational Psychiatry As A Bridge From Neuroscience To Clinical Applications.
No ratings yet
Computational Psychiatry As A Bridge From Neuroscience To Clinical Applications.
10 pages
A Convolutional Route To Abbreviation Disambiguation in Clinical Text
No ratings yet
A Convolutional Route To Abbreviation Disambiguation in Clinical Text
8 pages
Improvements On Cross Validation The 632 Bootstrap Method
No ratings yet
Improvements On Cross Validation The 632 Bootstrap Method
14 pages
PRNN
No ratings yet
PRNN
17 pages
Book's Solutions
No ratings yet
Book's Solutions
20 pages
Prediction of One Repetition Maximum
No ratings yet
Prediction of One Repetition Maximum
10 pages
Sentiment Analysis Poster
No ratings yet
Sentiment Analysis Poster
1 page
Property Rental Price Prediction Using The Extreme Gradient Boosting
No ratings yet
Property Rental Price Prediction Using The Extreme Gradient Boosting
6 pages
Lange and Sippel MachineLearning Hydrology
No ratings yet
Lange and Sippel MachineLearning Hydrology
26 pages
Lancet
No ratings yet
Lancet
98 pages
Ensemble of Technical Analysis and Machine Learning For Market Trend Prediction
No ratings yet
Ensemble of Technical Analysis and Machine Learning For Market Trend Prediction
7 pages
Journal of Building Engineering
No ratings yet
Journal of Building Engineering
14 pages
Opcode Sequences As Representation of Executables For Data-Mining-Based Unknown Malware Detection (Elsevier-2013) PDF
No ratings yet
Opcode Sequences As Representation of Executables For Data-Mining-Based Unknown Malware Detection (Elsevier-2013) PDF
19 pages
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet

Thinking The Unthinkable

Uploaded by

Thinking The Unthinkable

Uploaded by

Computers and the Theory of Statistics: Thinking the Unthinkable

Author(s): Bradley Efron

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

COMPUTERS AND THE THEORY OF STATISTICS:

Abstract.This is a surveyarticleconcerningrecentadvancesin certainareas ofstatisticaltheory,written

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

3) Decide thatsetA reallyis biggerthanset B ifXA - XB is in theupper5% ofthe

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

such as the standardlinear model, takes on qualitativelynew aspects when applied

LSAT 576 635 558 578 666 580 555 661

LSAT 651 605 653 575 545 572 594

i = 1, 2, , 15. (The LSAT is a nationaltest,similarto the Graduate Record Exam,

Because of the Cauchy-Schwarzinequalityit is alwaystruethat-1 _p ? 1. The case

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

3.10 *10 .11

= -1 indicatesa perfectstraight line relationshipwithnegativeslope. Figure1 shows

would approachifthe numberof data pairswas increasedfromn = 15 towardn = oo.

true,butthebias is smallenoughto be ignoredin thelaw schoolexample,forthesake of

5Suppose thatinsteadof the correlationcoefficient, we wishto estimatethe standarddeviationof the

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

P(i)_P .116 -.013 -.021 -.000 -.045 .004 .008 .040

A(i) _A -.025 -.000 .042 .009 -.036 -.009 .003

The statistician mightnow reportp = .776 ? .143. This meansthathisbestguessof

calculatedtheoretically. Thisdensityfunctiondependsonlyuponp,noton themeansor

Prob {a - b}. Figure2 showsfp(*) for =

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

16% 50% 84%

FIG. 2. Thenormaltheory density function oftheobservedcorrelation p* for15 data pairs (xi,yi)

The calculations of the next section suggest an answer somewhat closer to

6 Thename"bootstrap" ismeanttobe euphonic

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

5) Let [a*, b*] be the central68% intervalforthe p* values, i.e.

Figure3 showstheresultsofN = 1000 bootstrapreplications.The histogramofthe

_ _ 16% t50% 84%

FIG. 3. Histogram,1000 bootstrap replications

What we have called p before,the truecorrelation,mightbetterbe called p(F),

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

methodto estimateo-as to estimatep itself-a simplesubstitution

(3.1) o-(F) = halfthe lengthof the central68% of the

7Steps 2 through5 of the bootstrapprocedure are done theoretically,ratherthan by computer

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

comparativelyshorttime,but theoreticianscan be expectedto take greaterinterestin

Expandingg(*) in a Taylorseries around (1/n, 1/n, , 1/n) gives

8The remainderofthissectionassumessomeknowledge ofstatistical

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

(4.2) . (x) = 4.26(e-x - 4 e-2x + e-3x)

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

subject to the constraint

Constraint(4.4) is a smoothnesscondition:if we take a = 0, 7(x, 0) is verysmooth

inotherwords,thecurver1(x, a) closestto thetruemeanfunction, (x). Anotherwayto

Since u2 = (0.2)2 is a fixednumber,minimizing(4.5) is equivalentto minimizingthe

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

FIG. 6. Ten x pointsindependently

and covariancematrixthe identity.The y population differsin havingmean vector

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

course the statisticianwouldn'tknow that in a real situation.Notice thatthe linear

error. Prob {x E B},

whichhappensto equal 0.41 forthesituationinFig. 6. In thisdefinition,

whichequals 0.30 in Fig. 6. It is well knownthaterrorxtendsto underestimateerrorx,

(4.10) biasx E{errorx- errorx}.

(4.11) biasx= error--errorx.

4) Repeat steps 1)-3) a large numberN of times,obtainingindependentvalues

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

In the presentcase, N = 100 bootstrapreplicationsgave the estimatebias* = 0.078.

sense above. Table 3 shows microbecountsin 69 swabs fromdifferent portionsof a

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

The averagex ofa setofnumbersx1,x2, , x, can also be derivedas thatnumber

Here 0( ) is a preselectedfunction,whichcan be chosen to give good robustness

Choosingc = o makes T equal to theaverage,whilec = 0 (actually,thelimitas c -* 0, in

(5.3) Off(X) f(x)

The name "M estimator"comes fromMaximumlikelihood.

This content downloaded from 188.72.126.25 on Wed, 18 Jun 2014 09:32:52 AM

maypreferto changethe "optimum"Off(.) to a morerobustchoice ql(*).This reduces

The influencefunctionrepresentstheeffectupon T(F) of a smalllocal changein F. By

(5.5) T(F) -T(F) + - T(xi;F).

(5.6) [I[T(x; F)]2 dF(X)]112

5) Let [a, b] be the central68% intervalforthe p* values, i.e.