Leo Breiman - Statistical Modeling - Two Cultures
Leo Breiman - Statistical Modeling - Two Cultures
Leo Breiman - Statistical Modeling - Two Cultures
1
b
m
x
m
where the coefcients {b
m
} are to be estimated,
is N(0
2
) and
2
is to be estimated. Given that
the data is generated this way, elegant tests of
hypotheses, condence intervals, distributions of
the residual sum-of-squares and asymptotics can be
derived. This made the model attractive in terms
of the mathematics involved. This theory was used
both by academic statisticians and others to derive
signicance levels for coefcients on the basis of
model (R), with little consideration as to whether
the data on hand could have been generated by a
linear model. Hundreds, perhaps thousands of arti-
cles were published claiming proof of something or
other because the coefcient was signicant at the
5% level.
Goodness-of-t was demonstrated mostly by giv-
ing the value of the multiple correlation coefcient
R
2
which was often closer to zero than one and
which could be over inated by the use of too many
parameters. Besides computing R
2
, nothing else
was done to see if the observational data could have
been generated by model (R). For instance, a study
was done several decades ago by a well-known
member of a university statistics department to
assess whether there was gender discrimination in
the salaries of the faculty. All personnel les were
examined and a data base set up which consisted of
salary as the response variable and 25 other vari-
ables which characterized academic performance;
that is, papers published, quality of journals pub-
lished in, teaching record, evaluations, etc. Gender
appears as a binary predictor variable.
A linear regression was carried out on the data
and the gender coefcient was signicant at the
5% level. That this was strong evidence of sex dis-
crimination was accepted as gospel. The design
of the study raises issues that enter before the
consideration of a modelCan the data gathered
STATISTICAL MODELING: THE TWO CULTURES 203
answer the question posed? Is inference justied
when your sample is the entire population? Should
a data model be used? The deciencies in analysis
occurred because the focus was on the model and
not on the problem.
The linear regression model led to many erro-
neous conclusions that appeared in journal articles
waving the 5% signicance level without knowing
whether the model t the data. Nowadays, I think
most statisticians will agree that this is a suspect
way to arrive at conclusions. At the time, there were
few objections from the statistical profession about
the fairy-tale aspect of the procedure, But, hidden in
an elementary textbook, Mosteller and Tukey (1977)
discuss many of the fallacies possible in regression
and write The whole area of guided regression is
fraught with intellectual, statistical, computational,
and subject matter difculties.
Even currently, there are only rare published cri-
tiques of the uncritical use of data models. One of
the few is David Freedman, who examines the use
of regression models (1994); the use of path models
(1987) and data modeling (1991, 1995). The analysis
in these papers is incisive.
5.2 Problems in Current Data Modeling
Current applied practice is to check the data
model t using goodness-of-t tests and residual
analysis. At one point, some years ago, I set up a
simulated regression problem in seven dimensions
with a controlled amount of nonlinearity. Standard
tests of goodness-of-t did not reject linearity until
the nonlinearity was extreme. Recent theory sup-
ports this conclusion. Work by Bickel, Ritov and
Stoker (2001) shows that goodness-of-t tests have
very little power unless the direction of the alter-
native is precisely specied. The implication is that
omnibus goodness-of-t tests, which test in many
directions simultaneously, have little power, and
will not reject until the lack of t is extreme.
Furthermore, if the model is tinkered with on the
basis of the data, that is, if variables are deleted
or nonlinear combinations of the variables added,
then goodness-of-t tests are not applicable. Resid-
ual analysis is similarly unreliable. In a discussion
after a presentation of residual analysis in a sem-
inar at Berkeley in 1993, William Cleveland, one
of the fathers of residual analysis, admitted that it
could not uncover lack of t in more than four to ve
dimensions. The papers I have read on using resid-
ual analysis to check lack of t are conned to data
sets with two or three variables.
With higher dimensions, the interactions between
the variables can produce passable residual plots for
a variety of models. A residual plot is a goodness-of-
t test, and lacks power in more than a few dimen-
sions. An acceptable residual plot does not imply
that the model is a good t to the data.
There are a variety of ways of analyzing residuals.
For instance, Landwher, Preibon and Shoemaker
(1984, with discussion) gives a detailed analysis of
tting a logistic model to a three-variable data set
using various residual plots. But each of the four
discussants present other methods for the analysis.
One is left with an unsettled sense about the arbi-
trariness of residual analysis.
Misleading conclusions may follow from data
models that pass goodness-of-t tests and residual
checks. But published applications to data often
show little care in checking model t using these
methods or any other. For instance, many of the
current application articles in JASA that t data
models have very little discussion of how well their
model ts the data. The question of how well the
model ts the data is of secondary importance com-
pared to the construction of an ingenious stochastic
model.
5.3 The Multiplicity of Data Models
One goal of statistics is to extract information
from the data about the underlying mechanism pro-
ducing the data. The greatest plus of data modeling
is that it produces a simple and understandable pic-
ture of the relationship between the input variables
and responses. For instance, logistic regression in
classication is frequently used because it produces
a linear combination of the variables with weights
that give an indication of the variable importance.
The end result is a simple picture of how the pre-
diction variables affect the response variable plus
condence intervals for the weights. Suppose two
statisticians, each one with a different approach
to data modeling, t a model to the same data
set. Assume also that each one applies standard
goodness-of-t tests, looks at residuals, etc., and
is convinced that their model ts the data. Yet
the two models give different pictures of natures
mechanism and lead to different conclusions.
McCullah and Nelder (1989) write Data will
often point with almost equal emphasis on sev-
eral possible models, and it is important that the
statistician recognize and accept this. Well said,
but different models, all of them equally good, may
give different pictures of the relation between the
predictor and response variables. The question of
which one most accurately reects the data is dif-
cult to resolve. One reason for this multiplicity
is that goodness-of-t tests and other methods for
checking t give a yesno answer. With the lack of
204 L. BREIMAN
power of these tests with data having more than a
small number of dimensions, there will be a large
number of models whose t is acceptable. There is
no way, among the yesno methods for gauging t,
of determining which is the better model. A few
statisticians know this. Mountain and Hsiao (1989)
write, It is difcult to formulate a comprehensive
model capable of encompassing all rival models.
Furthermore, with the use of nite samples, there
are dubious implications with regard to the validity
and power of various encompassing tests that rely
on asymptotic theory.
Data models in current use may have more dam-
aging results than the publications in the social sci-
ences based on a linear regression analysis. Just as
the 5% level of signicance became a de facto stan-
dard for publication, the Cox model for the analysis
of survival times and logistic regression for survive
nonsurvive data have become the de facto standard
for publication in medical journals. That different
survival models, equally well tting, could give dif-
ferent conclusions is not an issue.
5.4 Predictive Accuracy
The most obvious way to see how well the model
box emulates natures box is this: put a case x down
natures box getting an output y. Similarly, put the
same case x down the model box getting an out-
put y
/
. The closeness of y and y
/
is a measure of
how good the emulation is. For a data model, this
translates as: t the parameters in your model by
using the data, then, using the model, predict the
data and see how good the prediction is.
Prediction is rarely perfect. There are usu-
ally many unmeasured variables whose effect is
referred to as noise. But the extent to which the
model box emulates natures box is a measure of
how well our model can reproduce the natural
phenomenon producing the data.
McCullagh and Nelder (1989) in their book on
generalized linear models also think the answer is
obvious. They write, At rst sight it might seem
as though a good model is one that ts the data
very well; that is, one that makes (the model pre-
dicted value) very close to y (the response value).
Then they go on to note that the extent of the agree-
ment is biased by the number of parameters used
in the model and so is not a satisfactory measure.
They are, of course, right. If the model has too many
parameters, then it may overt the data and give a
biased estimate of accuracy. But there are ways to
remove the bias. To get a more unbiased estimate
of predictive accuracy, cross-validation can be used,
as advocated in an important early work by Stone
(1974). If the data set is larger, put aside a test set.
Mosteller and Tukey (1977) were early advocates
of cross-validation. They write, Cross-validation is
a natural route to the indication of the quality of any
data-derived quantity . We plan to cross-validate
carefully wherever we can.
Judging by the infrequency of estimates of pre-
dictive accuracy in JASA, this measure of model
t that seems natural to me (and to Mosteller and
Tukey) is not natural to others. More publication of
predictive accuracy estimates would establish stan-
dards for comparison of models, a practice that is
common in machine learning.
6. THE LIMITATIONS OF DATA MODELS
With the insistence on data models, multivariate
analysis tools in statistics are frozen at discriminant
analysis and logistic regression in classication and
multiple linear regression in regression. Nobody
really believes that multivariate data is multivari-
ate normal, but that data model occupies a large
number of pages in every graduate textbook on
multivariate statistical analysis.
With data gathered from uncontrolled observa-
tions on complex systems involving unknown physi-
cal, chemical, or biological mechanisms, the a priori
assumption that nature would generate the data
through a parametric model selected by the statis-
tician can result in questionable conclusions that
cannot be substantiated by appeal to goodness-of-t
tests and residual analysis. Usually, simple para-
metric models imposed on data generated by com-
plex systems, for example, medical data, nancial
data, result in a loss of accuracy and information as
compared to algorithmic models (see Section 11).
There is an old saying If all a man has is a
hammer, then every problem looks like a nail. The
trouble for statisticians is that recently some of the
problems have stopped looking like nails. I conjec-
ture that the result of hitting this wall is that more
complicated data models are appearing in current
published applications. Bayesian methods combined
with Markov Chain Monte Carlo are cropping up all
over. This may signify that as data becomes more
complex, the data models become more cumbersome
and are losing the advantage of presenting a simple
and clear picture of natures mechanism.
Approaching problems by looking for a data model
imposes an a priori straight jacket that restricts the
ability of statisticians to deal with a wide range of
statistical problems. The best available solution to
a data problem might be a data model; then again
it might be an algorithmic model. The data and the
problem guide the solution. To solve a wider range
of data problems, a larger set of tools is needed.
STATISTICAL MODELING: THE TWO CULTURES 205
Perhaps the damaging consequence of the insis-
tence on data models is that statisticians have ruled
themselves out of some of the most interesting and
challenging statistical problems that have arisen
out of the rapidly increasing ability of computers
to store and manipulate data. These problems are
increasingly present in many elds, both scientic
and commercial, and solutions are being found by
nonstatisticians.
7. ALGORITHMIC MODELING
Under other names, algorithmic modeling has
been used by industrial statisticians for decades.
See, for instance, the delightful book Fitting Equa-
tions to Data (Daniel and Wood, 1971). It has been
used by psychometricians and social scientists.
Reading a preprint of Gis book (1990) many years
ago uncovered a kindred spirit. It has made small
inroads into the analysis of medical data starting
with Richard Olshens work in the early 1980s. For
further work, see Zhang and Singer (1999). Jerome
Friedman and Grace Wahba have done pioneering
work on the development of algorithmic methods.
But the list of statisticians in the algorithmic mod-
eling business is short, and applications to data are
seldom seen in the journals. The development of
algorithmic methods was taken up by a community
outside statistics.
7.1 A New Research Community
In the mid-1980s two powerful new algorithms
for tting data became available: neural nets and
decision trees. A new research community using
these tools sprang up. Their goal was predictive
accuracy. The community consisted of young com-
puter scientists, physicists and engineers plus a few
aging statisticians. They began using the new tools
in working on complex prediction problems where it
was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear
time series prediction, handwriting recognition,
prediction in nancial markets.
Their interests range over many elds that were
once considered happy hunting grounds for statisti-
cians and have turned out thousands of interesting
research papers related to applications and method-
ology. A large majority of the papers analyze real
data. The criterion for any model is what is the pre-
dictive accuracy. An idea of the range of research
of this group can be got by looking at the Proceed-
ings of the Neural Information Processing Systems
Conference (their main yearly meeting) or at the
Machine Learning Journal.
7.2 Theory in Algorithmic Modeling
Data models are rarely used in this community.
The approach is that nature produces data in a
black box whose insides are complex, mysterious,
and, at least, partly unknowable. What is observed
is a set of xs that go in and a subsequent set of ys
that come out. The problem is to nd an algorithm
f(x) such that for future x in a test set, f(x) will
be a good predictor of y.
The theory in this eld shifts focus from data mod-
els to the properties of algorithms. It characterizes
their strength as predictors, convergence if they
are iterative, and what gives them good predictive
accuracy. The one assumption made in the theory
is that the data is drawn i.i.d. from an unknown
multivariate distribution.
There is isolated work in statistics where the
focus is on the theory of the algorithms. Grace
Wahbas research on smoothing spline algo-
rithms and their applications to data (using cross-
validation) is built on theory involving reproducing
kernels in Hilbert Space (1990). The nal chapter
of the CART book (Breiman et al., 1984) contains
a proof of the asymptotic convergence of the CART
algorithm to the Bayes risk by letting the trees grow
as the sample size increases. There are others, but
the relative frequency is small.
Theory resulted in a major advance in machine
learning. Vladimir Vapnik constructed informative
bounds on the generalization error (innite test set
error) of classication algorithms which depend on
the capacity of the algorithm. These theoretical
bounds led to support vector machines (see Vapnik,
1995, 1998) which have proved to be more accu-
rate predictors in classication and regression then
neural nets, and are the subject of heated current
research (see Section 10).
My last paper Some innity theory for tree
ensembles (Breiman, 2000) uses a function space
analysis to try and understand the workings of tree
ensemble methods. One section has the heading,
My kingdom for some good theory. There is an
effective method for forming ensembles known as
boosting, but there isnt any nite sample size
theory that tells us why it works so well.
7.3 Recent Lessons
The advances in methodology and increases in
predictive accuracy since the mid-1980s that have
occurred in the research of machine learning has
been phenomenal. There have been particularly
exciting developments in the last ve years. What
has been learned? The three lessons that seem most
206 L. BREIMAN
important to one:
Rashomon: the multiplicity of good models;
Occam: the conict between simplicity and accu-
racy;
Bellman: dimensionalitycurse or blessing.
8. RASHOMON AND THE MULTIPLICITY
OF GOOD MODELS
Rashomon is a wonderful Japanese movie in
which four people, from different vantage points,
witness an incident in which one person dies and
another is supposedly raped. When they come to
testify in court, they all report the same facts, but
their stories of what happened are very different.
What I call the Rashomon Effect is that there
is often a multitude of different descriptions [equa-
tions f(x)] in a class of functions giving about the
same minimum error rate. The most easily under-
stood example is subset selection in linear regres-
sion. Suppose there are 30 variables and we want to
nd the best ve variable linear regressions. There
are about 140,000 ve-variable subsets in competi-
tion. Usually we pick the one with the lowest resid-
ual sum-of-squares (RSS), or, if there is a test set,
the lowest test error. But there may be (and gen-
erally are) many ve-variable equations that have
RSS within 1.0% of the lowest RSS (see Breiman,
1996a). The same is true if test set error is being
measured.
So here are three possible pictures with RSS or
test set error within 1.0% of each other:
Picture 1
y = 21 38x
3
06x
8
832x
12
21x
17
32x
27
Picture 2
y = 89 46x
5
001x
6
120x
15
175x
21
02x
22
Picture 3
y = 767 93x
2
220x
7
132x
8
34x
11
72x
28
V(M) is large.
Lo and behold, with this denition, large margin
means large divergence.
Since the good old days at Fair, Isaac, there have
been many improvements in the algorithmic mod-
eling approaches. We now use genetic algorithms
to screen very large structured sets of prediction
characteristics. Our segmentation algorithms have
been automated to yield even more predictive sys-
tems. Our palatable GAM modeling tool now han-
dles smooth splines, as well as splines mixed with
step functions, with all kinds of constraint capabil-
ity. Maximizing divergence is still a favorite, but
we also maximize constrained GLM likelihood func-
tions. We also are experimenting with computa-
tionally intensive algorithms that will optimize any
objective function that makes sense in the busi-
ness environment. All of these improvements are
squarely in the culture of algorithmic modeling.
OVERFITTING THE TEST SAMPLE
Professor Breiman emphasizes the importance of
performance on the test sample. However, this can
be overdone. The test sample is supposed to repre-
sent the population to be encountered in the future.
But in reality, it is usually a random sample of the
222 L. BREIMAN
current population. High performance on the test
sample does not guarantee high performance on
future samples, things do change. There are prac-
tices that can be followed to protect against change.
One can monitor the performance of the mod-
els over time and develop new models when there
has been sufcient degradation of performance. For
some of Fair, Isaacs core products, the redevelop-
ment cycle is about 1824 months. Fair, Isaac also
does score engineering in an attempt to make
the models more robust over time. This includes
damping the inuence of individual characteristics,
using monotone constraints and minimizing the size
of the models subject to performance constraints
on the current test sample. This score engineer-
ing amounts to moving from very nonparametric (no
score engineering) to more semiparametric (lots of
score engineering).
SPIN-OFFS FROM THE DATA
MODELING CULTURE
In Section 6 of Professor Breimans paper, he says
that multivariate analysis tools in statistics are
frozen at discriminant analysis and logistic regres-
sion in classication This is not necessarily all
that bad. These tools can carry you very far as long
as you ignore all of the textbook advice on how to
use them. To illustrate, I use the saga of the Fat
Scorecard.
Early in my research days at Fair, Isaac, I
was searching for an improvement over segmented
scorecards. The idea was to develop rst a very
good global scorecard and then to develop small
adjustments for a number of overlapping segments.
To develop the global scorecard, I decided to use
logistic regression applied to the attribute dummy
variables. There were 36 characteristics available
for tting. A typical scorecard has about 15 char-
acteristics. My variable selection was structured
so that an entire characteristic was either in or
out of the model. What I discovered surprised
me. All models t with anywhere from 27 to 36
characteristics had the same performance on the
test sample. This is what Professor Breiman calls
Rashomon and the multiplicity of good models. To
keep the model as small as possible, I chose the one
with 27 characteristics. This model had 162 score
weights (logistic regression coefcients), whose P-
values ranged from 0.0001 to 0.984, with only one
less than 0.05; i.e., statistically signicant. The con-
dence intervals for the 162 score weights were use-
less. To get this great scorecard, I had to ignore
the conventional wisdom on how to use logistic
regression.
So far, all I had was the scorecard GAM. So clearly
I was missing all of those interactions that just had
to be in the model. To model the interactions, I tried
developing small adjustments on various overlap-
ping segments. No matter how hard I tried, noth-
ing improved the test sample performance over the
global scorecard. I started calling it the Fat Score-
card.
Earlier, on this same data set, another Fair, Isaac
researcher had developed a neural network with
2,000 connection weights. The Fat Scorecard slightly
outperformed the neural network on the test sam-
ple. I cannot claim that this would work for every
data set. But for this data set, I had developed an
excellent algorithmic model with a simple data mod-
eling tool.
Why did the simple additive model work so well?
One idea is that some of the characteristics in the
model are acting as surrogates for certain inter-
action terms that are not explicitly in the model.
Another reason is that the scorecard is really a
sophisticated neural net. The inputs are the original
inputs. Associated with each characteristic is a hid-
den node. The summation functions coming into the
hidden nodes are the transformations dening the
characteristics. The transfer functions of the hid-
den nodes are the step functions (compiled from the
score weights)all derived from the data. The nal
output is a linear function of the outputs of the hid-
den nodes. The result is highly nonlinear and inter-
active, when looked at as a function of the original
inputs.
The Fat Scorecard study had an ingredient that
is rare. We not only had the traditional test sample,
but had three other test samples, taken one, two,
and three years later. In this case, the Fat Scorecard
outperformed the more traditional thinner score-
card for all four test samples. So the feared over-
tting to the traditional test sample never mate-
rialized. To get a better handle on this you need
an understanding of how the relationships between
variables evolve over time.
I recently encountered another connection
between algorithmic modeling and data modeling.
In classical multivariate discriminant analysis, one
assumes that the prediction variables have a mul-
tivariate normal distribution. But for a scorecard,
the prediction variables are hundreds of attribute
dummy variables, which are very nonnormal. How-
ever, if you apply the discriminant analysis algo-
rithm to the attribute dummy variables, you can
get a great algorithmic model, even though the
assumptions of discriminant analysis are severely
violated.
STATISTICAL MODELING: THE TWO CULTURES 223
A SOLUTION TO THE OCCAM DILEMMA
I think that there is a solution to the Occam
dilemma without resorting to goal-oriented argu-
ments. Clients really do insist on interpretable func-
tions, f(x). Segmented palatable scorecards are
very interpretable by the customer and are very
accurate. Professor Breiman himself gave single
trees an A+ on interpretability. The shallow-to-
medium tree in a segmented scorecard rates an A++.
The palatable scorecards in the leaves of the trees
are built from interpretable (possibly complex) char-
acteristics. Sometimes we cant implement them
until the lawyers and regulators approve. And that
requires super interpretability. Our more sophisti-
cated products have 10 to 20 segments and up to
100 characteristics (not all in every segment). These
models are very accurate and very interpretable.
I coined a phrase called the Ping-Pong theorem.
This theorem says that if we revealed to Profes-
sor Breiman the performance of our best model and
gave him our data, then he could develop an algo-
rithmic model using random forests, which would
outperform our model. But if he revealed to us the
performance of his model, then we could develop
a segmented scorecard, which would outperform
his model. We might need more characteristics,
attributes and segments, but our experience in this
kind of contest is on our side.
However, all the competing models in this game of
Ping-Pong would surely be algorithmic models. But
some of them could be interpretable.
THE ALGORITHM TUNING DILEMMA
As far as I can tell, all approaches to algorithmic
model building contain tuning parameters, either
explicit or implicit. For example, we use penalized
objective functions for tting and marginal contri-
bution thresholds for characteristic selection. With
experience, analysts learn how to set these tuning
parameters in order to get excellent test sample
or cross-validation results. However, in industry
and academia, there is sometimes a little tinker-
ing, which involves peeking at the test sample. The
result is some bias in the test sample or cross-
validation results. This is the same kind of tinkering
that upsets test of t pureness. This is a challenge
for the algorithmic modeling approach. How do you
optimize your results and get an unbiased estimate
of the generalization error?
GENERALIZING THE GENERALIZATION ERROR
In most commercial applications of algorithmic
modeling, the function, f(x), is used to make deci-
sions. In some academic research, classication is
used as a surrogate for the decision process, and
misclassication error is used as a surrogate for
prot. However, I see a mismatch between the algo-
rithms used to develop the models and the business
measurement of the models value. For example, at
Fair, Isaac, we frequently maximize divergence. But
when we argue the models value to the clients, we
dont necessarily brag about the great divergence.
We try to use measures that the client can relate to.
The ROC curve is one favorite, but it may not tell
the whole story. Sometimes, we develop simulations
of the clients business operation to show how the
model will improve their situation. For example, in
a transaction fraud control process, some measures
of interest are false positive rate, speed of detec-
tion and dollars saved when 0.5% of the transactions
are agged as possible frauds. The 0.5% reects the
number of transactions that can be processed by
the current fraud management staff. Perhaps what
the client really wants is a score that will maxi-
mize the dollars saved in their fraud control sys-
tem. The score that maximizes test set divergence
or minimizes test set misclassications does not do
it. The challenge for algorithmic modeling is to nd
an algorithm that maximizes the generalization dol-
lars saved, not generalization error.
We have made some progress in this area using
ideas from support vector machines and boosting.
By manipulating the observation weights used in
standard algorithms, we can improve the test set
performance on any objective of interest. But the
price we pay is computational intensity.
MEASURING IMPORTANCEIS IT
REALLY POSSIBLE?
I like Professor Breimans idea for measuring the
importance of variables in black box models. A Fair,
Isaac spin on this idea would be to build accurate
models for which no variable is much more impor-
tant than other variables. There is always a chance
that a variable and its relationships will change in
the future. After that, you still want the model to
work. So dont make any variable dominant.
I think that there is still an issue with measuring
importance. Consider a set of inputs and an algo-
rithm that yields a black box, for which x
1
is impor-
tant. From the Ping Pong theorem there exists a
set of input variables, excluding x
1
and an algorithm
that will yield an equally accurate black box. For
this black box, x
1
is unimportant.
224 L. BREIMAN
IN SUMMARY
Algorithmic modeling is a very important area of
statistics. It has evolved naturally in environments
with lots of data and lots of decisions. But you
can do it without suffering the Occam dilemma;
for example, use medium trees with interpretable
GAMs in the leaves. They are very accurate and
interpretable. And you can do it with data modeling
tools as long as you (i) ignore most textbook advice,
(ii) embrace the blessing of dimensionality, (iii) use
constraints in the tting optimizations (iv) use reg-
ularization, and (v) validate the results.
Comment
Emanuel Parzen
1. BREIMAN DESERVES OUR
APPRECIATION
I strongly support the view that statisticians must
face the crisis of the difculties in their practice of
regression. Breiman alerts us to systematic blun-
ders (leading to wrong conclusions) that have been
committed applying current statistical practice of
data modeling. In the spirit of statistician, avoid
doing harm I propose that the rst goal of statisti-
cal ethics should be to guarantee to our clients that
any mistakes in our analysis are unlike any mis-
takes that statisticians have made before.
The two goals in analyzing data which Leo calls
prediction and information I prefer to describe as
management and science. Management seeks
prot, practical answers (predictions) useful for
decision making in the short run. Science seeks
truth, fundamental knowledge about nature which
provides understanding and control in the long run.
As a historical note, Students t-test has many sci-
entic applications but was invented by Student as
a management tool to make Guinness beer better
(bitter?).
Breiman does an excellent job of presenting the
case that the practice of statistical science, using
only the conventional data modeling culture, needs
reform. He deserves much thanks for alerting us to
the algorithmic modeling culture. Breiman warns us
that if the model is a poor emulation of nature, the
conclusions may be wrong. This situation, which
I call the right answer to the wrong question, is
called by statisticians the error of the third kind.
Engineers at M.I.T. dene suboptimization as
elegantly solving the wrong problem.
Emanuel Parzen is Distinguished Professor, Depart-
ment of Statistics, Texas A&M University, 415 C
Block Building, College Station, Texas 77843 (e-mail:
[email protected]).
Breiman presents the potential benets of algo-
rithmic models (better predictive accuracy thandata
models, and consequently better information about
the underlying mechanism and avoiding question-
able conclusions which results from weak predictive
accuracy) and support vector machines (which pro-
vide almost perfect separation and discrimination
between two classes by increasing the dimension of
the feature set). He convinces me that the methods
of algorithmic modeling are important contributions
to the tool kit of statisticians.
If the profession of statistics is to remain healthy,
and not limit its research opportunities, statis-
ticians must learn about the cultures in which
Breiman works, but also about many other cultures
of statistics.
2. HYPOTHESES TO TEST TO AVOID
BLUNDERS OF STATISTICAL MODELING
Breiman deserves our appreciation for pointing
out generic deviations from standard assumptions
(which I call bivariate dependence and two-sample
conditional clustering) for which we should rou-
tinely check. Test null hypothesis can be a use-
ful algorithmic concept if we use tests that diagnose
in a model-free way the directions of deviation from
the null hypothesis model.
Bivariate dependence (correlation) may exist
between features [independent (input) variables] in
a regression causing them to be proxies for each
other and our models to be unstable with differ-
ent forms of regression models being equally well
tting. We need tools to routinely test the hypoth-
esis of statistical independence of the distributions
of independent (input) variables.
Two sample conditional clustering arises in the
distributions of independent (input) variables to
discriminate between two classes, which we call the
conditional distribution of input variables X given
each class. Class I may have only one mode (clus-
ter) at low values of X while class II has two modes
STATISTICAL MODELING: THE TWO CULTURES 225
(clusters) at low and high values of X. We would like
to conclude that high values of X are observed only
for members of class II but low values of Xoccur for
members of both classes. The hypothesis we propose
testing is equality of the pooled distribution of both
samples and the conditional distribution of sample
I, which is equivalent to P|classI[X| = P|classI|.
For successful discrimination one seeks to increase
the number (dimension) of inputs (features) X to
make P|classI[X| close to 1 or 0.
3. STATISTICAL MODELING, MANY
CULTURES, STATISTICAL METHODS MINING
Breiman speaks of two cultures of statistics; I
believe statistics has many cultures. At specialized
workshops (on maximum entropy methods or robust
methods or Bayesian methods or ) a main topic
of conversation is Why dont all statisticians think
like us?
I have my own eclectic philosophy of statis-
tical modeling to which I would like to attract
serious attention. I call it statistical methods
mining which seeks to provide a framework to
synthesize and apply the past half-century of
methodological progress in computationally inten-
sive methods for statistical modeling, including
EDA (exploratory data analysis), FDA (functional
data analysis), density estimation, Model DA (model
selection criteria data analysis), Bayesian priors
on function space, continuous parameter regres-
sion analysis and reproducing kernels, fast algo-
rithms, Kalman ltering, complexity, information,
quantile data analysis, nonparametric regression,
conditional quantiles.
I believe data mining is a special case of data
modeling. We should teach in our introductory
courses that one meaning of statistics is statisti-
cal data modeling done in a systematic way by
an iterated series of stages which can be abbrevi-
ated SIEVE (specify problem and general form of
models, identify tentatively numbers of parameters
and specialized models, estimate parameters, val-
idate goodness-of-t of estimated models, estimate
nal model nonparametrically or algorithmically).
MacKay and Oldford (2000) brilliantly present the
statistical method as a series of stages PPDAC
(problem, plan, data, analysis, conclusions).
4. QUANTILE CULTURE,
ALGORITHMIC MODELS
A culture of statistical data modeling based on
quantile functions, initiated in Parzen (1979), has
been my main research interest since 1976. In
my discussion to Stone (1977) I outlined a novel
approach to estimation of conditional quantile func-
tions which I only recently fully implemented. I
would like to extend the concept of algorithmic sta-
tistical models in two ways: (1) to mean data tting
by representations which use approximation the-
ory and numerical analysis; (2) to use the notation
of probability to describe empirical distributions of
samples (data sets) which are not assumed to be
generated by a random mechanism.
My quantile culture has not yet become widely
applied because you cannot give away a good idea,
you have to sell it (by integrating it in computer
programs usable by applied statisticians and thus
promote statistical methods mining).
A quantile function Q(u), 0 u 1, is the
inverse F
1
(u) of a distribution function F(x),
< x < . Its rigorous denition is Q(u) =
inf (x: F(x) u). When F is continuous with den-
sity f, F(Q(u)) = u, q(u) = Q
/
(u) = 1/f(Q(u)).
We use the notation Q for a true unknown quantile
function, Q
1
0
du
1
1
0
du
m
d(u
1
u
m
)d
1
(u
1
u
m
) = 1
A decile quantile bin B(k
1
k
m
) is dened
to be the set of observations (Y
1
Y
m
) satisfy-
ing, for j = 1 m Q
Y
j
((k
j
1)/10) < Y
j
Q
Y
j
(k
j
/10) Instead of deciles k/10 we could use
k/M for another base M.
To test the hypothesis that Y
1
Y
m
are statis-
tically independent we form for all k
j
= 1 10,
d(k
1
k
m
) = P|Bin(k
1
k
m
)|/
P|Bin(k
1
k
m
)[independence|
To test equality of distribution of a sample from pop-
ulation I and the pooled sample we form
d
1
(k
1
k
m
)
= P|Bin(k
1
k
m
)[ populationI|/
P|Bin(k
1
k
m
)[ pooled sample|
for all (k
1
k
m
) such that the denominator is
positive and otherwise dened arbitrarily. One can
show (letting X denote the population observed)
d
1
(k
1
k
m
) = P|X= I[ observation from
Bin(k
1
k
m
)|/P|X= I|
To test the null hypotheses in ways that detect
directions of deviations from the null hypothesis our
recommended rst step is quantile data analysis of
the values d(k
1
k
m
) and d
1
(k
1
k
m
).
I appreciate this opportunity to bring to the atten-
tion of researchers on high dimensional data anal-
ysis the potential of quantile methods. My con-
clusion is that statistical science has many cul-
tures and statisticians will be more successful when
they emulate Leo Breiman and apply as many cul-
tures as possible (which I call statistical methods
mining). Additional references are on my web site
at stat.tamu.edu.
Rejoinder
Leo Breiman
I thank the discussants. Im fortunate to have com-
ments from a group of experienced and creative
statisticianseven more so in that their comments
are diverse. Manny Parzen and Bruce Hoadley are
more or less in agreement, Brad Efron has seri-
ous reservations and D. R. Cox is in downright
disagreement.
I address Professor Coxs comments rst, since
our disagreement is crucial.
D. R. COX
Professor Cox is a worthy and thoughtful adver-
sary. We walk down part of the trail together and
then sharply diverge. To begin, I quote: Professor
Breiman takes data as his starting point. I would
prefer to start with an issue, a question, or a sci-
entic hypothesis, I agree, but would expand
the starting list to include the prediction of future
events. I have never worked on a project that has
started with Here is a lot of data; lets look at it
and see if we can get some ideas about how to use
it. The data has been put together and analyzed
starting with an objective.
C1 Data Models Can Be Useful
Professor Cox is committed to the use of data mod-
els. I readily acknowledge that there are situations
STATISTICAL MODELING: THE TWO CULTURES 227
where a simple data model may be useful and appro-
priate; for instance, if the science of the mechanism
producing the data is well enough known to deter-
mine the model apart from estimating parameters.
There are also situations of great complexity posing
important issues and questions in which there is not
enough data to resolve the questions to the accu-
racy desired. Simple models can then be useful in
giving qualitative understanding, suggesting future
research areas and the kind of additional data that
needs to be gathered.
At times, there is not enough data on which to
base predictions; but policy decisions need to be
made. In this case, constructing a model using what-
ever data exists, combined with scientic common
sense and subject-matter knowledge, is a reason-
able path. Professor Cox points to examples when
he writes:
Often the prediction is under quite dif-
ferent conditions from the data; what
is the likely progress of the incidence
of the epidemic of v-CJD in the United
Kingdom, what would be the effect on
annual incidence of cancer in the United
States reducing by 10% the medical use
of X-rays, etc.? That is, it may be desired
to predict the consequences of some-
thing only indirectly addressed by the
data available for analysis prediction,
always hazardous, without some under-
standing of the underlying process and
linking with other sources of information,
becomes more and more tentative.
I agree.
C2 Data Models Only
From here on we part company. Professor Coxs
discussion consists of a justication of the use of
data models to the exclusion of other approaches.
For instance, although he admits, I certainly
accept, although it goes somewhat against the grain
to do so, that there are situations where a directly
empirical approach is better the two examples
he gives of such situations are short-term economic
forecasts and real-time ood forecastsamong the
less interesting of all of the many current suc-
cessful algorithmic applications. In his view, the
only use for algorithmic models is short-term fore-
casting; there are no comments on the rich infor-
mation about the data and covariates available
from random forests or in the many elds, such as
pattern recognition, where algorithmic modeling is
fundamental.
He advocates construction of stochastic data mod-
els that summarize the understanding of the phe-
nomena under study. The methodology in the Cox
and Wermuth book (1996) attempts to push under-
standing further by nding casual orderings in the
covariate effects. The sixth chapter of this book con-
tains illustrations of this approach on four data sets.
The rst is a small data set of 68 patients with
seven covariates from a pilot study at the University
of Mainz to identify pyschological and socioeconomic
factors possibly important for glucose control in dia-
betes patients. This is a regression-type problem
with the response variable measured by GHb (gly-
cosylated haemoglobin). The model tting is done
by a number of linear regressions and validated
by the checking of various residual plots. The only
other reference to model validation is the statement,
R
2
= 034, reasonably large by the standards usual
for this eld of study. Predictive accuracy is not
computed, either for this example or for the three
other examples.
My comments on the questionable use of data
models apply to this analysis. Incidentally, I tried to
get one of the data sets used in the chapter to con-
duct an alternative analysis, but it was not possible
to get it before my rejoinder was due. It would have
been interesting to contrast our two approaches.
C3 Approach to Statistical Problems
Basing my critique on a small illustration in a
book is not fair to Professor Cox. To be fairer, I quote
his words about the nature of a statistical analysis:
Formal models are useful and often
almost, if not quite, essential for inci-
sive thinking. Descriptively appealing
and transparent methods with a rm
model base are the ideal. Notions of sig-
nicance tests, condence intervals, pos-
terior intervals, and all the formal appa-
ratus of inference are valuable tools to be
used as guides, but not in a mechanical
way; they indicate the uncertainty that
would apply under somewhat idealized,
maybe very idealized, conditions and as
such are often lower bounds to real
uncertainty. Analyses and model devel-
opment are at least partly exploratory.
Automatic methods of model selection
(and of variable selection in regression-
like problems) are to be shunned or, if
use is absolutely unavoidable, are to be
examined carefully for their effect on
the nal conclusions. Unfocused tests of
model adequacy are rarely helpful.
228 L. BREIMAN
Given the right kind of data: relatively small sam-
ple size and a handful of covariates, I have no doubt
that his experience and ingenuity in the craft of
model construction would result in an illuminating
model. But data characteristics are rapidly chang-
ing. In many of the most interesting current prob-
lems, the idea of starting with a formal model is not
tenable.
C4 Changes in Problems
My impression from Professor Coxs comments is
that he believes every statistical problem can be
best solved by constructing a data model. I believe
that statisticians need to be more pragmatic. Given
a statistical problem, nd a good solution, whether
it is a data model, an algorithmic model or (although
it is somewhat against my grain), a Bayesian data
model or a completely different approach.
My work on the 1990 Census Adjustment
(Breiman, 1994) involved a painstaking analysis of
the sources of error in the data. This was done by
a long study of thousands of pages of evaluation
documents. This seemed the most appropriate way
of answering the question of the accuracy of the
adjustment estimates.
The conclusion that the adjustment estimates
were largely the result of bad data has never been
effectively contested and is supported by the results
of the Year 2000 Census Adjustment effort. The
accuracy of the adjustment estimates was, arguably,
the most important statistical issue of the last
decade, and could not be resolved by any amount
of statistical modeling.
A primary reason why we cannot rely on data
models alone is the rapid change in the nature of
statistical problems. The realm of applications of
statistics has expanded more in the last twenty-ve
years than in any comparable period in the history
of statistics.
In an astronomy and statistics workshop this
year, a speaker remarked that in twenty-ve years
we have gone from being a small sample-size science
to a very large sample-size science. Astronomical
data bases now contain data on two billion objects
comprising over 100 terabytes and the rate of new
information is accelerating.
A recent biostatistics workshop emphasized the
analysis of genetic data. An exciting breakthrough
is the use of microarrays to locate regions of gene
activity. Here the sample size is small, but the num-
ber of variables ranges in the thousands. The ques-
tions are which specic genes contribute to the
occurrence of various types of diseases.
Questions about the areas of thinking in the brain
are being studied using functional MRI. The data
gathered in each run consists of a sequence of
150,000 pixel images. Gigabytes of satellite infor-
mation are being used in projects to predict and
understand short- and long-term environmental
and weather changes.
Underlying this rapid change is the rapid evo-
lution of the computer, a device for gathering, stor-
ing and manipulation of incredible amounts of data,
together with technological advances incorporating
computing, such as satellites and MRI machines.
The problems are exhilarating. The methods used
in statistics for small sample sizes and a small num-
ber of variables are not applicable. John Rice, in his
summary talk at the astronomy and statistics work-
shop said, Statisticians have to become opportunis-
tic. That is, faced with a problem, they must nd
a reasonable solution by whatever method works.
One surprising aspect of both workshops was how
opportunistic statisticians faced with genetic and
astronomical data had become. Algorithmic meth-
ods abounded.
C5 Mainstream Procedures and Tools
Professor Cox views my critique of the use of data
models as based in part on a caricature. Regard-
ing my references to articles in journals such as
JASA, he states that they are not typical of main-
stream statistical analysis, but are used to illustrate
technique rather than explain the process of anal-
ysis. His concept of mainstream statistical analysis
is summarized in the quote given in my Section C3.
It is the kind of thoughtful and careful analysis that
he prefers and is capable of.
Following this summary is the statement:
By contrast, Professor Breiman equates
mainstream applied statistics to a rel-
atively mechanical process involving
somehow or other choosing a model, often
a default model of standard form, and
applying standard methods of analysis
and goodness-of-t procedures.
The disagreement is denitionalwhat is main-
stream? In terms of numbers my denition of main-
stream prevails, I guess, at a ratio of at least 100
to 1. Simply count the number of people doing their
statistical analysis using canned packages, or count
the number of SAS licenses.
In the academic world, we often overlook the fact
that we are a small slice of all statisticians and
an even smaller slice of all those doing analyses of
data. There are many statisticians and nonstatis-
ticians in diverse elds using data to reach con-
clusions and depending on tools supplied to them
STATISTICAL MODELING: THE TWO CULTURES 229
by SAS, SPSS, etc. Their conclusions are important
and are sometimes published in medical or other
subject-matter journals. They do not have the sta-
tistical expertise, computer skills, or time needed to
construct more appropriate tools. I was faced with
this problem as a consultant when conned to using
the BMDP linear regression, stepwise linear regres-
sion, and discriminant analysis programs. My con-
cept of decision trees arose when I was faced with
nonstandard data that could not be treated by these
standard methods.
When I rejoined the university after my consult-
ing years, one of my hopes was to provide better
general purpose tools for the analysis of data. The
rst step in this direction was the publication of the
CART book (Breiman et al., 1984). CART and other
similar decision tree methods are used in thou-
sands of applications yearly in many elds. It has
proved robust and reliable. There are others that
are more recent; random forests is the latest. A pre-
liminary version of random forests is free source
with f77 code, S and R interfaces available at
www.stat.berkeley.edu/users/breiman.
A nearly completed second version will also be
put on the web site and translated into Java by the
Weka group. My collaborator, Adele Cutler, and I
will continue to upgrade, add new features, graph-
ics, and a good interface.
My philosophy about the eld of academic statis-
tics is that we have a responsibility to provide
the many people working in applications outside of
academia with useful, reliable, and accurate analy-
sis tools. Two excellent examples are wavelets and
decision trees. More are needed.
BRAD EFRON
Brad seems to be a bit puzzled about how to react
to my article. Ill start with what appears to be his
biggest reservation.
E1 From Simple to Complex Models
Brad is concerned about the use of complex
models without simple interpretability in their
structure, even though these models may be the
most accurate predictors possible. But the evolution
of science is from simple to complex.
The equations of general relativity are consider-
ably more complex and difcult to understand than
Newtons equations. The quantum mechanical equa-
tions for a system of molecules are extraordinarily
difcult to interpret. Physicists accept these com-
plex models as the facts of life, and do their best to
extract usable information from them.
There is no consideration given to trying to under-
stand cosmology on the basis of Newtons equations
or nuclear reactions in terms of hard ball models for
atoms. The scientic approach is to use these com-
plex models as the best possible descriptions of the
physical world and try to get usable information out
of them.
There are many engineering and scientic appli-
cations where simpler models, such as Newtons
laws, are certainly sufcientsay, in structural
design. Even here, for larger structures, the model
is complex and the analysis difcult. In scientic
elds outside statistics, answering questions is done
by extracting information from increasingly com-
plex and accurate models.
The approach I suggest is similar. In genetics,
astronomy and many other current areas statistics
is needed to answer questions, construct the most
accurate possible model, however complex, and then
extract usable information from it.
Random forests is in use at some major drug
companies whose statisticians were impressed by
its ability to determine gene expression (variable
importance) in microarray data. They were not
concerned about its complexity or black-box appear-
ance.
E2 Prediction
Leos paper overstates both its [predic-
tions] role, and our professions lack of
interest in it Most statistical surveys
have the identication of causal factors
as their ultimate role.
My point was that it is difcult to tell, using
goodness-of-t tests and residual analysis, how well
a model ts the data. An estimate of its test set accu-
racy is a preferable assessment. If, for instance, a
model gives predictive accuracy only slightly better
than the all survived or other baseline estimates,
we cant put much faith in its reliability in the iden-
tication of causal factors.
I agree that often statistical surveys have
the identication of casual factors as their ultimate
role. I would add that the more predictively accu-
rate the model is, the more faith can be put into the
variables that it ngers as important.
E3 Variable Importance
A signicant and often overlooked point raised
by Brad is what meaning can one give to state-
ments that variable X is important or not impor-
tant. This has puzzled me on and off for quite a
while. In fact, variable importance has always been
dened operationally. In regression the important
230 L. BREIMAN
variables are dened by doing best subsets or vari-
able deletion.
Another approach used in linear methods such as
logistic regression and survival models is to com-
pare the size of the slope estimate for a variable to
its estimated standard error. The larger the ratio,
the more important the variable. Both of these def-
initions can lead to erroneous conclusions.
My denition of variable importance is based
on prediction. A variable might be considered
important if deleting it seriously affects prediction
accuracy. This brings up the problem that if two
variables are highly correlated, deleting one or the
other of them will not affect prediction accuracy.
Deleting both of them may degrade accuracy consid-
erably. The denition used in random forests spots
both variables.
Importance does not yet have a satisfactory the-
oretical denition (I havent been able to locate the
article Brad references but Ill keep looking). It
depends on the dependencies between the output
variable and the input variables, and on the depen-
dencies between the input variables. The problem
begs for research.
E4 Other Reservations
Sample sizes have swollen alarmingly
while goals grow less distinct (nd inter-
esting data structure).
I have not noticed any increasing fuzziness in
goals, only that they have gotten more diverse. In
the last two workshops I attended (genetics and
astronomy) the goals in using the data were clearly
laid out. Searching for structure is rarely seen
even though data may be in the terabyte range.
The new algorithms often appear in the
form of black boxes with enormous num-
bers of adjustable parameters (knobs to
twiddle).
This is a perplexing statement and perhaps I dont
understand what Brad means. Random forests has
only one adjustable parameter that needs to be set
for a run, is insensitive to the value of this param-
eter over a wide range, and has a quick and simple
way for determining a good value. Support vector
machines depend on the settings of 12 parameters.
Other algorithmic models are similarly sparse in the
number of knobs that have to be twiddled.
New methods always look better than old
ones. Complicated models are harder
to criticize than simple ones.
In 1992 I went to my rst NIPS conference. At
that time, the exciting algorithmic methodology was
neural nets. My attitude was grim skepticism. Neu-
ral nets had been given too much hype, just as AI
had been given and failed expectations. I came away
a believer. Neural nets delivered on the bottom line!
In talk after talk, in problem after problem, neu-
ral nets were being used to solve difcult prediction
problems with test set accuracies better than any-
thing I had seen up to that time.
My attitude toward new and/or complicated meth-
ods is pragmatic. Prove that youve got a better
mousetrap and Ill buy it. But the proof had bet-
ter be concrete and convincing.
Brad questions where the bias and variance have
gone. It is surprising when, trained in classical bias-
variance terms and convinced of the curse of dimen-
sionality, one encounters methods that can handle
thousands of variables with little loss of accuracy.
It is not voodoo statistics; there is some simple the-
ory that illuminates the behavior of random forests
(Breiman, 1999). I agree that more theoretical work
is needed to increase our understanding.
Brad is an innovative and exible thinker who
has contributed much to our eld. He is opportunis-
tic in problem solving and may, perhaps not overtly,
already have algorithmic modeling in his bag of
tools.
BRUCE HOADLEY
I thank Bruce Hoadley for his description of
the algorithmic procedures developed at Fair, Isaac
since the 1960s. They sound like people I would
enjoy working with. Bruce makes two points of mild
contention. One is the following:
High performance (predictive accuracy)
on the test sample does not guaran-
tee high performance on future samples;
things do change.
I agreealgorithmic models accurate in one con-
text must be modied to stay accurate in others.
This does not necessarily imply that the way the
model is constructed needs to be altered, but that
data gathered in the new context should be used in
the construction.
His other point of contention is that the Fair, Isaac
algorithm retains interpretability, so that it is pos-
sible to have both accuracy and interpretability. For
clients who like to know whats going on, thats a
sellable item. But developments in algorithmic mod-
eling indicate that the Fair, Isaac algorithm is an
exception.
A computer scientist working in the machine
learning area joined a large money management
STATISTICAL MODELING: THE TWO CULTURES 231
company some years ago and set up a group to do
portfolio management using stock predictions given
by large neural nets. When we visited, I asked how
he explained the neural nets to clients. Simple, he
said; We t binary trees to the inputs and outputs
of the neural nets and show the trees to the clients.
Keeps them happy! In both stock prediction and
credit rating, the priority is accuracy. Interpretabil-
ity is a secondary goal that can be nessed.
MANNY PARZEN
Manny Parzen opines that there are not two but
many modeling cultures. This is not an issue I want
to ercely contest. I like my division because it is
pretty clear cutare you modeling the inside of the
box or not? For instance, I would include Bayesians
in the data modeling culture. I will keep my eye on
the quantile culture to see what develops.
Most of all, I appreciate Mannys openness to the
issues raised in my paper. With the rapid changes
in the scope of statistical problems, more open and
concrete discussion of what works and what doesnt
should be welcomed.
WHERE ARE WE HEADING?
Many of the best statisticians I have talked to
over the past years have serious concerns about the
viability of statistics as a eld. Oddly, we are in a
period where there has never been such a wealth of
new statistical problems and sources of data. The
danger is that if we dene the boundaries of our
eld in terms of familar tools and familar problems,
we will fail to grasp the new opportunities.
ADDITIONAL REFERENCES
Beverdige, W. V. I (1952) The Art of Scientic Investigation.
Heinemann, London.
Breiman, L. (1994) The 1990 Census adjustment: undercount or
bad data (with discussion)? Statist. Sci. 9 458475.
Cox, D. R. and Wermuth, N. (1996) Multivariate Dependencies.
Chapman and Hall, London.
Efron, B. and Gong, G. (1983) A leisurely look at the bootstrap,
the jackknife, and cross-validation. Amer. Statist. 37 3648.
Efron, B. and Tibshirani, R. (1996) Improvements on cross-
validation: the 632 rule. J. Amer. Statist. Assoc. 91 548560.
Efron, B. and Tibshirani, R. (1998) The problem of regions.
Ann. Statist. 26 12871318.
Gong, G. (1982) Cross-validation, the jackknife, and the boot-
strap: excess error estimation in forward logistic regression.
Ph.D. dissertation, Stanford Univ.
MacKay, R. J. and Oldford, R. W. (2000) Scientic method,
statistical method, and the speed of light. Statist. Sci. 15
224253.
Parzen, E. (1979) Nonparametric statistical data modeling (with
discussion). J. Amer. Statist. Assoc. 74 105131.
Stone, C. (1977) Consistent nonparametric regression. Ann.
Statist. 5 595645.