Leo Breiman - Statistical Modeling - Two Cultures

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Statistical Science

2001, Vol. 16, No. 3, 199231


Statistical Modeling: The Two Cultures
Leo Breiman
Abstract. There are two cultures in the use of statistical modeling to
reach conclusions from data. One assumes that the data are generated
by a given stochastic data model. The other uses algorithmic models and
treats the data mechanism as unknown. The statistical community has
been committed to the almost exclusive use of data models. This commit-
ment has led to irrelevant theory, questionable conclusions, and has kept
statisticians from working on a large range of interesting current prob-
lems. Algorithmic modeling, both in theory and practice, has developed
rapidly in elds outside statistics. It can be used both on large complex
data sets and as a more accurate and informative alternative to data
modeling on smaller data sets. If our goal as a eld is to use data to
solve problems, then we need to move away from exclusive dependence
on data models and adopt a more diverse set of tools.
1. INTRODUCTION
Statistics starts with data. Think of the data as
being generated by a black box in which a vector of
input variables x (independent variables) go in one
side, and on the other side the response variables y
come out. Inside the black box, nature functions to
associate the predictor variables with the response
variables, so the picture is like this:
y x nature
There are two goals in analyzing the data:
Prediction. To be able to predict what the responses
are going to be to future input variables;
Information. To extract some information about
how nature is associating the response variables
to the input variables.
There are two different approaches toward these
goals:
The Data Modeling Culture
The analysis in this culture starts with assuming
a stochastic data model for the inside of the black
box. For example, a common data model is that data
are generated by independent draws from
response variables = f(predictor variables,
random noise, parameters)
Leo Breiman is Professor, Department of Statistics,
University of California, Berkeley, California 94720-
4735 (e-mail: [email protected]).
The values of the parameters are estimated from
the data and the model then used for information
and/or prediction. Thus the black box is lled in like
this:
y x
linear regression
logistic regression
Cox model
Model validation. Yesno using goodness-of-t
tests and residual examination.
Estimated culture population. 98% of all statisti-
cians.
The Algorithmic Modeling Culture
The analysis in this culture considers the inside of
the box complex and unknown. Their approach is to
nd a function f(x)an algorithm that operates on
x to predict the responses y. Their black box looks
like this:
y x unknown
decision trees
neural nets
Model validation. Measured by predictive accuracy.
Estimated culture population. 2% of statisticians,
many in other elds.
In this paper I will argue that the focus in the
statistical community on data models has:

Led to irrelevant theory and questionable sci-


entic conclusions;
199
200 L. BREIMAN

Kept statisticians from using more suitable


algorithmic models;

Prevented statisticians from working on excit-


ing new problems;
I will also review some of the interesting new
developments in algorithmic modeling in machine
learning and look at applications to three data sets.
2. ROAD MAP
It may be revealing to understand how I became a
member of the small second culture. After a seven-
year stint as an academic probabilist, I resigned and
went into full-time free-lance consulting. After thir-
teen years of consulting I joined the Berkeley Statis-
tics Department in 1980 and have been there since.
My experiences as a consultant formed my views
about algorithmic modeling. Section 3 describes two
of the projects I worked on. These are given to show
how my views grew from such problems.
When I returned to the university and began
reading statistical journals, the research was dis-
tant from what I had done as a consultant. All
articles begin and end with data models. My obser-
vations about published theoretical research in
statistics are in Section 4.
Data modeling has given the statistics eld many
successes in analyzing data and getting informa-
tion about the mechanisms producing the data. But
there is also misuse leading to questionable con-
clusions about the underlying mechanism. This is
reviewed in Section 5. Following that is a discussion
(Section 6) of how the commitment to data modeling
has prevented statisticians from entering new sci-
entic and commercial elds where the data being
gathered is not suitable for analysis by data models.
In the past fteen years, the growth in algorith-
mic modeling applications and methodology has
been rapid. It has occurred largely outside statis-
tics in a new communityoften called machine
learningthat is mostly young computer scientists
(Section 7). The advances, particularly over the last
ve years, have been startling. Three of the most
important changes in perception to be learned from
these advances are described in Sections 8, 9, and
10, and are associated with the following names:
Rashomon: the multiplicity of good models;
Occam: the conict between simplicity and
accuracy;
Bellman: dimensionalitycurse or blessing?
Section 11 is titled Information from a Black
Box and is important in showing that an algo-
rithmic model can produce more and more reliable
information about the structure of the relationship
between inputs and outputs than data models. This
is illustrated using two medical data sets and a
genetic data set. A glossary at the end of the paper
explains terms that not all statisticians may be
familiar with.
3. PROJECTS IN CONSULTING
As a consultant I designed and helped supervise
surveys for the Environmental Protection Agency
(EPA) and the state and federal court systems. Con-
trolled experiments were designed for the EPA, and
I analyzed trafc data for the U.S. Department of
Transportation and the California Transportation
Department. Most of all, I worked on a diverse set
of prediction projects. Here are some examples:
Predicting next-day ozone levels.
Using mass spectra to identify halogen-containing
compounds.
Predicting the class of a ship from high altitude
radar returns.
Using sonar returns to predict the class of a sub-
marine.
Identity of hand-sent Morse Code.
Toxicity of chemicals.
On-line prediction of the cause of a freeway trafc
breakdown.
Speech recognition
The sources of delay in criminal trials in state court
systems.
To understand the nature of these problems and
the approaches taken to solve them, I give a fuller
description of the rst two on the list.
3.1 The Ozone Project
In the mid- to late 1960s ozone levels became a
serious health problem in the Los Angeles Basin.
Three different alert levels were established. At the
highest, all government workers were directed not
to drive to work, children were kept off playgrounds
and outdoor exercise was discouraged.
The major source of ozone at that time was auto-
mobile tailpipe emissions. These rose into the low
atmosphere and were trapped there by an inversion
layer. A complex chemical reaction, aided by sun-
light, cooked away and produced ozone two to three
hours after the morning commute hours. The alert
warnings were issued in the morning, but would be
more effective if they could be issued 12 hours in
advance. In the mid-1970s, the EPA funded a large
effort to see if ozone levels could be accurately pre-
dicted 12 hours in advance.
Commuting patterns in the Los Angeles Basin
are regular, with the total variation in any given
STATISTICAL MODELING: THE TWO CULTURES 201
daylight hour varying only a few percent from
one weekday to another. With the total amount of
emissions about constant, the resulting ozone lev-
els depend on the meteorology of the preceding
days. A large data base was assembled consist-
ing of lower and upper air measurements at U.S.
weather stations as far away as Oregon and Ari-
zona, together with hourly readings of surface
temperature, humidity, and wind speed at the
dozens of air pollution stations in the Basin and
nearby areas.
Altogether, there were daily and hourly readings
of over 450 meteorological variables for a period of
seven years, with corresponding hourly values of
ozone and other pollutants in the Basin. Let x be
the predictor vector of meteorological variables on
the nth day. There are more than 450 variables in
x since information several days back is included.
Let y be the ozone level on the (n 1)st day. Then
the problem was to construct a function f(x) such
that for any future day and future predictor vari-
ables x for that day, f(x) is an accurate predictor of
the next days ozone level y.
To estimate predictive accuracy, the rst ve
years of data were used as the training set. The
last two years were set aside as a test set. The
algorithmic modeling methods available in the pre-
1980s decades seem primitive now. In this project
large linear regressions were run, followed by vari-
able selection. Quadratic terms in, and interactions
among, the retained variables were added and vari-
able selection used again to prune the equations. In
the end, the project was a failurethe false alarm
rate of the nal predictor was too high. I have
regrets that this project cant be revisited with the
tools available today.
3.2 The Chlorine Project
The EPA samples thousands of compounds a year
and tries to determine their potential toxicity. In
the mid-1970s, the standard procedure was to mea-
sure the mass spectra of the compound and to try
to determine its chemical structure from its mass
spectra.
Measuring the mass spectra is fast and cheap. But
the determination of chemical structure from the
mass spectra requires a painstaking examination
by a trained chemist. The cost and availability of
enough chemists to analyze all of the mass spectra
produced daunted the EPA. Many toxic compounds
contain halogens. So the EPA funded a project to
determine if the presence of chlorine in a compound
could be reliably predicted from its mass spectra.
Mass spectra are produced by bombarding the
compound with ions in the presence of a magnetic
eld. The molecules of the compound split and the
lighter fragments are bent more by the magnetic
eld than the heavier. Then the fragments hit an
absorbing strip, with the position of the fragment on
the strip determined by the molecular weight of the
fragment. The intensity of the exposure at that posi-
tion measures the frequency of the fragment. The
resultant mass spectra has numbers reecting fre-
quencies of fragments from molecular weight 1 up to
the molecular weight of the original compound. The
peaks correspond to frequent fragments and there
are many zeroes. The available data base consisted
of the known chemical structure and mass spectra
of 30,000 compounds.
The mass spectrum predictor vector x is of vari-
able dimensionality. Molecular weight in the data
base varied from 30 to over 10,000. The variable to
be predicted is
y = 1: contains chlorine,
y = 2: does not contain chlorine.
The problem is to construct a function f(x) that
is an accurate predictor of y where x is the mass
spectrum of the compound.
To measure predictive accuracy the data set was
randomly divided into a 25,000 member training
set and a 5,000 member test set. Linear discrim-
inant analysis was tried, then quadratic discrimi-
nant analysis. These were difcult to adapt to the
variable dimensionality. By this time I was thinking
about decision trees. The hallmarks of chlorine in
mass spectra were researched. This domain knowl-
edge was incorporated into the decision tree algo-
rithm by the design of the set of 1,500 yesno ques-
tions that could be applied to a mass spectra of any
dimensionality. The result was a decision tree that
gave 95% accuracy on both chlorines and nonchlo-
rines (see Breiman, Friedman, Olshen and Stone,
1984).
3.3 Perceptions on Statistical Analysis
As I left consulting to go back to the university,
these were the perceptions I had about working with
data to nd answers to problems:
(a) Focus on nding a good solutionthats what
consultants get paid for.
(b) Live with the data before you plunge into
modeling.
(c) Search for a model that gives a good solution,
either algorithmic or data.
(d) Predictive accuracy on test sets is the crite-
rion for how good the model is.
(e) Computers are an indispensable partner.
202 L. BREIMAN
4. RETURN TO THE UNIVERSITY
I had one tip about what research in the uni-
versity was like. A friend of mine, a prominent
statistician from the Berkeley Statistics Depart-
ment, visited me in Los Angeles in the late 1970s.
After I described the decision tree method to him,
his rst question was, Whats the model for the
data?
4.1 Statistical Research
Upon my return, I started reading the Annals of
Statistics, the agship journal of theoretical statis-
tics, and was bemused. Every article started with
Assume that the data are generated by the follow-
ing model:
followed by mathematics exploring inference, hypo-
thesis testing and asymptotics. There is a wide
spectrum of opinion regarding the usefulness of the
theory published in the Annals of Statistics to the
eld of statistics as a science that deals with data. I
am at the very low end of the spectrum. Still, there
have been some gems that have combined nice
theory and signicant applications. An example is
wavelet theory. Even in applications, data models
are universal. For instance, in the Journal of the
American Statistical Association (JASA), virtually
every article contains a statement of the form:
Assume that the data are generated by the follow-
ing model:
I am deeply troubled by the current and past use
of data models in applications, where quantitative
conclusions are drawn and perhaps policy decisions
made.
5. THE USE OF DATA MODELS
Statisticians in applied research consider data
modeling as the template for statistical analysis:
Faced with an applied problem, think of a data
model. This enterprise has at its heart the belief
that a statistician, by imagination and by looking
at the data, can invent a reasonably good para-
metric class of models for a complex mechanism
devised by nature. Then parameters are estimated
and conclusions are drawn. But when a model is t
to data to draw quantitative conclusions:

The conclusions are about the models mecha-


nism, and not about natures mechanism.
It follows that:

If the model is a poor emulation of nature, the


conclusions may be wrong.
These truisms have often been ignored in the enthu-
siasm for tting data models. A few decades ago,
the commitment to data models was such that even
simple precautions such as residual analysis or
goodness-of-t tests were not used. The belief in the
infallibility of data models was almost religious. It
is a strange phenomenononce a model is made,
then it becomes truth and the conclusions from it
are infallible.
5.1 An Example
I illustrate with a famous (also infamous) exam-
ple: assume the data is generated by independent
draws from the model
(R) y = b
0

M

1
b
m
x
m

where the coefcients {b
m
} are to be estimated,
is N(0
2
) and
2
is to be estimated. Given that
the data is generated this way, elegant tests of
hypotheses, condence intervals, distributions of
the residual sum-of-squares and asymptotics can be
derived. This made the model attractive in terms
of the mathematics involved. This theory was used
both by academic statisticians and others to derive
signicance levels for coefcients on the basis of
model (R), with little consideration as to whether
the data on hand could have been generated by a
linear model. Hundreds, perhaps thousands of arti-
cles were published claiming proof of something or
other because the coefcient was signicant at the
5% level.
Goodness-of-t was demonstrated mostly by giv-
ing the value of the multiple correlation coefcient
R
2
which was often closer to zero than one and
which could be over inated by the use of too many
parameters. Besides computing R
2
, nothing else
was done to see if the observational data could have
been generated by model (R). For instance, a study
was done several decades ago by a well-known
member of a university statistics department to
assess whether there was gender discrimination in
the salaries of the faculty. All personnel les were
examined and a data base set up which consisted of
salary as the response variable and 25 other vari-
ables which characterized academic performance;
that is, papers published, quality of journals pub-
lished in, teaching record, evaluations, etc. Gender
appears as a binary predictor variable.
A linear regression was carried out on the data
and the gender coefcient was signicant at the
5% level. That this was strong evidence of sex dis-
crimination was accepted as gospel. The design
of the study raises issues that enter before the
consideration of a modelCan the data gathered
STATISTICAL MODELING: THE TWO CULTURES 203
answer the question posed? Is inference justied
when your sample is the entire population? Should
a data model be used? The deciencies in analysis
occurred because the focus was on the model and
not on the problem.
The linear regression model led to many erro-
neous conclusions that appeared in journal articles
waving the 5% signicance level without knowing
whether the model t the data. Nowadays, I think
most statisticians will agree that this is a suspect
way to arrive at conclusions. At the time, there were
few objections from the statistical profession about
the fairy-tale aspect of the procedure, But, hidden in
an elementary textbook, Mosteller and Tukey (1977)
discuss many of the fallacies possible in regression
and write The whole area of guided regression is
fraught with intellectual, statistical, computational,
and subject matter difculties.
Even currently, there are only rare published cri-
tiques of the uncritical use of data models. One of
the few is David Freedman, who examines the use
of regression models (1994); the use of path models
(1987) and data modeling (1991, 1995). The analysis
in these papers is incisive.
5.2 Problems in Current Data Modeling
Current applied practice is to check the data
model t using goodness-of-t tests and residual
analysis. At one point, some years ago, I set up a
simulated regression problem in seven dimensions
with a controlled amount of nonlinearity. Standard
tests of goodness-of-t did not reject linearity until
the nonlinearity was extreme. Recent theory sup-
ports this conclusion. Work by Bickel, Ritov and
Stoker (2001) shows that goodness-of-t tests have
very little power unless the direction of the alter-
native is precisely specied. The implication is that
omnibus goodness-of-t tests, which test in many
directions simultaneously, have little power, and
will not reject until the lack of t is extreme.
Furthermore, if the model is tinkered with on the
basis of the data, that is, if variables are deleted
or nonlinear combinations of the variables added,
then goodness-of-t tests are not applicable. Resid-
ual analysis is similarly unreliable. In a discussion
after a presentation of residual analysis in a sem-
inar at Berkeley in 1993, William Cleveland, one
of the fathers of residual analysis, admitted that it
could not uncover lack of t in more than four to ve
dimensions. The papers I have read on using resid-
ual analysis to check lack of t are conned to data
sets with two or three variables.
With higher dimensions, the interactions between
the variables can produce passable residual plots for
a variety of models. A residual plot is a goodness-of-
t test, and lacks power in more than a few dimen-
sions. An acceptable residual plot does not imply
that the model is a good t to the data.
There are a variety of ways of analyzing residuals.
For instance, Landwher, Preibon and Shoemaker
(1984, with discussion) gives a detailed analysis of
tting a logistic model to a three-variable data set
using various residual plots. But each of the four
discussants present other methods for the analysis.
One is left with an unsettled sense about the arbi-
trariness of residual analysis.
Misleading conclusions may follow from data
models that pass goodness-of-t tests and residual
checks. But published applications to data often
show little care in checking model t using these
methods or any other. For instance, many of the
current application articles in JASA that t data
models have very little discussion of how well their
model ts the data. The question of how well the
model ts the data is of secondary importance com-
pared to the construction of an ingenious stochastic
model.
5.3 The Multiplicity of Data Models
One goal of statistics is to extract information
from the data about the underlying mechanism pro-
ducing the data. The greatest plus of data modeling
is that it produces a simple and understandable pic-
ture of the relationship between the input variables
and responses. For instance, logistic regression in
classication is frequently used because it produces
a linear combination of the variables with weights
that give an indication of the variable importance.
The end result is a simple picture of how the pre-
diction variables affect the response variable plus
condence intervals for the weights. Suppose two
statisticians, each one with a different approach
to data modeling, t a model to the same data
set. Assume also that each one applies standard
goodness-of-t tests, looks at residuals, etc., and
is convinced that their model ts the data. Yet
the two models give different pictures of natures
mechanism and lead to different conclusions.
McCullah and Nelder (1989) write Data will
often point with almost equal emphasis on sev-
eral possible models, and it is important that the
statistician recognize and accept this. Well said,
but different models, all of them equally good, may
give different pictures of the relation between the
predictor and response variables. The question of
which one most accurately reects the data is dif-
cult to resolve. One reason for this multiplicity
is that goodness-of-t tests and other methods for
checking t give a yesno answer. With the lack of
204 L. BREIMAN
power of these tests with data having more than a
small number of dimensions, there will be a large
number of models whose t is acceptable. There is
no way, among the yesno methods for gauging t,
of determining which is the better model. A few
statisticians know this. Mountain and Hsiao (1989)
write, It is difcult to formulate a comprehensive
model capable of encompassing all rival models.
Furthermore, with the use of nite samples, there
are dubious implications with regard to the validity
and power of various encompassing tests that rely
on asymptotic theory.
Data models in current use may have more dam-
aging results than the publications in the social sci-
ences based on a linear regression analysis. Just as
the 5% level of signicance became a de facto stan-
dard for publication, the Cox model for the analysis
of survival times and logistic regression for survive
nonsurvive data have become the de facto standard
for publication in medical journals. That different
survival models, equally well tting, could give dif-
ferent conclusions is not an issue.
5.4 Predictive Accuracy
The most obvious way to see how well the model
box emulates natures box is this: put a case x down
natures box getting an output y. Similarly, put the
same case x down the model box getting an out-
put y
/
. The closeness of y and y
/
is a measure of
how good the emulation is. For a data model, this
translates as: t the parameters in your model by
using the data, then, using the model, predict the
data and see how good the prediction is.
Prediction is rarely perfect. There are usu-
ally many unmeasured variables whose effect is
referred to as noise. But the extent to which the
model box emulates natures box is a measure of
how well our model can reproduce the natural
phenomenon producing the data.
McCullagh and Nelder (1989) in their book on
generalized linear models also think the answer is
obvious. They write, At rst sight it might seem
as though a good model is one that ts the data
very well; that is, one that makes (the model pre-
dicted value) very close to y (the response value).
Then they go on to note that the extent of the agree-
ment is biased by the number of parameters used
in the model and so is not a satisfactory measure.
They are, of course, right. If the model has too many
parameters, then it may overt the data and give a
biased estimate of accuracy. But there are ways to
remove the bias. To get a more unbiased estimate
of predictive accuracy, cross-validation can be used,
as advocated in an important early work by Stone
(1974). If the data set is larger, put aside a test set.
Mosteller and Tukey (1977) were early advocates
of cross-validation. They write, Cross-validation is
a natural route to the indication of the quality of any
data-derived quantity . We plan to cross-validate
carefully wherever we can.
Judging by the infrequency of estimates of pre-
dictive accuracy in JASA, this measure of model
t that seems natural to me (and to Mosteller and
Tukey) is not natural to others. More publication of
predictive accuracy estimates would establish stan-
dards for comparison of models, a practice that is
common in machine learning.
6. THE LIMITATIONS OF DATA MODELS
With the insistence on data models, multivariate
analysis tools in statistics are frozen at discriminant
analysis and logistic regression in classication and
multiple linear regression in regression. Nobody
really believes that multivariate data is multivari-
ate normal, but that data model occupies a large
number of pages in every graduate textbook on
multivariate statistical analysis.
With data gathered from uncontrolled observa-
tions on complex systems involving unknown physi-
cal, chemical, or biological mechanisms, the a priori
assumption that nature would generate the data
through a parametric model selected by the statis-
tician can result in questionable conclusions that
cannot be substantiated by appeal to goodness-of-t
tests and residual analysis. Usually, simple para-
metric models imposed on data generated by com-
plex systems, for example, medical data, nancial
data, result in a loss of accuracy and information as
compared to algorithmic models (see Section 11).
There is an old saying If all a man has is a
hammer, then every problem looks like a nail. The
trouble for statisticians is that recently some of the
problems have stopped looking like nails. I conjec-
ture that the result of hitting this wall is that more
complicated data models are appearing in current
published applications. Bayesian methods combined
with Markov Chain Monte Carlo are cropping up all
over. This may signify that as data becomes more
complex, the data models become more cumbersome
and are losing the advantage of presenting a simple
and clear picture of natures mechanism.
Approaching problems by looking for a data model
imposes an a priori straight jacket that restricts the
ability of statisticians to deal with a wide range of
statistical problems. The best available solution to
a data problem might be a data model; then again
it might be an algorithmic model. The data and the
problem guide the solution. To solve a wider range
of data problems, a larger set of tools is needed.
STATISTICAL MODELING: THE TWO CULTURES 205
Perhaps the damaging consequence of the insis-
tence on data models is that statisticians have ruled
themselves out of some of the most interesting and
challenging statistical problems that have arisen
out of the rapidly increasing ability of computers
to store and manipulate data. These problems are
increasingly present in many elds, both scientic
and commercial, and solutions are being found by
nonstatisticians.
7. ALGORITHMIC MODELING
Under other names, algorithmic modeling has
been used by industrial statisticians for decades.
See, for instance, the delightful book Fitting Equa-
tions to Data (Daniel and Wood, 1971). It has been
used by psychometricians and social scientists.
Reading a preprint of Gis book (1990) many years
ago uncovered a kindred spirit. It has made small
inroads into the analysis of medical data starting
with Richard Olshens work in the early 1980s. For
further work, see Zhang and Singer (1999). Jerome
Friedman and Grace Wahba have done pioneering
work on the development of algorithmic methods.
But the list of statisticians in the algorithmic mod-
eling business is short, and applications to data are
seldom seen in the journals. The development of
algorithmic methods was taken up by a community
outside statistics.
7.1 A New Research Community
In the mid-1980s two powerful new algorithms
for tting data became available: neural nets and
decision trees. A new research community using
these tools sprang up. Their goal was predictive
accuracy. The community consisted of young com-
puter scientists, physicists and engineers plus a few
aging statisticians. They began using the new tools
in working on complex prediction problems where it
was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear
time series prediction, handwriting recognition,
prediction in nancial markets.
Their interests range over many elds that were
once considered happy hunting grounds for statisti-
cians and have turned out thousands of interesting
research papers related to applications and method-
ology. A large majority of the papers analyze real
data. The criterion for any model is what is the pre-
dictive accuracy. An idea of the range of research
of this group can be got by looking at the Proceed-
ings of the Neural Information Processing Systems
Conference (their main yearly meeting) or at the
Machine Learning Journal.
7.2 Theory in Algorithmic Modeling
Data models are rarely used in this community.
The approach is that nature produces data in a
black box whose insides are complex, mysterious,
and, at least, partly unknowable. What is observed
is a set of xs that go in and a subsequent set of ys
that come out. The problem is to nd an algorithm
f(x) such that for future x in a test set, f(x) will
be a good predictor of y.
The theory in this eld shifts focus from data mod-
els to the properties of algorithms. It characterizes
their strength as predictors, convergence if they
are iterative, and what gives them good predictive
accuracy. The one assumption made in the theory
is that the data is drawn i.i.d. from an unknown
multivariate distribution.
There is isolated work in statistics where the
focus is on the theory of the algorithms. Grace
Wahbas research on smoothing spline algo-
rithms and their applications to data (using cross-
validation) is built on theory involving reproducing
kernels in Hilbert Space (1990). The nal chapter
of the CART book (Breiman et al., 1984) contains
a proof of the asymptotic convergence of the CART
algorithm to the Bayes risk by letting the trees grow
as the sample size increases. There are others, but
the relative frequency is small.
Theory resulted in a major advance in machine
learning. Vladimir Vapnik constructed informative
bounds on the generalization error (innite test set
error) of classication algorithms which depend on
the capacity of the algorithm. These theoretical
bounds led to support vector machines (see Vapnik,
1995, 1998) which have proved to be more accu-
rate predictors in classication and regression then
neural nets, and are the subject of heated current
research (see Section 10).
My last paper Some innity theory for tree
ensembles (Breiman, 2000) uses a function space
analysis to try and understand the workings of tree
ensemble methods. One section has the heading,
My kingdom for some good theory. There is an
effective method for forming ensembles known as
boosting, but there isnt any nite sample size
theory that tells us why it works so well.
7.3 Recent Lessons
The advances in methodology and increases in
predictive accuracy since the mid-1980s that have
occurred in the research of machine learning has
been phenomenal. There have been particularly
exciting developments in the last ve years. What
has been learned? The three lessons that seem most
206 L. BREIMAN
important to one:
Rashomon: the multiplicity of good models;
Occam: the conict between simplicity and accu-
racy;
Bellman: dimensionalitycurse or blessing.
8. RASHOMON AND THE MULTIPLICITY
OF GOOD MODELS
Rashomon is a wonderful Japanese movie in
which four people, from different vantage points,
witness an incident in which one person dies and
another is supposedly raped. When they come to
testify in court, they all report the same facts, but
their stories of what happened are very different.
What I call the Rashomon Effect is that there
is often a multitude of different descriptions [equa-
tions f(x)] in a class of functions giving about the
same minimum error rate. The most easily under-
stood example is subset selection in linear regres-
sion. Suppose there are 30 variables and we want to
nd the best ve variable linear regressions. There
are about 140,000 ve-variable subsets in competi-
tion. Usually we pick the one with the lowest resid-
ual sum-of-squares (RSS), or, if there is a test set,
the lowest test error. But there may be (and gen-
erally are) many ve-variable equations that have
RSS within 1.0% of the lowest RSS (see Breiman,
1996a). The same is true if test set error is being
measured.
So here are three possible pictures with RSS or
test set error within 1.0% of each other:
Picture 1
y = 21 38x
3
06x
8
832x
12
21x
17
32x
27

Picture 2
y = 89 46x
5
001x
6
120x
15
175x
21
02x
22

Picture 3
y = 767 93x
2
220x
7
132x
8
34x
11
72x
28

Which one is better? The problem is that each one


tells a different story about which variables are
important.
The Rashomon Effect also occurs with decision
trees and neural nets. In my experiments with trees,
if the training set is perturbed only slightly, say by
removing a random 23% of the data, I can get
a tree quite different from the original but with
almost the same test set error. I once ran a small
neural net 100 times on simple three-dimensional
data reselecting the initial weights to be small and
random on each run. I found 32 distinct minima,
each of which gave a different picture, and having
about equal test set error.
This effect is closely connected to what I call
instability (Breiman, 1996a) that occurs when there
are many different models crowded together that
have about the same training or test set error. Then
a slight perturbation of the data or in the model
construction will cause a skip from one model to
another. The two models are close to each other in
terms of error, but can be distant in terms of the
form of the model.
If, in logistic regression or the Cox model, the
common practice of deleting the less important
covariates is carried out, then the model becomes
unstablethere are too many competing models.
Say you are deleting from 15 variables to 4 vari-
ables. Perturb the data slightly and you will very
possibly get a different four-variable model and
a different conclusion about which variables are
important. To improve accuracy by weeding out less
important covariates you run into the multiplicity
problem. The picture of which covariates are impor-
tant can vary signicantly between two models
having about the same deviance.
Aggregating over a large set of competing mod-
els can reduce the nonuniqueness while improving
accuracy. Arena et al. (2000) bagged (see Glossary)
logistic regression models on a data base of toxic and
nontoxic chemicals where the number of covariates
in each model was reduced from 15 to 4 by stan-
dard best subset selection. On a test set, the bagged
model was signicantly more accurate than the sin-
gle model with four covariates. It is also more stable.
This is one possible x. The multiplicity problem
and its effect on conclusions drawn from models
needs serious attention.
9. OCCAM AND SIMPLICITY VS. ACCURACY
Occams Razor, long admired, is usually inter-
preted to mean that simpler is better. Unfortunately,
in prediction, accuracy and simplicity (interpretabil-
ity) are in conict. For instance, linear regression
gives a fairly interpretable picture of the y x rela-
tion. But its accuracy is usually less than that
of the less interpretable neural nets. An example
closer to my work involves trees.
On interpretability, trees rate an A. A project
I worked on in the late 1970s was the analysis of
delay in criminal cases in state court systems. The
Constitution gives the accused the right to a speedy
trial. The Center for the State Courts was concerned
STATISTICAL MODELING: THE TWO CULTURES 207
Table 1
Data set descriptions
Training Test
Data set Sample size Sample size Variables Classes
Cancer 699 9 2
Ionosphere 351 34 2
Diabetes 768 8 2
Glass 214 9 6
Soybean 683 35 19
Letters 15,000 5000 16 26
Satellite 4,435 2000 36 6
Shuttle 43,500 14,500 9 7
DNA 2,000 1,186 60 3
Digit 7,291 2,007 256 10
that in many states, the trials were anything but
speedy. It funded a study of the causes of the delay.
I visited many states and decided to do the anal-
ysis in Colorado, which had an excellent computer-
ized court data system. A wealth of information was
extracted and processed.
The dependent variable for each criminal case
was the time from arraignment to the time of sen-
tencing. All of the other information in the trial his-
tory were the predictor variables. A large decision
tree was grown, and I showed it on an overhead and
explained it to the assembled Colorado judges. One
of the splits was on District N which had a larger
delay time than the other districts. I refrained from
commenting on this. But as I walked out I heard one
judge say to another, I knew those guys in District
N were dragging their feet.
While trees rate an A on interpretability, they
are good, but not great, predictors. Give them, say,
a B on prediction.
9.1 Growing Forests for Prediction
Instead of a single tree predictor, grow a forest of
trees on the same datasay 50 or 100. If we are
classifying, put the new x down each tree in the for-
est and get a vote for the predicted class. Let the for-
est prediction be the class that gets the most votes.
There has been a lot of work in the last ve years on
ways to grow the forest. All of the well-known meth-
ods grow the forest by perturbing the training set,
growing a tree on the perturbed training set, per-
turbing the training set again, growing another tree,
etc. Some familiar methods are bagging (Breiman,
1996b), boosting (Freund and Schapire, 1996), arc-
ing (Breiman, 1998), and additive logistic regression
(Friedman, Hastie and Tibshirani, 1998).
My preferred method to date is random forests. In
this approach successive decision trees are grown by
introducing a random element into their construc-
tion. For example, suppose there are 20 predictor
variables. At each node choose several of the 20 at
random to use to split the node. Or use a random
combination of a random selection of a few vari-
ables. This idea appears in Ho (1998), in Amit and
Geman (1997) and is developed in Breiman (1999).
9.2 Forests Compared to Trees
We compare the performance of single trees
(CART) to random forests on a number of small
and large data sets, mostly from the UCI repository
(ftp.ics.uci.edu/pub/MachineLearningDatabases). A
summary of the data sets is given in Table 1.
Table 2 compares the test set error of a single tree
to that of the forest. For the ve smaller data sets
above the line, the test set error was estimated by
leaving out a random 10% of the data, then run-
ning CART and the forest on the other 90%. The
left-out 10% was run down the tree and the forest
and the error on this 10% computed for both. This
was repeated 100 times and the errors averaged.
The larger data sets below the line came with a
separate test set. People who have been in the clas-
sication eld for a while nd these increases in
accuracy startling. Some errors are halved. Others
are reduced by one-third. In regression, where the
Table 2
Test set misclassication error (%)
Data set Forest Single tree
Breast cancer 2.9 5.9
Ionosphere 5.5 11.2
Diabetes 24.2 25.3
Glass 22.0 30.4
Soybean 5.7 8.6
Letters 3.4 12.4
Satellite 8.6 14.8
Shuttle 10
3
7.0 62.0
DNA 3.9 6.2
Digit 6.2 17.1
208 L. BREIMAN
forest prediction is the average over the individual
tree predictions, the decreases in mean-squared test
set error are similar.
9.3 Random Forests are A Predictors
The Statlog Project (Mitchie, Spiegelhalter and
Taylor, 1994) compared 18 different classiers.
Included were neural nets, CART, linear and
quadratic discriminant analysis, nearest neighbor,
etc. The rst four data sets below the line in Table 1
were the only ones used in the Statlog Project that
came with separate test sets. In terms of rank of
accuracy on these four data sets, the forest comes
in 1, 1, 1, 1 for an average rank of 1.0. The next
best classier had an average rank of 7.3.
The fth data set below the line consists of 1616
pixel gray scale depictions of handwritten ZIP Code
numerals. It has been extensively used by AT&T
Bell Labs to test a variety of prediction methods.
A neural net handcrafted to the data got a test set
error of 5.1% vs. 6.2% for a standard run of random
forest.
9.4 The Occam Dilemma
So forests are A predictors. But their mechanism
for producing a prediction is difcult to understand.
Trying to delve into the tangled web that generated
a plurality vote from 100 trees is a Herculean task.
So on interpretability, they rate an F. Which brings
us to the Occam dilemma:

Accuracy generally requires more complex pre-


diction methods. Simple and interpretable functions
do not make the most accurate predictors.
Using complex predictors may be unpleasant, but
the soundest path is to go for predictive accuracy
rst, then try to understand why. In fact, Section
10 points out that from a goal-oriented statistical
viewpoint, there is no Occams dilemma. (For more
on Occams Razor see Domingos, 1998, 1999.)
10. BELLMAN AND THE CURSE OF
DIMENSIONALITY
The title of this section refers to Richard Bell-
mans famous phrase, the curse of dimensionality.
For decades, the rst step in prediction methodol-
ogy was to avoid the curse. If there were too many
prediction variables, the recipe was to nd a few
features (functions of the predictor variables) that
contain most of the information and then use
these features to replace the original variables. In
procedures common in statistics such as regres-
sion, logistic regression and survival models the
advised practice is to use variable deletion to reduce
the dimensionality. The published advice was that
high dimensionality is dangerous. For instance, a
well-regarded book on pattern recognition (Meisel,
1972) states the features must be relatively
few in number. But recent work has shown that
dimensionality can be a blessing.
10.1 Digging It Out in Small Pieces
Reducing dimensionality reduces the amount of
information available for prediction. The more pre-
dictor variables, the more information. There is also
information in various combinations of the predictor
variables. Lets try going in the opposite direction:

Instead of reducing dimensionality, increase it


by adding many functions of the predictor variables.
There may now be thousands of features. Each
potentially contains a small amount of information.
The problem is how to extract and put together
these little pieces of information. There are two
outstanding examples of work in this direction, The
Shape Recognition Forest (Y. Amit and D. Geman,
1997) and Support Vector Machines (V. Vapnik,
1995, 1998).
10.2 The Shape Recognition Forest
In 1992, the National Institute of Standards and
Technology (NIST) set up a competition for machine
algorithms to read handwritten numerals. They put
together a large set of pixel pictures of handwritten
numbers (223,000) written by over 2,000 individ-
uals. The competition attracted wide interest, and
diverse approaches were tried.
The AmitGeman approach dened many thou-
sands of small geometric features in a hierarchi-
cal assembly. Shallow trees are grown, such that at
each node, 100 features are chosen at random from
the appropriate level of the hierarchy; and the opti-
mal split of the node based on the selected features
is found.
When a pixel picture of a number is dropped down
a single tree, the terminal node it lands in gives
probability estimates p
0
p
9
that it represents
numbers 0 1 9. Over 1,000 trees are grown, the
probabilities averaged over this forest, and the pre-
dicted number is assigned to the largest averaged
probability.
Using a 100,000 example training set and a
50,000 test set, the AmitGeman method gives a
test set error of 0.7%close to the limits of human
error.
10.3 Support Vector Machines
Suppose there is two-class data having prediction
vectors in M-dimensional Euclidean space. The pre-
diction vectors for class #1 are {x(1)} and those for
STATISTICAL MODELING: THE TWO CULTURES 209
class #2 are {x(2)}. If these two sets of vectors can
be separated by a hyperplane then there is an opti-
mal separating hyperplane. Optimal is dened as
meaning that the distance of the hyperplane to any
prediction vector is maximal (see below).
The set of vectors in {x(1)} and in {x(2)} that
achieve the minimum distance to the optimal
separating hyperplane are called the support vec-
tors. Their coordinates determine the equation of
the hyperplane. Vapnik (1995) showed that if a
separating hyperplane exists, then the optimal sep-
arating hyperplane has low generalization error
(see Glossary).
optimal hyperplane
support vector
In two-class data, separability by a hyperplane
does not often occur. However, let us increase the
dimensionality by adding as additional predictor
variables all quadratic monomials in the original
predictor variables; that is, all terms of the form
x
m1
x
m2
. A hyperplane in the original variables plus
quadratic monomials in the original variables is a
more complex creature. The possibility of separa-
tion is greater. If no separation occurs, add cubic
monomials as input features. If there are originally
30 predictor variables, then there are about 40,000
features if monomials up to the fourth degree are
added.
The higher the dimensionality of the set of fea-
tures, the more likely it is that separation occurs. In
the ZIP Code data set, separation occurs with fourth
degree monomials added. The test set error is 4.1%.
Using a large subset of the NIST data base as a
training set, separation also occurred after adding
up to fourth degree monomials and gave a test set
error rate of 1.1%.
Separation can always be had by raising the
dimensionality high enough. But if the separating
hyperplane becomes too complex, the generalization
error becomes large. An elegant theorem (Vapnik,
1995) gives this bound for the expected generaliza-
tion error:
Ex(GE) Ex(number of support vectors)/(N1)
where N is the sample size and the expectation is
over all training sets of size Ndrawn from the same
underlying distribution as the original training set.
The number of support vectors increases with the
dimensionality of the feature space. If this number
becomes too large, the separating hyperplane will
not give low generalization error. If separation can-
not be realized with a relatively small number of
support vectors, there is another version of support
vector machines that denes optimality by adding
a penalty term for the vectors on the wrong side of
the hyperplane.
Some ingenious algorithms make nding the opti-
mal separating hyperplane computationally feasi-
ble. These devices reduce the search to a solution
of a quadratic programming problem with linear
inequality constraints that are of the order of the
number N of cases, independent of the dimension
of the feature space. Methods tailored to this partic-
ular problem produce speed-ups of an order of mag-
nitude over standard methods for solving quadratic
programming problems.
Support vector machines can also be used to
provide accurate predictions in other areas (e.g.,
regression). It is an exciting idea that gives excel-
lent performance and is beginning to supplant the
use of neural nets. A readable introduction is in
Cristianini and Shawe-Taylor (2000).
11. INFORMATION FROM A BLACK BOX
The dilemma posed in the last section is that
the models that best emulate nature in terms of
predictive accuracy are also the most complex and
inscrutable. But this dilemma can be resolved by
realizing the wrong question is being asked. Nature
forms the outputs y from the inputs x by means of
a black box with complex and unknown interior.
y x nature
Current accurate prediction methods are also
complex black boxes.
y x
neural nets
forests
support vectors
So we are facing two black boxes, where ours
seems only slightly less inscrutable than natures.
In data generated by medical experiments, ensem-
bles of predictors can give cross-validated error
rates signicantly lower than logistic regression.
My biostatistician friends tell me, Doctors can
interpret logistic regression. There is no way they
can interpret a black box containing fty trees
hooked together. In a choice between accuracy and
interpretability, theyll go for interpretability.
Framing the question as the choice between accu-
racy and interpretability is an incorrect interpre-
tation of what the goal of a statistical analysis is.
210 L. BREIMAN
The point of a model is to get useful information
about the relation between the response and pre-
dictor variables. Interpretability is a way of getting
information. But a model does not have to be simple
to provide reliable information about the relation
between predictor and response variables; neither
does it have to be a data model.

The goal is not interpretability, but accurate


information.
The following three examples illustrate this point.
The rst shows that random forests applied to a
medical data set can give more reliable informa-
tion about covariate strengths than logistic regres-
sion. The second shows that it can give interesting
information that could not be revealed by a logistic
regression. The third is an application to a microar-
ray data where it is difcult to conceive of a data
model that would uncover similar information.
11.1 Example I: Variable Importance in a
Survival Data Set
The data set contains survival or nonsurvival
of 155 hepatitis patients with 19 covariates. It is
available at ftp.ics.uci.edu/pub/MachineLearning-
Databases and was contributed by Gail Gong. The
description is in a le called hepatitis.names. The
data set has been previously analyzed by Diaconis
and Efron (1983), and Cestnik, Konenenko and
Bratko (1987). The lowest reported error rate to
date, 17%, is in the latter paper.
Diaconis and Efron refer to work by Peter Gre-
gory of the Stanford Medical School who analyzed
this data and concluded that the important vari-
ables were numbers 6, 12, 14, 19 and reports an esti-
mated 20% predictive accuracy. The variables were
reduced in two stagesthe rst was by informal
data analysis. The second refers to a more formal
. 5
. 5
1. 5
2. 5
3. 5
s
t
a
n
d
a
r
d
i
z
e
d

c
o
e
f
f
i
c
i
e
n
t
s
0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0
variables
Fig. 1. Standardized coefcients logistic regression.
(unspecied) statistical procedure which I assume
was logistic regression.
Efron and Diaconis drew 500 bootstrap samples
from the original data set and used a similar pro-
cedure to isolate the important variables in each
bootstrapped data set. The authors comment, Of
the four variables originally selected not one was
selected in more than 60 percent of the samples.
Hence the variables identied in the original analy-
sis cannot be taken too seriously. We will come back
to this conclusion later.
Logistic Regression
The predictive error rate for logistic regression on
the hepatitis data set is 17.4%. This was evaluated
by doing 100 runs, each time leaving out a randomly
selected 10% of the data as a test set, and then
averaging over the test set errors.
Usually, the initial evaluation of which variables
are important is based on examining the absolute
values of the coefcients of the variables in the logis-
tic regression divided by their standard deviations.
Figure 1 is a plot of these values.
The conclusion from looking at the standard-
ized coefcients is that variables 7 and 11 are the
most important covariates. When logistic regres-
sion is run using only these two variables, the
cross-validated error rate rises to 22.9%. Another
way to nd important variables is to run a best
subsets search which, for any value k, nds the
subset of k variables having lowest deviance.
This procedure raises the problems of instability
and multiplicity of models (see Section 7.1). There
are about 4,000 subsets containing four variables.
Of these, there are almost certainly a substantial
number that have deviance close to the minimum
and give different pictures of what the underlying
mechanism is.
STATISTICAL MODELING: THE TWO CULTURES 211
1 0
0
1 0
2 0
3 0
4 0
5 0
p
e
r
c
e
n
t

i
n
c
r
e
s
e

i
n

e
r
r
o
r
0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0
variables
Fig. 2. Variable importance-random forest.
Random Forests
The random forests predictive error rate, evalu-
ated by averaging errors over 100 runs, each time
leaving out 10% of the data as a test set, is 12.3%
almost a 30% reduction from the logistic regression
error.
Random forests consists of a large number of
randomly constructed trees, each voting for a class.
Similar to bagging (Breiman, 1996), a bootstrap
sample of the training set is used to construct each
tree. A random selection of the input variables is
searched to nd the best split for each node.
To measure the importance of the mth variable,
the values of the mth variable are randomly per-
muted in all of the cases left out in the current
bootstrap sample. Then these cases are run down
the current tree and their classication noted. At
the end of a run consisting of growing many trees,
the percent increase in misclassication rate due to
noising up each variable is computed. This is the
4
3
2
1
0
1
v
a
r
i
a
b
l
e

1
2
0 . 2 . 4 . 6 . 8 1
class 1 probability
VARIABLE 12 vs PROBABILITY #1
3
2
1
0
1
v
a
r
i
a
b
l
e

1
7
0 . 2 . 4 . 6 . 8 1
class 1 probability
VARIABLE 17 vs PROBABILITY #1
Fig. 3. Variable 17 vs. probability #1.
measure of variable importance that is shown in
Figure 1.
Random forests singles out two variables, the
12th and the 17th, as being important. As a veri-
cation both variables were run in random forests,
individually and together. The test set error rates
over 100 replications were 14.3% each. Running
both together did no better. We conclude that virtu-
ally all of the predictive capability is provided by a
single variable, either 12 or 17.
To explore the interaction between 12 and 17 a bit
further, at the end of a random forest run using all
variables, the output includes the estimated value
of the probability of each class vs. the case number.
This information is used to get plots of the vari-
able values (normalized to mean zero and standard
deviation one) vs. the probability of death. The vari-
able values are smoothed using a weighted linear
regression smoother. The results are in Figure 3 for
variables 12 and 17.
212 L. BREIMAN
Fig. 4. Variable importanceBupa data.
The graphs of the variable values vs. class death
probability are almost linear and similar. The two
variables turn out to be highly correlated. Thinking
that this might have affected the logistic regression
results, it was run again with one or the other of
these two variables deleted. There was little change.
Out of curiosity, I evaluated variable impor-
tance in logistic regression in the same way that I
did in random forests, by permuting variable val-
ues in the 10% test set and computing how much
that increased the test set error. Not much help
variables 12 and 17 were not among the 3 variables
ranked as most important. In partial verication
of the importance of 12 and 17, I tried them sep-
arately as single variables in logistic regression.
Variable 12 gave a 15.7% error rate, variable 17
came in at 19.3%.
To go back to the original DiaconisEfron analy-
sis, the problem is clear. Variables 12 and 17 are sur-
rogates for each other. If one of them appears impor-
tant in a model built on a bootstrap sample, the
other does not. So each ones frequency of occurrence
Fig. 5. Cluster averagesBupa data.
is automatically less than 50%. The paper lists the
variables selected in ten of the samples. Either 12
or 17 appear in seven of the ten.
11.2 Example II Clustering in Medical Data
The Bupa liver data set is a two-class biomedical
data set also available at ftp.ics.uci.edu/pub/Mac-
hineLearningDatabases. The covariates are:
1. mcv mean corpuscular volume
2. alkphos alkaline phosphotase
3. sgpt alamine aminotransferase
4. sgot aspartate aminotransferase
5. gammagt gamma-glutamyl transpeptidase
6. drinks half-pint equivalents of alcoholic
beverage drunk per day
The rst ve attributes are the results of blood
tests thought to be related to liver functioning. The
345 patients are classied into two classes by the
severity of their liver malfunctioning. Class two is
severe malfunctioning. In a random forests run,
STATISTICAL MODELING: THE TWO CULTURES 213
the misclassication error rate is 28%. The variable
importance given by random forests is in Figure 4.
Blood tests 3 and 5 are the most important, fol-
lowed by test 4. Random forests also outputs an
intrinsic similarity measure which can be used to
cluster. When this was applied, two clusters were
discovered in class two. The average of each variable
is computed and plotted in each of these clusters in
Figure 5.
An interesting facet emerges. The class two sub-
jects consist of two distinct groups: those that have
high scores on blood tests 3, 4, and 5 and those that
have low scores on those tests.
11.3 Example III: Microarray Data
Random forests was run on a microarray lym-
phoma data set with three classes, sample size of
81 and 4,682 variables (genes) without any variable
selection [for more information about this data set,
see Dudoit, Fridlyand and Speed, (2000)]. The error
rate was low. What was also interesting from a sci-
entic viewpoint was an estimate of the importance
of each of the 4,682 gene expressions.
The graph in Figure 6 was produced by a run
of random forests. This result is consistent with
assessments of variable importance made using
other algorithmic methods, but appears to have
sharper detail.
11.4 Remarks about the Examples
The examples show that much information is
available from an algorithmic model. Friedman
Fig. 6. Microarray variable importance.
(1999) derives similar variable information from a
different way of constructing a forest. The similar-
ity is that they are both built as ways to give low
predictive error.
There are 32 deaths and 123 survivors in the hep-
atitis data set. Calling everyone a survivor gives a
baseline error rate of 20.6%. Logistic regression low-
ers this to 17.4%. It is not extracting much useful
information from the data, which may explain its
inability to nd the important variables. Its weak-
ness might have been unknown and the variable
importances accepted at face value if its predictive
accuracy was not evaluated.
Random forests is also capable of discovering
important aspects of the data that standard data
models cannot uncover. The potentially interesting
clustering of class two patients in Example II is an
illustration. The standard procedure when tting
data models such as logistic regression is to delete
variables; to quote from Diaconis and Efron (1983)
again, statistical experience suggests that it is
unwise to t a model that depends on 19 variables
with only 155 data points available. Newer meth-
ods in machine learning thrive on variablesthe
more the better. For instance, random forests does
not overt. It gives excellent accuracy on the lym-
phoma data set of Example III which has over 4,600
variables, with no variable deletion and is capable
of extracting variable importance information from
the data.
214 L. BREIMAN
These examples illustrate the following points:

Higher predictive accuracy is associated with


more reliable information about the underlying data
mechanism. Weak predictive accuracy can lead to
questionable conclusions.

Algorithmic models can give better predictive


accuracy than data models, and provide better infor-
mation about the underlying mechanism.
12. FINAL REMARKS
The goals in statistics are to use data to predict
and to get information about the underlying data
mechanism. Nowhere is it written on a stone tablet
what kind of model should be used to solve problems
involving data. To make my position clear, I am not
against data models per se. In some situations they
are the most appropriate way to solve the problem.
But the emphasis needs to be on the problem and
on the data.
Unfortunately, our eld has a vested interest in
data models, come hell or high water. For instance,
see Dempsters (1998) paper on modeling. His posi-
tion on the 1990 Census adjustment controversy is
particularly interesting. He admits that he doesnt
know much about the data or the details, but argues
that the problem can be solved by a strong dose
of modeling. That more modeling can make error-
ridden data accurate seems highly unlikely to me.
Terrabytes of data are pouring into computers
from many sources, both scientic, and commer-
cial, and there is a need to analyze and understand
the data. For instance, data is being generated
at an awesome rate by telescopes and radio tele-
scopes scanning the skies. Images containing mil-
lions of stellar objects are stored on tape or disk.
Astronomers need automated ways to scan their
data to nd certain types of stellar objects or novel
objects. This is a fascinating enterprise, and I doubt
if data models are applicable. Yet I would enter this
in my ledger as a statistical problem.
The analysis of genetic data is one of the most
challenging and interesting statistical problems
around. Microarray data, like that analyzed in
Section 11.3 can lead to signicant advances in
understanding genetic effects. But the analysis
of variable importance in Section 11.3 would be
difcult to do accurately using a stochastic data
model.
Problems such as stellar recognition or analysis
of gene expression data could be high adventure for
statisticians. But it requires that they focus on solv-
ing the problem instead of asking what data model
they can create. The best solution could be an algo-
rithmic model, or maybe a data model, or maybe a
combination. But the trick to being a scientist is to
be open to using a wide variety of tools.
The roots of statistics, as in science, lie in work-
ing with data and checking theory against data. I
hope in this century our eld will return to its roots.
There are signs that this hope is not illusory. Over
the last ten years, there has been a noticeable move
toward statistical work on real world problems and
reaching out by statisticians toward collaborative
work with other disciplines. I believe this trend will
continue and, in fact, has to continue if we are to
survive as an energetic and creative eld.
GLOSSARY
Since some of the terms used in this paper may
not be familiar to all statisticians, I append some
denitions.
Innite test set error. Assume a loss function
L(y y) that is a measure of the error when y is
the true response and y the predicted response.
In classication, the usual loss is 1 if y ,= y and
zero if y = y. In regression, the usual loss is
(y y)
2
. Given a set of data (training set) consist-
ing of {(y
n
x
n
)n = 1 2 N}, use it to construct
a predictor function (x) of y. Assume that the
training set is i.i.d drawn from the distribution of
the random vector Y X. The innite test set error
is E(L(Y (X))). This is called the generalization
error in machine learning.
The generalization error is estimated either by
setting aside a part of the data as a test set or by
cross-validation.
Predictive accuracy. This refers to the size of
the estimated generalization error. Good predictive
accuracy means low estimated error.
Trees and nodes. This terminology refers to deci-
sion trees as described in the Breiman et al book
(1984).
Dropping an x down a tree. When a vector of pre-
dictor variables is dropped down a tree, at each
intermediate node it has instructions whether to go
left or right depending on the coordinates of x. It
stops at a terminal node and is assigned the predic-
tion given by that node.
Bagging. An acronym for bootstrap aggregat-
ing. Start with an algorithm such that given any
training set, the algorithm produces a prediction
function (x). The algorithm can be a decision tree
construction, logistic regression with variable dele-
tion, etc. Take a bootstrap sample from the training
set and use this bootstrap training set to construct
the predictor
1
(x). Take another bootstrap sam-
ple and using this second training set construct the
predictor
2
(x). Continue this way for K steps. In
regression, average all of the {
k
(x)} to get the
STATISTICAL MODELING: THE TWO CULTURES 215
bagged predictor at x. In classication, that class
which has the plurality vote of the {
k
(x)} is the
bagged predictor. Bagging has been shown effective
in variance reduction (Breiman, 1996b).
Boosting. This is a more complex way of forming
an ensemble of predictors in classication than bag-
ging (Freund and Schapire, 1996). It uses no ran-
domization but proceeds by altering the weights on
the training set. Its performance in terms of low pre-
diction error is excellent (for details see Breiman,
1998).
ACKNOWLEDGMENTS
Many of my ideas about data modeling were
formed in three decades of conversations with my
old friend and collaborator, Jerome Friedman. Con-
versations with Richard Olshen about the Cox
model and its use in biostatistics helped me to
understand the background. I am also indebted to
William Meisel, who headed some of the predic-
tion projects I consulted on and helped me make
the transition from probability theory to algorithms,
and to Charles Stone for illuminating conversations
about the nature of statistics and science. Im grate-
ful also for the comments of the editor, Leon Gleser,
which prompted a major rewrite of the rst draft
of this manuscript and resulted in a different and
better paper.
REFERENCES
Amit, Y. and Geman, D. (1997). Shape quantization and recog-
nition with randomized trees. Neural Computation 9 1545
1588.
Arena, C., Sussman, N., Chiang, K., Mazumdar, S., Macina,
O. and Li, W. (2000). Bagging Structure-Activity Rela-
tionships: A simulation study for assessing misclassica-
tion rates. Presented at the Second Indo-U.S. Workshop on
Mathematical Chemistry, Duluth, MI. (Available at NSuss-
[email protected]).
Bickel, P., Ritov, Y. and Stoker, T. (2001). Tailor-made tests
for goodness of t for semiparametric hypotheses. Unpub-
lished manuscript.
Breiman, L. (1996a). The heuristics of instability in model selec-
tion. Ann. Statist. 24 23502381.
Breiman, L. (1996b). Bagging predictors. Machine Learning J.
26 123140.
Breiman, L. (1998). Arcing classiers. Discussion paper, Ann.
Statist. 26 801824.
Breiman. L. (2000). Some innity theory for tree ensembles.
(Available at www.stat.berkeley.edu/technical reports).
Breiman, L. (2001). Random forests. Machine Learning J. 45 5
32.
Breiman, L. and Friedman, J. (1985). Estimating optimal trans-
formations in multiple regression and correlation. J. Amer.
Statist. Assoc. 80 580619.
Breiman, L., Friedman, J., Olshen, R. and Stone, C.
(1984). Classication and Regression Trees. Wadsworth,
Belmont, CA.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction
to Support Vector Machines. Cambridge Univ. Press.
Daniel, C. and Wood, F. (1971). Fitting equations to data. Wiley,
New York.
Dempster, A. (1998). Logicist statistic 1. Models and Modeling.
Statist. Sci. 13 3 248276.
Diaconis, P. and Efron, B. (1983). Computer intensive methods
in statistics. Scientic American 248 116131.
Domingos, P. (1998). Occams two razors: the sharp and the
blunt. In Proceedings of the Fourth International Conference
on Knowledge Discovery and Data Mining (R. Agrawal and
P. Stolorz, eds.) 3743. AAAI Press, Menlo Park, CA.
Domingos, P. (1999). The role of Occams razor in knowledge dis-
covery. Data Mining and Knowledge Discovery 3 409425.
Dudoit, S., Fridlyand, J. and Speed, T. (2000). Comparison
of discrimination methods for the classication of tumors.
(Available at www.stat.berkeley.edu/technical reports).
Freedman, D. (1987). As others see us: a case study in path
analysis (with discussion). J. Ed. Statist. 12 101223.
Freedman, D. (1991). Statistical models and shoe leather. Soci-
ological Methodology 1991 (with discussion) 291358.
Freedman, D. (1991). Some issues in the foundations of statis-
tics. Foundations of Science 1 1983.
Freedman, D. (1994). From association to causation via regres-
sion. Adv. in Appl. Math. 18 59110.
Freund, Y. and Schapire, R. (1996). Experiments with a new
boosting algorithm. In Machine Learning: Proceedings of the
Thirteenth International Conference 148156. Morgan Kauf-
mann, San Francisco.
Friedman, J. (1999). Greedy predictive approximation: a gra-
dient boosting machine. Technical report, Dept. Statistics
Stanford Univ.
Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive
logistic regression: a statistical view of boosting. Ann. Statist.
28 337407.
Gifi, A. (1990). Nonlinear Multivariate Analysis. Wiley, New
York.
Ho, T. K. (1998). The random subspace method for constructing
decision forests. IEEE Trans. Pattern Analysis and Machine
Intelligence 20 832844.
Landswher, J., Preibon, D. and Shoemaker, A. (1984). Graph-
ical methods for assessing logistic regression models (with
discussion). J. Amer. Statist. Assoc. 79 6183.
McCullagh, P. and Nelder, J. (1989). Generalized Linear Mod-
els. Chapman and Hall, London.
Meisel, W. (1972). Computer-Oriented Approaches to Pattern
Recognition. Academic Press, New York.
Michie, D., Spiegelhalter, D. and Taylor, C. (1994). Machine
Learning, Neural and Statistical Classication. Ellis Hor-
wood, New York.
Mosteller, F. and Tukey, J. (1977). Data Analysis and Regres-
sion. Addison-Wesley, Redding, MA.
Mountain, D. and Hsiao, C. (1989). A combined structural and
exible functional approach for modelenery substitution.
J. Amer. Statist. Assoc. 84 7687.
Stone, M. (1974). Cross-validatory choice and assessment of sta-
tistical predictions. J. Roy. Statist. Soc. B 36 111147.
Vapnik, V. (1995). The Nature of Statistical Learning Theory.
Springer, New York.
Vapnik, V (1998). Statistical Learning Theory. Wiley, New York.
Wahba, G. (1990). Spline Models for Observational Data. SIAM,
Philadelphia.
Zhang, H. and Singer, B. (1999). Recursive Partitioning in the
Health Sciences. Springer, New York.
216 L. BREIMAN
Comment
D. R. Cox
Professor Breimans interesting paper gives both
a clear statement of the broad approach underly-
ing some of his inuential and widely admired con-
tributions and outlines some striking applications
and developments. He has combined this with a cri-
tique of what, for want of a better term, I will call
mainstream statistical thinking, based in part on
a caricature. Like all good caricatures, it contains
enough truth and exposes enough weaknesses to be
thought-provoking.
There is not enough space to comment on all the
many points explicitly or implicitly raised in the
paper. There follow some remarks about a few main
issues.
One of the attractions of our subject is the aston-
ishingly wide range of applications as judged not
only in terms of substantive eld but also in terms
of objectives, quality and quantity of data and
so on. Thus any unqualied statement that in
applications has to be treated sceptically. One
of our failings has, I believe, been, in a wish to
stress generality, not to set out more clearly the
distinctions between different kinds of application
and the consequences for the strategy of statistical
analysis. Of course we have distinctions between
decision-making and inference, between tests and
estimation, and between estimation and predic-
tion and these are useful but, I think, are, except
perhaps the rst, too phrased in terms of the tech-
nology rather than the spirit of statistical analysis.
I entirely agree with Professor Breiman that it
would be an impoverished and extremely unhis-
torical view of the subject to exclude the kind of
work he describes simply because it has no explicit
probabilistic base.
Professor Breiman takes data as his starting
point. I would prefer to start with an issue, a ques-
tion or a scientic hypothesis, although I would
be surprised if this were a real source of disagree-
ment. These issues may evolve, or even change
radically, as analysis proceeds. Data looking for
a question are not unknown and raise puzzles
but are, I believe, atypical in most contexts. Next,
even if we ignore design aspects and start with data,
D. R. Cox is an Honorary Fellow, Nufeld College,
Oxford OX1 1NF, United Kingdom, and associate
member, Department of Statistics, University of
Oxford (e-mail: [email protected]).
key points concern the precise meaning of the data,
the possible biases arising from the method of ascer-
tainment, the possible presence of major distorting
measurement errors and the nature of processes
underlying missing and incomplete data and data
that evolve in time in a way involving complex inter-
dependencies. For some of these, at least, it is hard
to see how to proceed without some notion of prob-
abilistic modeling.
Next Professor Breiman emphasizes prediction
as the objective, success at prediction being the
criterion of success, as contrasted with issues
of interpretation or understanding. Prediction is
indeed important from several perspectives. The
success of a theory is best judged from its ability to
predict in new contexts, although one cannot dis-
miss as totally useless theories such as the rational
action theory (RAT), in political science, which, as
I understand it, gives excellent explanations of the
past but which has failed to predict the real politi-
cal world. In a clinical trial context it can be argued
that an objective is to predict the consequences of
treatment allocation to future patients, and so on.
If the prediction is localized to situations directly
similar to those applying to the data there is then
an interesting and challenging dilemma. Is it prefer-
able to proceed with a directly empirical black-box
approach, as favored by Professor Breiman, or is
it better to try to take account of some underly-
ing explanatory process? The answer must depend
on the context but I certainly accept, although it
goes somewhat against the grain to do so, that
there are situations where a directly empirical
approach is better. Short term economic forecasting
and real-time ood forecasting are probably further
exemplars. Key issues are then the stability of the
predictor as practical prediction proceeds, the need
from time to time for recalibration and so on.
However, much prediction is not like this. Often
the prediction is under quite different conditions
from the data; what is the likely progress of the
incidence of the epidemic of v-CJD in the United
Kingdom, what would be the effect on annual inci-
dence of cancer in the United States of reducing by
10% the medical use of X-rays, etc.? That is, it may
be desired to predict the consequences of something
only indirectly addressed by the data available for
analysis. As we move toward such more ambitious
tasks, prediction, always hazardous, without some
understanding of underlying process and linking
with other sources of information, becomes more
STATISTICAL MODELING: THE TWO CULTURES 217
and more tentative. Formulation of the goals of
analysis solely in terms of direct prediction over the
data set seems then increasingly unhelpful.
This is quite apart from matters where the direct
objective is understanding of and tests of subject-
matter hypotheses about underlying process, the
nature of pathways of dependence and so on.
What is the central strategy of mainstream sta-
tistical analysis? This can most certainly not be dis-
cerned from the pages of Bernoulli, The Annals of
Statistics or the Scandanavian Journal of Statistics
nor from Biometrika and the Journal of Royal Sta-
tistical Society, Series B or even from the application
pages of Journal of the American Statistical Associa-
tion or Applied Statistics, estimable though all these
journals are. Of course as we move along the list,
there is an increase from zero to 100% in the papers
containing analyses of real data. But the papers
do so nearly always to illustrate technique rather
than to explain the process of analysis and inter-
pretation as such. This is entirely legitimate, but
is completely different from live analysis of current
data to obtain subject-matter conclusions or to help
solve specic practical issues. Put differently, if an
important conclusion is reached involving statisti-
cal analysis it will be reported in a subject-matter
journal or in a written or verbal report to colleagues,
government or business. When that happens, statis-
tical details are typically and correctly not stressed.
Thus the real procedures of statistical analysis can
be judged only by looking in detail at specic cases,
and access to these is not always easy. Failure to
discuss enough the principles involved is a major
criticism of the current state of theory.
I think tentatively that the following quite com-
monly applies. Formal models are useful and often
almost, if not quite, essential for incisive thinking.
Descriptively appealing and transparent methods
with a rm model base are the ideal. Notions of
signicance tests, condence intervals, posterior
intervals and all the formal apparatus of inference
are valuable tools to be used as guides, but not in a
mechanical way; they indicate the uncertainty that
would apply under somewhat idealized, may be
very idealized, conditions and as such are often
lower bounds to real uncertainty. Analyses and
model development are at least partly exploratory.
Automatic methods of model selection (and of vari-
able selection in regression-like problems) are to be
shunned or, if use is absolutely unavoidable, are to
be examined carefully for their effect on the nal
conclusions. Unfocused tests of model adequacy are
rarely helpful.
By contrast, Professor Breiman equates main-
stream applied statistics to a relatively mechanical
process involving somehow or other choosing a
model, often a default model of standard form,
and applying standard methods of analysis and
goodness-of-t procedures. Thus for survival data
choose a priori the proportional hazards model.
(Note, incidentally, that in the paper, often quoted
but probably rarely read, that introduced this
approach there was a comparison of several of the
many different models that might be suitable for
this kind of data.) It is true that many of the anal-
yses done by nonstatisticians or by statisticians
under severe time constraints are more or less like
those Professor Breiman describes. The issue then
is not whether they could ideally be improved, but
whether they capture enough of the essence of the
information in the data, together with some rea-
sonable indication of precision as a guard against
under or overinterpretation. Would more rened
analysis, possibly with better predictive power and
better t, produce subject-matter gains? There can
be no general answer to this, but one suspects that
quite often the limitations of conclusions lie more
in weakness of data quality and study design than
in ineffective analysis.
There are two broad lines of development active
at the moment arising out of mainstream statistical
ideas. The rst is the invention of models strongly
tied to subject-matter considerations, represent-
ing underlying dependencies, and their analysis,
perhaps by Markov chain Monte Carlo methods.
In elds where subject-matter considerations are
largely qualitative, we see a development based on
Markov graphs and their generalizations. These
methods in effect assume, subject in principle to
empirical test, more and more about the phenom-
ena under study. By contrast, there is an emphasis
on assuming less and less via, for example, kernel
estimates of regression functions, generalized addi-
tive models and so on. There is a need to be clearer
about the circumstances favoring these two broad
approaches, synthesizing them where possible.
My own interest tends to be in the former style
of work. From this perspective Cox and Wermuth
(1996, page 15) listed a number of requirements of a
statistical model. These are to establish a link with
background knowledge and to set up a connection
with previous work, to give some pointer toward
a generating process, to have primary parameters
with individual clear subject-matter interpretations,
to specify haphazard aspects well enough to lead
to meaningful assessment of precision and, nally,
that the t should be adequate. From this perspec-
tive, t, which is broadly related to predictive suc-
cess, is not the primary basis for model choice and
formal methods of model choice that take no account
218 L. BREIMAN
of the broader objectives are suspect in the present
context. In a sense these are efforts to establish data
descriptions that are potentially causal, recognizing
that causality, in the sense that a natural scientist
would use the term, can rarely be established from
one type of study and is at best somewhat tentative.
Professor Breiman takes a rather defeatist atti-
tude toward attempts to formulate underlying
processes; is this not to reject the base of much sci-
entic progress? The interesting illustrations given
by Beveridge (1952), where hypothesized processes
in various biological contexts led to important
progress, even though the hypotheses turned out in
the end to be quite false, illustrate the subtlety of
the matter. Especially in the social sciences, repre-
sentations of underlying process have to be viewed
with particular caution, but this does not make
them fruitless.
The absolutely crucial issue in serious main-
stream statistics is the choice of a model that
will translate key subject-matter questions into a
form for analysis and interpretation. If a simple
standard model is adequate to answer the subject-
matter question, this is ne: there are severe
hidden penalties for overelaboration. The statisti-
cal literature, however, concentrates on how to do
the analysis, an important and indeed fascinating
question, but a secondary step. Better a rough
answer to the right question than an exact answer
to the wrong question, an aphorism, due perhaps to
Lord Kelvin, that I heard as an undergraduate in
applied mathematics.
I have stayed away from the detail of the paper
but will comment on just one point, the interesting
theorem of Vapnik about complete separation. This
conrms folklore experience with empirical logistic
regression that, with a largish number of explana-
tory variables, complete separation is quite likely to
occur. It is interesting that in mainstream thinking
this is, I think, regarded as insecure in that com-
plete separation is thought to be a priori unlikely
and the estimated separating plane unstable. Pre-
sumably bootstrap and cross-validation ideas may
give here a quite misleading illusion of stability.
Of course if the complete separator is subtle and
stable Professor Breimans methods will emerge tri-
umphant and ultimately it is an empirical question
in each application as to what happens.
It will be clear that while I disagree with the main
thrust of Professor Breimans paper I found it stim-
ulating and interesting.
Comment
Brad Efron
At rst glance Leo Breimans stimulating paper
looks like an argument against parsimony and sci-
entic insight, and in favor of black boxes with lots
of knobs to twiddle. At second glance it still looks
that way, but the paper is stimulating, and Leo has
some important points to hammer home. At the risk
of distortion I will try to restate one of those points,
the most interesting one in my opinion, using less
confrontational and more historical language.
From the point of view of statistical development
the twentieth century might be labeled 100 years
of unbiasedness. Following Fishers lead, most of
our current statistical theory and practice revolves
around unbiased or nearly unbiased estimates (par-
ticularly MLEs), and tests based on such estimates.
The power of this theory has made statistics the
Brad Efron is Professor, Department of Statis-
tics, Sequoia Hall, 390 Serra Mall, Stanford Uni-
versity, Stanford, California 943054065 (e-mail:
[email protected]).
dominant interpretational methodology in dozens of
elds, but, as we say in California these days, it
is power purchased at a price: the theory requires a
modestly high ratio of signal to noise, sample size to
number of unknown parameters, to have much hope
of success. Good experimental design amounts to
enforcing favorable conditions for unbiased estima-
tion and testing, so that the statistician wont nd
himself or herself facing 100 data points and 50
parameters.
Now it is the twenty-rst century when, as the
paper reminds us, we are being asked to face prob-
lems that never heard of good experimental design.
Sample sizes have swollen alarmingly while goals
grow less distinct (nd interesting data structure).
New algorithms have arisen to deal with new prob-
lems, a healthy sign it seems to me even if the inno-
vators arent all professional statisticians. There are
enough physicists to handle the physics case load,
but there are fewer statisticians and more statistics
problems, and we need all the help we can get. An
STATISTICAL MODELING: THE TWO CULTURES 219
attractive feature of Leos paper is his openness to
new ideas whatever their source.
The new algorithms often appear in the form of
black boxes with enormous numbers of adjustable
parameters (knobs to twiddle), sometimes more
knobs than data points. These algorithms can be
quite successful as Leo points outs, sometimes
more so than their classical counterparts. However,
unless the bias-variance trade-off has been sus-
pended to encourage new statistical industries, their
success must hinge on some form of biased estima-
tion. The bias may be introduced directly as with the
regularization of overparameterized linear mod-
els, more subtly as in the pruning of overgrown
regression trees, or surreptitiously as with support
vector machines, but it has to be lurking somewhere
inside the theory.
Of course the trouble with biased estimation is
that we have so little theory to fall back upon.
Fishers information bound, which tells us how well
a (nearly) unbiased estimator can possibly perform,
is of no help at all in dealing with heavily biased
methodology. Numerical experimentation by itself,
unguided by theory, is prone to faddish wandering:
Rule 1. New methods always look better than old
ones. Neural nets are better than logistic regres-
sion, support vector machines are better than neu-
ral nets, etc. In fact it is very difcult to run an
honest simulation comparison, and easy to inadver-
tently cheat by choosing favorable examples, or by
not putting as much effort into optimizing the dull
old standard as the exciting new challenger.
Rule 2. Complicated methods are harder to crit-
icize than simple ones. By now it is easy to check
the efciency of a logistic regression, but it is no
small matter to analyze the limitations of a sup-
port vector machine. One of the best things statis-
ticians do, and something that doesnt happen out-
side our profession, is clarify the inferential basis
of a proposed new methodology, a nice recent exam-
ple being Friedman, Hastie, and Tibshiranis anal-
ysis of boosting, (2000). The past half-century has
seen the clarication process successfully at work on
nonparametrics, robustness and survival analysis.
There has even been some success with biased esti-
mation in the form of Stein shrinkage and empirical
Bayes, but I believe the hardest part of this work
remains to be done. Papers like Leos are a call for
more analysis and theory, not less.
Prediction is certainly an interesting subject but
Leos paper overstates both its role and our profes-
sions lack of interest in it.

The prediction culture, at least around Stan-


ford, is a lot bigger than 2%, though its constituency
changes and most of us wouldnt welcome being
typecast.

Estimation and testing are a form of prediction:


In our sample of 20 patients drug A outperformed
drug B; would this still be true if we went on to test
all possible patients?

Prediction by itself is only occasionally suf-


cient. The post ofce is happy with any method
that predicts correct addresses from hand-written
scrawls. Peter Gregory undertook his study for pre-
diction purposes, but also to better understand the
medical basis of hepatitis. Most statistical surveys
have the identication of causal factors as their ulti-
mate goal.
The hepatitis data was rst analyzed by Gail
Gong in her 1982 Ph.D. thesis, which concerned pre-
diction problems and bootstrap methods for improv-
ing on cross-validation. (Cross-validation itself is an
uncertain methodology that deserves further crit-
ical scrutiny; see, for example, Efron and Tibshi-
rani, 1996). The Scientic American discussion is
quite brief, a more thorough description appearing
in Efron and Gong (1983). Variables 12 or 17 (13
or 18 in Efron and Gongs numbering) appeared
as important in 60% of the bootstrap simulations,
which might be compared with the 59% for variable
19, the most for any single explanator.
In what sense are variable 12 or 17 or 19
important or not important? This is the kind of
interesting inferential question raised by prediction
methodology. Tibshirani and I made a stab at an
answer in our 1998 annals paper. I believe that the
current interest in statistical prediction will eventu-
ally invigorate traditional inference, not eliminate
it.
A third front seems to have been opened in
the long-running frequentist-Bayesian wars by the
advocates of algorithmic prediction, who dont really
believe in any inferential school. Leos paper is at its
best when presenting the successes of algorithmic
modeling, which comes across as a positive devel-
opment for both statistical practice and theoretical
innovation. This isnt an argument against tradi-
tional data modeling any more than splines are an
argument against polynomials. The whole point of
science is to open up black boxes, understand their
insides, and build better boxes for the purposes of
mankind. Leo himself is a notably successful sci-
entist, so we can hope that the present paper was
written more as an advocacy device than as the con-
fessions of a born-again black boxist.
220 L. BREIMAN
Comment
Bruce Hoadley
INTRODUCTION
Professor Breimans paper is an important one
for statisticians to read. He and Statistical Science
should be applauded for making this kind of mate-
rial available to a large audience. His conclusions
are consistent with how statistics is often practiced
in business. This discussion will consist of an anec-
dotal recital of my encounters with the algorithmic
modeling culture. Along the way, areas of mild dis-
agreement with Professor Breiman are discussed. I
also include a few proposals for research topics in
algorithmic modeling.
CASE STUDY OF AN ALGORITHMIC
MODELING CULTURE
Although I spent most of my career in manage-
ment at Bell Labs and Bellcore, the last seven years
have been with the research group at Fair, Isaac.
This company provides all kinds of decision sup-
port solutions to several industries, and is very
well known for credit scoring. Credit scoring is a
great example of the problem discussed by Professor
Breiman. The input variables, x, might come from
company databases or credit bureaus. The output
variable, y, is some indicator of credit worthiness.
Credit scoring has been a protable business for
Fair, Isaac since the 1960s, so it is instructive to
look at the Fair, Isaac analytic approach to see how
it ts into the two cultures described by Professor
Breiman. The Fair, Isaac approach was developed by
engineers and operations research people and was
driven by the needs of the clients and the quality
of the data. The inuences of the statistical com-
munity were mostly from the nonparametric side
things like jackknife and bootstrap.
Consider an example of behavior scoring, which
is used in credit card account management. For
pedagogical reasons, I consider a simplied version
(in the real world, things get more complicated) of
monthly behavior scoring. The input variables, x,
in this simplied version, are the monthly bills and
payments over the last 12 months. So the dimen-
sion of x is 24. The output variable is binary and
is the indicator of no severe delinquency over the
Dr. Bruce Hoadley is with Fair, Isaac and Co.,
Inc., 120 N. Redwood Drive, San Rafael, California
94903-1996 (e-mail: BruceHoadley@ FairIsaac.com).
next 6 months. The goal is to estimate the function,
f(x) = log(Pr{y = 1[x}/ Pr{y = 0[x}). Professor
Breiman argues that some kind of simple logistic
regression from the data modeling culture is not
the way to solve this problem. I agree. Lets take
a look at how the engineers at Fair, Isaac solved
this problemway back in the 1960s and 1970s.
The general form used for f(x) was called a
segmented scorecard. The process for developing
a segmented scorecard was clearly an algorithmic
modeling process.
The rst step was to transform x into many inter-
pretable variables called prediction characteristics.
This was done in stages. The rst stage was to
compute several time series derived from the orig-
inal two. An example is the time series of months
delinquenta nonlinear function. The second stage
was to dene characteristics as operators on the
time series. For example, the number of times in
the last six months that the customer was more
than two months delinquent. This process can lead
to thousands of characteristics. A subset of these
characteristics passes a screen for further analysis.
The next step was to segment the population
based on the screened characteristics. The segmen-
tation was done somewhat informally. But when
I looked at the process carefully, the segments
turned out to be the leaves of a shallow-to-medium
tree. And the tree was built sequentially using
mostly binary splits based on the best splitting
characteristicsdened in a reasonable way. The
algorithm was manual, but similar in concept to the
CART algorithm, with a different purity index.
Next, a separate function, f(x), was developed for
each segment. The function used was called a score-
card. Each characteristic was chopped up into dis-
crete intervals or sets called attributes. A score-
card was a linear function of the attribute indicator
(dummy) variables derived from the characteristics.
The coefcients of the dummy variables were called
score weights.
This construction amounted to an explosion of
dimensionality. They started with 24 predictors.
These were transformed into hundreds of charac-
teristics and pared down to about 100 characteris-
tics. Each characteristic was discretized into about
10 attributes, and there were about 10 segments.
This makes 100 10 10 = 10 000 features. Yes
indeed, dimensionality is a blessing.
STATISTICAL MODELING: THE TWO CULTURES 221
What Fair, Isaac calls a scorecard is now else-
where called a generalized additive model (GAM)
with bin smoothing. However, a simple GAM would
not do. Client demand, legal considerations and
robustness over time led to the concept of score engi-
neering. For example, the score had to be monotoni-
cally decreasing in certain delinquency characteris-
tics. Prior judgment also played a role in the design
of scorecards. For some characteristics, the score
weights were shrunk toward zero in order to moder-
ate the inuence of these characteristics. For other
characteristics, the score weights were expanded in
order to increase the inuence of these character-
istics. These adjustments were not done willy-nilly.
They were done to overcome known weaknesses in
the data.
So how did these Fair, Isaac pioneers t these
complicated GAM models back in the 1960s and
1970s? Logistic regression was not generally avail-
able. And besides, even todays commercial GAM
software will not handle complex constraints. What
they did was to maximize (subject to constraints)
a measure called divergence, which measures how
well the score, S, separates the two populations with
different values of y. The formal denition of diver-
gence is 2(E|S[y = 1| E|S[y = 0|)
2
/(V|S[y =
1| V|S[y = 0|). This constrained tting was done
with a heuristic nonlinear programming algorithm.
A linear transformation was used to convert to a log
odds scale.
Characteristic selection was done by analyzing
the change in divergence after adding (removing)
each candidate characteristic to (from) the current
best model. The analysis was done informally to
achieve good performance on the test sample. There
were no formal tests of t and no tests of score
weight statistical signicance. What counted was
performance on the test sample, which was a surro-
gate for the future real world.
These early Fair, Isaac engineers were ahead of
their time and charter members of the algorithmic
modeling culture. The score formula was linear in
an exploded dimension. A complex algorithm was
used to t the model. There was no claim that the
nal score formula was correct, only that it worked
well on the test sample. This approach grew natu-
rally out of the demands of the business and the
quality of the data. The overarching goal was to
develop tools that would help clients make bet-
ter decisions through data. What emerged was a
very accurate and palatable algorithmic modeling
solution, which belies Breimans statement: The
algorithmic modeling methods available in the pre-
1980s decades seem primitive now. At a recent ASA
meeting, I heard talks on treed regression, which
looked like segmented scorecards to me.
After a few years with Fair, Isaac, I developed a
talk entitled, Credit ScoringA Parallel Universe
of Prediction and Classication. The theme was
that Fair, Isaac developed in parallel many of the
concepts used in modern algorithmic modeling.
Certain aspects of the data modeling culture crept
into the Fair, Isaac approach. The use of divergence
was justied by assuming that the score distribu-
tions were approximately normal. So rather than
making assumptions about the distribution of the
inputs, they made assumptions about the distribu-
tion of the output. This assumption of normality was
supported by a central limit theorem, which said
that sums of many random variables are approxi-
mately normaleven when the component random
variables are dependent and multiples of dummy
random variables.
Modern algorithmic classication theory has
shown that excellent classiers have one thing in
common, they all have large margin. Margin, M, is
a random variable that measures the comfort level
with which classications are made. When the cor-
rect classication is made, the margin is positive;
it is negative otherwise. Since margin is a random
variable, the precise denition of large margin is
tricky. It does not mean that E|M| is large. When
I put my data modeling hat on, I surmised that
large margin means that E|M|/

V(M) is large.
Lo and behold, with this denition, large margin
means large divergence.
Since the good old days at Fair, Isaac, there have
been many improvements in the algorithmic mod-
eling approaches. We now use genetic algorithms
to screen very large structured sets of prediction
characteristics. Our segmentation algorithms have
been automated to yield even more predictive sys-
tems. Our palatable GAM modeling tool now han-
dles smooth splines, as well as splines mixed with
step functions, with all kinds of constraint capabil-
ity. Maximizing divergence is still a favorite, but
we also maximize constrained GLM likelihood func-
tions. We also are experimenting with computa-
tionally intensive algorithms that will optimize any
objective function that makes sense in the busi-
ness environment. All of these improvements are
squarely in the culture of algorithmic modeling.
OVERFITTING THE TEST SAMPLE
Professor Breiman emphasizes the importance of
performance on the test sample. However, this can
be overdone. The test sample is supposed to repre-
sent the population to be encountered in the future.
But in reality, it is usually a random sample of the
222 L. BREIMAN
current population. High performance on the test
sample does not guarantee high performance on
future samples, things do change. There are prac-
tices that can be followed to protect against change.
One can monitor the performance of the mod-
els over time and develop new models when there
has been sufcient degradation of performance. For
some of Fair, Isaacs core products, the redevelop-
ment cycle is about 1824 months. Fair, Isaac also
does score engineering in an attempt to make
the models more robust over time. This includes
damping the inuence of individual characteristics,
using monotone constraints and minimizing the size
of the models subject to performance constraints
on the current test sample. This score engineer-
ing amounts to moving from very nonparametric (no
score engineering) to more semiparametric (lots of
score engineering).
SPIN-OFFS FROM THE DATA
MODELING CULTURE
In Section 6 of Professor Breimans paper, he says
that multivariate analysis tools in statistics are
frozen at discriminant analysis and logistic regres-
sion in classication This is not necessarily all
that bad. These tools can carry you very far as long
as you ignore all of the textbook advice on how to
use them. To illustrate, I use the saga of the Fat
Scorecard.
Early in my research days at Fair, Isaac, I
was searching for an improvement over segmented
scorecards. The idea was to develop rst a very
good global scorecard and then to develop small
adjustments for a number of overlapping segments.
To develop the global scorecard, I decided to use
logistic regression applied to the attribute dummy
variables. There were 36 characteristics available
for tting. A typical scorecard has about 15 char-
acteristics. My variable selection was structured
so that an entire characteristic was either in or
out of the model. What I discovered surprised
me. All models t with anywhere from 27 to 36
characteristics had the same performance on the
test sample. This is what Professor Breiman calls
Rashomon and the multiplicity of good models. To
keep the model as small as possible, I chose the one
with 27 characteristics. This model had 162 score
weights (logistic regression coefcients), whose P-
values ranged from 0.0001 to 0.984, with only one
less than 0.05; i.e., statistically signicant. The con-
dence intervals for the 162 score weights were use-
less. To get this great scorecard, I had to ignore
the conventional wisdom on how to use logistic
regression.
So far, all I had was the scorecard GAM. So clearly
I was missing all of those interactions that just had
to be in the model. To model the interactions, I tried
developing small adjustments on various overlap-
ping segments. No matter how hard I tried, noth-
ing improved the test sample performance over the
global scorecard. I started calling it the Fat Score-
card.
Earlier, on this same data set, another Fair, Isaac
researcher had developed a neural network with
2,000 connection weights. The Fat Scorecard slightly
outperformed the neural network on the test sam-
ple. I cannot claim that this would work for every
data set. But for this data set, I had developed an
excellent algorithmic model with a simple data mod-
eling tool.
Why did the simple additive model work so well?
One idea is that some of the characteristics in the
model are acting as surrogates for certain inter-
action terms that are not explicitly in the model.
Another reason is that the scorecard is really a
sophisticated neural net. The inputs are the original
inputs. Associated with each characteristic is a hid-
den node. The summation functions coming into the
hidden nodes are the transformations dening the
characteristics. The transfer functions of the hid-
den nodes are the step functions (compiled from the
score weights)all derived from the data. The nal
output is a linear function of the outputs of the hid-
den nodes. The result is highly nonlinear and inter-
active, when looked at as a function of the original
inputs.
The Fat Scorecard study had an ingredient that
is rare. We not only had the traditional test sample,
but had three other test samples, taken one, two,
and three years later. In this case, the Fat Scorecard
outperformed the more traditional thinner score-
card for all four test samples. So the feared over-
tting to the traditional test sample never mate-
rialized. To get a better handle on this you need
an understanding of how the relationships between
variables evolve over time.
I recently encountered another connection
between algorithmic modeling and data modeling.
In classical multivariate discriminant analysis, one
assumes that the prediction variables have a mul-
tivariate normal distribution. But for a scorecard,
the prediction variables are hundreds of attribute
dummy variables, which are very nonnormal. How-
ever, if you apply the discriminant analysis algo-
rithm to the attribute dummy variables, you can
get a great algorithmic model, even though the
assumptions of discriminant analysis are severely
violated.
STATISTICAL MODELING: THE TWO CULTURES 223
A SOLUTION TO THE OCCAM DILEMMA
I think that there is a solution to the Occam
dilemma without resorting to goal-oriented argu-
ments. Clients really do insist on interpretable func-
tions, f(x). Segmented palatable scorecards are
very interpretable by the customer and are very
accurate. Professor Breiman himself gave single
trees an A+ on interpretability. The shallow-to-
medium tree in a segmented scorecard rates an A++.
The palatable scorecards in the leaves of the trees
are built from interpretable (possibly complex) char-
acteristics. Sometimes we cant implement them
until the lawyers and regulators approve. And that
requires super interpretability. Our more sophisti-
cated products have 10 to 20 segments and up to
100 characteristics (not all in every segment). These
models are very accurate and very interpretable.
I coined a phrase called the Ping-Pong theorem.
This theorem says that if we revealed to Profes-
sor Breiman the performance of our best model and
gave him our data, then he could develop an algo-
rithmic model using random forests, which would
outperform our model. But if he revealed to us the
performance of his model, then we could develop
a segmented scorecard, which would outperform
his model. We might need more characteristics,
attributes and segments, but our experience in this
kind of contest is on our side.
However, all the competing models in this game of
Ping-Pong would surely be algorithmic models. But
some of them could be interpretable.
THE ALGORITHM TUNING DILEMMA
As far as I can tell, all approaches to algorithmic
model building contain tuning parameters, either
explicit or implicit. For example, we use penalized
objective functions for tting and marginal contri-
bution thresholds for characteristic selection. With
experience, analysts learn how to set these tuning
parameters in order to get excellent test sample
or cross-validation results. However, in industry
and academia, there is sometimes a little tinker-
ing, which involves peeking at the test sample. The
result is some bias in the test sample or cross-
validation results. This is the same kind of tinkering
that upsets test of t pureness. This is a challenge
for the algorithmic modeling approach. How do you
optimize your results and get an unbiased estimate
of the generalization error?
GENERALIZING THE GENERALIZATION ERROR
In most commercial applications of algorithmic
modeling, the function, f(x), is used to make deci-
sions. In some academic research, classication is
used as a surrogate for the decision process, and
misclassication error is used as a surrogate for
prot. However, I see a mismatch between the algo-
rithms used to develop the models and the business
measurement of the models value. For example, at
Fair, Isaac, we frequently maximize divergence. But
when we argue the models value to the clients, we
dont necessarily brag about the great divergence.
We try to use measures that the client can relate to.
The ROC curve is one favorite, but it may not tell
the whole story. Sometimes, we develop simulations
of the clients business operation to show how the
model will improve their situation. For example, in
a transaction fraud control process, some measures
of interest are false positive rate, speed of detec-
tion and dollars saved when 0.5% of the transactions
are agged as possible frauds. The 0.5% reects the
number of transactions that can be processed by
the current fraud management staff. Perhaps what
the client really wants is a score that will maxi-
mize the dollars saved in their fraud control sys-
tem. The score that maximizes test set divergence
or minimizes test set misclassications does not do
it. The challenge for algorithmic modeling is to nd
an algorithm that maximizes the generalization dol-
lars saved, not generalization error.
We have made some progress in this area using
ideas from support vector machines and boosting.
By manipulating the observation weights used in
standard algorithms, we can improve the test set
performance on any objective of interest. But the
price we pay is computational intensity.
MEASURING IMPORTANCEIS IT
REALLY POSSIBLE?
I like Professor Breimans idea for measuring the
importance of variables in black box models. A Fair,
Isaac spin on this idea would be to build accurate
models for which no variable is much more impor-
tant than other variables. There is always a chance
that a variable and its relationships will change in
the future. After that, you still want the model to
work. So dont make any variable dominant.
I think that there is still an issue with measuring
importance. Consider a set of inputs and an algo-
rithm that yields a black box, for which x
1
is impor-
tant. From the Ping Pong theorem there exists a
set of input variables, excluding x
1
and an algorithm
that will yield an equally accurate black box. For
this black box, x
1
is unimportant.
224 L. BREIMAN
IN SUMMARY
Algorithmic modeling is a very important area of
statistics. It has evolved naturally in environments
with lots of data and lots of decisions. But you
can do it without suffering the Occam dilemma;
for example, use medium trees with interpretable
GAMs in the leaves. They are very accurate and
interpretable. And you can do it with data modeling
tools as long as you (i) ignore most textbook advice,
(ii) embrace the blessing of dimensionality, (iii) use
constraints in the tting optimizations (iv) use reg-
ularization, and (v) validate the results.
Comment
Emanuel Parzen
1. BREIMAN DESERVES OUR
APPRECIATION
I strongly support the view that statisticians must
face the crisis of the difculties in their practice of
regression. Breiman alerts us to systematic blun-
ders (leading to wrong conclusions) that have been
committed applying current statistical practice of
data modeling. In the spirit of statistician, avoid
doing harm I propose that the rst goal of statisti-
cal ethics should be to guarantee to our clients that
any mistakes in our analysis are unlike any mis-
takes that statisticians have made before.
The two goals in analyzing data which Leo calls
prediction and information I prefer to describe as
management and science. Management seeks
prot, practical answers (predictions) useful for
decision making in the short run. Science seeks
truth, fundamental knowledge about nature which
provides understanding and control in the long run.
As a historical note, Students t-test has many sci-
entic applications but was invented by Student as
a management tool to make Guinness beer better
(bitter?).
Breiman does an excellent job of presenting the
case that the practice of statistical science, using
only the conventional data modeling culture, needs
reform. He deserves much thanks for alerting us to
the algorithmic modeling culture. Breiman warns us
that if the model is a poor emulation of nature, the
conclusions may be wrong. This situation, which
I call the right answer to the wrong question, is
called by statisticians the error of the third kind.
Engineers at M.I.T. dene suboptimization as
elegantly solving the wrong problem.
Emanuel Parzen is Distinguished Professor, Depart-
ment of Statistics, Texas A&M University, 415 C
Block Building, College Station, Texas 77843 (e-mail:
[email protected]).
Breiman presents the potential benets of algo-
rithmic models (better predictive accuracy thandata
models, and consequently better information about
the underlying mechanism and avoiding question-
able conclusions which results from weak predictive
accuracy) and support vector machines (which pro-
vide almost perfect separation and discrimination
between two classes by increasing the dimension of
the feature set). He convinces me that the methods
of algorithmic modeling are important contributions
to the tool kit of statisticians.
If the profession of statistics is to remain healthy,
and not limit its research opportunities, statis-
ticians must learn about the cultures in which
Breiman works, but also about many other cultures
of statistics.
2. HYPOTHESES TO TEST TO AVOID
BLUNDERS OF STATISTICAL MODELING
Breiman deserves our appreciation for pointing
out generic deviations from standard assumptions
(which I call bivariate dependence and two-sample
conditional clustering) for which we should rou-
tinely check. Test null hypothesis can be a use-
ful algorithmic concept if we use tests that diagnose
in a model-free way the directions of deviation from
the null hypothesis model.
Bivariate dependence (correlation) may exist
between features [independent (input) variables] in
a regression causing them to be proxies for each
other and our models to be unstable with differ-
ent forms of regression models being equally well
tting. We need tools to routinely test the hypoth-
esis of statistical independence of the distributions
of independent (input) variables.
Two sample conditional clustering arises in the
distributions of independent (input) variables to
discriminate between two classes, which we call the
conditional distribution of input variables X given
each class. Class I may have only one mode (clus-
ter) at low values of X while class II has two modes
STATISTICAL MODELING: THE TWO CULTURES 225
(clusters) at low and high values of X. We would like
to conclude that high values of X are observed only
for members of class II but low values of Xoccur for
members of both classes. The hypothesis we propose
testing is equality of the pooled distribution of both
samples and the conditional distribution of sample
I, which is equivalent to P|classI[X| = P|classI|.
For successful discrimination one seeks to increase
the number (dimension) of inputs (features) X to
make P|classI[X| close to 1 or 0.
3. STATISTICAL MODELING, MANY
CULTURES, STATISTICAL METHODS MINING
Breiman speaks of two cultures of statistics; I
believe statistics has many cultures. At specialized
workshops (on maximum entropy methods or robust
methods or Bayesian methods or ) a main topic
of conversation is Why dont all statisticians think
like us?
I have my own eclectic philosophy of statis-
tical modeling to which I would like to attract
serious attention. I call it statistical methods
mining which seeks to provide a framework to
synthesize and apply the past half-century of
methodological progress in computationally inten-
sive methods for statistical modeling, including
EDA (exploratory data analysis), FDA (functional
data analysis), density estimation, Model DA (model
selection criteria data analysis), Bayesian priors
on function space, continuous parameter regres-
sion analysis and reproducing kernels, fast algo-
rithms, Kalman ltering, complexity, information,
quantile data analysis, nonparametric regression,
conditional quantiles.
I believe data mining is a special case of data
modeling. We should teach in our introductory
courses that one meaning of statistics is statisti-
cal data modeling done in a systematic way by
an iterated series of stages which can be abbrevi-
ated SIEVE (specify problem and general form of
models, identify tentatively numbers of parameters
and specialized models, estimate parameters, val-
idate goodness-of-t of estimated models, estimate
nal model nonparametrically or algorithmically).
MacKay and Oldford (2000) brilliantly present the
statistical method as a series of stages PPDAC
(problem, plan, data, analysis, conclusions).
4. QUANTILE CULTURE,
ALGORITHMIC MODELS
A culture of statistical data modeling based on
quantile functions, initiated in Parzen (1979), has
been my main research interest since 1976. In
my discussion to Stone (1977) I outlined a novel
approach to estimation of conditional quantile func-
tions which I only recently fully implemented. I
would like to extend the concept of algorithmic sta-
tistical models in two ways: (1) to mean data tting
by representations which use approximation the-
ory and numerical analysis; (2) to use the notation
of probability to describe empirical distributions of
samples (data sets) which are not assumed to be
generated by a random mechanism.
My quantile culture has not yet become widely
applied because you cannot give away a good idea,
you have to sell it (by integrating it in computer
programs usable by applied statisticians and thus
promote statistical methods mining).
A quantile function Q(u), 0 u 1, is the
inverse F
1
(u) of a distribution function F(x),
< x < . Its rigorous denition is Q(u) =
inf (x: F(x) u). When F is continuous with den-
sity f, F(Q(u)) = u, q(u) = Q
/
(u) = 1/f(Q(u)).
We use the notation Q for a true unknown quantile
function, Q

for a raw estimator from a sample, and


Q

for a smooth estimator of the true Q.


Concepts dened for Q(u) can be dened also for
other versions of quantile functions. Quantile func-
tions can compress data by a ve-number sum-
mary, values of Q(u) at u = 05 025 075 01 09
(or 005 095). Measures of location and scale are
QM = 05(Q(025) Q(075)) QD = 2(Q(075)
Q(025)). To use quantile functions to identify dis-
tributions tting data we propose the quantile
quartile function Q/Q(u) = (Q(u) QM)/QD
Five-number summary of distribution becomes
QM QD Q/Q(05) skewness, Q/Q(01) left-tail,
Q/Q(09) right-tail. Elegance of Q/Q(u) is its uni-
versal values at u = 025 075 Values [Q/Q(u)[ > 1
are outliers as dened by Tukey EDA.
For the fundamental problem of comparison of
two distributions F and G we dene the compar-
ison distribution D(u; F G) and comparison den-
sity d(u; F G) = D
/
(u; F G) For F G continu-
ous, dene D(u; F G) = G(F
1
(u)) d(u; F G) =
g(F
1
(u))/f(F
1
(u)) assuming f(x) = 0 implies
g(x) = 0, written G _ F. For F G discrete with
probability mass functions p
F
and p
G
dene (assum-
ing G _F) d(u; F G) = p
G
(F
1
(u))/p
F
(F
1
(u)).
Our applications of comparison distributions
often assume F to be an unconditional distribution
and G a conditional distribution. To analyze bivari-
ate data (X Y) a fundamental tool is dependence
density d(t u) = d(u; F
Y
F
Y[X=Q
X
(t)
) When X Y
is jointly continuous,
d(t u)
= f
XY
(Q
X
(t) Q
Y
(u))/f
X
(Q
X
(t))f
Y
(Q
Y
(u))
226 L. BREIMAN
The statistical independence hypothesis F
XY
=
F
X
F
Y
is equivalent to d(t u) = 1, all t u. A funda-
mental formula for estimation of conditional quan-
tile functions is
Q
Y[X=x
(u) = Q
Y
(D
1
(u; F
Y
F
Y[X=x
))
= Q
Y
(s) u = D(s; F
Y
F
Y[X=x
)
To compare the distributions of two univariate
samples, let Y denote the continuous response vari-
able and X be binary 0, 1 denoting the population
from which Y is observed. The comparison density
is dened (note F
Y
is the pooled distribution func-
tion)
d
1
(u) = d(u; F
Y
F
Y[X=1
)
= P|X= 1[Y = Q
Y
(u)|/P|X= 1|
5. QUANTILE IDEAS FOR HIGH
DIMENSIONAL DATA ANALYSIS
By high dimensional data we mean multivariate
data (Y
1
Y
m
). We form approximate high
dimensional comparison densities d(u
1
u
m
) to
test statistical independence of the variables and,
when we have two samples, d
1
(u
1
u
m
) to test
equality of sample I with pooled sample. All our
distributions are empirical distributions but we use
notation for true distributions in our formulas. Note
that

1
0
du
1

1
0
du
m
d(u
1
u
m
)d
1
(u
1
u
m
) = 1
A decile quantile bin B(k
1
k
m
) is dened
to be the set of observations (Y
1
Y
m
) satisfy-
ing, for j = 1 m Q
Y
j
((k
j
1)/10) < Y
j

Q
Y
j
(k
j
/10) Instead of deciles k/10 we could use
k/M for another base M.
To test the hypothesis that Y
1
Y
m
are statis-
tically independent we form for all k
j
= 1 10,
d(k
1
k
m
) = P|Bin(k
1
k
m
)|/
P|Bin(k
1
k
m
)[independence|
To test equality of distribution of a sample from pop-
ulation I and the pooled sample we form
d
1
(k
1
k
m
)
= P|Bin(k
1
k
m
)[ populationI|/
P|Bin(k
1
k
m
)[ pooled sample|
for all (k
1
k
m
) such that the denominator is
positive and otherwise dened arbitrarily. One can
show (letting X denote the population observed)
d
1
(k
1
k
m
) = P|X= I[ observation from
Bin(k
1
k
m
)|/P|X= I|
To test the null hypotheses in ways that detect
directions of deviations from the null hypothesis our
recommended rst step is quantile data analysis of
the values d(k
1
k
m
) and d
1
(k
1
k
m
).
I appreciate this opportunity to bring to the atten-
tion of researchers on high dimensional data anal-
ysis the potential of quantile methods. My con-
clusion is that statistical science has many cul-
tures and statisticians will be more successful when
they emulate Leo Breiman and apply as many cul-
tures as possible (which I call statistical methods
mining). Additional references are on my web site
at stat.tamu.edu.
Rejoinder
Leo Breiman
I thank the discussants. Im fortunate to have com-
ments from a group of experienced and creative
statisticianseven more so in that their comments
are diverse. Manny Parzen and Bruce Hoadley are
more or less in agreement, Brad Efron has seri-
ous reservations and D. R. Cox is in downright
disagreement.
I address Professor Coxs comments rst, since
our disagreement is crucial.
D. R. COX
Professor Cox is a worthy and thoughtful adver-
sary. We walk down part of the trail together and
then sharply diverge. To begin, I quote: Professor
Breiman takes data as his starting point. I would
prefer to start with an issue, a question, or a sci-
entic hypothesis, I agree, but would expand
the starting list to include the prediction of future
events. I have never worked on a project that has
started with Here is a lot of data; lets look at it
and see if we can get some ideas about how to use
it. The data has been put together and analyzed
starting with an objective.
C1 Data Models Can Be Useful
Professor Cox is committed to the use of data mod-
els. I readily acknowledge that there are situations
STATISTICAL MODELING: THE TWO CULTURES 227
where a simple data model may be useful and appro-
priate; for instance, if the science of the mechanism
producing the data is well enough known to deter-
mine the model apart from estimating parameters.
There are also situations of great complexity posing
important issues and questions in which there is not
enough data to resolve the questions to the accu-
racy desired. Simple models can then be useful in
giving qualitative understanding, suggesting future
research areas and the kind of additional data that
needs to be gathered.
At times, there is not enough data on which to
base predictions; but policy decisions need to be
made. In this case, constructing a model using what-
ever data exists, combined with scientic common
sense and subject-matter knowledge, is a reason-
able path. Professor Cox points to examples when
he writes:
Often the prediction is under quite dif-
ferent conditions from the data; what
is the likely progress of the incidence
of the epidemic of v-CJD in the United
Kingdom, what would be the effect on
annual incidence of cancer in the United
States reducing by 10% the medical use
of X-rays, etc.? That is, it may be desired
to predict the consequences of some-
thing only indirectly addressed by the
data available for analysis prediction,
always hazardous, without some under-
standing of the underlying process and
linking with other sources of information,
becomes more and more tentative.
I agree.
C2 Data Models Only
From here on we part company. Professor Coxs
discussion consists of a justication of the use of
data models to the exclusion of other approaches.
For instance, although he admits, I certainly
accept, although it goes somewhat against the grain
to do so, that there are situations where a directly
empirical approach is better the two examples
he gives of such situations are short-term economic
forecasts and real-time ood forecastsamong the
less interesting of all of the many current suc-
cessful algorithmic applications. In his view, the
only use for algorithmic models is short-term fore-
casting; there are no comments on the rich infor-
mation about the data and covariates available
from random forests or in the many elds, such as
pattern recognition, where algorithmic modeling is
fundamental.
He advocates construction of stochastic data mod-
els that summarize the understanding of the phe-
nomena under study. The methodology in the Cox
and Wermuth book (1996) attempts to push under-
standing further by nding casual orderings in the
covariate effects. The sixth chapter of this book con-
tains illustrations of this approach on four data sets.
The rst is a small data set of 68 patients with
seven covariates from a pilot study at the University
of Mainz to identify pyschological and socioeconomic
factors possibly important for glucose control in dia-
betes patients. This is a regression-type problem
with the response variable measured by GHb (gly-
cosylated haemoglobin). The model tting is done
by a number of linear regressions and validated
by the checking of various residual plots. The only
other reference to model validation is the statement,
R
2
= 034, reasonably large by the standards usual
for this eld of study. Predictive accuracy is not
computed, either for this example or for the three
other examples.
My comments on the questionable use of data
models apply to this analysis. Incidentally, I tried to
get one of the data sets used in the chapter to con-
duct an alternative analysis, but it was not possible
to get it before my rejoinder was due. It would have
been interesting to contrast our two approaches.
C3 Approach to Statistical Problems
Basing my critique on a small illustration in a
book is not fair to Professor Cox. To be fairer, I quote
his words about the nature of a statistical analysis:
Formal models are useful and often
almost, if not quite, essential for inci-
sive thinking. Descriptively appealing
and transparent methods with a rm
model base are the ideal. Notions of sig-
nicance tests, condence intervals, pos-
terior intervals, and all the formal appa-
ratus of inference are valuable tools to be
used as guides, but not in a mechanical
way; they indicate the uncertainty that
would apply under somewhat idealized,
maybe very idealized, conditions and as
such are often lower bounds to real
uncertainty. Analyses and model devel-
opment are at least partly exploratory.
Automatic methods of model selection
(and of variable selection in regression-
like problems) are to be shunned or, if
use is absolutely unavoidable, are to be
examined carefully for their effect on
the nal conclusions. Unfocused tests of
model adequacy are rarely helpful.
228 L. BREIMAN
Given the right kind of data: relatively small sam-
ple size and a handful of covariates, I have no doubt
that his experience and ingenuity in the craft of
model construction would result in an illuminating
model. But data characteristics are rapidly chang-
ing. In many of the most interesting current prob-
lems, the idea of starting with a formal model is not
tenable.
C4 Changes in Problems
My impression from Professor Coxs comments is
that he believes every statistical problem can be
best solved by constructing a data model. I believe
that statisticians need to be more pragmatic. Given
a statistical problem, nd a good solution, whether
it is a data model, an algorithmic model or (although
it is somewhat against my grain), a Bayesian data
model or a completely different approach.
My work on the 1990 Census Adjustment
(Breiman, 1994) involved a painstaking analysis of
the sources of error in the data. This was done by
a long study of thousands of pages of evaluation
documents. This seemed the most appropriate way
of answering the question of the accuracy of the
adjustment estimates.
The conclusion that the adjustment estimates
were largely the result of bad data has never been
effectively contested and is supported by the results
of the Year 2000 Census Adjustment effort. The
accuracy of the adjustment estimates was, arguably,
the most important statistical issue of the last
decade, and could not be resolved by any amount
of statistical modeling.
A primary reason why we cannot rely on data
models alone is the rapid change in the nature of
statistical problems. The realm of applications of
statistics has expanded more in the last twenty-ve
years than in any comparable period in the history
of statistics.
In an astronomy and statistics workshop this
year, a speaker remarked that in twenty-ve years
we have gone from being a small sample-size science
to a very large sample-size science. Astronomical
data bases now contain data on two billion objects
comprising over 100 terabytes and the rate of new
information is accelerating.
A recent biostatistics workshop emphasized the
analysis of genetic data. An exciting breakthrough
is the use of microarrays to locate regions of gene
activity. Here the sample size is small, but the num-
ber of variables ranges in the thousands. The ques-
tions are which specic genes contribute to the
occurrence of various types of diseases.
Questions about the areas of thinking in the brain
are being studied using functional MRI. The data
gathered in each run consists of a sequence of
150,000 pixel images. Gigabytes of satellite infor-
mation are being used in projects to predict and
understand short- and long-term environmental
and weather changes.
Underlying this rapid change is the rapid evo-
lution of the computer, a device for gathering, stor-
ing and manipulation of incredible amounts of data,
together with technological advances incorporating
computing, such as satellites and MRI machines.
The problems are exhilarating. The methods used
in statistics for small sample sizes and a small num-
ber of variables are not applicable. John Rice, in his
summary talk at the astronomy and statistics work-
shop said, Statisticians have to become opportunis-
tic. That is, faced with a problem, they must nd
a reasonable solution by whatever method works.
One surprising aspect of both workshops was how
opportunistic statisticians faced with genetic and
astronomical data had become. Algorithmic meth-
ods abounded.
C5 Mainstream Procedures and Tools
Professor Cox views my critique of the use of data
models as based in part on a caricature. Regard-
ing my references to articles in journals such as
JASA, he states that they are not typical of main-
stream statistical analysis, but are used to illustrate
technique rather than explain the process of anal-
ysis. His concept of mainstream statistical analysis
is summarized in the quote given in my Section C3.
It is the kind of thoughtful and careful analysis that
he prefers and is capable of.
Following this summary is the statement:
By contrast, Professor Breiman equates
mainstream applied statistics to a rel-
atively mechanical process involving
somehow or other choosing a model, often
a default model of standard form, and
applying standard methods of analysis
and goodness-of-t procedures.
The disagreement is denitionalwhat is main-
stream? In terms of numbers my denition of main-
stream prevails, I guess, at a ratio of at least 100
to 1. Simply count the number of people doing their
statistical analysis using canned packages, or count
the number of SAS licenses.
In the academic world, we often overlook the fact
that we are a small slice of all statisticians and
an even smaller slice of all those doing analyses of
data. There are many statisticians and nonstatis-
ticians in diverse elds using data to reach con-
clusions and depending on tools supplied to them
STATISTICAL MODELING: THE TWO CULTURES 229
by SAS, SPSS, etc. Their conclusions are important
and are sometimes published in medical or other
subject-matter journals. They do not have the sta-
tistical expertise, computer skills, or time needed to
construct more appropriate tools. I was faced with
this problem as a consultant when conned to using
the BMDP linear regression, stepwise linear regres-
sion, and discriminant analysis programs. My con-
cept of decision trees arose when I was faced with
nonstandard data that could not be treated by these
standard methods.
When I rejoined the university after my consult-
ing years, one of my hopes was to provide better
general purpose tools for the analysis of data. The
rst step in this direction was the publication of the
CART book (Breiman et al., 1984). CART and other
similar decision tree methods are used in thou-
sands of applications yearly in many elds. It has
proved robust and reliable. There are others that
are more recent; random forests is the latest. A pre-
liminary version of random forests is free source
with f77 code, S and R interfaces available at
www.stat.berkeley.edu/users/breiman.
A nearly completed second version will also be
put on the web site and translated into Java by the
Weka group. My collaborator, Adele Cutler, and I
will continue to upgrade, add new features, graph-
ics, and a good interface.
My philosophy about the eld of academic statis-
tics is that we have a responsibility to provide
the many people working in applications outside of
academia with useful, reliable, and accurate analy-
sis tools. Two excellent examples are wavelets and
decision trees. More are needed.
BRAD EFRON
Brad seems to be a bit puzzled about how to react
to my article. Ill start with what appears to be his
biggest reservation.
E1 From Simple to Complex Models
Brad is concerned about the use of complex
models without simple interpretability in their
structure, even though these models may be the
most accurate predictors possible. But the evolution
of science is from simple to complex.
The equations of general relativity are consider-
ably more complex and difcult to understand than
Newtons equations. The quantum mechanical equa-
tions for a system of molecules are extraordinarily
difcult to interpret. Physicists accept these com-
plex models as the facts of life, and do their best to
extract usable information from them.
There is no consideration given to trying to under-
stand cosmology on the basis of Newtons equations
or nuclear reactions in terms of hard ball models for
atoms. The scientic approach is to use these com-
plex models as the best possible descriptions of the
physical world and try to get usable information out
of them.
There are many engineering and scientic appli-
cations where simpler models, such as Newtons
laws, are certainly sufcientsay, in structural
design. Even here, for larger structures, the model
is complex and the analysis difcult. In scientic
elds outside statistics, answering questions is done
by extracting information from increasingly com-
plex and accurate models.
The approach I suggest is similar. In genetics,
astronomy and many other current areas statistics
is needed to answer questions, construct the most
accurate possible model, however complex, and then
extract usable information from it.
Random forests is in use at some major drug
companies whose statisticians were impressed by
its ability to determine gene expression (variable
importance) in microarray data. They were not
concerned about its complexity or black-box appear-
ance.
E2 Prediction
Leos paper overstates both its [predic-
tions] role, and our professions lack of
interest in it Most statistical surveys
have the identication of causal factors
as their ultimate role.
My point was that it is difcult to tell, using
goodness-of-t tests and residual analysis, how well
a model ts the data. An estimate of its test set accu-
racy is a preferable assessment. If, for instance, a
model gives predictive accuracy only slightly better
than the all survived or other baseline estimates,
we cant put much faith in its reliability in the iden-
tication of causal factors.
I agree that often statistical surveys have
the identication of casual factors as their ultimate
role. I would add that the more predictively accu-
rate the model is, the more faith can be put into the
variables that it ngers as important.
E3 Variable Importance
A signicant and often overlooked point raised
by Brad is what meaning can one give to state-
ments that variable X is important or not impor-
tant. This has puzzled me on and off for quite a
while. In fact, variable importance has always been
dened operationally. In regression the important
230 L. BREIMAN
variables are dened by doing best subsets or vari-
able deletion.
Another approach used in linear methods such as
logistic regression and survival models is to com-
pare the size of the slope estimate for a variable to
its estimated standard error. The larger the ratio,
the more important the variable. Both of these def-
initions can lead to erroneous conclusions.
My denition of variable importance is based
on prediction. A variable might be considered
important if deleting it seriously affects prediction
accuracy. This brings up the problem that if two
variables are highly correlated, deleting one or the
other of them will not affect prediction accuracy.
Deleting both of them may degrade accuracy consid-
erably. The denition used in random forests spots
both variables.
Importance does not yet have a satisfactory the-
oretical denition (I havent been able to locate the
article Brad references but Ill keep looking). It
depends on the dependencies between the output
variable and the input variables, and on the depen-
dencies between the input variables. The problem
begs for research.
E4 Other Reservations
Sample sizes have swollen alarmingly
while goals grow less distinct (nd inter-
esting data structure).
I have not noticed any increasing fuzziness in
goals, only that they have gotten more diverse. In
the last two workshops I attended (genetics and
astronomy) the goals in using the data were clearly
laid out. Searching for structure is rarely seen
even though data may be in the terabyte range.
The new algorithms often appear in the
form of black boxes with enormous num-
bers of adjustable parameters (knobs to
twiddle).
This is a perplexing statement and perhaps I dont
understand what Brad means. Random forests has
only one adjustable parameter that needs to be set
for a run, is insensitive to the value of this param-
eter over a wide range, and has a quick and simple
way for determining a good value. Support vector
machines depend on the settings of 12 parameters.
Other algorithmic models are similarly sparse in the
number of knobs that have to be twiddled.
New methods always look better than old
ones. Complicated models are harder
to criticize than simple ones.
In 1992 I went to my rst NIPS conference. At
that time, the exciting algorithmic methodology was
neural nets. My attitude was grim skepticism. Neu-
ral nets had been given too much hype, just as AI
had been given and failed expectations. I came away
a believer. Neural nets delivered on the bottom line!
In talk after talk, in problem after problem, neu-
ral nets were being used to solve difcult prediction
problems with test set accuracies better than any-
thing I had seen up to that time.
My attitude toward new and/or complicated meth-
ods is pragmatic. Prove that youve got a better
mousetrap and Ill buy it. But the proof had bet-
ter be concrete and convincing.
Brad questions where the bias and variance have
gone. It is surprising when, trained in classical bias-
variance terms and convinced of the curse of dimen-
sionality, one encounters methods that can handle
thousands of variables with little loss of accuracy.
It is not voodoo statistics; there is some simple the-
ory that illuminates the behavior of random forests
(Breiman, 1999). I agree that more theoretical work
is needed to increase our understanding.
Brad is an innovative and exible thinker who
has contributed much to our eld. He is opportunis-
tic in problem solving and may, perhaps not overtly,
already have algorithmic modeling in his bag of
tools.
BRUCE HOADLEY
I thank Bruce Hoadley for his description of
the algorithmic procedures developed at Fair, Isaac
since the 1960s. They sound like people I would
enjoy working with. Bruce makes two points of mild
contention. One is the following:
High performance (predictive accuracy)
on the test sample does not guaran-
tee high performance on future samples;
things do change.
I agreealgorithmic models accurate in one con-
text must be modied to stay accurate in others.
This does not necessarily imply that the way the
model is constructed needs to be altered, but that
data gathered in the new context should be used in
the construction.
His other point of contention is that the Fair, Isaac
algorithm retains interpretability, so that it is pos-
sible to have both accuracy and interpretability. For
clients who like to know whats going on, thats a
sellable item. But developments in algorithmic mod-
eling indicate that the Fair, Isaac algorithm is an
exception.
A computer scientist working in the machine
learning area joined a large money management
STATISTICAL MODELING: THE TWO CULTURES 231
company some years ago and set up a group to do
portfolio management using stock predictions given
by large neural nets. When we visited, I asked how
he explained the neural nets to clients. Simple, he
said; We t binary trees to the inputs and outputs
of the neural nets and show the trees to the clients.
Keeps them happy! In both stock prediction and
credit rating, the priority is accuracy. Interpretabil-
ity is a secondary goal that can be nessed.
MANNY PARZEN
Manny Parzen opines that there are not two but
many modeling cultures. This is not an issue I want
to ercely contest. I like my division because it is
pretty clear cutare you modeling the inside of the
box or not? For instance, I would include Bayesians
in the data modeling culture. I will keep my eye on
the quantile culture to see what develops.
Most of all, I appreciate Mannys openness to the
issues raised in my paper. With the rapid changes
in the scope of statistical problems, more open and
concrete discussion of what works and what doesnt
should be welcomed.
WHERE ARE WE HEADING?
Many of the best statisticians I have talked to
over the past years have serious concerns about the
viability of statistics as a eld. Oddly, we are in a
period where there has never been such a wealth of
new statistical problems and sources of data. The
danger is that if we dene the boundaries of our
eld in terms of familar tools and familar problems,
we will fail to grasp the new opportunities.
ADDITIONAL REFERENCES
Beverdige, W. V. I (1952) The Art of Scientic Investigation.
Heinemann, London.
Breiman, L. (1994) The 1990 Census adjustment: undercount or
bad data (with discussion)? Statist. Sci. 9 458475.
Cox, D. R. and Wermuth, N. (1996) Multivariate Dependencies.
Chapman and Hall, London.
Efron, B. and Gong, G. (1983) A leisurely look at the bootstrap,
the jackknife, and cross-validation. Amer. Statist. 37 3648.
Efron, B. and Tibshirani, R. (1996) Improvements on cross-
validation: the 632 rule. J. Amer. Statist. Assoc. 91 548560.
Efron, B. and Tibshirani, R. (1998) The problem of regions.
Ann. Statist. 26 12871318.
Gong, G. (1982) Cross-validation, the jackknife, and the boot-
strap: excess error estimation in forward logistic regression.
Ph.D. dissertation, Stanford Univ.
MacKay, R. J. and Oldford, R. W. (2000) Scientic method,
statistical method, and the speed of light. Statist. Sci. 15
224253.
Parzen, E. (1979) Nonparametric statistical data modeling (with
discussion). J. Amer. Statist. Assoc. 74 105131.
Stone, C. (1977) Consistent nonparametric regression. Ann.
Statist. 5 595645.

You might also like