A Study of Cross-Validation and Bootstrap For Accuracy Estimation and Model Selection
A Study of Cross-Validation and Bootstrap For Accuracy Estimation and Model Selection
net/publication/2352264
CITATIONS READS
4,995 43,114
1 author:
Ron Kohavi
Airbnb
120 PUBLICATIONS 40,383 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ron Kohavi on 15 June 2013.
Abstract for low variance, assuming the bias aects all classiers
similarly (e.g., estimates are 5% pessimistic).
We review accuracy estimation methods and In this paper we explain some of the assumptions made
compare the two most common methods: cross- by the dierent estimation methods, and present con-
validation and bootstrap. Recent experimen- crete examples where each method fails. While it is
tal results on articial data and theoretical re- known that no accuracy estimation can be correct all
sults in restricted settings have shown that for the time (Wolpert 1994b, Schaer 1994), we are inter-
selecting a good classier from a set of classi- ested in identifying a method that is well suited for the
ers (model selection), ten-fold cross-validation biases and trends in typical real world datasets.
may be better than the more expensive leave- Recent results, both theoretical and experimental,
one-out cross-validation. We report on a large- have shown that it is not always the case that increas-
scale experiment|over half a million runs of ing the computational cost is benecial, especially if the
C4.5 and a Naive-Bayes algorithm|to estimate relative accuracies are more important than the exact
the eects of dierent parameters on these al- values. For example, leave-one-out is almost unbiased,
gorithms on real-world datasets. For cross- but it has high variance, leading to unreliable estimates
validation, we vary the number of folds and (Efron 1983). For linear models, using leave-one-out
whether the folds are stratied or not; for boot- cross-validation for model selection is asymptotically in-
strap, we vary the number of bootstrap sam- consistent in the sense that the probability of selecting
ples. Our results indicate that for real-word the model with the best predictive power does not con-
datasets similar to ours, the best method to use verge to one as the total number of observations ap-
for model selection is ten-fold stratied cross proaches innity (Zhang 1992, Shao 1993).
validation, even if computation power allows This paper is organized as follows. Section 2 describes
using more folds. the common accuracy estimation methods and ways of
computing condence bounds that hold under some as-
sumptions. Section 3 discusses related work comparing
1 Introduction cross-validation variants and bootstrap variants. Sec-
It can not be emphasized enough that no claim tion 4 discusses methodology underlying our experiment.
whatsoever is being made in this paper that all The results of the experiments are given Section 5 with a
algorithms are equivalent in practice, in the real
world. In particular, no claim is being made that one discussion of important observations. We conclude with
should not use cross-validation in the real world. a summary in Section 6.
|Wolpert (1994a)
Estimating the accuracy of a classier induced by su- 2 Methods for Accuracy Estimation
pervised learning algorithms is important not only to A classier is a function that maps an unlabelled in-
predict its future prediction accuracy, but also for choos- stance to a label using internal data structures. An in-
ing a classier from a given set (model selection), or ducer, or an induction algorithm, builds a classier from
combining classiers (Wolpert 1992). For estimating the a given dataset. CART and C4.5 (Breiman, Friedman,
nal accuracy of a classier, we would like an estimation Olshen & Stone 1984, Quinlan 1993) are decision tree in-
method with low bias and low variance. To choose a ducers that build decision tree classiers. In this paper,
classier or to combine classiers, the absolute accura- we are not interested in the specic method for inducing
cies are less important and we are willing to trade o bias classiers, but assume access to a dataset and an inducer
A longer version of the paper can be retrieved by anony- of interest.
mous ftp to starry.stanford.edu:pub/ronnyk/accEst-long.ps Let V be the space of unlabelled instances and Y the
set of possible labels. Let X = V Y be the space of the classier) and a variance of acc (1 ? acc)=h. Thus,
labelled instances and D = fx1 ; x2; : : :; xng be a dataset by De Moivre-Laplace limit theorem, we have
(possibly a multiset) consisting of n labelled instances, ( )
where xi = hvi 2 V ; yi 2 Yi. A classier C maps an unla- Pr ?z < p ? acc < z
acch
(2)
belled instance v 2 V to a label y 2 Y and an inducer I acc(1 ? acc)=h
maps a given dataset D into a classier C . The notation
I (D; v) will denote the label assigned to an unlabelled in- where z is the (1+
)=2-th quantile point of the standard
stance v by the classier built by inducer I on dataset D, normal distribution. To get a 100
percent condence
i.e., I (D; v) = (I (D))(v). We assume that there exists a interval, one determines z and inverts the inequalities.
distribution on the set of labelled instances and that our Inversion of the inequalities leads to a quadratic equation
dataset consists of i.i.d. (independently and identically in acc, the roots of which are the low and high condence
distributed) instances. We consider equal misclassica- points:
tion costs using a 0/1 loss function, but the accuracy p
estimation methods can easily be extended to other loss 2h acch + z 2 z 4h acch + z 2 ? 4h acc2h : (3)
functions. 2(h + z 2 )
The accuracy of a classier C is the probability of
correctly classifying a randomly selected instance, i.e., The above equation is not conditioned on the dataset D;
acc = Pr(C (v) = y) for a randomly selected instance if more information is available about the probability of
hv; yi 2 X , where the probability distribution over the the given dataset, it must be taken into account.
instance space is the same as the distribution that was The holdout estimate is a random number that de-
used to select instances for the inducer's training set. pends on the division into a training set and a test set.
Given a nite dataset, we would like to estimate the fu- In random subsampling, the holdout method is re-
ture performance of a classier induced by the given in- peated k times, and the estimated accuracy is derived
ducer and dataset. A single accuracy estimate is usually by averaging the runs. The standard deviation can be
meaningless without a condence interval; thus we will estimated as the standard deviation of the accuracy es-
consider how to approximate such an interval when pos- timations from each holdout run.
sible. In order to identify weaknesses, we also attempt The main assumption that is violated in random sub-
to identify cases where the estimates fail. sampling is the independence of instances in the test set
from those in the training set. If the training and test
2.1 Holdout set are formed by a split of an original dataset, then
The holdout method, sometimes called test sample esti- an over-represented class in one subset will be a under-
mation, partitions the data into two mutually exclusive represented in the other. To demonstrate the issue, we
subsets called a training set and a test set, or holdout set. simulated a 2/3, 1/3 split of Fisher's famous iris dataset
It is common to designate 2/3 of the data as the training and used a majority inducer that builds a classier pre-
set and the remaining 1/3 as the test set. The training dicting the prevalent class in the training set. The iris
set is given to the inducer, and the induced classier is dataset describes iris plants using four continuous fea-
tested on the test set. Formally, let Dh , the holdout set, tures, and the task is to classify each instance (an iris)
be a subset of D of size h, and let Dt be D n Dh . The as Iris Setosa, Iris Versicolour, or Iris Virginica. For each
holdout estimated accuracy is dened as class label, there are exactly one third of the instances
with that label (50 instances of each class from a to-
acch =
1 X (I (D ; v ); y ) ; (1) tal of 150 instances); thus we expect 33.3% prediction
h hvi ;yi i2Dh t i i
accuracy. However, because the test set will always con-
tain less than 1/3 of the instances of the class that was
where (i; j) = 1 if i = j and 0 otherwise. Assuming prevalent in the training set, the accuracy predicted by
that the inducer's accuracy increases as more instances the holdout method is 27.68% with a standard deviation
are seen, the holdout method is a pessimistic estimator of 0.13% (estimated by averaging 500 holdouts).
because only a portion of the data is given to the inducer In practice, the dataset size is always nite, and usu-
for training. The more instances we leave for the test set, ally smaller than we would like it to be. The holdout
the higher the bias of our estimate; however, fewer test method makes inecient use of the data: a third of
set instances means that the condence interval for the dataset is not used for training the inducer.
accuracy will be wider as shown below.
Each test instance can be viewed as a Bernoulli trial: 2.2 Cross-Validation, Leave-one-out, and
correct or incorrect prediction. Let S be the number Stratication
of correct classications on the test set, then S is dis- In k-fold cross-validation, sometimes called rotation esti-
tributed binomially (sum of Bernoulli trials). For rea- mation, the dataset D is randomly split into k mutually
sonably large holdout sets, the distribution of S=h is ap- exclusive subsets (the folds) D1 ; D2; : : :; Dk of approx-
proximately normal with mean acc (the true accuracy of imately equal size. The inducer is trained and tested
k times; each time t 2 f1; 2; : : :; kg, it is trained on The above proposition helps understand one possible
D n Dt and tested on Dt. The cross-validation estimate assumption that is made when using cross-validation: if
of accuracy is the overall number of correct classica- an inducer is unstable for a particular dataset under a set
tions, divided by the number of instances in the dataset. of perturbations introduced by cross-validation, the ac-
Formally, let D(i) be the test set that includes instance curacy estimate is likely to be unreliable. If the inducer
xi = hvi ; yi i, then the cross-validation estimate of accu- is almost stable on a given dataset, we should expect
racy a reliable estimate. The next corollary takes the idea
slightly further and shows a result that we have observed
acccv =
1 X (I (D n D ; v ); y ) : (4) empirically: there is almost no change in the variance of
n hvi;yi i2D (i) i i
the cross-validation estimate when the number of folds
is varied.
The cross-validation estimate is a random number Corollary 2 (Variance in cross-validation)
that depends on the division into folds.? Complete Given a dataset and an inducer. If the inducer is sta-
cross-validation is the average of all m=k m possibil-
ble under the perturbations caused by deleting the test
ities for choosing m=k instances out of m, but it is instances for the folds in k-fold cross-validation for var-
usually too expensive. Except for leave-one-one (n-fold ious values of k, then the variance of the estimates will
cross-validation), which is always complete, k-fold cross- be the same.
validation is estimating complete k-fold cross-validation Proof: The variance of k-fold cross-validation in Propo-
using a single split of the data into the folds. Repeat- sition 1 does not depend on k.
ing cross-validation multiple times using dierent splits
into folds provides a better Monte-Carlo estimate to the While some inducers are likely to be inherently more
complete cross-validation at an added cost. In strati- stable, the following example shows that one must also
ed cross-validation, the folds are stratied so that take into account the dataset and the actual perturba-
they contain approximately the same proportions of la- tions.
bels as the original dataset. Example 1 (Failure of leave-one-out)
An inducer is stable for a given dataset and a set of Fisher's iris dataset contains 50 instances of each class,
perturbations, if it induces classiers that make the same leading one to expect that a majority inducer should
predictions when it is given the perturbed datasets. have accuracy about 33%. However, the combination of
Proposition 1 (Variance in k-fold CV) this dataset with a majority inducer is unstable for the
Given a dataset and an inducer. If the inducer is small perturbations performed by leave-one-out. When
stable under the perturbations caused by deleting the an instance is deleted from the dataset, its label is a mi-
instances for the folds in k-fold cross-validation, the nority in the training set; thus the majority inducer pre-
cross-validation estimate will be unbiased and the vari- dicts one of the other two classes and always errs in clas-
ance of the estimated accuracy will be approximately sifying the test instance. The leave-one-out estimated
acccv (1 ? acccv )=n, where n is the number of instances accuracy for a majority inducer on the iris dataset is
in the dataset. therefore 0%. Moreover, all folds have this estimated ac-
curacy; thus the standard deviation of the folds is again
Proof: If we assume that the k classiers produced make 0%, giving the unjustied assurance that the estimate is
the same predictions, then the estimated accuracy has stable.
a binomial distribution with n trials and probability of The example shows an inherent problem with cross-
success equal to the accuracy of the classier. validation that applies to more than just a majority in-
For large enough n, a condence interval may be com- ducer. In a no-information dataset, where the label val-
puted using Equation 3 with h equal to n, the number ues are completely random, the best an induction algo-
of instances. rithm can do is predict majority. Leave-one-out on such
In reality, a complex inducer is unlikely to be stable a dataset with 50% of the labels for each class and a
for large perturbations, unless it has reached its maximal majority inducer (the best possible inducer) would still
learning capacity. We expect the perturbations induced predict 0% accuracy.
by leave-one-out to be small and therefore the classier
should be very stable. As we increase the size of the 2.3 Bootstrap
perturbations, stability is less likely to hold: we expect The bootstrap family was introduced by Efron and is
stability to hold more in 20-fold cross-validation than in fully described in Efron & Tibshirani (1993). Given a
10-fold cross-validation and both should be more stable dataset of size n, a bootstrap sample is created by
than holdout of 1/3. The proposition does not apply sampling n instances uniformly from the data (with re-
to the resubstitution estimate because it requires the in- placement). Since the dataset is sampled with replace-
ducer to be stable when no instances are given in the ment, the probability of any given instance not being
dataset. chosen after n samples is (1 ? 1=n)n e?1 0:368; the
expected number of distinct instances from the original Jain, Dubes & Chen (1987) compared the performance
dataset appearing in the test set is thus 0:632n. The 0 of the 0 bootstrap and leave-one-out cross-validation
accuracy estimate is derived by using the bootstrap sam- on nearest neighbor classiers using articial data and
ple for training and the rest of the instances for testing. claimed that the condence interval of the bootstrap
Given a number b, the number of bootstrap samples, let estimator is smaller than that of leave-one-out. Weiss
0i be the accuracy estimate for bootstrap sample i. The (1991) followed similar lines and compared stratied
.632 bootstrap estimate is dened as cross-validation and two bootstrap methods with near-
b est neighbor classiers. His results were that stratied
1X two-fold cross validation is relatively low variance and
accboot =
b i=1 (0:632 0i + :368 accs ) (5) superior to leave-one-out.
where accs is the resubstitution accuracy estimate on Breiman & Spector (1992) conducted a feature sub-
the full dataset (i.e., the accuracy on the training set). set selection experiments for regression, and compared
The variance of the estimate can be determined by com- leave-one-out cross-validation, k-fold cross-validation
puting the variance of the estimates for the samples. for various k, stratied k-fold cross-validation, bias-
The assumptions made by bootstrap are basically the corrected bootstrap, and partial cross-validation (not
same as that of cross-validation, i.e., stability of the al- discussed here). Tests were done on articial datasets
gorithm on the dataset: the \bootstrap world" should with 60 and 160 instances. The behavior observed
closely approximate the real world. The .632 bootstrap was: (1) the leave-one-out has low bias and RMS (root
fails to give the expected result when the classier is a mean square) error, whereas two-fold and ve-fold cross-
perfect memorizer (e.g., an unpruned decision tree or a validation have larger bias and RMS error only at models
one nearest neighbor classier) and the dataset is com- with many features; (2) the pessimistic bias of ten-fold
pletely random, say with two classes. The resubstitution cross-validation at small samples was signicantly re-
accuracy is 100%, and the 0 accuracy is about 50%. duced for the samples of size 160; (3) for model selection,
Plugging these into the bootstrap formula, one gets an ten-fold cross-validation is better than leave-one-out.
estimated accuracy of about 68.4%, far from the real ac- Bailey & Elkan (1993) compared leave-one-out cross-
curacy of 50%. Bootstrap can be shown to fail if we add validation to .632 bootstrap using the FOIL inducer
a memorizer module to any given inducer and adjust its and four synthetic datasets involving Boolean concepts.
predictions. If the memorizer remembers the training set They observed high variability and little bias in the
and makes the predictions when the test instance was a leave-one-out estimates, and low variability but large
training instances, adjusting its predictions can make the bias in the .632 estimates.
resubstitution accuracy change from 0% to 100% and can Weiss and Indurkyha (Weiss & Indurkhya 1994) con-
thus bias the overall estimated accuracy in any direction ducted experiments on real-world data to determine the
we want. applicability of cross-validation to decision tree pruning.
Their results were that for samples at least of size 200,
3 Related Work using stratied ten-fold cross-validation to choose the
amount of pruning yields unbiased trees (with respect to
Some experimental studies comparing dierent accuracy their optimal size).
estimation methods have been previously done, but most
of them were on articial or small datasets. We now 4 Methodology
describe some of these eorts.
Efron (1983) conducted ve sampling experiments and In order to conduct a large-scale experiment we decided
compared leave-one-out cross-validation, several variants to use C4.5 and a Naive-Bayesian classier. The C4.5
of bootstrap, and several other methods. The purpose algorithm (Quinlan 1993) is a descendent of ID3 that
of the experiments was to \investigate some related es- builds decision trees top-down. The Naive-Bayesian clas-
timators, which seem to oer considerably improved es- sier (Langley, Iba & Thompson 1992) used was the one
timation in small samples." The results indicate that implemented in MLC++ (Kohavi, John, Long, Manley
leave-one-out cross-validation gives nearly unbiased esti- & P
eger 1994) that uses the observed ratios for nominal
mates of the accuracy, but often with unacceptably high features and assumes a Gaussian distribution for contin-
variability, particularly for small samples; and that the uous features. The exact details are not crucial for this
.632 bootstrap performed best. paper because we are interested in the behavior of the
Breiman et al. (1984) conducted experiments using accuracy estimation methods more than the internals
cross-validation for decision tree pruning. They chose of the induction algorithms. The underlying hypothe-
ten-fold cross-validation for the CART program and sis spaces|decision trees for C4.5 and summary statis-
claimed it was satisfactory for choosing the correct tree. tics for Naive-Bayes|are dierent enough that we hope
They claimed that \the dierence in the cross-validation conclusions based on these two induction algorithms will
estimates of the risks of two rules tends to be much more apply to other induction algorithms.
accurate than the two estimates themselves." Because the target concept is unknown for real-world
concepts, we used the holdout method to estimate the % acc
quality of the cross-validation and bootstrap estimates. 75
(Murphy & Aha 1995) that contained more than 500 60 Vehicle
mate the true accuracies using the holdout method. The % acc
\true" accuracy estimates in Table 1 were computed by 100
taking a random sample of the given size, computing the 99.5 Mushroom
such that the learning curve for both algorithms did 97.5
not atten out too early, that is, before one hundred 97
Figure 2: C4.5: The bias of bootstrap with varying sam- Figure 3: Cross-validation: standard deviation of accu-
ples. Estimates are good for mushroom, hypothyroid, racy (population). Dierent line styles are used to help
and chess, but are extremely biased (optimistically) for dierentiate between curves.
vehicle and rand, and somewhat biased for soybean.
6 Summary
In what follows, all gures for standard deviation will We reviewed common accuracy estimation methods in-
be drawn with the same range for the standard devi- cluding holdout, cross-validation, and bootstrap, and
ation: 0 to 7.5%. Figure 3 shows the standard devia- showed examples where each one fails to produce a good
tions for C4.5 and Naive Bayes using varying number estimate. We have compared the latter two approaches
of folds for cross-validation. The results for stratied on a variety of real-world datasets with diering charac-
cross-validation were similar with slightly lower variance. teristics.
Figure 4 shows the same information for .632 bootstrap. Proposition 1 shows that if the induction algorithm
is stable for a given dataset, the variance of the cross-
Cross-validation has high variance at 2-folds on both validation estimates should be approximately the same,
C4.5 and Naive-Bayes. On C4.5, there is high variance independent of the number of folds. Although the induc-
at the high-ends too|at leave-one-out and leave-two- tion algorithms are not stable, they are approximately
out|for three les out of the seven datasets. Stratica- stable. k-fold cross-validation with moderate k values
tion reduces the variance slightly, and thus seems to be (10-20) reduces the variance while increasing the bias.
uniformly better than cross-validation, both for bias and As k decreases (2-5) and the sample sizes get smaller,
variance. there is variance due to the instability of the training
Jain, A. K., Dubes, R. C. & Chen, C. (1987), \Boot-
strap techniques for error estimation", IEEE trans-
std dev
C4.5
7 Soybean
actions on pattern analysis and machine intelli-
gence PAMI-9(5), 628{633.
Rand
6 Vehicle
5