0% found this document useful (0 votes)
266 views8 pages

A Study of Cross-Validation and Bootstrap For Accuracy Estimation and Model Selection

This document summarizes a study comparing cross-validation and bootstrap methods for estimating the accuracy of machine learning models and selecting between models. The study involved over half a million runs of decision tree algorithms on real-world datasets, varying parameters of cross-validation and bootstrap. The results indicate that 10-fold stratified cross-validation provided the best approach for model selection on these types of datasets, even if more computationally expensive methods were possible.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
266 views8 pages

A Study of Cross-Validation and Bootstrap For Accuracy Estimation and Model Selection

This document summarizes a study comparing cross-validation and bootstrap methods for estimating the accuracy of machine learning models and selecting between models. The study involved over half a million runs of decision tree algorithms on real-world datasets, varying parameters of cross-validation and bootstrap. The results indicate that 10-fold stratified cross-validation provided the best approach for model selection on these types of datasets, even if more computationally expensive methods were possible.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2352264

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model


Selection

Article · March 2001


Source: CiteSeer

CITATIONS READS
4,995 43,114

1 author:

Ron Kohavi
Airbnb
120 PUBLICATIONS   40,383 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Online Controlled Experiments and A/B Testing View project

Machine Learning C++ - MLC++ View project

All content following this page was uploaded by Ron Kohavi on 15 June 2013.

The user has requested enhancement of the downloaded file.


Appears in the International Joint Conference on Arti cial Intelligence (IJCAI), 1995

A Study of Cross-Validation and Bootstrap


for Accuracy Estimation and Model Selection
Ron Kohavi
Computer Science Department
Stanford University
Stanford, CA. 94305
[email protected]
http://robotics.stanford.edu/~ronnyk

Abstract for low variance, assuming the bias a ects all classi ers
similarly (e.g., estimates are 5% pessimistic).
We review accuracy estimation methods and In this paper we explain some of the assumptions made
compare the two most common methods: cross- by the di erent estimation methods, and present con-
validation and bootstrap. Recent experimen- crete examples where each method fails. While it is
tal results on arti cial data and theoretical re- known that no accuracy estimation can be correct all
sults in restricted settings have shown that for the time (Wolpert 1994b, Scha er 1994), we are inter-
selecting a good classi er from a set of classi- ested in identifying a method that is well suited for the
ers (model selection), ten-fold cross-validation biases and trends in typical real world datasets.
may be better than the more expensive leave- Recent results, both theoretical and experimental,
one-out cross-validation. We report on a large- have shown that it is not always the case that increas-
scale experiment|over half a million runs of ing the computational cost is bene cial, especially if the
C4.5 and a Naive-Bayes algorithm|to estimate relative accuracies are more important than the exact
the e ects of di erent parameters on these al- values. For example, leave-one-out is almost unbiased,
gorithms on real-world datasets. For cross- but it has high variance, leading to unreliable estimates
validation, we vary the number of folds and (Efron 1983). For linear models, using leave-one-out
whether the folds are strati ed or not; for boot- cross-validation for model selection is asymptotically in-
strap, we vary the number of bootstrap sam- consistent in the sense that the probability of selecting
ples. Our results indicate that for real-word the model with the best predictive power does not con-
datasets similar to ours, the best method to use verge to one as the total number of observations ap-
for model selection is ten-fold strati ed cross proaches in nity (Zhang 1992, Shao 1993).
validation, even if computation power allows This paper is organized as follows. Section 2 describes
using more folds. the common accuracy estimation methods and ways of
computing con dence bounds that hold under some as-
sumptions. Section 3 discusses related work comparing
1 Introduction cross-validation variants and bootstrap variants. Sec-
It can not be emphasized enough that no claim tion 4 discusses methodology underlying our experiment.
whatsoever is being made in this paper that all The results of the experiments are given Section 5 with a
algorithms are equivalent in practice, in the real
world. In particular, no claim is being made that one discussion of important observations. We conclude with
should not use cross-validation in the real world. a summary in Section 6.
|Wolpert (1994a)
Estimating the accuracy of a classi er induced by su- 2 Methods for Accuracy Estimation
pervised learning algorithms is important not only to A classi er is a function that maps an unlabelled in-
predict its future prediction accuracy, but also for choos- stance to a label using internal data structures. An in-
ing a classi er from a given set (model selection), or ducer, or an induction algorithm, builds a classi er from
combining classi ers (Wolpert 1992). For estimating the a given dataset. CART and C4.5 (Breiman, Friedman,
nal accuracy of a classi er, we would like an estimation Olshen & Stone 1984, Quinlan 1993) are decision tree in-
method with low bias and low variance. To choose a ducers that build decision tree classi ers. In this paper,
classi er or to combine classi ers, the absolute accura- we are not interested in the speci c method for inducing
cies are less important and we are willing to trade o bias classi ers, but assume access to a dataset and an inducer
A longer version of the paper can be retrieved by anony- of interest.
mous ftp to starry.stanford.edu:pub/ronnyk/accEst-long.ps Let V be the space of unlabelled instances and Y the
set of possible labels. Let X = V  Y be the space of the classi er) and a variance of acc  (1 ? acc)=h. Thus,
labelled instances and D = fx1 ; x2; : : :; xng be a dataset by De Moivre-Laplace limit theorem, we have
(possibly a multiset) consisting of n labelled instances, ( )
where xi = hvi 2 V ; yi 2 Yi. A classi er C maps an unla- Pr ?z < p ? acc < z 
acch
(2)
belled instance v 2 V to a label y 2 Y and an inducer I acc(1 ? acc)=h
maps a given dataset D into a classi er C . The notation
I (D; v) will denote the label assigned to an unlabelled in- where z is the (1+ )=2-th quantile point of the standard
stance v by the classi er built by inducer I on dataset D, normal distribution. To get a 100 percent con dence
i.e., I (D; v) = (I (D))(v). We assume that there exists a interval, one determines z and inverts the inequalities.
distribution on the set of labelled instances and that our Inversion of the inequalities leads to a quadratic equation
dataset consists of i.i.d. (independently and identically in acc, the roots of which are the low and high con dence
distributed) instances. We consider equal misclassi ca- points:
tion costs using a 0/1 loss function, but the accuracy p
estimation methods can easily be extended to other loss 2h  acch + z 2  z  4h  acch + z 2 ? 4h  acc2h : (3)
functions. 2(h + z 2 )
The accuracy of a classi er C is the probability of
correctly classifying a randomly selected instance, i.e., The above equation is not conditioned on the dataset D;
acc = Pr(C (v) = y) for a randomly selected instance if more information is available about the probability of
hv; yi 2 X , where the probability distribution over the the given dataset, it must be taken into account.
instance space is the same as the distribution that was The holdout estimate is a random number that de-
used to select instances for the inducer's training set. pends on the division into a training set and a test set.
Given a nite dataset, we would like to estimate the fu- In random subsampling, the holdout method is re-
ture performance of a classi er induced by the given in- peated k times, and the estimated accuracy is derived
ducer and dataset. A single accuracy estimate is usually by averaging the runs. The standard deviation can be
meaningless without a con dence interval; thus we will estimated as the standard deviation of the accuracy es-
consider how to approximate such an interval when pos- timations from each holdout run.
sible. In order to identify weaknesses, we also attempt The main assumption that is violated in random sub-
to identify cases where the estimates fail. sampling is the independence of instances in the test set
from those in the training set. If the training and test
2.1 Holdout set are formed by a split of an original dataset, then
The holdout method, sometimes called test sample esti- an over-represented class in one subset will be a under-
mation, partitions the data into two mutually exclusive represented in the other. To demonstrate the issue, we
subsets called a training set and a test set, or holdout set. simulated a 2/3, 1/3 split of Fisher's famous iris dataset
It is common to designate 2/3 of the data as the training and used a majority inducer that builds a classi er pre-
set and the remaining 1/3 as the test set. The training dicting the prevalent class in the training set. The iris
set is given to the inducer, and the induced classi er is dataset describes iris plants using four continuous fea-
tested on the test set. Formally, let Dh , the holdout set, tures, and the task is to classify each instance (an iris)
be a subset of D of size h, and let Dt be D n Dh . The as Iris Setosa, Iris Versicolour, or Iris Virginica. For each
holdout estimated accuracy is de ned as class label, there are exactly one third of the instances
with that label (50 instances of each class from a to-
acch =
1 X (I (D ; v ); y ) ; (1) tal of 150 instances); thus we expect 33.3% prediction
h hvi ;yi i2Dh t i i
accuracy. However, because the test set will always con-
tain less than 1/3 of the instances of the class that was
where (i; j) = 1 if i = j and 0 otherwise. Assuming prevalent in the training set, the accuracy predicted by
that the inducer's accuracy increases as more instances the holdout method is 27.68% with a standard deviation
are seen, the holdout method is a pessimistic estimator of 0.13% (estimated by averaging 500 holdouts).
because only a portion of the data is given to the inducer In practice, the dataset size is always nite, and usu-
for training. The more instances we leave for the test set, ally smaller than we would like it to be. The holdout
the higher the bias of our estimate; however, fewer test method makes inecient use of the data: a third of
set instances means that the con dence interval for the dataset is not used for training the inducer.
accuracy will be wider as shown below.
Each test instance can be viewed as a Bernoulli trial: 2.2 Cross-Validation, Leave-one-out, and
correct or incorrect prediction. Let S be the number Strati cation
of correct classi cations on the test set, then S is dis- In k-fold cross-validation, sometimes called rotation esti-
tributed binomially (sum of Bernoulli trials). For rea- mation, the dataset D is randomly split into k mutually
sonably large holdout sets, the distribution of S=h is ap- exclusive subsets (the folds) D1 ; D2; : : :; Dk of approx-
proximately normal with mean acc (the true accuracy of imately equal size. The inducer is trained and tested
k times; each time t 2 f1; 2; : : :; kg, it is trained on The above proposition helps understand one possible
D n Dt and tested on Dt. The cross-validation estimate assumption that is made when using cross-validation: if
of accuracy is the overall number of correct classi ca- an inducer is unstable for a particular dataset under a set
tions, divided by the number of instances in the dataset. of perturbations introduced by cross-validation, the ac-
Formally, let D(i) be the test set that includes instance curacy estimate is likely to be unreliable. If the inducer
xi = hvi ; yi i, then the cross-validation estimate of accu- is almost stable on a given dataset, we should expect
racy a reliable estimate. The next corollary takes the idea
slightly further and shows a result that we have observed
acccv =
1 X (I (D n D ; v ); y ) : (4) empirically: there is almost no change in the variance of
n hvi;yi i2D (i) i i
the cross-validation estimate when the number of folds
is varied.
The cross-validation estimate is a random number Corollary 2 (Variance in cross-validation)
that depends on the division into folds.? Complete Given a dataset and an inducer. If the inducer is sta-
cross-validation is the average of all m=k m  possibil-
ble under the perturbations caused by deleting the test
ities for choosing m=k instances out of m, but it is instances for the folds in k-fold cross-validation for var-
usually too expensive. Except for leave-one-one (n-fold ious values of k, then the variance of the estimates will
cross-validation), which is always complete, k-fold cross- be the same.
validation is estimating complete k-fold cross-validation Proof: The variance of k-fold cross-validation in Propo-
using a single split of the data into the folds. Repeat- sition 1 does not depend on k.
ing cross-validation multiple times using di erent splits
into folds provides a better Monte-Carlo estimate to the While some inducers are likely to be inherently more
complete cross-validation at an added cost. In strati- stable, the following example shows that one must also
ed cross-validation, the folds are strati ed so that take into account the dataset and the actual perturba-
they contain approximately the same proportions of la- tions.
bels as the original dataset. Example 1 (Failure of leave-one-out)
An inducer is stable for a given dataset and a set of Fisher's iris dataset contains 50 instances of each class,
perturbations, if it induces classi ers that make the same leading one to expect that a majority inducer should
predictions when it is given the perturbed datasets. have accuracy about 33%. However, the combination of
Proposition 1 (Variance in k-fold CV) this dataset with a majority inducer is unstable for the
Given a dataset and an inducer. If the inducer is small perturbations performed by leave-one-out. When
stable under the perturbations caused by deleting the an instance is deleted from the dataset, its label is a mi-
instances for the folds in k-fold cross-validation, the nority in the training set; thus the majority inducer pre-
cross-validation estimate will be unbiased and the vari- dicts one of the other two classes and always errs in clas-
ance of the estimated accuracy will be approximately sifying the test instance. The leave-one-out estimated
acccv  (1 ? acccv )=n, where n is the number of instances accuracy for a majority inducer on the iris dataset is
in the dataset. therefore 0%. Moreover, all folds have this estimated ac-
curacy; thus the standard deviation of the folds is again
Proof: If we assume that the k classi ers produced make 0%, giving the unjusti ed assurance that the estimate is
the same predictions, then the estimated accuracy has stable.
a binomial distribution with n trials and probability of The example shows an inherent problem with cross-
success equal to the accuracy of the classi er. validation that applies to more than just a majority in-
For large enough n, a con dence interval may be com- ducer. In a no-information dataset, where the label val-
puted using Equation 3 with h equal to n, the number ues are completely random, the best an induction algo-
of instances. rithm can do is predict majority. Leave-one-out on such
In reality, a complex inducer is unlikely to be stable a dataset with 50% of the labels for each class and a
for large perturbations, unless it has reached its maximal majority inducer (the best possible inducer) would still
learning capacity. We expect the perturbations induced predict 0% accuracy.
by leave-one-out to be small and therefore the classi er
should be very stable. As we increase the size of the 2.3 Bootstrap
perturbations, stability is less likely to hold: we expect The bootstrap family was introduced by Efron and is
stability to hold more in 20-fold cross-validation than in fully described in Efron & Tibshirani (1993). Given a
10-fold cross-validation and both should be more stable dataset of size n, a bootstrap sample is created by
than holdout of 1/3. The proposition does not apply sampling n instances uniformly from the data (with re-
to the resubstitution estimate because it requires the in- placement). Since the dataset is sampled with replace-
ducer to be stable when no instances are given in the ment, the probability of any given instance not being
dataset. chosen after n samples is (1 ? 1=n)n  e?1  0:368; the
expected number of distinct instances from the original Jain, Dubes & Chen (1987) compared the performance
dataset appearing in the test set is thus 0:632n. The 0 of the 0 bootstrap and leave-one-out cross-validation
accuracy estimate is derived by using the bootstrap sam- on nearest neighbor classi ers using arti cial data and
ple for training and the rest of the instances for testing. claimed that the con dence interval of the bootstrap
Given a number b, the number of bootstrap samples, let estimator is smaller than that of leave-one-out. Weiss
0i be the accuracy estimate for bootstrap sample i. The (1991) followed similar lines and compared strati ed
.632 bootstrap estimate is de ned as cross-validation and two bootstrap methods with near-
b est neighbor classi ers. His results were that strati ed
1X two-fold cross validation is relatively low variance and
accboot =
b i=1 (0:632  0i + :368  accs ) (5) superior to leave-one-out.
where accs is the resubstitution accuracy estimate on Breiman & Spector (1992) conducted a feature sub-
the full dataset (i.e., the accuracy on the training set). set selection experiments for regression, and compared
The variance of the estimate can be determined by com- leave-one-out cross-validation, k-fold cross-validation
puting the variance of the estimates for the samples. for various k, strati ed k-fold cross-validation, bias-
The assumptions made by bootstrap are basically the corrected bootstrap, and partial cross-validation (not
same as that of cross-validation, i.e., stability of the al- discussed here). Tests were done on arti cial datasets
gorithm on the dataset: the \bootstrap world" should with 60 and 160 instances. The behavior observed
closely approximate the real world. The .632 bootstrap was: (1) the leave-one-out has low bias and RMS (root
fails to give the expected result when the classi er is a mean square) error, whereas two-fold and ve-fold cross-
perfect memorizer (e.g., an unpruned decision tree or a validation have larger bias and RMS error only at models
one nearest neighbor classi er) and the dataset is com- with many features; (2) the pessimistic bias of ten-fold
pletely random, say with two classes. The resubstitution cross-validation at small samples was signi cantly re-
accuracy is 100%, and the 0 accuracy is about 50%. duced for the samples of size 160; (3) for model selection,
Plugging these into the bootstrap formula, one gets an ten-fold cross-validation is better than leave-one-out.
estimated accuracy of about 68.4%, far from the real ac- Bailey & Elkan (1993) compared leave-one-out cross-
curacy of 50%. Bootstrap can be shown to fail if we add validation to .632 bootstrap using the FOIL inducer
a memorizer module to any given inducer and adjust its and four synthetic datasets involving Boolean concepts.
predictions. If the memorizer remembers the training set They observed high variability and little bias in the
and makes the predictions when the test instance was a leave-one-out estimates, and low variability but large
training instances, adjusting its predictions can make the bias in the .632 estimates.
resubstitution accuracy change from 0% to 100% and can Weiss and Indurkyha (Weiss & Indurkhya 1994) con-
thus bias the overall estimated accuracy in any direction ducted experiments on real-world data to determine the
we want. applicability of cross-validation to decision tree pruning.
Their results were that for samples at least of size 200,
3 Related Work using strati ed ten-fold cross-validation to choose the
amount of pruning yields unbiased trees (with respect to
Some experimental studies comparing di erent accuracy their optimal size).
estimation methods have been previously done, but most
of them were on arti cial or small datasets. We now 4 Methodology
describe some of these e orts.
Efron (1983) conducted ve sampling experiments and In order to conduct a large-scale experiment we decided
compared leave-one-out cross-validation, several variants to use C4.5 and a Naive-Bayesian classi er. The C4.5
of bootstrap, and several other methods. The purpose algorithm (Quinlan 1993) is a descendent of ID3 that
of the experiments was to \investigate some related es- builds decision trees top-down. The Naive-Bayesian clas-
timators, which seem to o er considerably improved es- si er (Langley, Iba & Thompson 1992) used was the one
timation in small samples." The results indicate that implemented in MLC++ (Kohavi, John, Long, Manley
leave-one-out cross-validation gives nearly unbiased esti- & P eger 1994) that uses the observed ratios for nominal
mates of the accuracy, but often with unacceptably high features and assumes a Gaussian distribution for contin-
variability, particularly for small samples; and that the uous features. The exact details are not crucial for this
.632 bootstrap performed best. paper because we are interested in the behavior of the
Breiman et al. (1984) conducted experiments using accuracy estimation methods more than the internals
cross-validation for decision tree pruning. They chose of the induction algorithms. The underlying hypothe-
ten-fold cross-validation for the CART program and sis spaces|decision trees for C4.5 and summary statis-
claimed it was satisfactory for choosing the correct tree. tics for Naive-Bayes|are di erent enough that we hope
They claimed that \the di erence in the cross-validation conclusions based on these two induction algorithms will
estimates of the risks of two rules tends to be much more apply to other induction algorithms.
accurate than the two estimates themselves." Because the target concept is unknown for real-world
concepts, we used the holdout method to estimate the % acc
quality of the cross-validation and bootstrap estimates. 75

To choose a set of datasets, we looked at the learning 70 Soybean


curves for C4.5 and Naive-Bayes for most of the super-
vised classi cation datasets at the UC Irvine repository 65

(Murphy & Aha 1995) that contained more than 500 60 Vehicle

instances (about 25 such datasets). We felt that a min- 55


imum of 500 instances were required for testing. While
the true accuracies of a real dataset cannot be computed 50 Rand

because we do not know the target concept, we can esti- 45


2 5 10 20 -5 -2 -1
folds

mate the true accuracies using the holdout method. The % acc
\true" accuracy estimates in Table 1 were computed by 100

taking a random sample of the given size, computing the 99.5 Mushroom

accuracy using the rest of the dataset as a test set, and 99

repeating 500 times. 98.5 Hypo


Chess
We chose six datasets from a wide variety of domains, 98

such that the learning curve for both algorithms did 97.5

not atten out too early, that is, before one hundred 97

instances. We also added a no information dataset,


96.5

rand, with 20 Boolean features and a Boolean random


96 folds
2 5 10 20 -5 -2 -1

label. On one dataset, vehicle, the generalization accu-


racy of the Naive-Bayes algorithm deteriorated by more Figure 1: C4.5: The bias of cross-validation with varying
than 4% as more instances were given. A similar phe- folds. A negative k folds stands for leave-k-out. Error
nomenon was observed on the shuttle dataset. Such bars are 95% con dence intervals for the mean. The gray
a phenomenon was predicted by Scha er and Wolpert regions indicate 95% con dence intervals for the true ac-
(Scha er 1994, Wolpert 1994b), but we were surprised curacies. Note the di erent ranges for the accuracy axis.
that it was observed on two real-world datasets.
To see how well an accuracy estimation method per-
forms, we sampled instances from the dataset (uniformly validation for small k's is apparent. Most of the esti-
without replacement), and created a training set of the mates are reasonably good at 10 folds and at 20 folds
desired size. We then ran the induction algorithm on they are almost unbiased.
the training set and tested the classi er on the rest of Strati ed cross-validation (not shown) had similar be-
the instances in the dataset. This was repeated 50 times havior, except for lower pessimism. The estimated accu-
at points where the learning curve was sloping up . The racy for soybean at 2-fold was 7% higher and at ve-fold,
same folds in cross-validation and the same samples in 4.7% higher; for vehicle at 2-fold, the accuracy was 2.8%
bootstrap were used for both algorithms compared. higher and at ve-fold, 1.9% higher. Thus strati cation
seems to be a less biased estimation method.
5 Results and Discussion Figure 2 shows the bias and variance for the .632 boot-
strap accuracy estimation method. Although the .632
We now show the experimental results and discuss their bootstrap is almost unbiased for chess, hypothyroid, and
signi cance. We begin with a discussion of the bias in mushroom for both inducers, it is highly biased for soy-
the estimation methods and follow with a discussion of bean with C4.5, vehicle with both inducers, and rand
the variance. Due to lack of space, we omit some graphs with both inducers. The bias with C4.5 and vehicle is
for the Naive-Bayes algorithm when the behavior is ap- 9.8%.
proximately the same as that of C4.5.
5.2 The Variance
5.1 The Bias While a given method may have low bias, its perfor-
The bias of a method to estimate a parameter  is de- mance (accuracy estimation in our case) may be poor
ned as the expected value minus the estimated value. due to high variance. In the experiments above, we have
An unbiased estimation method is a method that has formed con dence intervals by using the standard de-
zero bias. Figure 1 shows the bias and variance of k-fold viation of the mean accuracy. We now switch to the
cross-validation on several datasets (the breast cancer standard deviation of the population, i.e., the expected
dataset is not shown). standard deviation of a single accuracy estimation run.
The diagrams clearly show that k-fold cross-validation In practice, if one does a single cross-validation run, the
is pessimistically biased, especially for two and ve folds. expected accuracy will be the mean reported above,pbut
For the learning curves that have a large derivative at the standard deviation will be higher by a factor of 50,
the measurement point, the pessimism in k-fold cross- the number of runs we averaged in the experiments.
Dataset no. of sample-size no. of duplicate C4.5 Naive-Bayes
attr. / total size categories instances
Breast cancer 10 50/699 2 8 91.370.10 94.220.10
Chess 36 900/3196 2 0 98.190.03 86.800.07
Hypothyroid 25 400/3163 2 77 98.520.03 97.630.02
Mushroom 22 800/8124 2 0 99.360.02 94.540.03
Soybean large 35 100/683 19 52 70.490.22 79.760.14
Vehicle 18 100/846 4 0 60.110.16 46.800.16
Rand 20 100/3000 2 9 49.960.04 49.900.04
Table 1: True accuracy estimates for the datasets using C4.5 and Naive-Bayes classi ers at the chosen sample sizes.

% acc std dev


C4.5
Vehicle
75 Estimated 7 Soybean
Soybean 6
70 Soybean Vehicle
5 Rand
65 Breast
Rand
4
60 Vehicle
3
55
2
50 Rand Hypo
1 Chess
Mushroom
45 samples 0 folds
1 2 5 10 20 50 100 2 5 10 20 -5 -2 -1

% acc std dev Naive-Bayes


100 7
99.5 Mushroom Vehicle
6 Rand
99 Soybean
5
98.5 Hypo
Chess 4 Breast
98
97.5 3
97 2
Chess
96.5 1 Hypo
Mushroom
96 samples
1 2 5 10 20 50 100 0 folds
2 5 10 20 -5 -2 -1

Figure 2: C4.5: The bias of bootstrap with varying sam- Figure 3: Cross-validation: standard deviation of accu-
ples. Estimates are good for mushroom, hypothyroid, racy (population). Di erent line styles are used to help
and chess, but are extremely biased (optimistically) for di erentiate between curves.
vehicle and rand, and somewhat biased for soybean.
6 Summary
In what follows, all gures for standard deviation will We reviewed common accuracy estimation methods in-
be drawn with the same range for the standard devi- cluding holdout, cross-validation, and bootstrap, and
ation: 0 to 7.5%. Figure 3 shows the standard devia- showed examples where each one fails to produce a good
tions for C4.5 and Naive Bayes using varying number estimate. We have compared the latter two approaches
of folds for cross-validation. The results for strati ed on a variety of real-world datasets with di ering charac-
cross-validation were similar with slightly lower variance. teristics.
Figure 4 shows the same information for .632 bootstrap. Proposition 1 shows that if the induction algorithm
is stable for a given dataset, the variance of the cross-
Cross-validation has high variance at 2-folds on both validation estimates should be approximately the same,
C4.5 and Naive-Bayes. On C4.5, there is high variance independent of the number of folds. Although the induc-
at the high-ends too|at leave-one-out and leave-two- tion algorithms are not stable, they are approximately
out|for three les out of the seven datasets. Strati ca- stable. k-fold cross-validation with moderate k values
tion reduces the variance slightly, and thus seems to be (10-20) reduces the variance while increasing the bias.
uniformly better than cross-validation, both for bias and As k decreases (2-5) and the sample sizes get smaller,
variance. there is variance due to the instability of the training
Jain, A. K., Dubes, R. C. & Chen, C. (1987), \Boot-
strap techniques for error estimation", IEEE trans-
std dev
C4.5
7 Soybean
actions on pattern analysis and machine intelli-
gence PAMI-9(5), 628{633.
Rand
6 Vehicle
5

Kohavi, R., John, G., Long, R., Manley, D. &


Breast
4
3 P eger, K. (1994), MLC++: A machine learn-
2 ing library in C++, in \Tools with Arti cial
Hypo
1 Mushroom Intelligence", IEEE Computer Society Press,
pp. 740{743. Available by anonymous ftp from:
Chess
0 samples
1 2 5 10 20 50 100
starry.Stanford.EDU:pub/ronnyk/mlc/toolsmlc.ps .
Figure 4: .632 Bootstrap: standard deviation in accu- Langley, P., Iba, W. & Thompson, K. (1992), An anal-
racy (population). ysis of bayesian classi ers, in \Proceedings of the
tenth national conference on arti cial intelligence",
AAAI Press and MIT Press, pp. 223{228.
sets themselves, leading to an increase in variance. This
is most apparent for datasets with many categories, such Murphy, P. M. & Aha, D. W. (1995), UCI
as soybean. In these situations, strati cation seems to repository of machine learning databases,
help, but repeated runs may be a better approach. http://www.ics.uci.edu/~mlearn/MLRepository.html.
Our results indicate that strati cation is generally a
better scheme, both in terms of bias and variance, when Quinlan, J. R. (1993), C4.5: Programs for Machine
compared to regular cross-validation. Bootstrap has low Learning, Morgan Kaufmann, Los Altos, California.
variance, but extremely large bias on some problems. We Scha er, C. (1994), A conservation law for generalization
recommend using strati ed ten-fold cross-validation for performance, in \Machine Learning: Proceedings
model selection. of the Eleventh International Conference", Morgan
Kaufmann, pp. 259{265.
Acknowledgments We thank David Wolpert for a
thorough reading of this paper and many interesting dis- Shao, J. (1993), \Linear model selection via cross-
cussions. We thank Wray Buntine, Tom Bylander, Brad validation", Journal of the American Statistical As-
Efron, Jerry Friedman, Rob Holte, George John, Pat sociation 88(422), 486{494.
Langley, Rob Tibshirani, and Sholom Weiss for their Weiss, S. M. (1991), \Small sample error rate estimation
helpful comments and suggestions. Dan Sommer eld for k-nearest neighbor classi ers", IEEE Transac-
implemented the bootstrap method in MLC++. All ex- tions on Pattern Analysis and Machine Intelligence
periments were conducted using MLC++, partly partly 13(3), 285{289.
funded by ONR grant N00014-94-1-0448 and NSF grants
IRI-9116399 and IRI-9411306. Weiss, S. M. & Indurkhya, N. (1994), Decision tree
pruning : Biased or optimal, in \Proceedings of
References the twelfth national conference on arti cial intel-
Bailey, T. L. & Elkan, C. (1993), Estimating the ac- ligence", AAAI Press and MIT Press, pp. 626{632.
curacy of learned concepts, in \Proceedings of In-
ternational Joint Conference on Arti cial Intelli- Wolpert, D. H. (1992), \Stacked generalization", Neural
gence", Morgan Kaufmann Publishers, pp. 895{900. Networks 5, 241{259.
Breiman, L. & Spector, P. (1992), \Submodel selection Wolpert, D. H. (1994a), O -training set error and a pri-
and evaluation in regression. the x-random case", ori distinctions between learning algorithms, Tech-
International Statistical Review 60(3), 291{319. nical Report SFI TR 94-12-123, The Sante Fe In-
stitute.
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone,
C. J. (1984), Classi cation and Regression Trees, Wolpert, D. H. (1994b), The relationship between PAC,
Wadsworth International Group. the statistical physics framework, the Bayesian
framework, and the VC framework, Technical re-
Efron, B. (1983), \Estimating the error rate of a pre- port, The Santa Fe Institute, Santa Fe, NM.
diction rule: improvement on cross-validation",
Journal of the American Statistical Association Zhang, P. (1992), \On the distributional properties of
78(382), 316{330. model selection criteria", Journal of the American
Statistical Association 87(419), 732{737.
Efron, B. & Tibshirani, R. (1993), An introduction to
the bootstrap, Chapman & Hall.

View publication stats

You might also like