0% found this document useful (0 votes)

266 views8 pages

A Study of Cross-Validation and Bootstrap For Accuracy Estimation and Model Selection

This document summarizes a study comparing cross-validation and bootstrap methods for estimating the accuracy of machine learning models and selecting between models. The study involved over half a million runs of decision tree algorithms on real-world datasets, varying parameters of cross-validation and bootstrap. The results indicate that 10-fold stratified cross-validation provided the best approach for model selection on these types of datasets, even if more computationally expensive methods were possible.

Uploaded by

Francisco Arteaga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

266 views8 pages

A Study of Cross-Validation and Bootstrap For Accuracy Estimation and Model Selection

Uploaded by

Francisco Arteaga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2352264

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model

Selection

Article · March 2001

Source: CiteSeer

CITATIONS READS
4,995 43,114

1 author:

Ron Kohavi
Airbnb
120 PUBLICATIONS 40,383 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Online Controlled Experiments and A/B Testing View project

Machine Learning C++ - MLC++ View project

All content following this page was uploaded by Ron Kohavi on 15 June 2013.

The user has requested enhancement of the downloaded file.

Appears in the International Joint Conference on Articial Intelligence (IJCAI), 1995

A Study of Cross-Validation and Bootstrap

for Accuracy Estimation and Model Selection
Ron Kohavi
Computer Science Department
Stanford University
Stanford, CA. 94305
[email protected]
http://robotics.stanford.edu/~ronnyk

Abstract for low variance, assuming the bias aects all classiers
similarly (e.g., estimates are 5% pessimistic).
We review accuracy estimation methods and In this paper we explain some of the assumptions made
compare the two most common methods: cross- by the dierent estimation methods, and present con-
validation and bootstrap. Recent experimen- crete examples where each method fails. While it is
tal results on articial data and theoretical re- known that no accuracy estimation can be correct all
sults in restricted settings have shown that for the time (Wolpert 1994b, Schaer 1994), we are inter-
selecting a good classier from a set of classi- ested in identifying a method that is well suited for the
ers (model selection), ten-fold cross-validation biases and trends in typical real world datasets.
may be better than the more expensive leave- Recent results, both theoretical and experimental,
one-out cross-validation. We report on a large- have shown that it is not always the case that increas-
scale experiment|over half a million runs of ing the computational cost is benecial, especially if the
C4.5 and a Naive-Bayes algorithm|to estimate relative accuracies are more important than the exact
the eects of dierent parameters on these al- values. For example, leave-one-out is almost unbiased,
gorithms on real-world datasets. For cross- but it has high variance, leading to unreliable estimates
validation, we vary the number of folds and (Efron 1983). For linear models, using leave-one-out
whether the folds are stratied or not; for boot- cross-validation for model selection is asymptotically in-
strap, we vary the number of bootstrap sam- consistent in the sense that the probability of selecting
ples. Our results indicate that for real-word the model with the best predictive power does not con-
datasets similar to ours, the best method to use verge to one as the total number of observations ap-
for model selection is ten-fold stratied cross proaches innity (Zhang 1992, Shao 1993).
validation, even if computation power allows This paper is organized as follows. Section 2 describes
using more folds. the common accuracy estimation methods and ways of
computing condence bounds that hold under some as-
sumptions. Section 3 discusses related work comparing
1 Introduction cross-validation variants and bootstrap variants. Sec-
It can not be emphasized enough that no claim tion 4 discusses methodology underlying our experiment.
whatsoever is being made in this paper that all The results of the experiments are given Section 5 with a
algorithms are equivalent in practice, in the real
world. In particular, no claim is being made that one discussion of important observations. We conclude with
should not use cross-validation in the real world. a summary in Section 6.
|Wolpert (1994a)
Estimating the accuracy of a classier induced by su- 2 Methods for Accuracy Estimation
pervised learning algorithms is important not only to A classier is a function that maps an unlabelled in-
predict its future prediction accuracy, but also for choos- stance to a label using internal data structures. An in-
ing a classier from a given set (model selection), or ducer, or an induction algorithm, builds a classier from
combining classiers (Wolpert 1992). For estimating the a given dataset. CART and C4.5 (Breiman, Friedman,
nal accuracy of a classier, we would like an estimation Olshen & Stone 1984, Quinlan 1993) are decision tree in-
method with low bias and low variance. To choose a ducers that build decision tree classiers. In this paper,
classier or to combine classiers, the absolute accura- we are not interested in the specic method for inducing
cies are less important and we are willing to trade o bias classiers, but assume access to a dataset and an inducer
A longer version of the paper can be retrieved by anony- of interest.
mous ftp to starry.stanford.edu:pub/ronnyk/accEst-long.ps Let V be the space of unlabelled instances and Y the
set of possible labels. Let X = V Y be the space of the classier) and a variance of acc (1 ? acc)=h. Thus,
labelled instances and D = fx1 ; x2; : : :; xng be a dataset by De Moivre-Laplace limit theorem, we have
(possibly a multiset) consisting of n labelled instances, ( )
where xi = hvi 2 V ; yi 2 Yi. A classier C maps an unla- Pr ?z < p ? acc < z
acch
(2)
belled instance v 2 V to a label y 2 Y and an inducer I acc(1 ? acc)=h
maps a given dataset D into a classier C . The notation
I (D; v) will denote the label assigned to an unlabelled in- where z is the (1+ )=2-th quantile point of the standard
stance v by the classier built by inducer I on dataset D, normal distribution. To get a 100 percent condence
i.e., I (D; v) = (I (D))(v). We assume that there exists a interval, one determines z and inverts the inequalities.
distribution on the set of labelled instances and that our Inversion of the inequalities leads to a quadratic equation
dataset consists of i.i.d. (independently and identically in acc, the roots of which are the low and high condence
distributed) instances. We consider equal misclassica- points:
tion costs using a 0/1 loss function, but the accuracy p
estimation methods can easily be extended to other loss 2h acch + z 2 z 4h acch + z 2 ? 4h acc2h : (3)
functions. 2(h + z 2 )
The accuracy of a classier C is the probability of
correctly classifying a randomly selected instance, i.e., The above equation is not conditioned on the dataset D;
acc = Pr(C (v) = y) for a randomly selected instance if more information is available about the probability of
hv; yi 2 X , where the probability distribution over the the given dataset, it must be taken into account.
instance space is the same as the distribution that was The holdout estimate is a random number that de-
used to select instances for the inducer's training set. pends on the division into a training set and a test set.
Given a nite dataset, we would like to estimate the fu- In random subsampling, the holdout method is re-
ture performance of a classier induced by the given in- peated k times, and the estimated accuracy is derived
ducer and dataset. A single accuracy estimate is usually by averaging the runs. The standard deviation can be
meaningless without a condence interval; thus we will estimated as the standard deviation of the accuracy es-
consider how to approximate such an interval when pos- timations from each holdout run.
sible. In order to identify weaknesses, we also attempt The main assumption that is violated in random sub-
to identify cases where the estimates fail. sampling is the independence of instances in the test set
from those in the training set. If the training and test
2.1 Holdout set are formed by a split of an original dataset, then
The holdout method, sometimes called test sample esti- an over-represented class in one subset will be a under-
mation, partitions the data into two mutually exclusive represented in the other. To demonstrate the issue, we
subsets called a training set and a test set, or holdout set. simulated a 2/3, 1/3 split of Fisher's famous iris dataset
It is common to designate 2/3 of the data as the training and used a majority inducer that builds a classier pre-
set and the remaining 1/3 as the test set. The training dicting the prevalent class in the training set. The iris
set is given to the inducer, and the induced classier is dataset describes iris plants using four continuous fea-
tested on the test set. Formally, let Dh , the holdout set, tures, and the task is to classify each instance (an iris)
be a subset of D of size h, and let Dt be D n Dh . The as Iris Setosa, Iris Versicolour, or Iris Virginica. For each
holdout estimated accuracy is dened as class label, there are exactly one third of the instances
with that label (50 instances of each class from a to-
acch =
1 X (I (D ; v ); y ) ; (1) tal of 150 instances); thus we expect 33.3% prediction
h hvi ;yi i2Dh t i i
accuracy. However, because the test set will always con-
tain less than 1/3 of the instances of the class that was
where (i; j) = 1 if i = j and 0 otherwise. Assuming prevalent in the training set, the accuracy predicted by
that the inducer's accuracy increases as more instances the holdout method is 27.68% with a standard deviation
are seen, the holdout method is a pessimistic estimator of 0.13% (estimated by averaging 500 holdouts).
because only a portion of the data is given to the inducer In practice, the dataset size is always nite, and usu-
for training. The more instances we leave for the test set, ally smaller than we would like it to be. The holdout
the higher the bias of our estimate; however, fewer test method makes inecient use of the data: a third of
set instances means that the condence interval for the dataset is not used for training the inducer.
accuracy will be wider as shown below.
Each test instance can be viewed as a Bernoulli trial: 2.2 Cross-Validation, Leave-one-out, and
correct or incorrect prediction. Let S be the number Stratication
of correct classications on the test set, then S is dis- In k-fold cross-validation, sometimes called rotation esti-
tributed binomially (sum of Bernoulli trials). For rea- mation, the dataset D is randomly split into k mutually
sonably large holdout sets, the distribution of S=h is ap- exclusive subsets (the folds) D1 ; D2; : : :; Dk of approx-
proximately normal with mean acc (the true accuracy of imately equal size. The inducer is trained and tested
k times; each time t 2 f1; 2; : : :; kg, it is trained on The above proposition helps understand one possible
D n Dt and tested on Dt. The cross-validation estimate assumption that is made when using cross-validation: if
of accuracy is the overall number of correct classica- an inducer is unstable for a particular dataset under a set
tions, divided by the number of instances in the dataset. of perturbations introduced by cross-validation, the ac-
Formally, let D(i) be the test set that includes instance curacy estimate is likely to be unreliable. If the inducer
xi = hvi ; yi i, then the cross-validation estimate of accu- is almost stable on a given dataset, we should expect
racy a reliable estimate. The next corollary takes the idea
slightly further and shows a result that we have observed
acccv =
1 X (I (D n D ; v ); y ) : (4) empirically: there is almost no change in the variance of
n hvi;yi i2D (i) i i
the cross-validation estimate when the number of folds
is varied.
The cross-validation estimate is a random number Corollary 2 (Variance in cross-validation)
that depends on the division into folds.? Complete Given a dataset and an inducer. If the inducer is sta-
cross-validation is the average of all m=k m possibil-
ble under the perturbations caused by deleting the test
ities for choosing m=k instances out of m, but it is instances for the folds in k-fold cross-validation for var-
usually too expensive. Except for leave-one-one (n-fold ious values of k, then the variance of the estimates will
cross-validation), which is always complete, k-fold cross- be the same.
validation is estimating complete k-fold cross-validation Proof: The variance of k-fold cross-validation in Propo-
using a single split of the data into the folds. Repeat- sition 1 does not depend on k.
ing cross-validation multiple times using dierent splits
into folds provides a better Monte-Carlo estimate to the While some inducers are likely to be inherently more
complete cross-validation at an added cost. In strati- stable, the following example shows that one must also
ed cross-validation, the folds are stratied so that take into account the dataset and the actual perturba-
they contain approximately the same proportions of la- tions.
bels as the original dataset. Example 1 (Failure of leave-one-out)
An inducer is stable for a given dataset and a set of Fisher's iris dataset contains 50 instances of each class,
perturbations, if it induces classiers that make the same leading one to expect that a majority inducer should
predictions when it is given the perturbed datasets. have accuracy about 33%. However, the combination of
Proposition 1 (Variance in k-fold CV) this dataset with a majority inducer is unstable for the
Given a dataset and an inducer. If the inducer is small perturbations performed by leave-one-out. When
stable under the perturbations caused by deleting the an instance is deleted from the dataset, its label is a mi-
instances for the folds in k-fold cross-validation, the nority in the training set; thus the majority inducer pre-
cross-validation estimate will be unbiased and the vari- dicts one of the other two classes and always errs in clas-
ance of the estimated accuracy will be approximately sifying the test instance. The leave-one-out estimated
acccv (1 ? acccv )=n, where n is the number of instances accuracy for a majority inducer on the iris dataset is
in the dataset. therefore 0%. Moreover, all folds have this estimated ac-
curacy; thus the standard deviation of the folds is again
Proof: If we assume that the k classiers produced make 0%, giving the unjustied assurance that the estimate is
the same predictions, then the estimated accuracy has stable.
a binomial distribution with n trials and probability of The example shows an inherent problem with cross-
success equal to the accuracy of the classier. validation that applies to more than just a majority in-
For large enough n, a condence interval may be com- ducer. In a no-information dataset, where the label val-
puted using Equation 3 with h equal to n, the number ues are completely random, the best an induction algo-
of instances. rithm can do is predict majority. Leave-one-out on such
In reality, a complex inducer is unlikely to be stable a dataset with 50% of the labels for each class and a
for large perturbations, unless it has reached its maximal majority inducer (the best possible inducer) would still
learning capacity. We expect the perturbations induced predict 0% accuracy.
by leave-one-out to be small and therefore the classier
should be very stable. As we increase the size of the 2.3 Bootstrap
perturbations, stability is less likely to hold: we expect The bootstrap family was introduced by Efron and is
stability to hold more in 20-fold cross-validation than in fully described in Efron & Tibshirani (1993). Given a
10-fold cross-validation and both should be more stable dataset of size n, a bootstrap sample is created by
than holdout of 1/3. The proposition does not apply sampling n instances uniformly from the data (with re-
to the resubstitution estimate because it requires the in- placement). Since the dataset is sampled with replace-
ducer to be stable when no instances are given in the ment, the probability of any given instance not being
dataset. chosen after n samples is (1 ? 1=n)n e?1 0:368; the
expected number of distinct instances from the original Jain, Dubes & Chen (1987) compared the performance
dataset appearing in the test set is thus 0:632n. The 0 of the 0 bootstrap and leave-one-out cross-validation
accuracy estimate is derived by using the bootstrap sam- on nearest neighbor classiers using articial data and
ple for training and the rest of the instances for testing. claimed that the condence interval of the bootstrap
Given a number b, the number of bootstrap samples, let estimator is smaller than that of leave-one-out. Weiss
0i be the accuracy estimate for bootstrap sample i. The (1991) followed similar lines and compared stratied
.632 bootstrap estimate is dened as cross-validation and two bootstrap methods with near-
b est neighbor classiers. His results were that stratied
1X two-fold cross validation is relatively low variance and
accboot =
b i=1 (0:632 0i + :368 accs ) (5) superior to leave-one-out.
where accs is the resubstitution accuracy estimate on Breiman & Spector (1992) conducted a feature sub-
the full dataset (i.e., the accuracy on the training set). set selection experiments for regression, and compared
The variance of the estimate can be determined by com- leave-one-out cross-validation, k-fold cross-validation
puting the variance of the estimates for the samples. for various k, stratied k-fold cross-validation, bias-
The assumptions made by bootstrap are basically the corrected bootstrap, and partial cross-validation (not
same as that of cross-validation, i.e., stability of the al- discussed here). Tests were done on articial datasets
gorithm on the dataset: the \bootstrap world" should with 60 and 160 instances. The behavior observed
closely approximate the real world. The .632 bootstrap was: (1) the leave-one-out has low bias and RMS (root
fails to give the expected result when the classier is a mean square) error, whereas two-fold and ve-fold cross-
perfect memorizer (e.g., an unpruned decision tree or a validation have larger bias and RMS error only at models
one nearest neighbor classier) and the dataset is com- with many features; (2) the pessimistic bias of ten-fold
pletely random, say with two classes. The resubstitution cross-validation at small samples was signicantly re-
accuracy is 100%, and the 0 accuracy is about 50%. duced for the samples of size 160; (3) for model selection,
Plugging these into the bootstrap formula, one gets an ten-fold cross-validation is better than leave-one-out.
estimated accuracy of about 68.4%, far from the real ac- Bailey & Elkan (1993) compared leave-one-out cross-
curacy of 50%. Bootstrap can be shown to fail if we add validation to .632 bootstrap using the FOIL inducer
a memorizer module to any given inducer and adjust its and four synthetic datasets involving Boolean concepts.
predictions. If the memorizer remembers the training set They observed high variability and little bias in the
and makes the predictions when the test instance was a leave-one-out estimates, and low variability but large
training instances, adjusting its predictions can make the bias in the .632 estimates.
resubstitution accuracy change from 0% to 100% and can Weiss and Indurkyha (Weiss & Indurkhya 1994) con-
thus bias the overall estimated accuracy in any direction ducted experiments on real-world data to determine the
we want. applicability of cross-validation to decision tree pruning.
Their results were that for samples at least of size 200,
3 Related Work using stratied ten-fold cross-validation to choose the
amount of pruning yields unbiased trees (with respect to
Some experimental studies comparing dierent accuracy their optimal size).
estimation methods have been previously done, but most
of them were on articial or small datasets. We now 4 Methodology
describe some of these eorts.
Efron (1983) conducted ve sampling experiments and In order to conduct a large-scale experiment we decided
compared leave-one-out cross-validation, several variants to use C4.5 and a Naive-Bayesian classier. The C4.5
of bootstrap, and several other methods. The purpose algorithm (Quinlan 1993) is a descendent of ID3 that
of the experiments was to \investigate some related es- builds decision trees top-down. The Naive-Bayesian clas-
timators, which seem to oer considerably improved es- sier (Langley, Iba & Thompson 1992) used was the one
timation in small samples." The results indicate that implemented in MLC++ (Kohavi, John, Long, Manley
leave-one-out cross-validation gives nearly unbiased esti- & P eger 1994) that uses the observed ratios for nominal
mates of the accuracy, but often with unacceptably high features and assumes a Gaussian distribution for contin-
variability, particularly for small samples; and that the uous features. The exact details are not crucial for this
.632 bootstrap performed best. paper because we are interested in the behavior of the
Breiman et al. (1984) conducted experiments using accuracy estimation methods more than the internals
cross-validation for decision tree pruning. They chose of the induction algorithms. The underlying hypothe-
ten-fold cross-validation for the CART program and sis spaces|decision trees for C4.5 and summary statis-
claimed it was satisfactory for choosing the correct tree. tics for Naive-Bayes|are dierent enough that we hope
They claimed that \the dierence in the cross-validation conclusions based on these two induction algorithms will
estimates of the risks of two rules tends to be much more apply to other induction algorithms.
accurate than the two estimates themselves." Because the target concept is unknown for real-world
concepts, we used the holdout method to estimate the % acc
quality of the cross-validation and bootstrap estimates. 75

To choose a set of datasets, we looked at the learning 70 Soybean

curves for C4.5 and Naive-Bayes for most of the super-
vised classication datasets at the UC Irvine repository 65

(Murphy & Aha 1995) that contained more than 500 60 Vehicle

instances (about 25 such datasets). We felt that a min- 55

imum of 500 instances were required for testing. While
the true accuracies of a real dataset cannot be computed 50 Rand

because we do not know the target concept, we can esti- 45

2 5 10 20 -5 -2 -1
folds

mate the true accuracies using the holdout method. The % acc
\true" accuracy estimates in Table 1 were computed by 100

taking a random sample of the given size, computing the 99.5 Mushroom

accuracy using the rest of the dataset as a test set, and 99

repeating 500 times. 98.5 Hypo

Chess
We chose six datasets from a wide variety of domains, 98

such that the learning curve for both algorithms did 97.5

not atten out too early, that is, before one hundred 97

instances. We also added a no information dataset,

96.5

rand, with 20 Boolean features and a Boolean random

96 folds
2 5 10 20 -5 -2 -1

label. On one dataset, vehicle, the generalization accu-

racy of the Naive-Bayes algorithm deteriorated by more Figure 1: C4.5: The bias of cross-validation with varying
than 4% as more instances were given. A similar phe- folds. A negative k folds stands for leave-k-out. Error
nomenon was observed on the shuttle dataset. Such bars are 95% condence intervals for the mean. The gray
a phenomenon was predicted by Schaer and Wolpert regions indicate 95% condence intervals for the true ac-
(Schaer 1994, Wolpert 1994b), but we were surprised curacies. Note the dierent ranges for the accuracy axis.
that it was observed on two real-world datasets.
To see how well an accuracy estimation method per-
forms, we sampled instances from the dataset (uniformly validation for small k's is apparent. Most of the esti-
without replacement), and created a training set of the mates are reasonably good at 10 folds and at 20 folds
desired size. We then ran the induction algorithm on they are almost unbiased.
the training set and tested the classier on the rest of Stratied cross-validation (not shown) had similar be-
the instances in the dataset. This was repeated 50 times havior, except for lower pessimism. The estimated accu-
at points where the learning curve was sloping up . The racy for soybean at 2-fold was 7% higher and at ve-fold,
same folds in cross-validation and the same samples in 4.7% higher; for vehicle at 2-fold, the accuracy was 2.8%
bootstrap were used for both algorithms compared. higher and at ve-fold, 1.9% higher. Thus stratication
seems to be a less biased estimation method.
5 Results and Discussion Figure 2 shows the bias and variance for the .632 boot-
strap accuracy estimation method. Although the .632
We now show the experimental results and discuss their bootstrap is almost unbiased for chess, hypothyroid, and
signicance. We begin with a discussion of the bias in mushroom for both inducers, it is highly biased for soy-
the estimation methods and follow with a discussion of bean with C4.5, vehicle with both inducers, and rand
the variance. Due to lack of space, we omit some graphs with both inducers. The bias with C4.5 and vehicle is
for the Naive-Bayes algorithm when the behavior is ap- 9.8%.
proximately the same as that of C4.5.
5.2 The Variance
5.1 The Bias While a given method may have low bias, its perfor-
The bias of a method to estimate a parameter is de- mance (accuracy estimation in our case) may be poor
ned as the expected value minus the estimated value. due to high variance. In the experiments above, we have
An unbiased estimation method is a method that has formed condence intervals by using the standard de-
zero bias. Figure 1 shows the bias and variance of k-fold viation of the mean accuracy. We now switch to the
cross-validation on several datasets (the breast cancer standard deviation of the population, i.e., the expected
dataset is not shown). standard deviation of a single accuracy estimation run.
The diagrams clearly show that k-fold cross-validation In practice, if one does a single cross-validation run, the
is pessimistically biased, especially for two and ve folds. expected accuracy will be the mean reported above,pbut
For the learning curves that have a large derivative at the standard deviation will be higher by a factor of 50,
the measurement point, the pessimism in k-fold cross- the number of runs we averaged in the experiments.
Dataset no. of sample-size no. of duplicate C4.5 Naive-Bayes
attr. / total size categories instances
Breast cancer 10 50/699 2 8 91.370.10 94.220.10
Chess 36 900/3196 2 0 98.190.03 86.800.07
Hypothyroid 25 400/3163 2 77 98.520.03 97.630.02
Mushroom 22 800/8124 2 0 99.360.02 94.540.03
Soybean large 35 100/683 19 52 70.490.22 79.760.14
Vehicle 18 100/846 4 0 60.110.16 46.800.16
Rand 20 100/3000 2 9 49.960.04 49.900.04
Table 1: True accuracy estimates for the datasets using C4.5 and Naive-Bayes classiers at the chosen sample sizes.

% acc std dev

C4.5
Vehicle
75 Estimated 7 Soybean
Soybean 6
70 Soybean Vehicle
5 Rand
65 Breast
Rand
4
60 Vehicle
3
55
2
50 Rand Hypo
1 Chess
Mushroom
45 samples 0 folds
1 2 5 10 20 50 100 2 5 10 20 -5 -2 -1

% acc std dev Naive-Bayes

100 7
99.5 Mushroom Vehicle
6 Rand
99 Soybean
5
98.5 Hypo
Chess 4 Breast
98
97.5 3
97 2
Chess
96.5 1 Hypo
Mushroom
96 samples
1 2 5 10 20 50 100 0 folds
2 5 10 20 -5 -2 -1

Figure 2: C4.5: The bias of bootstrap with varying sam- Figure 3: Cross-validation: standard deviation of accu-
ples. Estimates are good for mushroom, hypothyroid, racy (population). Dierent line styles are used to help
and chess, but are extremely biased (optimistically) for dierentiate between curves.
vehicle and rand, and somewhat biased for soybean.
6 Summary
In what follows, all gures for standard deviation will We reviewed common accuracy estimation methods in-
be drawn with the same range for the standard devi- cluding holdout, cross-validation, and bootstrap, and
ation: 0 to 7.5%. Figure 3 shows the standard devia- showed examples where each one fails to produce a good
tions for C4.5 and Naive Bayes using varying number estimate. We have compared the latter two approaches
of folds for cross-validation. The results for stratied on a variety of real-world datasets with diering charac-
cross-validation were similar with slightly lower variance. teristics.
Figure 4 shows the same information for .632 bootstrap. Proposition 1 shows that if the induction algorithm
is stable for a given dataset, the variance of the cross-
Cross-validation has high variance at 2-folds on both validation estimates should be approximately the same,
C4.5 and Naive-Bayes. On C4.5, there is high variance independent of the number of folds. Although the induc-
at the high-ends too|at leave-one-out and leave-two- tion algorithms are not stable, they are approximately
out|for three les out of the seven datasets. Stratica- stable. k-fold cross-validation with moderate k values
tion reduces the variance slightly, and thus seems to be (10-20) reduces the variance while increasing the bias.
uniformly better than cross-validation, both for bias and As k decreases (2-5) and the sample sizes get smaller,
variance. there is variance due to the instability of the training
Jain, A. K., Dubes, R. C. & Chen, C. (1987), \Boot-
strap techniques for error estimation", IEEE trans-
std dev
C4.5
7 Soybean
actions on pattern analysis and machine intelli-
gence PAMI-9(5), 628{633.
Rand
6 Vehicle
5

Kohavi, R., John, G., Long, R., Manley, D. &

Breast
4
3 P eger, K. (1994), MLC++: A machine learn-
2 ing library in C++, in \Tools with Articial
Hypo
1 Mushroom Intelligence", IEEE Computer Society Press,
pp. 740{743. Available by anonymous ftp from:
Chess
0 samples
1 2 5 10 20 50 100
starry.Stanford.EDU:pub/ronnyk/mlc/toolsmlc.ps .
Figure 4: .632 Bootstrap: standard deviation in accu- Langley, P., Iba, W. & Thompson, K. (1992), An anal-
racy (population). ysis of bayesian classiers, in \Proceedings of the
tenth national conference on articial intelligence",
AAAI Press and MIT Press, pp. 223{228.
sets themselves, leading to an increase in variance. This
is most apparent for datasets with many categories, such Murphy, P. M. & Aha, D. W. (1995), UCI
as soybean. In these situations, stratication seems to repository of machine learning databases,
help, but repeated runs may be a better approach. http://www.ics.uci.edu/~mlearn/MLRepository.html.
Our results indicate that stratication is generally a
better scheme, both in terms of bias and variance, when Quinlan, J. R. (1993), C4.5: Programs for Machine
compared to regular cross-validation. Bootstrap has low Learning, Morgan Kaufmann, Los Altos, California.
variance, but extremely large bias on some problems. We Schaer, C. (1994), A conservation law for generalization
recommend using stratied ten-fold cross-validation for performance, in \Machine Learning: Proceedings
model selection. of the Eleventh International Conference", Morgan
Kaufmann, pp. 259{265.
Acknowledgments We thank David Wolpert for a
thorough reading of this paper and many interesting dis- Shao, J. (1993), \Linear model selection via cross-
cussions. We thank Wray Buntine, Tom Bylander, Brad validation", Journal of the American Statistical As-
Efron, Jerry Friedman, Rob Holte, George John, Pat sociation 88(422), 486{494.
Langley, Rob Tibshirani, and Sholom Weiss for their Weiss, S. M. (1991), \Small sample error rate estimation
helpful comments and suggestions. Dan Sommereld for k-nearest neighbor classiers", IEEE Transac-
implemented the bootstrap method in MLC++. All ex- tions on Pattern Analysis and Machine Intelligence
periments were conducted using MLC++, partly partly 13(3), 285{289.
funded by ONR grant N00014-94-1-0448 and NSF grants
IRI-9116399 and IRI-9411306. Weiss, S. M. & Indurkhya, N. (1994), Decision tree
pruning : Biased or optimal, in \Proceedings of
References the twelfth national conference on articial intel-
Bailey, T. L. & Elkan, C. (1993), Estimating the ac- ligence", AAAI Press and MIT Press, pp. 626{632.
curacy of learned concepts, in \Proceedings of In-
ternational Joint Conference on Articial Intelli- Wolpert, D. H. (1992), \Stacked generalization", Neural
gence", Morgan Kaufmann Publishers, pp. 895{900. Networks 5, 241{259.
Breiman, L. & Spector, P. (1992), \Submodel selection Wolpert, D. H. (1994a), O-training set error and a pri-
and evaluation in regression. the x-random case", ori distinctions between learning algorithms, Tech-
International Statistical Review 60(3), 291{319. nical Report SFI TR 94-12-123, The Sante Fe In-
stitute.
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone,
C. J. (1984), Classication and Regression Trees, Wolpert, D. H. (1994b), The relationship between PAC,
Wadsworth International Group. the statistical physics framework, the Bayesian
framework, and the VC framework, Technical re-
Efron, B. (1983), \Estimating the error rate of a pre- port, The Santa Fe Institute, Santa Fe, NM.
diction rule: improvement on cross-validation",
Journal of the American Statistical Association Zhang, P. (1992), \On the distributional properties of
78(382), 316{330. model selection criteria", Journal of the American
Statistical Association 87(419), 732{737.
Efron, B. & Tibshirani, R. (1993), An introduction to
the bootstrap, Chapman & Hall.

View publication stats

Parmar PYQ Series 4 Complete English Killer
No ratings yet
Parmar PYQ Series 4 Complete English Killer
481 pages
Business Mathematics - Module 1.1 - Express Fractions To Decimal and Percent Forms
100% (1)
Business Mathematics - Module 1.1 - Express Fractions To Decimal and Percent Forms
12 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
46 pages
4C's and Principles of Communication
100% (1)
4C's and Principles of Communication
2 pages
Sensitivity Analysis
No ratings yet
Sensitivity Analysis
64 pages
Model Selection Evaluation Algorithm Selection 1684595082
No ratings yet
Model Selection Evaluation Algorithm Selection 1684595082
51 pages
14 Model Selection and Boosting
No ratings yet
14 Model Selection and Boosting
51 pages
Immanuel Kant Correspondence 1
100% (1)
Immanuel Kant Correspondence 1
659 pages
P.E. 9 - Q1 - Module1b
No ratings yet
P.E. 9 - Q1 - Module1b
13 pages
Accuracy Measures
No ratings yet
Accuracy Measures
61 pages
Cross Validation
No ratings yet
Cross Validation
9 pages
K Fold
No ratings yet
K Fold
9 pages
Resampling Methods
No ratings yet
Resampling Methods
15 pages
Fe Del Mundo
0% (1)
Fe Del Mundo
7 pages
Week 4 R Programming Model Validation
No ratings yet
Week 4 R Programming Model Validation
5 pages
Fundamental of AI (BE02000041)
No ratings yet
Fundamental of AI (BE02000041)
55 pages
Case Study 2
No ratings yet
Case Study 2
2 pages
To1 Saintek Bandung & Cirebon (22 Agustus 2021)
No ratings yet
To1 Saintek Bandung & Cirebon (22 Agustus 2021)
9 pages
Teaching New Head Way Plus English Course
No ratings yet
Teaching New Head Way Plus English Course
39 pages
Umaru Ali Shinkafi Polytechnic Sokoto
No ratings yet
Umaru Ali Shinkafi Polytechnic Sokoto
14 pages
HIC Learning Module: Holy Infant College
No ratings yet
HIC Learning Module: Holy Infant College
9 pages
Turtle - Turtle Graphics - Python 3.12.3 Documentation
No ratings yet
Turtle - Turtle Graphics - Python 3.12.3 Documentation
34 pages
Pioneers of Modern Teaching
No ratings yet
Pioneers of Modern Teaching
48 pages
Combo Meals - Interactive Classroom Activity
No ratings yet
Combo Meals - Interactive Classroom Activity
17 pages
Novelraj, Abhirami Statistical Bio Assignment 2
No ratings yet
Novelraj, Abhirami Statistical Bio Assignment 2
4 pages
Ethical Hacker v1 0 Overview
No ratings yet
Ethical Hacker v1 0 Overview
1 page
Kungfupunctuation
No ratings yet
Kungfupunctuation
11 pages
Lesson Plan
No ratings yet
Lesson Plan
3 pages
Carnival in Brazil
No ratings yet
Carnival in Brazil
2 pages
Simultaneous Equations Quadratic
No ratings yet
Simultaneous Equations Quadratic
7 pages
The Value of FAITH Is Upheld Through Effective Communication That Respects and Understands Diverse Religious Beliefs Within The PUP Community
No ratings yet
The Value of FAITH Is Upheld Through Effective Communication That Respects and Understands Diverse Religious Beliefs Within The PUP Community
5 pages
Course Outline (EEE315Lab)
No ratings yet
Course Outline (EEE315Lab)
4 pages
Motivation Letter Sample 3
No ratings yet
Motivation Letter Sample 3
2 pages
674 - Esl A1 Level MCQ Test With Answers Elementary Test 1
No ratings yet
674 - Esl A1 Level MCQ Test With Answers Elementary Test 1
8 pages
Practice Quiz M1 (Ungraded) 03
No ratings yet
Practice Quiz M1 (Ungraded) 03
5 pages
Lesson Plan: Trolls-Just Like You and Me?: Featured Resources
No ratings yet
Lesson Plan: Trolls-Just Like You and Me?: Featured Resources
5 pages
General Measure of Enterprising Tendency v2 - GET2
No ratings yet
General Measure of Enterprising Tendency v2 - GET2
5 pages
Rubric For High Quality Physical Education
No ratings yet
Rubric For High Quality Physical Education
2 pages
CH E 350, Process Heat Transfer Fall 2010: Course Content
No ratings yet
CH E 350, Process Heat Transfer Fall 2010: Course Content
2 pages
Listening Skills Develop Early
No ratings yet
Listening Skills Develop Early
1 page
How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
Exploring Probability and Random Processes Using MATLAB®
From Everand
Exploring Probability and Random Processes Using MATLAB®
Roshan Trivedi
No ratings yet
Foundations of Elementary Analysis
From Everand
Foundations of Elementary Analysis
Roshan Trivedi
No ratings yet
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Julia for Data Science
From Everand
Julia for Data Science
Anshul Joshi
No ratings yet
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
Stevens' Handbook of Experimental Psychology and Cognitive Neuroscience, Methodology
From Everand
Stevens' Handbook of Experimental Psychology and Cognitive Neuroscience, Methodology
Wiley
No ratings yet
Interpolation and Extrapolation Optimal Designs 2: Finite Dimensional General Models
From Everand
Interpolation and Extrapolation Optimal Designs 2: Finite Dimensional General Models
Giorgio Celant
No ratings yet
Constraint Networks: Targeting Simplicity for Techniques and Algorithms
From Everand
Constraint Networks: Targeting Simplicity for Techniques and Algorithms
Christophe Lecoutre
No ratings yet
Adaptive Filtering Prediction and Control
From Everand
Adaptive Filtering Prediction and Control
Graham C Goodwin
No ratings yet
Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications
From Everand
Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications
Harald Cramér
4/5 (2)
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Nonlinear Transformations of Random Processes
From Everand
Nonlinear Transformations of Random Processes
Ralph Deutsch
No ratings yet
10 Minute Guide to Orthogonal Array Test Strategy
From Everand
10 Minute Guide to Orthogonal Array Test Strategy
Rajeev Nair Raman
No ratings yet
Optimal Pathfinding with A-Star Algorithms: Definitive Reference for Developers and Engineers
From Everand
Optimal Pathfinding with A-Star Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Analytical Methods of Optimization
From Everand
Analytical Methods of Optimization
D. F. Lawden
No ratings yet
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
From Everand
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
Jyh-Horng Jeng
No ratings yet
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
From Everand
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Feedback Control Theory
From Everand
Feedback Control Theory
Bruce Francis
5/5 (1)
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Cross Correlation: Unlocking Patterns in Computer Vision
From Everand
Cross Correlation: Unlocking Patterns in Computer Vision
Fouad Sabry
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automated Theorem Proving: Fundamentals and Applications
From Everand
Automated Theorem Proving: Fundamentals and Applications
Fouad Sabry
No ratings yet
Rule of Inference: Fundamentals and Applications
From Everand
Rule of Inference: Fundamentals and Applications
Fouad Sabry
No ratings yet
Abductive Reasoning: Fundamentals and Applications
From Everand
Abductive Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Brute Force Search: Fundamentals and Applications
From Everand
Brute Force Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
Optical Flow: Exploring Dynamic Visual Patterns in Computer Vision
From Everand
Optical Flow: Exploring Dynamic Visual Patterns in Computer Vision
Fouad Sabry
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Artificial Intelligence Diagnosis: Fundamentals and Applications
From Everand
Artificial Intelligence Diagnosis: Fundamentals and Applications
Fouad Sabry
No ratings yet
State Space Search: Fundamentals and Applications
From Everand
State Space Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
Satplan: Fundamentals and Applications
From Everand
Satplan: Fundamentals and Applications
Fouad Sabry
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Dynamic Bayesian Networks: Fundamentals and Applications
From Everand
Dynamic Bayesian Networks: Fundamentals and Applications
Fouad Sabry
No ratings yet
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
From Everand
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

A Study of Cross-Validation and Bootstrap For Accuracy Estimation and Model Selection

Uploaded by

A Study of Cross-Validation and Bootstrap For Accuracy Estimation and Model Selection

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model

Article · March 2001

Online Controlled Experiments and A/B Testing View project

Machine Learning C++ - MLC++ View project

The user has requested enhancement of the downloaded file.

A Study of Cross-Validation and Bootstrap

To choose a set of datasets, we looked at the learning 70 Soybean

instances (about 25 such datasets). We felt that a min- 55

because we do not know the target concept, we can esti- 45

accuracy using the rest of the dataset as a test set, and 99

repeating 500 times. 98.5 Hypo

instances. We also added a no information dataset,

rand, with 20 Boolean features and a Boolean random

label. On one dataset, vehicle, the generalization accu-

% acc std dev

% acc std dev Naive-Bayes

Kohavi, R., John, G., Long, R., Manley, D. &

View publication stats

You might also like