0% found this document useful (0 votes)
6 views

Cross Validation 1

Uploaded by

VARUN MODI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Cross Validation 1

Uploaded by

VARUN MODI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Pattern Recognition 48 (2015) 2839–2846

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

Performance evaluation of classification algorithms by k-fold


and leave-one-out cross validation
Tzu-Tsung Wong n
Institute of Information Management National Cheng Kung University 1, Ta-Sheuh Road, Tainan City 701, Taiwan, ROC

art ic l e i nf o a b s t r a c t

Article history: Classification is an essential task for predicting the class values of new instances. Both k-fold and leave-
Received 6 November 2014 one-out cross validation are very popular for evaluating the performance of classification algorithms.
Received in revised form Many data mining literatures introduce the operations for these two kinds of cross validation and the
4 February 2015
statistical methods that can be used to analyze the resulting accuracies of algorithms, while those
Accepted 8 March 2015
Available online 17 March 2015
contents are generally not all consistent. Analysts can therefore be confused in performing a cross
validation procedure. In this paper, the independence assumptions in cross validation are introduced,
Keywords: and the circumstances that satisfy the assumptions are also addressed. The independence assumptions
Classification are then used to derive the sampling distributions of the point estimators for k-fold and leave-one-out
Independence
cross validation. The cross validation procedure to have such sampling distributions is discussed to
k-Fold cross validation
provide new insights in evaluating the performance of classification algorithms.
Leave-one-out cross validation
Sampling distribution & 2015 Elsevier Ltd. All rights reserved.

1. Introduction estimate. The concept of bias and variance is employed to argue


why simple classification algorithms such as naïve Bayesian
Data mining is an emerged technique for automatically proces- classifiers and k-nearest neighbors can achieve competitive per-
sing the huge amount of data stored in computers, and classifica- formance [2], and Bengio and Grandvalet [3] showed that the
tion is an essential task in data mining for assigning the class unbiased estimator of the variance of k-fold cross validation does
values of new instances. Two popular approaches for evaluating not exist. The bias of accuracy estimate will be smaller when the
the performance of a classification algorithm on a data set are number of folds is either five or ten [4].
k-fold and leave-one-out cross validation. When the amount of The following four factors can affect an accuracy estimate
data is large, k-fold cross validation should be employed to obtained by k-fold cross validation.
estimate the accuracy of the model induced from a classification
algorithm, because the accuracy resulting from the training data of  The number of folds.
the model is generally too optimistic [1]. Leave-one-out cross  The number of instances in a fold.
validation is a special case of k-fold cross validation, in which the  The level of averaging.
number of folds equals the number of instances. When the  The repetition of k-fold cross validation.
number of instances either in a data set or for a class value is
small, such as gene microarray data and gene sequence data, Introducing the ways of executing k-fold and leave-one-out cross
leave-one-out cross validation should be adopted to obtain a validation are almost necessary in every book for data mining.
reliable accuracy estimate for a classification algorithm. Unlike Table 1 summarizes the context of the two approaches in six
leave-one-out cross validation, there is a randomness mechanism books, including whether the way to make statistical inference for
in k-fold cross validation such that the mean accuracy resulting the results obtained by leave-one-out cross validation is intro-
from k-fold cross validation on a data set is not a constant. duced, where a hyphen denotes that a book does not explicitly
Bias and variance are two main measures for investigating the describe the item corresponding to a column. The four factors for
impact of this randomness mechanism, where bias represents the executing k-fold cross validation are not all the same in every pair
expected difference between an accuracy estimate and actual of books in Table 1. This implies that not only the randomness
accuracy, and variance represents the variability of an accuracy mechanism for dividing instances into folds, but also the settings
of the four factors can affect the accuracy estimate obtained by k-
fold cross validation.
n
Tel.: þ 886 6 2757575x53722; fax: þ886 6 2362162. Every book in Table 1 provides the ways to derive confidence
E-mail address: [email protected] interval or to perform hypothesis testing for the results obtained

http://dx.doi.org/10.1016/j.patcog.2015.03.009
0031-3203/& 2015 Elsevier Ltd. All rights reserved.
2840 T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846

Table 1 large. Since sample proportion is a special case of sample mean,


A summary for k-fold and leave-one-out cross validation. the central limit theorem can be applied to sample proportion as
well. If a population follows a normal distribution, then the
k-Fold cross validation Leave-one-out
cross validation sampling distribution of a sample mean calculated from indepen-
No. of Fold size Averaging Repetition dent observations will be normally distributed for any sample size,
folds level because the sum of two independent normal random variables is
also normally distributed [10]. When the probability distribution
Alpaydin 10 or np Z 5 and n Fold Suggested –
[5] 30 (1  p) Z 5a
governing a population is unknown, the sampling distribution will
Han et al. 10 – Data set Suggested – be approximately normal only when the central limit theorem can
[6] be applied. In this case, the criteria to determine whether a sample
Kantardzic – – Fold – – size n is large are necessary.
[7]
Let xi be the ith observation in a sample. Many statistics
Mitchell – 30 Fold – –
[8] literatures suggest that if the probability distribution for the
Tan et al. Large – Data set – – population is not highly skewed, a sample is large enough to
[9] assume a normal distribution for the sample mean if sample size
Witten 10 100 Fold Suggested – nZ 30. When the population proportion p is very close to zero or
et al. [1]
one, n Z30 is no longer an appropriate criterion for applying the
a
n and p represent the number of instances in a fold and actual accuracy, central limit theorem. The criterion about large sample is therefore
respectively. revised as npZ 5 and n(1  p)Z5 for sample proportion [11,12],
which are derived from the normal approximation of binomial
by k-fold cross validation. In leave-one-out cross validation, every distribution. For instance, if p ¼0.9, a sample with size n ¼40 is not
fold has only one instance, and hence random partition is not large enough to assume that p follows a normal distribution
necessary. The ways of making statistical inference on the results because n(1  p)¼4 o5. Note that np and n(1  p) represent the
obtained by k-fold cross validation cannot therefore be applied to expected number of successes and failures, respectively in a sample
analyze those obtained by leave-one-out cross validation. How- with size n. This means that the resulting accuracy of a classification
ever, none of the six books given in Table 1 introduces statistical algorithm can be assumed to follow a normal distribution if the
methods specifically suitable for evaluating the performance of a expected numbers of correct and wrong predictions on n instances
classification algorithm by leave-one-out cross validation. are both not less than five.
For any population parameter, a necessary step for deriving The procedure for obtaining a simple random sample with size
confidence interval or performing hypothesis testing is to know the n from a finite population is to control that every possible sample
sampling distribution of its corresponding point estimator. In this of size n has the same probability to be chosen. When the size
paper, we will derive the sampling distributions of the point estima- of a population is infinite, the number of possible samples of size n
tors for k-fold and leave-one-out cross validation. The sampling will also be infinite. A random sample of size n from an infinite
distributions will be used to analyze the proper settings of the four population must therefore satisfy two conditions: each observa-
factors in performing k-fold cross validation, and to design statistical tion is selected independently and comes from the same popula-
inference methods for leave-one-out cross validation. tion [12]. To ensure that the predictions of two instances are
This paper is organized as follows. Section 2 briefly introduces independent, they must be independently drawn from the same
the definition of sampling distribution and the central limit population. Let xi ¼1 if the prediction of instance i by a classification
theorem. A sample is generally assumed to be collected by simple algorithm is correct, and xi ¼0 otherwise. Then we have E(xi)¼p and
random sampling for deriving the sampling distribution of a point Var(xi)¼p(1 p), where p is the prediction accuracy of the model
estimator. The observations in a random sample need to be induced by the algorithm on the population. The sampling distribu-
P
independent and come from the same population. Section 3 tion of p ¼ ni¼ 1 xi =n can be approximated by N(p, p(1 p)/n) by the
therefore discusses the independence properties about the testing central limit theorem if npZ5 and n(1 p)Z5.
results obtained by k-fold and leave-one-out cross validation. The Since the population proportion p is generally unknown, it is not
sampling distributions for k-fold and leave-one-out cross valida- easy to determine whether conditions npZ5 and n(1 p)Z5 hold for
tion are derived in Sections 4 and 5, respectively. The conclusions a random sample with n instances. Hence, in identifying whether a
and future directions for research are summarized in Section 6. random sample with size n is large or not, we will replace p by p in the
two conditions, because p is an unbiased estimator of p. In the
remainder of this paper, a random sample with n instances is therefore
2. Sampling distributions considered to be large if the numbers of correct and wrong predictions
are both not less than five, and they are called large-sample conditions.
Accuracy is a critical measure for evaluating the performance of a For instance, if the number of correct predictions for a random sample
classification algorithm. When all instances in a data set have the same with 50 instances is 42, then the resulting accuracy can be assumed to
weight, the accuracy of a classification algorithm on a data set is be normally distributed. On the contrary, if the number of correct
defined as the number of instances predicted correctly over the total predictions for this sample is 47, then it is inappropriate to assume
number of instances. In this case, accuracy is considered as a sample that the resulting accuracy follows a normal distribution, because the
proportion that is a special case of a sample mean. Sample mean x and number of wrong predictions is only three.
sample proportion p are point estimators of population mean μ and
population proportion p, respectively, and the probability distributions
governing point estimators are called sampling distributions. This 3. Independence
section will briefly introduce the sampling distributions of x and p. The
sampling distribution of the sample mean can be approximated by a Most statistical inference techniques need data to be collected
normal distribution as the same size becomes large. by simple random sampling. As described in the previous section,
The usefulness of the central limit theorem is that regardless of when population size is infinite, two necessary conditions for
the probability distribution governing a population, sample mean simple random sampling are: every observation comes from the
can be assumed to have a normal distribution when sample size is same population, and all observations are collected independently.
T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846 2841

The instances in a data set, called a random sample, are generally assumption is true in comparing their performance. Note that a
assumed to come from the same population. Otherwise, they decision tree can be represented as a set of classification rules. The
should not be in the same data set for learning. It is therefore model spaces of decision tree induction and sequential covering
reasonable to assume that the instances in a data set are all algorithm will have common models.
governed by the probability distribution for a population. When When two classification algorithms have the same model
data are collected by simple random sampling, every two space, they can still be scheme-independent if their preference
instances in a data set are considered to be independent. This on models are independent. For instance, the growing measure in
section will discuss the impact of independence assumptions on decision tree induction can be gain ratio or gini index. Since the
the point estimators obtained by cross validation. two measures set different preference on models, the two models
The purpose of classification is to find a model from training found by gain ratio and gini index can be assumed to be
data such that the model can have a correct prediction on the class independent. In summary, if the model spaces of two classification
value for most new instances. Let A, R, and e represent classifica- algorithms do not have equivalent representation, then the
tion algorithm, training data set, and new instance, respectively. scheme-independence assumption is true regardless of their
Then the model learned from training data R by classification search bias. When the model spaces of two algorithms have
algorithm A can be represented as MA,R, and hence the class value common models, then the scheme-independence assumption
predicted by this model for instance e can be denoted by MA,R(e). can be true only when they have independent search bias. This
Let c(e) be the actual class value of instance e, and let pA be the guideline can also be applied to analyze whether the scheme-
actual probability of correct prediction of classification algorithm A independence assumption is true for discretization and feature
on the population; i.e., p{MA,R(e) ¼c(e)} ¼pA. This expression selection, two popular tasks in data preprocessing.
implies that whether a prediction is correct or not depends on Discretization transforms continuous attributes into discrete
classification algorithm, training data, and new instance. ones. The two main operations of discretization is to determine
the number of intervals and the boundaries of every interval. If
Definition 1. Let the instances for training and testing be two discretization methods perform the two operations indepen-
independent. dently, then the models represented by the discretized continuous
attributes can be assumed to be nonequivalent. In this case, the
(a) Instance-independence assumption: For any two inde- scheme-independence assumption is true even the classification
pendent instances e1 and e2, MA,R(e1) ¼c(e1) is independent algorithms for evaluating the two discretization methods are the
of MB,R0 (e2) ¼c(e2) for any A, B, R, and R0 . same. For instance, the model spaces formed by the discrete
(b) Scheme-independence assumption: For any two different attributes resulting from the equal-width discretization and the
classification methods A and B, MA,R(e1)¼ c(e1) is independent entropy-based discretization are nonequivalent.
of MB,R0 (e2) ¼c(e2) for any R, R0 , e1, and e2. Feature selection is a tool to remove redundant and irrelevant
attributes for classification. Suppose that the classification algorithms
In cross validation, an instance in R cannot be for testing. It is for evaluating two feature selection methods are the same. If the
therefore reasonable to assume that any new instance is indepen- intersection of the attribute subsets chosen by the two feature
dent of training data set R. Two new instances e1 and e2 are selection methods is not empty, then the model spaces resulting
independent when they are collected by simple random sampling. from the two attribute subsets will have common models. The
The predictions of e1 and e2 are therefore independent regardless scheme-independence assumption is therefore false in this case.
of classification algorithms and training data for generating pre- In leave-one-out cross validation, every instance is in turn used to
diction models. This means that the instance-independence test the model induced from the other instances. Thus, the instance-
assumption is always true for both k-fold and leave-one-out cross independence assumption guarantees that every prediction in leave-
validation. However, the conditions under which the scheme- one-out cross validation is independent of each other. The scheme-
independence assumption is true will be more complicate. independence assumption indicates that MA,R(e)¼c(e) is independent
Every classification algorithm has its own mechanism in learn- of MB,R(e)¼c(e). This assumption provides a base for comparing the
ing a model from training data. It seems that assuming that MA,R(e) performance of two classification algorithms by cross validation.
is independent of MB,R(e) for two different classification algorithms Though training data can affect the result of a prediction, they
A and B should be reasonable. Let A be the algorithm for finding a cannot be used to determine whether two predictions are inde-
fully grown decision tree by the gain ratio from R, and let B be the pendent without considering testing instances and classification
algorithm for finding a fully grown decision tree by the gain ratio algorithms. When two training data sets R and R0 are both
from R and pruning this tree by a measure. The only difference collected by simple random sampling, R and R0 are independent.
between algorithms A and B is that B has a mechanism to prune Since they are governed by the same probability distribution for
the fully grown tree. Since the fully grown trees found by the population, it is possible that a classification algorithm will
algorithms A and B must be identical, it is therefore inappropriate induce the same model from R and R0 . In this case, MA,R(e) and MA,
to assume that MA,R(e) is independent of MB,R(e) in this case. R0 (e) will be the same. It is therefore inappropriate to assume that
As addressed by Mitchell [8], a classification algorithm must have training data sets are independent in cross validation.
inductive bias to predict class values for new instances, and the
inductive bias can be language bias, search bias, or both. The models Proposition 1. The k accuracies resulting from k-fold cross valida-
considered in the learning mechanism of a classification algorithm tion are independent.
form a model space. A classification algorithm will have no language
bias if all possible models have been included in its model space. The Proof. Let a data set D be divided into folds F1, F2, …, Fk such that
search bias describes whether a classification algorithm has used a Fi \ Fj ¼ ∅ for any i aj. In evaluating the performance of classifica-
measure to prefer a model over another. If the model spaces of two tion algorithm A, the accuracy of fold Fj is calculated as
classification algorithms have nonequivalent representation, then the Σ e A Fj IðM A;D\F j ðeÞ ¼ cðeÞÞ=j F j j , where |Fj| is the number of instances
scheme-independence assumption must be true. For instance, a model in Fj, and I(Y¼y) is an indicator function that has value one when
in decision tree induction is a decision tree, and a model in support condition Y ¼y holds, and zero otherwise. Since Fi \ Fj ¼∅ for any
vector machine is a hyperplane. Since the two algorithms have iaj, by the instance-independence assumption, Σ e A Fi IðM A;D\F i ðeÞ ¼
nonequivalent representation of a model, the scheme-independence cðeÞÞ=j F i j and Σ e A Fj IðM A;D\F j ðeÞ ¼ cðeÞÞ=j F j j are independent.
2842 T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846

Proposition 1 indicates that the k accuracies obtained by k-fold and


cross validation are all independent. Since they are generated by Pk
j¼1 Varðpj Þ pð1 pÞ
the same procedure, the k accuracies are considered to come from VarðpÞ ¼ ¼ :
2 n
the same population. The k accuracies resulting from k-fold cross k
validation can therefore be observations in a sample collected by
simple random sampling for estimating the actual accuracy of a
classification algorithm. Theorem 1 shows that if the testing results of all folds satisfy
Some studies randomly choose a specific proportion of the large-sample conditions, the sampling distribution of p can be
instances from a data set to be testing data, and the remaining assumed to be a normal distribution regardless of the number of
instances are for training. This procedure can be repeated to obtain folds. However, when either rj o5 or m  rj o5 for fold Fj, pj should
several accuracies for evaluating the performance of a classifica- not be used to calculate p. Otherwise, the sampling distribution of
tion algorithm. The mean value of the accuracies is an unbiased p will not be approximately normal for estimating the accuracy of
point estimator of the actual accuracy of the classification algo- the classification algorithm.
rithm. However, since the accuracies are not independent, it will
be very difficult to derive a sampling distribution for statistical 4.2. Two algorithms
inference.
There are two ways to compare the prediction accuracies of
two classification algorithms by k-fold cross validation. When it is
possible to evaluate the two algorithms by the same data in each
4. k-Fold cross validation iteration, the matched sample approach is more suitable for this
purpose. If the testing data for two algorithms in an iteration are
A popular procedure for estimating the performance of a different, then we should adopt the independent sample approach
classification algorithm or comparing the performance between to compare their accuracies.
two classification algorithms on a data set is k-fold cross valida- Let pA and pB be the prediction accuracies of classification
tion. This procedure randomly divides a data set into k disjoint algorithms A and B, respectively on the population corresponding
folds with approximately equal size, and each fold is in turn used to a data set D. When D is randomly divided into disjoint folds F1,
to test the model induced from the other k  1 folds by a F2, …, Fk, both A and B are trained by the instances in D\Fj and
classification algorithm. The performance of the classification tested by Fj in the jth iteration. Assume as before that |D| ¼n and
algorithm is evaluated by the average of the k accuracies resulting |Fj| ¼ m for j¼1, 2, …, k. Let rij be the number of instances in Fj
from k-fold cross validation, and hence the level of averaging is correctly classified by algorithm i for i¼ A, B and j¼1, 2, …, k. In
assumed to be at fold. This section will present the sampling this matched sample case, the point estimator for identifying
distributions of the point estimators for k-fold cross validation, whether it is appropriate to assume pA ¼pB is calculated as
P
and discuss the appropriate way of its application. All folds are d ¼ kj ¼ 1 dj =k, where dj ¼(rAj–rBj)/m for j¼1, 2, …, k.
assumed to contain the same number of instances except explicitly
Theorem 2. If rij Z5 and m rij Z5 for i¼A, B and j¼1, 2, …, k, then
specified. P
the sampling distribution of the point estimator d ¼ kj ¼ 1 dj =k can
be approximated by a normal distribution with mean pA pB when
the scheme-independence assumption is satisfied.
4.1. Single algorithm
Proof. Since rij Z5 and m  rij Z 5 for i¼A, B and j¼ 1, 2, …, k, rij/m
Let a data set D be divided into disjoint folds F1, F2, …, Fk, and can be assumed to follow a normal distribution with mean pi.
let |D| ¼n and |Fj| ¼m be the number of instances in D and Fj, When the scheme-independence assumption is satisfied, point
respectively; i.e., n ¼km. Furthermore, let the number of correct estimators rAj/m and rBj/m are independent. Hence, the sampling
predictions on Fj by a classification algorithm A be rj, and let the distribution of dj ¼rAj/m–rBj/m can be approximated by a normal
prediction accuracy of A on the whole population corresponding to distribution. We also have
data set D be p. The following theorem presents the necessary  
k k
conditions for the mean accuracy resulting from k-fold cross to be EðdÞ ¼ E Σ dj =k ¼ Σ Eðdj Þ=k ¼ pA  pB :
j¼1 j¼1
approximately normally distributed.

Theorem 1. If rj Z5 and m rj Z5 for j¼1, 2, …, k, then the sampling


P When classification algorithms A and B are independent, it can
distribution of the resulting mean accuracy p ¼ kj ¼ 1 pj =k can be
be shown that
approximated by a normal distribution with mean p and variance
p(1 p)/n, where pj = rj/m for j = 1, 2, …, k pA ð1  pA Þ pB ð1  pB Þ
Varðdj Þ ¼ þ :
m m
Proof. According to the discussion given in Section 2, if rj Z 5 and Note that d is a sample mean instead of a sample proportion, and
m  rj Z5, then by the central limit theorem, pj ¼rj/m can be that both pA and pB are unknown. The variance of sample
P
assumed to follow a normal distribution with mean p and variance {d1, d2, …, dk} is calculated as s2d ¼ kj ¼ 1 ðdj dÞ2 =ðk  1Þ that is
p(1  p)/m. Since every pair of instances in D are independent, and an estimate of Var(dj), and t value will be the test statistic in this
Fi \ Fj ¼ ∅ for any iaj, by Proposition 1, accuracies p1 through pk case. When the null hypothesis is H0: pA–pB ¼0 with pffiffiffisignificance
are independent and identically distributed random variables. level α, the test statistic is calculated as t ¼ d=ðsd = kÞ with k  1
Since the sum of independent random variables governed by degrees of freedom. The two classification algorithms A and B will
normal distributions is also normally distributed, the sampling have significantly different accuracy if the p-value corresponding
P
distribution of p ¼ kj ¼ 1 pj =k can be approximated by a normal to the t value is less than α. If the conditions specified in Theorem
distribution. We also have 2 hold, this matched sample approach can be applied for any value
Pk of k.
j¼1 Eðpj Þ When the testing data at each iteration for the two algorithms
EðpÞ ¼ ¼p
k are different, the independent sample approach can be used to
T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846 2843

derive the sampling distribution for comparing their performance. results of all folds satisfy the large-sample conditions, the number
Let pAj ¼rAj/m be the accuracy of A at iteration j for j¼ 1, 2, …, k, of folds can therefore be as large as possible.
P
and similarly for pBj for j ¼1, 2, …, q. Then pA ¼ kj ¼ 1 pAj =k and
Pq
pB ¼ j ¼ 1 pBj =q are unbiased estimators of pA and pB, respectively.
Hence, the point estimator of pA  pB is pA  pB in this case.
4.3.2. The number of instances in a fold
Theorem 3. If the testing results of the kþ q folds for evaluating the Most literatures define the first step of k-fold cross validation
performance of algorithms A and B are all satisfy the large-sample as: randomly partition a data set into k disjoint folds with
conditions, then the sampling distribution of the point estimator approximately equal size. The way to perform this operation is
pA  pB can be approximated by a normal distribution with mean not unique. The followings are three possible ways to divide a data
pA pB and variance ðpA ð1  pA Þ=nÞ þ ðpB ð1  pB Þ=nÞ when the set with 203 instances into k ¼5 folds.
scheme-independence assumption is true.
(1) Each instance is independently assigned to fold j that equals
Proof. Since the numbers of correct and wrong predictions in the smallest integer larger than or equal to 5u, where u is a
each fold are not less five, by Theorem 1, pi approximately follows random number larger than zero. For instance, an instance will
a normal distribution with mean pi and variance pi ð1  pi Þ=n for be assigned to the second fold when u ¼0.26. Let the number
i¼A, B. When the scheme-independence assumption is true, the of instances in the five folds be 38, 41, 44, 37, and 43,
prediction of an instance by a model induced by algorithm A is respectively.
independent of the prediction of the same instance by a model (2) Generate a random number for each instance, and assign the
learned by algorithm B. This implies that pA and pB are indepen- random numbers into ascending order. Then divide the
dent, and hence the sampling distribution of pA  pB can be assumed instances into five folds according to the sorted random
to be normally distributed. We also have EðpA  pB Þ ¼ pA –pB and numbers, and the number of instances in the five folds is 40,
pA ð1  pA Þ pB ð1  pB Þ 40, 40, 40, and 43, respectively.
VarðpA  pB Þ ¼ VarðpA Þ þ VarðpB Þ ¼ þ : (3) The process of generating and sorting random numbers is the
n n
same as in approach (2), while the number of instances in the
five folds is 40, 40, 41, 41, and 41, respectively.
When the null hypothesis is H0: pA–pB ¼0, the two samples for
estimating pA and pB are pooled together to calculate a more The mean accuracies resulting from the three approaches are
reliable estimate for pA and pB as p ¼ ðp þ pB Þ=2. The test statistic generally different, and which one should be adopted to divide
pA ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
is therefore calculated as z ¼ ðpA  pB Þ= 2pð1  pÞ=n. this data set into five folds?
There is another way to calculate a test statistic for comparing Let mj be the number of instances in fold j. The pj for j¼1, 2, …,
the performance of two classification algorithms A and B. Compute k are all unbiased estimators of actual accuracy p. If they are
P P
pA ¼ ki ¼ 1 pAi =k and pB ¼ qj¼ 1 pBj =q as before. If the numbers of considered as the observations for estimating p, then they should
correct and wrong predictions in every fold are not less than five, come from the same population; i.e., they should follow the same
then pA  pB can be assumed to have a normal distribution with normal distribution. This means that they should have, or at least
mean pA  pB, and the variance of this distribution can be esti- approximately, the same variance. By Theorem 1, pj can be
mated as ðs2A =kÞ þ ðs2B =qÞ, where s2A ¼ ðΣ ki ¼ 1 ðpAi  pA Þ2 =ðk  1Þ and assumed to have a normal distribution with mean p and variance
s2B ¼ ðΣ qj¼ 1 ðpBj  pB Þ2 =ðq  1Þ. The test statistics is therefore p(1-p)/mj. The mj for j¼ 1, 2, …, k should therefore be as close to
 
t ¼ ðpA  pB Þ= s2A =k þs2B =q : This way employs the hypothesis test- each other as possible. Hence, approach (3) is the most recom-
ing for comparing two population means by the independent mended one to divide a data set into folds. When the number of
sample approach. As discussed in the previous paragraph, the test instances n in a data set is large such that n/k is far larger than k,
statistic for this purpose can be z value that provides a more the other two approaches can also be adopted.
precise information to determine the hypothesis testing result. The large-sample conditions indicates that it is inappropriate to
decide whether a sample is large or not only by the number of
4.3. Discussion instances. For instance, when the classification accuracy of an
algorithm on a data set is close to 50%, the testing results of a fold
As described in Section 1, there are four factors that can affect that contains only 20 instances is likely to satisfy the large-sample
the results obtained by k-fold cross validation. The sampling conditions. On the contrary, when an algorithm has close to 100%
distributions derived in the previous two subsections will be used prediction accuracy on a data set, a fold containing more than 200
to investigate the appropriate settings of the four factors in instances may fail to be a large sample.
performing k-fold cross validation.

4.3.1. The number of folds 4.3.3. The level of averaging


Theorem 1 shows that the sampling distribution of the mean The averaging for accuracy estimates can be performed in the
accuracy obtained by k-fold cross validation is independent of the level of fold or data set. The level of averaging in deriving sampling
number of folds k. This is reasonable because the sampling distributions is set at fold, and hence the large-sample conditions
distributions for various values of k are all derived from the same are checked fold by fold. When an accuracy estimate is calculated
testing instances. When k is large, the number of training by the total number of correct predictions in k folds over the
instances becomes large in each iteration, while the computational number of instances in a data set, it can be shown that the
cost of k-fold cross validation will be high, and the number of sampling distribution of this point estimator can be approximated
instances in a fold will be small. This implies that the testing by N(p, p(1 p)/n), the same as the one given in Theorem 1.
results of the instances in a fold have a larger chance to violate the
large-sample conditions. In this case, the variance of the sampling Example 1. A data set containing 200 instances is randomly
distribution estimated by pð1  pÞ=n will be larger, because the divided into five folds for evaluating the performance of a
number of instances in the fold that does not satisfy the large- classification algorithm, and the number of correct predictions in
sample conditions cannot be added into n. When the testing the five folds is 32, 28, 30, 32, and 28, respectively. If the level of
2844 T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846

averaging is at data set, the mean and variance of the accuracy have the same sampling distribution as pA  pB in the independent
estimator for the classification algorithm are calculated as sample approach.
p ¼ ð32 þ 28 þ 30 þ 30 þ 32Þ=5 ¼ 0:76 In dividing the instances in a data set into independent
training and testing data, an instance can play only one role in
and
an iteration. Another approach that also satisfies this require-
VarðpÞ ¼ 0:76ð1  0:76Þ=200 ¼ 0:000912: ment is to randomly choose a pre-specified proportion of
The interval with confidence level 1  α is 0.76 70.030zα/2. If the instances from a data set for testing, and the instances not
level of averaging is at fold, these two values are calculated as chosen for testing are for training. If this procedure is executed
only once, and the testing results satisfy the large-sample
0:80 þ 0:70 þ 0:75 þ 0:75 þ 0:80 conditions, then we will have a sampling distribution to make
p¼ ¼ 0:76
5 statistical inference about prediction accuracy. Since the num-
and ber of testing instances is generally far less than the number of
instances in a data set, the variance of the point estimator
0:042 þð  0:06Þ2 þ ð 0:01Þ2 þ ð  0:01Þ2 þ0:042
VarðpÞ ¼ ¼ 0:0007: resulting from this approach will be larger than that resulting
51
from k-fold cross validation. The procedure to randomly choose
In this case, the point estimator p is a sample mean instead of a testing instances should not be repeated, because the testing
sample proportion. Hence, the interval with confidence level 1  α sets in two rounds may have common instances. In this case,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
is 0:76 7 t α=2 ð0:0007=5Þ ¼ 0:76 7 0:012t α=2 : the testing results of the two rounds will not be independent,
and hence they should not be aggregated together for deriving
Example 1 shows that when the level of averaging is set at data set, a more reliable point estimate.
the variance of a point estimator depends only on the sample
proportion. In this case, when two classification algorithms evaluated
by k-fold cross validation have the same mean accuracy, the variance
of the two sample proportions will be the same. We will not be able to
5. Leave-one-out cross validation
know which algorithm has a more stable performance. This will not
occur when prediction accuracy is calculated fold by fold. Hence, if the
Many studies adopt leave-one-out cross validation to evaluate
testing results of every fold satisfy the large-sample conditions, the
the performance of a classification algorithm when the number of
level of averaging should be set at fold.
instances in a data set or the number of instances for a class value
is small. Since the randomness of dividing instances into for
4.3.4. Repetition training and testing does not exist, the point estimate of accuracy
As shown in Table 1, several literatures suggest that k-fold cross for a given data set is constant. This section will derive the
validation can be repeatedly performed to obtain several unbiased sampling distributions for leave-one-out cross validation to make
estimates such that the point estimate of p can be more reliable. statistical inference about the mean accuracies of classification
For instance, let pj and p0j for j¼1, 2, …, k be the estimates obtained algorithms.
by the first and the second rounds of k-fold cross validation,
P
respectively. Then p0 ¼ kj ¼ 1 ðpj þ p0j Þ=2k should be a better esti-
Pk
mate of p than p ¼ j ¼ 1 pj =k, because p0 is an unbiased estimator
obtained from a larger sample. Theoretically, p0 will have a smaller 5.1. Single algorithm
variance than p.
An interesting result here is that the expected difference The prediction of an instance can be either correct or wrong.
between the aggregate point estimator and p will become smaller This means the random variable corresponding to the prediction of
when the rounds for performing k-fold cross validation becomes an instance follows a Bernoulli distribution with success prob-
larger. Since the number of instances for learning does not ability p. When every instance is independent of each other, the
increase, what is the new information to make the point estimate number of correct predictions on n instances will follow a
more precise? Note that p0 is a more reliable estimate of p than p if binomial distribution with parameters n and p. Let xi be the
the pj and the p0j for j¼1, 2, …, k are all independent. The same data random variable corresponding to the prediction of the ith
are used in the first and the second rounds for random partition. instance. Then P{xi ¼1}¼ p, and P{xi ¼0}¼ 1  p. Hence, the sam-
P
An instances in fold i of the first round will be assigned to fold j for pling distribution of the point estimator p ¼ ni¼ 1 xi =n is approxi-
Pn P
some j in the second round. Since the classification algorithm is mated by N(p, p(1  p)/n) when i ¼ 1 xi Z 5 and n  ni¼ 1 xi Z 5.
still the same one, neither the instance-independence nor the Note that this sampling distribution is the same as the one
scheme-independence assumption holds in this case. This means obtained in Theorem 1 This is reasonable because leave-one-
that pi and p0j are not independent. The prediction of the same out cross validation is a special case of k-fold cross validation.
instance in the first and the second rounds will be positively Since k-fold cross validation is more efficient, leave-one-out
correlated. For any two random variables X and Y, we have cross validation will be used only when the random partition in
Var(XþY)¼Var(X)þVar(Y)þ2Cov(X,Y). The variance of XþY increases k-fold cross validation have a large impact on performance
when X and Y are positively correlated. Since pi is positively evaluation.
correlated to some of the p0j , the variance of p0 actually does not Unlike k-fold cross validation, the sample variance obtained by
reduce even it is calculated from a larger sample. This explains the leave-one-out cross validation is constant. If leave-one-out cross
myth that repeatedly performing k-fold cross validation can obtain validation is executed several rounds for a data set, every round
a more reliable estimate of p. will have the same resulting sample mean and sample variance.
In comparing the performance of two classification algorithms These sample means cannot be pooled together to derive a more
P
A and B, let p^ i ¼ kj ¼ 1 r ij =n for i ¼A, B when the averaging level is reliable point estimate because their variance equals zero. It is
set at data set. As argued previously, it is inappropriate to perform therefore helpless to repeatedly perform leave-one-out cross
k-fold cross validation repeatedly. The matched sample approach validation for obtaining a more reliable point estimate. This
therefore cannot be used to determine whether pA and pB are provides another explanation about why k-fold cross validation
significantly different in this case. The point estimator p^ A  p^ B will should be executed only once.
T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846 2845

5.2. Two algorithms 6. Conclusions

Similar to k-fold cross validation, both the independent and the Both k-fold and leave-one-out cross validation are popular
matched sample approaches can be used to compare the perfor- approaches for evaluating the performance of a classification
mance between two classification algorithms in leave-one-out algorithm, while how to use the testing results of the two
cross validation. Since leave-one-out cross validation does not evaluation approaches for making statistical inference are not all
have the mechanism for random partition, the independent the same in literatures. In this paper, we consider four factors to
sample approach will be relatively simple. Note that the difference investigate the usage of k-fold cross validation. The factors include
of the predictions of an instance by two classification algorithms the number of folds, the number of instances in a fold, the level of
will not be approximately normally distributed. Deriving the averaging, and the repetition of cross validation. In order to study
sampling distribution of the matched sample approach for leave- the impact of the four factors, we first propose the independence
one-out cross validation is therefore not so straightforward. assumptions and define the large-sample conditions in cross
In a data set containing n instances, let ri be the number of validation. They are then used to derive the sampling distributions
correct predictions made by classification algorithm i evaluated by of the point estimators for the two cross validation approaches in
leave-one-cross validation for i¼ A, B. If ri Z5 and n  ri Z5 for i¼ A, evaluating the performance of one algorithm or comparing the
B, then by the same argument as given in the proof of Theorem 3, performance of two algorithms.
pA  pB will approximately follow a normal distribution with mean According to the sampling distributions for k-fold cross valida-
pA  pB and variance (pA(1-pA)/n)+(pB(1-pB)/n) when the scheme- tion, the large-sample conditions determine the number of
independence assumption is true for algorithms A and B. So, when instances in a fold and the number of folds that can be as large
the independent sample approach is used, k-fold and leave-one- as possible when the large-sample conditions still hold in every
out cross validation actually have the same sampling distribution fold. If the variability of the performance of a classification
for making statistical inference. Alternatively, when the matched algorithm is important, the level of averaging should be set at
sample approach is used, let xij represent the random variable fold. Since the mean accuracies obtained by any two rounds of
corresponding to the prediction of the jth instance by classification k-fold cross validation are dependent, repeatedly performing
algorithm i for i¼A, B and j ¼1, 2, …, n. Then the value of yj ¼xAj– k-fold cross validation cannot provide a more reliable point
xBj can be  1, 0, or þ1. estimate. Both the independent and the matched sample
approaches can be used to make statistical inference about the
testing results of leave-one-out cross validation.
Theorem 4. Let n  1, n0, and n þ 1 be the frequency of yj ¼  1, 0, and
When the scheme-independence assumption is not satisfied,
þ1, respectively. If n  1 Z5, n0 Z5, and n þ 1 Z 5, then the sampling
P neither the matched nor the independent sample approach can be
distribution of y ¼ nj¼ 1 yj =n can be approximated by a normal
applied to compare the performance of two classification algo-
distribution with mean pA pB when the scheme-independence
rithms for k-fold and leave-one-out cross validation. New statis-
assumption is satisfied.
tical inference methods should be established to serve this
purpose. Since there is a random partition mechanism in k-fold
Proof. When n  1 Z5, n0 Z5, and n þ 1 Z5, by the central limit cross validation, how to reduce the variability of a point estimate
theorem, point estimator y can be assumed to follow a normal obtained from k-fold cross validation is an interesting research
distribution. Since P{xij ¼ 1}¼pi and P{xij ¼0}¼ 1  pi for i¼ A, B, and topic. Repeating k-fold cross validation is not an appropriate way
the scheme-independence assumption is satisfied, we have for this purpose, and hence new efficient methods should be
P{yj ¼  1}¼(1–pA)pB, P{yj ¼0}¼(1–pA)(1–pB)þ pApB, and developed to obtained more reliable accuracy estimates.
P{yj ¼ þ 1}¼pA(1–pB), and hence E(yj)¼ pA–pB. Since every yj is an
Pn
unbiased estimator of pA  pB, EðyÞ ¼ j ¼ 1 Eðyj Þ=n ¼ pA  pB .
Conflict of interest statement
Theorem 4 shows that the matched sample approach is
applicable for leave-one-out cross validation. Sample variance None declared.
P
s2y ¼ nj¼ 1 ðyj  yÞ=ðn  1Þ can be an estimate of Var(yj) for j¼1, 2,
qffiffiffiffiffiffiffiffiffiffi
…, n. The test statistic is therefore calculated as t ¼ y= s2y =n with
Acknowledgments
n  1 degrees of freedom for testing null hypothesis H0: pA  pB ¼0.
This research was supported by the National Science Council
Example 2. Suppose that the number of correct predictions of Taiwan under Grant no. 101-2410-H-006-006.
leave-one-out cross validation for classification algorithms A and B
on a data set with 100 instances is 80 and 84, respectively. In the References
independent sample approach, the test statistic for identifying
whether the accuracies of the two classification algorithms are [1] I.H. Witten, E. Frank, M.A. Hall, Data Mining: Practical Machine Learning Tools
significantly different is calculated as and Techniques, 3rd Edition, Morgan Kaufmann, Massachusetts, 2011.
[2] J.H. Friedman, On bias, variance, 0/1-loss, and the curse-of-dimensionality,
pA  pB 0:80 0:84
z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼  0:7362; Data Min. Knowl. Discov. 1 (1997) 55–77.
2pð1 pÞ=n 2  0:82  0:18=100 [3] Y. Bengio, Y. Grandvalet, No unbiased estimator of the variance of K-fold cross-
validation, J. Mach. Learn. Res. 5 (2004) 1089–1105.
[4] J.D. Rodriguez, A. Perez, J.A. Lozano, Sensitivity analysis of k-fold cross
where p is the pooled mean accuracy. Let sample variance s2y and validation in prediction error estimation, IEEE. Trans. Pattern. Anal. Mach.
the frequency of n  1, n0, and n þ 1 be 0.16, 30, 44, and 26, Intell. 32 (2010) 569–575.
respectively in the matched sample approach. Then the test [5] E. Alpaydin, Introduction to Machine Learning, 2nd Edition, MIT Press,
Massachusetts, 2010.
statistic for this case is
[6] J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 3rd Edition,
Morgan Kaufmann, Massachusetts, 2012.
 0:04 [7] M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms, 2nd
t ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼  1:0: Edition, John Wiley & Sons, New Jersey, 2011.
0:16=100
[8] T.M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997.
2846 T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846

[9] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, [11] R.J. Freund, W.J. Wilson, D.L. Mohr, Statistical Methods, 3rd Edition, Academic
Massachusetts, 2006. Press, Massachusetts, 2010.
[10] G. Casella, R.L. Berger, Statistical Inference, 2nd Edition, Duxbury, California, [12] D.R. Anderson, D.J. Sweeney, T.A. Williams, J.D. Camm, J.J. Cochran, Statistics
2002. for Business and Economics, 12th Edition, South-Western, Tennessee, 2012.

Tzu-Tsung Wong is a professor in the Institute of Information Management at National Cheng Kung University, Taiwan, ROC. He received his Ph.D. degree majored in
industrial engineering from the University of Wisconsin at Madison. His research interests include Bayesian statistical analysis, naïve Bayesian classifiers, and classification
methods for gene sequence data.

You might also like