Cross Validation 1
Cross Validation 1
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
art ic l e i nf o a b s t r a c t
Article history: Classification is an essential task for predicting the class values of new instances. Both k-fold and leave-
Received 6 November 2014 one-out cross validation are very popular for evaluating the performance of classification algorithms.
Received in revised form Many data mining literatures introduce the operations for these two kinds of cross validation and the
4 February 2015
statistical methods that can be used to analyze the resulting accuracies of algorithms, while those
Accepted 8 March 2015
Available online 17 March 2015
contents are generally not all consistent. Analysts can therefore be confused in performing a cross
validation procedure. In this paper, the independence assumptions in cross validation are introduced,
Keywords: and the circumstances that satisfy the assumptions are also addressed. The independence assumptions
Classification are then used to derive the sampling distributions of the point estimators for k-fold and leave-one-out
Independence
cross validation. The cross validation procedure to have such sampling distributions is discussed to
k-Fold cross validation
provide new insights in evaluating the performance of classification algorithms.
Leave-one-out cross validation
Sampling distribution & 2015 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.patcog.2015.03.009
0031-3203/& 2015 Elsevier Ltd. All rights reserved.
2840 T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846
The instances in a data set, called a random sample, are generally assumption is true in comparing their performance. Note that a
assumed to come from the same population. Otherwise, they decision tree can be represented as a set of classification rules. The
should not be in the same data set for learning. It is therefore model spaces of decision tree induction and sequential covering
reasonable to assume that the instances in a data set are all algorithm will have common models.
governed by the probability distribution for a population. When When two classification algorithms have the same model
data are collected by simple random sampling, every two space, they can still be scheme-independent if their preference
instances in a data set are considered to be independent. This on models are independent. For instance, the growing measure in
section will discuss the impact of independence assumptions on decision tree induction can be gain ratio or gini index. Since the
the point estimators obtained by cross validation. two measures set different preference on models, the two models
The purpose of classification is to find a model from training found by gain ratio and gini index can be assumed to be
data such that the model can have a correct prediction on the class independent. In summary, if the model spaces of two classification
value for most new instances. Let A, R, and e represent classifica- algorithms do not have equivalent representation, then the
tion algorithm, training data set, and new instance, respectively. scheme-independence assumption is true regardless of their
Then the model learned from training data R by classification search bias. When the model spaces of two algorithms have
algorithm A can be represented as MA,R, and hence the class value common models, then the scheme-independence assumption
predicted by this model for instance e can be denoted by MA,R(e). can be true only when they have independent search bias. This
Let c(e) be the actual class value of instance e, and let pA be the guideline can also be applied to analyze whether the scheme-
actual probability of correct prediction of classification algorithm A independence assumption is true for discretization and feature
on the population; i.e., p{MA,R(e) ¼c(e)} ¼pA. This expression selection, two popular tasks in data preprocessing.
implies that whether a prediction is correct or not depends on Discretization transforms continuous attributes into discrete
classification algorithm, training data, and new instance. ones. The two main operations of discretization is to determine
the number of intervals and the boundaries of every interval. If
Definition 1. Let the instances for training and testing be two discretization methods perform the two operations indepen-
independent. dently, then the models represented by the discretized continuous
attributes can be assumed to be nonequivalent. In this case, the
(a) Instance-independence assumption: For any two inde- scheme-independence assumption is true even the classification
pendent instances e1 and e2, MA,R(e1) ¼c(e1) is independent algorithms for evaluating the two discretization methods are the
of MB,R0 (e2) ¼c(e2) for any A, B, R, and R0 . same. For instance, the model spaces formed by the discrete
(b) Scheme-independence assumption: For any two different attributes resulting from the equal-width discretization and the
classification methods A and B, MA,R(e1)¼ c(e1) is independent entropy-based discretization are nonequivalent.
of MB,R0 (e2) ¼c(e2) for any R, R0 , e1, and e2. Feature selection is a tool to remove redundant and irrelevant
attributes for classification. Suppose that the classification algorithms
In cross validation, an instance in R cannot be for testing. It is for evaluating two feature selection methods are the same. If the
therefore reasonable to assume that any new instance is indepen- intersection of the attribute subsets chosen by the two feature
dent of training data set R. Two new instances e1 and e2 are selection methods is not empty, then the model spaces resulting
independent when they are collected by simple random sampling. from the two attribute subsets will have common models. The
The predictions of e1 and e2 are therefore independent regardless scheme-independence assumption is therefore false in this case.
of classification algorithms and training data for generating pre- In leave-one-out cross validation, every instance is in turn used to
diction models. This means that the instance-independence test the model induced from the other instances. Thus, the instance-
assumption is always true for both k-fold and leave-one-out cross independence assumption guarantees that every prediction in leave-
validation. However, the conditions under which the scheme- one-out cross validation is independent of each other. The scheme-
independence assumption is true will be more complicate. independence assumption indicates that MA,R(e)¼c(e) is independent
Every classification algorithm has its own mechanism in learn- of MB,R(e)¼c(e). This assumption provides a base for comparing the
ing a model from training data. It seems that assuming that MA,R(e) performance of two classification algorithms by cross validation.
is independent of MB,R(e) for two different classification algorithms Though training data can affect the result of a prediction, they
A and B should be reasonable. Let A be the algorithm for finding a cannot be used to determine whether two predictions are inde-
fully grown decision tree by the gain ratio from R, and let B be the pendent without considering testing instances and classification
algorithm for finding a fully grown decision tree by the gain ratio algorithms. When two training data sets R and R0 are both
from R and pruning this tree by a measure. The only difference collected by simple random sampling, R and R0 are independent.
between algorithms A and B is that B has a mechanism to prune Since they are governed by the same probability distribution for
the fully grown tree. Since the fully grown trees found by the population, it is possible that a classification algorithm will
algorithms A and B must be identical, it is therefore inappropriate induce the same model from R and R0 . In this case, MA,R(e) and MA,
to assume that MA,R(e) is independent of MB,R(e) in this case. R0 (e) will be the same. It is therefore inappropriate to assume that
As addressed by Mitchell [8], a classification algorithm must have training data sets are independent in cross validation.
inductive bias to predict class values for new instances, and the
inductive bias can be language bias, search bias, or both. The models Proposition 1. The k accuracies resulting from k-fold cross valida-
considered in the learning mechanism of a classification algorithm tion are independent.
form a model space. A classification algorithm will have no language
bias if all possible models have been included in its model space. The Proof. Let a data set D be divided into folds F1, F2, …, Fk such that
search bias describes whether a classification algorithm has used a Fi \ Fj ¼ ∅ for any i aj. In evaluating the performance of classifica-
measure to prefer a model over another. If the model spaces of two tion algorithm A, the accuracy of fold Fj is calculated as
classification algorithms have nonequivalent representation, then the Σ e A Fj IðM A;D\F j ðeÞ ¼ cðeÞÞ=j F j j , where |Fj| is the number of instances
scheme-independence assumption must be true. For instance, a model in Fj, and I(Y¼y) is an indicator function that has value one when
in decision tree induction is a decision tree, and a model in support condition Y ¼y holds, and zero otherwise. Since Fi \ Fj ¼∅ for any
vector machine is a hyperplane. Since the two algorithms have iaj, by the instance-independence assumption, Σ e A Fi IðM A;D\F i ðeÞ ¼
nonequivalent representation of a model, the scheme-independence cðeÞÞ=j F i j and Σ e A Fj IðM A;D\F j ðeÞ ¼ cðeÞÞ=j F j j are independent.
2842 T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846
derive the sampling distribution for comparing their performance. results of all folds satisfy the large-sample conditions, the number
Let pAj ¼rAj/m be the accuracy of A at iteration j for j¼ 1, 2, …, k, of folds can therefore be as large as possible.
P
and similarly for pBj for j ¼1, 2, …, q. Then pA ¼ kj ¼ 1 pAj =k and
Pq
pB ¼ j ¼ 1 pBj =q are unbiased estimators of pA and pB, respectively.
Hence, the point estimator of pA pB is pA pB in this case.
4.3.2. The number of instances in a fold
Theorem 3. If the testing results of the kþ q folds for evaluating the Most literatures define the first step of k-fold cross validation
performance of algorithms A and B are all satisfy the large-sample as: randomly partition a data set into k disjoint folds with
conditions, then the sampling distribution of the point estimator approximately equal size. The way to perform this operation is
pA pB can be approximated by a normal distribution with mean not unique. The followings are three possible ways to divide a data
pA pB and variance ðpA ð1 pA Þ=nÞ þ ðpB ð1 pB Þ=nÞ when the set with 203 instances into k ¼5 folds.
scheme-independence assumption is true.
(1) Each instance is independently assigned to fold j that equals
Proof. Since the numbers of correct and wrong predictions in the smallest integer larger than or equal to 5u, where u is a
each fold are not less five, by Theorem 1, pi approximately follows random number larger than zero. For instance, an instance will
a normal distribution with mean pi and variance pi ð1 pi Þ=n for be assigned to the second fold when u ¼0.26. Let the number
i¼A, B. When the scheme-independence assumption is true, the of instances in the five folds be 38, 41, 44, 37, and 43,
prediction of an instance by a model induced by algorithm A is respectively.
independent of the prediction of the same instance by a model (2) Generate a random number for each instance, and assign the
learned by algorithm B. This implies that pA and pB are indepen- random numbers into ascending order. Then divide the
dent, and hence the sampling distribution of pA pB can be assumed instances into five folds according to the sorted random
to be normally distributed. We also have EðpA pB Þ ¼ pA –pB and numbers, and the number of instances in the five folds is 40,
pA ð1 pA Þ pB ð1 pB Þ 40, 40, 40, and 43, respectively.
VarðpA pB Þ ¼ VarðpA Þ þ VarðpB Þ ¼ þ : (3) The process of generating and sorting random numbers is the
n n
same as in approach (2), while the number of instances in the
five folds is 40, 40, 41, 41, and 41, respectively.
When the null hypothesis is H0: pA–pB ¼0, the two samples for
estimating pA and pB are pooled together to calculate a more The mean accuracies resulting from the three approaches are
reliable estimate for pA and pB as p ¼ ðp þ pB Þ=2. The test statistic generally different, and which one should be adopted to divide
pA ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
is therefore calculated as z ¼ ðpA pB Þ= 2pð1 pÞ=n. this data set into five folds?
There is another way to calculate a test statistic for comparing Let mj be the number of instances in fold j. The pj for j¼1, 2, …,
the performance of two classification algorithms A and B. Compute k are all unbiased estimators of actual accuracy p. If they are
P P
pA ¼ ki ¼ 1 pAi =k and pB ¼ qj¼ 1 pBj =q as before. If the numbers of considered as the observations for estimating p, then they should
correct and wrong predictions in every fold are not less than five, come from the same population; i.e., they should follow the same
then pA pB can be assumed to have a normal distribution with normal distribution. This means that they should have, or at least
mean pA pB, and the variance of this distribution can be esti- approximately, the same variance. By Theorem 1, pj can be
mated as ðs2A =kÞ þ ðs2B =qÞ, where s2A ¼ ðΣ ki ¼ 1 ðpAi pA Þ2 =ðk 1Þ and assumed to have a normal distribution with mean p and variance
s2B ¼ ðΣ qj¼ 1 ðpBj pB Þ2 =ðq 1Þ. The test statistics is therefore p(1-p)/mj. The mj for j¼ 1, 2, …, k should therefore be as close to
t ¼ ðpA pB Þ= s2A =k þs2B =q : This way employs the hypothesis test- each other as possible. Hence, approach (3) is the most recom-
ing for comparing two population means by the independent mended one to divide a data set into folds. When the number of
sample approach. As discussed in the previous paragraph, the test instances n in a data set is large such that n/k is far larger than k,
statistic for this purpose can be z value that provides a more the other two approaches can also be adopted.
precise information to determine the hypothesis testing result. The large-sample conditions indicates that it is inappropriate to
decide whether a sample is large or not only by the number of
4.3. Discussion instances. For instance, when the classification accuracy of an
algorithm on a data set is close to 50%, the testing results of a fold
As described in Section 1, there are four factors that can affect that contains only 20 instances is likely to satisfy the large-sample
the results obtained by k-fold cross validation. The sampling conditions. On the contrary, when an algorithm has close to 100%
distributions derived in the previous two subsections will be used prediction accuracy on a data set, a fold containing more than 200
to investigate the appropriate settings of the four factors in instances may fail to be a large sample.
performing k-fold cross validation.
averaging is at data set, the mean and variance of the accuracy have the same sampling distribution as pA pB in the independent
estimator for the classification algorithm are calculated as sample approach.
p ¼ ð32 þ 28 þ 30 þ 30 þ 32Þ=5 ¼ 0:76 In dividing the instances in a data set into independent
training and testing data, an instance can play only one role in
and
an iteration. Another approach that also satisfies this require-
VarðpÞ ¼ 0:76ð1 0:76Þ=200 ¼ 0:000912: ment is to randomly choose a pre-specified proportion of
The interval with confidence level 1 α is 0.76 70.030zα/2. If the instances from a data set for testing, and the instances not
level of averaging is at fold, these two values are calculated as chosen for testing are for training. If this procedure is executed
only once, and the testing results satisfy the large-sample
0:80 þ 0:70 þ 0:75 þ 0:75 þ 0:80 conditions, then we will have a sampling distribution to make
p¼ ¼ 0:76
5 statistical inference about prediction accuracy. Since the num-
and ber of testing instances is generally far less than the number of
instances in a data set, the variance of the point estimator
0:042 þð 0:06Þ2 þ ð 0:01Þ2 þ ð 0:01Þ2 þ0:042
VarðpÞ ¼ ¼ 0:0007: resulting from this approach will be larger than that resulting
51
from k-fold cross validation. The procedure to randomly choose
In this case, the point estimator p is a sample mean instead of a testing instances should not be repeated, because the testing
sample proportion. Hence, the interval with confidence level 1 α sets in two rounds may have common instances. In this case,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
is 0:76 7 t α=2 ð0:0007=5Þ ¼ 0:76 7 0:012t α=2 : the testing results of the two rounds will not be independent,
and hence they should not be aggregated together for deriving
Example 1 shows that when the level of averaging is set at data set, a more reliable point estimate.
the variance of a point estimator depends only on the sample
proportion. In this case, when two classification algorithms evaluated
by k-fold cross validation have the same mean accuracy, the variance
of the two sample proportions will be the same. We will not be able to
5. Leave-one-out cross validation
know which algorithm has a more stable performance. This will not
occur when prediction accuracy is calculated fold by fold. Hence, if the
Many studies adopt leave-one-out cross validation to evaluate
testing results of every fold satisfy the large-sample conditions, the
the performance of a classification algorithm when the number of
level of averaging should be set at fold.
instances in a data set or the number of instances for a class value
is small. Since the randomness of dividing instances into for
4.3.4. Repetition training and testing does not exist, the point estimate of accuracy
As shown in Table 1, several literatures suggest that k-fold cross for a given data set is constant. This section will derive the
validation can be repeatedly performed to obtain several unbiased sampling distributions for leave-one-out cross validation to make
estimates such that the point estimate of p can be more reliable. statistical inference about the mean accuracies of classification
For instance, let pj and p0j for j¼1, 2, …, k be the estimates obtained algorithms.
by the first and the second rounds of k-fold cross validation,
P
respectively. Then p0 ¼ kj ¼ 1 ðpj þ p0j Þ=2k should be a better esti-
Pk
mate of p than p ¼ j ¼ 1 pj =k, because p0 is an unbiased estimator
obtained from a larger sample. Theoretically, p0 will have a smaller 5.1. Single algorithm
variance than p.
An interesting result here is that the expected difference The prediction of an instance can be either correct or wrong.
between the aggregate point estimator and p will become smaller This means the random variable corresponding to the prediction of
when the rounds for performing k-fold cross validation becomes an instance follows a Bernoulli distribution with success prob-
larger. Since the number of instances for learning does not ability p. When every instance is independent of each other, the
increase, what is the new information to make the point estimate number of correct predictions on n instances will follow a
more precise? Note that p0 is a more reliable estimate of p than p if binomial distribution with parameters n and p. Let xi be the
the pj and the p0j for j¼1, 2, …, k are all independent. The same data random variable corresponding to the prediction of the ith
are used in the first and the second rounds for random partition. instance. Then P{xi ¼1}¼ p, and P{xi ¼0}¼ 1 p. Hence, the sam-
P
An instances in fold i of the first round will be assigned to fold j for pling distribution of the point estimator p ¼ ni¼ 1 xi =n is approxi-
Pn P
some j in the second round. Since the classification algorithm is mated by N(p, p(1 p)/n) when i ¼ 1 xi Z 5 and n ni¼ 1 xi Z 5.
still the same one, neither the instance-independence nor the Note that this sampling distribution is the same as the one
scheme-independence assumption holds in this case. This means obtained in Theorem 1 This is reasonable because leave-one-
that pi and p0j are not independent. The prediction of the same out cross validation is a special case of k-fold cross validation.
instance in the first and the second rounds will be positively Since k-fold cross validation is more efficient, leave-one-out
correlated. For any two random variables X and Y, we have cross validation will be used only when the random partition in
Var(XþY)¼Var(X)þVar(Y)þ2Cov(X,Y). The variance of XþY increases k-fold cross validation have a large impact on performance
when X and Y are positively correlated. Since pi is positively evaluation.
correlated to some of the p0j , the variance of p0 actually does not Unlike k-fold cross validation, the sample variance obtained by
reduce even it is calculated from a larger sample. This explains the leave-one-out cross validation is constant. If leave-one-out cross
myth that repeatedly performing k-fold cross validation can obtain validation is executed several rounds for a data set, every round
a more reliable estimate of p. will have the same resulting sample mean and sample variance.
In comparing the performance of two classification algorithms These sample means cannot be pooled together to derive a more
P
A and B, let p^ i ¼ kj ¼ 1 r ij =n for i ¼A, B when the averaging level is reliable point estimate because their variance equals zero. It is
set at data set. As argued previously, it is inappropriate to perform therefore helpless to repeatedly perform leave-one-out cross
k-fold cross validation repeatedly. The matched sample approach validation for obtaining a more reliable point estimate. This
therefore cannot be used to determine whether pA and pB are provides another explanation about why k-fold cross validation
significantly different in this case. The point estimator p^ A p^ B will should be executed only once.
T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846 2845
Similar to k-fold cross validation, both the independent and the Both k-fold and leave-one-out cross validation are popular
matched sample approaches can be used to compare the perfor- approaches for evaluating the performance of a classification
mance between two classification algorithms in leave-one-out algorithm, while how to use the testing results of the two
cross validation. Since leave-one-out cross validation does not evaluation approaches for making statistical inference are not all
have the mechanism for random partition, the independent the same in literatures. In this paper, we consider four factors to
sample approach will be relatively simple. Note that the difference investigate the usage of k-fold cross validation. The factors include
of the predictions of an instance by two classification algorithms the number of folds, the number of instances in a fold, the level of
will not be approximately normally distributed. Deriving the averaging, and the repetition of cross validation. In order to study
sampling distribution of the matched sample approach for leave- the impact of the four factors, we first propose the independence
one-out cross validation is therefore not so straightforward. assumptions and define the large-sample conditions in cross
In a data set containing n instances, let ri be the number of validation. They are then used to derive the sampling distributions
correct predictions made by classification algorithm i evaluated by of the point estimators for the two cross validation approaches in
leave-one-cross validation for i¼ A, B. If ri Z5 and n ri Z5 for i¼ A, evaluating the performance of one algorithm or comparing the
B, then by the same argument as given in the proof of Theorem 3, performance of two algorithms.
pA pB will approximately follow a normal distribution with mean According to the sampling distributions for k-fold cross valida-
pA pB and variance (pA(1-pA)/n)+(pB(1-pB)/n) when the scheme- tion, the large-sample conditions determine the number of
independence assumption is true for algorithms A and B. So, when instances in a fold and the number of folds that can be as large
the independent sample approach is used, k-fold and leave-one- as possible when the large-sample conditions still hold in every
out cross validation actually have the same sampling distribution fold. If the variability of the performance of a classification
for making statistical inference. Alternatively, when the matched algorithm is important, the level of averaging should be set at
sample approach is used, let xij represent the random variable fold. Since the mean accuracies obtained by any two rounds of
corresponding to the prediction of the jth instance by classification k-fold cross validation are dependent, repeatedly performing
algorithm i for i¼A, B and j ¼1, 2, …, n. Then the value of yj ¼xAj– k-fold cross validation cannot provide a more reliable point
xBj can be 1, 0, or þ1. estimate. Both the independent and the matched sample
approaches can be used to make statistical inference about the
testing results of leave-one-out cross validation.
Theorem 4. Let n 1, n0, and n þ 1 be the frequency of yj ¼ 1, 0, and
When the scheme-independence assumption is not satisfied,
þ1, respectively. If n 1 Z5, n0 Z5, and n þ 1 Z 5, then the sampling
P neither the matched nor the independent sample approach can be
distribution of y ¼ nj¼ 1 yj =n can be approximated by a normal
applied to compare the performance of two classification algo-
distribution with mean pA pB when the scheme-independence
rithms for k-fold and leave-one-out cross validation. New statis-
assumption is satisfied.
tical inference methods should be established to serve this
purpose. Since there is a random partition mechanism in k-fold
Proof. When n 1 Z5, n0 Z5, and n þ 1 Z5, by the central limit cross validation, how to reduce the variability of a point estimate
theorem, point estimator y can be assumed to follow a normal obtained from k-fold cross validation is an interesting research
distribution. Since P{xij ¼ 1}¼pi and P{xij ¼0}¼ 1 pi for i¼ A, B, and topic. Repeating k-fold cross validation is not an appropriate way
the scheme-independence assumption is satisfied, we have for this purpose, and hence new efficient methods should be
P{yj ¼ 1}¼(1–pA)pB, P{yj ¼0}¼(1–pA)(1–pB)þ pApB, and developed to obtained more reliable accuracy estimates.
P{yj ¼ þ 1}¼pA(1–pB), and hence E(yj)¼ pA–pB. Since every yj is an
Pn
unbiased estimator of pA pB, EðyÞ ¼ j ¼ 1 Eðyj Þ=n ¼ pA pB .
Conflict of interest statement
Theorem 4 shows that the matched sample approach is
applicable for leave-one-out cross validation. Sample variance None declared.
P
s2y ¼ nj¼ 1 ðyj yÞ=ðn 1Þ can be an estimate of Var(yj) for j¼1, 2,
qffiffiffiffiffiffiffiffiffiffi
…, n. The test statistic is therefore calculated as t ¼ y= s2y =n with
Acknowledgments
n 1 degrees of freedom for testing null hypothesis H0: pA pB ¼0.
This research was supported by the National Science Council
Example 2. Suppose that the number of correct predictions of Taiwan under Grant no. 101-2410-H-006-006.
leave-one-out cross validation for classification algorithms A and B
on a data set with 100 instances is 80 and 84, respectively. In the References
independent sample approach, the test statistic for identifying
whether the accuracies of the two classification algorithms are [1] I.H. Witten, E. Frank, M.A. Hall, Data Mining: Practical Machine Learning Tools
significantly different is calculated as and Techniques, 3rd Edition, Morgan Kaufmann, Massachusetts, 2011.
[2] J.H. Friedman, On bias, variance, 0/1-loss, and the curse-of-dimensionality,
pA pB 0:80 0:84
z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:7362; Data Min. Knowl. Discov. 1 (1997) 55–77.
2pð1 pÞ=n 2 0:82 0:18=100 [3] Y. Bengio, Y. Grandvalet, No unbiased estimator of the variance of K-fold cross-
validation, J. Mach. Learn. Res. 5 (2004) 1089–1105.
[4] J.D. Rodriguez, A. Perez, J.A. Lozano, Sensitivity analysis of k-fold cross
where p is the pooled mean accuracy. Let sample variance s2y and validation in prediction error estimation, IEEE. Trans. Pattern. Anal. Mach.
the frequency of n 1, n0, and n þ 1 be 0.16, 30, 44, and 26, Intell. 32 (2010) 569–575.
respectively in the matched sample approach. Then the test [5] E. Alpaydin, Introduction to Machine Learning, 2nd Edition, MIT Press,
Massachusetts, 2010.
statistic for this case is
[6] J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 3rd Edition,
Morgan Kaufmann, Massachusetts, 2012.
0:04 [7] M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms, 2nd
t ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 1:0: Edition, John Wiley & Sons, New Jersey, 2011.
0:16=100
[8] T.M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997.
2846 T.-T. Wong / Pattern Recognition 48 (2015) 2839–2846
[9] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, [11] R.J. Freund, W.J. Wilson, D.L. Mohr, Statistical Methods, 3rd Edition, Academic
Massachusetts, 2006. Press, Massachusetts, 2010.
[10] G. Casella, R.L. Berger, Statistical Inference, 2nd Edition, Duxbury, California, [12] D.R. Anderson, D.J. Sweeney, T.A. Williams, J.D. Camm, J.J. Cochran, Statistics
2002. for Business and Economics, 12th Edition, South-Western, Tennessee, 2012.
Tzu-Tsung Wong is a professor in the Institute of Information Management at National Cheng Kung University, Taiwan, ROC. He received his Ph.D. degree majored in
industrial engineering from the University of Wisconsin at Madison. His research interests include Bayesian statistical analysis, naïve Bayesian classifiers, and classification
methods for gene sequence data.