We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
Luteeturkiee fo
188 Chapter 4 Classification Da fa Ain a) - P. Pan,
A, SARIN bath, Vy eer
4.5.4 Bootstrap CPenicem, 2.006),
‘The methods presented so far assume that the training records are sampled
without replacement. As a result, there are no duplicate records in the training
and test sets. In the bootstrap approach, the training records are sampled
with replacement; i.e., a record already chosen for training is put back into
the original pool of records so that it is equally likely to be redrawn. If the
original data has N records, it can be shown thal, on average, a bootstrap
sample of size N contains about 63.2% of the records in the original data. This
approximation follows from the fact that the probability a record is chosen by
8 bootstrap sample is 1~ (1 — 1/N)". When N is sufficiently large, the
probability asymptotically approaches 1 ~ e~} = 0.632. Records that are not
included in the bootstrap sample become part of the test set. ‘The model
induced from the training set is then applied to the test set to obtain an
estimate of the accuracy of the bootstrap sample, ¢;. The sampling procedure
is then repeated b times to generate b bootstrap samples.
‘There are several variations to the bootstrap sampling approach in terms
of how the overall accuracy of the classifier is computed. One of the more
widely used approaches is the .632 bootstrap, which computes the overall
accuracy by combining the accuracies of each bootstrap sample («;) with the
‘accuracy computed from a training set that contains all the labeled examples
4n the original data (acc.): st
Accuracy, cco Fue 632 x €¢ 40.368 x acc,). (4.11)
a
4.6 Methods for Comparing Classifiers
It is often useful to compare the performance of different classifiers to dever-
mine which classifier works better on a given data set. Howover, depending
(on the size of the data, the observed difference in accuracy between two clas-
sifiers may not be statistically significant. ‘This section examines some of the
statistical tests available to compare the performance of different models and
classifiers,
For illustrative purposes, consider a pair of classification models, My, and
‘Mp. Suppose Ma achieves 85% accuracy when evalitated on a test set con-
taining 30 records, while Mp achieves 75% accuracy on a different test set
containing 5000 records, Based on this information, is M, a better model
than Mg?
|
|
i
i
b
'4.6 Methods for Comparing Classifiers 189
‘The preceding example raises two key questions regarding the statistical
significance of the performance metrics:
1. Although My has a higher accuracy than Mp, it was tested on @ smaller
test set. How much confidence can we place on the accuracy for Ma?
2. Isit possible to explain the differonce in accuracy as a result of variations
in the composition of the test sets?
‘The first question relates to the issue of estimating the confidence interval of a
given model accuracy. The second question relates to the issue of testing the
statistical significance of the observed deviation. These issues are investigated
in the remainder of this section,
4.6.1 - Estimating a Confidence Interval for Accuracy
"To determine the confidence interval, we need to establish the probability
distribution that governs the accuracy measure. This section describes an ap-
proach for deriving the confidence interval by modeling the classification task
‘as a binomial experiment. Following is a list of characteristics of a binomial
experiment:
1. ‘Tho experiment consists of N independent trials, where each trial has
two possible outcomes: success or failure.
2. The probability of success, p, in each trial is constant.
‘An example of a binomial experiment is cOunting the number of heads that
turn up when a coin is fiipped NV times. Tf X is the number of successes
observed in N trials, then the probability that X takes a particular value is
«given by a binomial distribution with mean Np and variance Np(1 ~p):
PO == (Rea =o)"
For example, if the coin is fair (p = 0.5) and is flipped fifty times, then the
probebility that the head shows up 20 times is
Px = 20) = (0.51 — 0.5) = 0.0419.
20,
If the experiment is repeated many times, then the average munber of heads
expected to show up is 50x 0.5 = 25, while its variance is 50 x 0.5 x 0.5 = 12.5.
|
|190 Chapter 4 Classification
‘The task of predicting the class labels of test records can also be consid-
‘ered as a binomial experiment. Given a test set that contains N’ records, let
X be the number of records correctly predicted by a model and p be the true
accurucy of the model. By modeling the prediction task as 8 binomial experi-
ment, X has a binomial distribution with mean Np and variance Np(1—p).
It can be shown that the empirical accuracy, ace = X/N, also has binomial
distribution with mean p and variance p(1—p)/N (see Exercise 12). Although
the binomial distribution can be used to estimate the confidence interval for
‘ace, it is often approximated by @ normal distribution when N is sufficiently
large. Based on she normal distribution, the following confidence interval for
‘ace can be derived: :
ace—p Beer
°(- Zaj2& i= pN s Aa) =l-a, (4.12)
where Zaj2 and Z;—9/2 ate the upper and lower bounds obtained from a stan-
dard normal distribution at confidence level (1—«). Since a standard normal
distribution is symmetric around Z = 0, it follows that Zaj2 = Zi_a/2. Rear-
ranging this inequality leads to the following confidence interval for p:
2x Nx acct 22 jy 4 Zoya) Zajq + ANace ~ ANacc?
fal * Bory) Zoya
Ee (4.13)
2N + Zin)
‘The following table shows the values of Za/2 at different confidence levels:
Ta [0.99] 0.98 [095 [09 [08 [07 [05
Zajn [2-58 | 2.38 | 1.96 | 1.65 [1.28 [1.04 | 0.67
Example 4.4. Consider a model that has an accuracy of 80% when evaluated
‘on 100 test records. What is the confidence interval for its true accuracy at a
95% confidence level? ‘The confidence level of 95% corresponds to 4/9 = 1.96
according to the table given above. Inserting this term into Bquation 4.13
yields a confidence interval between 71.1% and 86.7%. The following table
‘shows the confidence interval when the number of records, NY, inereases:
N 20, 30. 700 500) ioo0 [5000
Confidence | 0.584 0.670 O.7i1|~ 0.763 [ O.7TA | 0.789
Interval | '— 0.919 | — 0.888 | ~ 0.867 | — 0.833 | — 0.824 | - 0.811
Note that the confidence interval becomes tighter when NV increases, .4.6 Methods for Comparing Classifiers 191.
4.6.2. Comparing the Performance of Two Models
Consider a pair of models, M; and Mp, that are evaluated on two independent
test sets, D; and D2. Let m, denote the number of records in D, and no denote
the number of records jn Dz. In addition, suppose the error rate for My on
Dy is e1 and the error rate for Mz on D2 is é2. Out goal is to test whether the
observed difference between ¢1 and ep is statistically significant.
‘Assuming that n; and ng are sufficiently large, the error rates ¢, and e2
can be approximated using normal distributions. If the observed difference in
the error rate is denoted as d= e, ~¢2, then d is also normally distributed
with mean dy, its true difference, and variance, 03, The variance of d can be
computed as follows:
2
€2(1 ~e2)
aie
where ¢3(1 —¢1)/m and ¢2(1 ~ ¢2)/na are the variances of the error rates.
Finally, at tho (1 — a)% confidence level, it can be shown that the confidence
interval for the true difference dy is given by the following equation:
dy =a 2apnBe- (4.15)
Example 4.5. Consider the problem described at the beginning of this sec-
tion, Model Ma has an error rate of e; = 0.15 when applied to Ny = 30
test records, while model Mp has an error rate of e = 0.25 when applied
to Np = 5000 test records. The observed difference in their error rates is
d= (0.15 — 0.25| = 0.1. In this example, we are performing a two-sided test
to check whether d; = 0 or d; # 0.’ The estimated variance of the observed
difference in error rates ean be computed as follows:
42 _ 0.15(1—0.15) | 0.25(1 ~ 0.25)
qe
or 64 = 0.0655, Insorting this value into Equation 4.15, we obtain the following
confidence interval for dy at 95% confidence level:
dy = 0.14 1.96 x 0.0655 = 0.1 + 0.128,
As the interval spans the value zero, we can conclude that the observed differ-
ence is not statistically significant at a 95% confidence level. .
|
I
|
|
|
|
I
|
I}
|192 Chapter 4 Classification
‘At what confidence level can we reject the hypothesis that d; = 0? ‘To do
this, we need to determine the value of Zaz such that the confidence interval
for d; does not span the value zero. We can reverse the preceding computation
and look for the value Za/2 such that d > Za/2d4- Replacing the values of d
and Gq gives Zap < 1.527. This value fist oseurs when (1a) $ 0.996 (for a
two-sided test). The result suggests that the null hypothesis can be rejected
‘at confidence level of 93.6% or lower.
4.6.3 Comparing the Performance of Two Classifiers
Suppose we want to compare the performance of two classifiers using the k-fold
cross-validation approach. Initially, the data set D is divided into k equal-sized
partitions. We then apply each classifier to construct a model from k— 1 of
the partitions and test it on the remaining partition. This step is repeated
times, each time using a different partition as the test set.
Lot Mj; denote the model induced by classification technique Zi during the
j* iteration. Note that each pair of models ‘Mi; and Mz; are tested on the
same partition j. Let e1; and e2, be theit respective error rates. The difference
between their error rates during the j% fold can be written a5 dj = e1j ~ €;.
If k is sufficiently large, then d; is normally distributed with mean dj", which
is the true difference in their error rates, and variance 0°. Unlike the previous
approach, the overall variance in the observed differences is estimated using
the following formula:
(4.16)
where d is the average difference. For this approach, we need to use a &
distribution to compute the confidence interval for dj;
Gf = EE toy eiFar
‘The coefficient: f(a),4-1 is obtained from a probability table with two input
parameters, its confidence level (1a) and the number of degrees of freedom,
k—1. The probability table for the distribution is shown in Table 4.6.
Example 4.6. Suppose the estimated difference in the accuracy of models
generated by two classification techniques has @ mean equal to 0.05 and a
standard deviation equal to 0,002. If the accuracy is estimated using a 30-fold
cross-validation approach, then at a 95% confidence level, the true accuracy
difference is,
dg? = 0.05 + 2.04 x 0.002, (417)
|
|
|
|4.7 Bibliographic Notes 198
Table 4.6. Probability table for t-distribution.
a=e)
k-110.99 7 0.98 | 0.95 | 0.9 | 08
1 3.08 | 6.31 | 12.7 | 31.8 | 63.7
2 1.89 | 2.92 | 4.30 | 6.96 | 9.92
4 1.53 | 2.13 | 2.78 | 3.75 | 4.60
9 1.38 | 1.83 | 2.26 | 2.82 | 3.25
14 | 1.34 | 1.76 | 2.14 | 2.62 | 2.98
19 | 1.33 | 1.73 | 2.09 | 2.54 | 2.86
2 | 132/171 | 206 } 2.49 | 2.80
29 | 1.31 | 1.70 | 2.04 | 2.46] 2.76
Since the confidence interval does not span the value zero, the observed dif-
ference between the techniques is statistically significant. .
4.7 Bibliographic Notes
Early classification systenis were developed to organize a large collection of
objects. For example, the Dewey Decimal and Library of Congress elassifica-
tion systems were designed to catalog and index the vast number of library
books. The categories are typically identified in a manual fashion, with the
help of domain experts.
Automated classification has been a subject of intensive research for many
years. The study of classification in classical statistics is sometimes known as
discriminant analysis, where the objective is to predict the group member-
ship of an object based on a sot of predictor variables. A well-known classical
method is Fisher’s linear discriminant analysis (117], which seeks to find a lin-
car projection of the data that produces the greatest discrimination between
objects that belong to different classes.
Many pattern recognition probloms also require the discrimination of ob-
jects from different classes. Examples include speech recognition, handwritten
character identification, and image classification. Readers who are interested
in the application of classification techniques for pattern recognition can refer
to the survey articles by Jain et al, [122] and Kulkarni et al. [128] or classic
pattern recognition books by Bishop [107], Duda et al. “{114), and Fukunaga
[118]. The subject of classification is also a major research topic in the fields of
neural networks, statistical learning, and machine lesrning. An in-depth treat
I
I