0% found this document useful (0 votes)
59 views6 pages

Methods For Comparing Classifiers

Uploaded by

reddiprasanna93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
59 views6 pages

Methods For Comparing Classifiers

Uploaded by

reddiprasanna93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
Luteeturkiee fo 188 Chapter 4 Classification Da fa Ain a) - P. Pan, A, SARIN bath, Vy eer 4.5.4 Bootstrap CPenicem, 2.006), ‘The methods presented so far assume that the training records are sampled without replacement. As a result, there are no duplicate records in the training and test sets. In the bootstrap approach, the training records are sampled with replacement; i.e., a record already chosen for training is put back into the original pool of records so that it is equally likely to be redrawn. If the original data has N records, it can be shown thal, on average, a bootstrap sample of size N contains about 63.2% of the records in the original data. This approximation follows from the fact that the probability a record is chosen by 8 bootstrap sample is 1~ (1 — 1/N)". When N is sufficiently large, the probability asymptotically approaches 1 ~ e~} = 0.632. Records that are not included in the bootstrap sample become part of the test set. ‘The model induced from the training set is then applied to the test set to obtain an estimate of the accuracy of the bootstrap sample, ¢;. The sampling procedure is then repeated b times to generate b bootstrap samples. ‘There are several variations to the bootstrap sampling approach in terms of how the overall accuracy of the classifier is computed. One of the more widely used approaches is the .632 bootstrap, which computes the overall accuracy by combining the accuracies of each bootstrap sample («;) with the ‘accuracy computed from a training set that contains all the labeled examples 4n the original data (acc.): st Accuracy, cco Fue 632 x €¢ 40.368 x acc,). (4.11) a 4.6 Methods for Comparing Classifiers It is often useful to compare the performance of different classifiers to dever- mine which classifier works better on a given data set. Howover, depending (on the size of the data, the observed difference in accuracy between two clas- sifiers may not be statistically significant. ‘This section examines some of the statistical tests available to compare the performance of different models and classifiers, For illustrative purposes, consider a pair of classification models, My, and ‘Mp. Suppose Ma achieves 85% accuracy when evalitated on a test set con- taining 30 records, while Mp achieves 75% accuracy on a different test set containing 5000 records, Based on this information, is M, a better model than Mg? | | i i b ' 4.6 Methods for Comparing Classifiers 189 ‘The preceding example raises two key questions regarding the statistical significance of the performance metrics: 1. Although My has a higher accuracy than Mp, it was tested on @ smaller test set. How much confidence can we place on the accuracy for Ma? 2. Isit possible to explain the differonce in accuracy as a result of variations in the composition of the test sets? ‘The first question relates to the issue of estimating the confidence interval of a given model accuracy. The second question relates to the issue of testing the statistical significance of the observed deviation. These issues are investigated in the remainder of this section, 4.6.1 - Estimating a Confidence Interval for Accuracy "To determine the confidence interval, we need to establish the probability distribution that governs the accuracy measure. This section describes an ap- proach for deriving the confidence interval by modeling the classification task ‘as a binomial experiment. Following is a list of characteristics of a binomial experiment: 1. ‘Tho experiment consists of N independent trials, where each trial has two possible outcomes: success or failure. 2. The probability of success, p, in each trial is constant. ‘An example of a binomial experiment is cOunting the number of heads that turn up when a coin is fiipped NV times. Tf X is the number of successes observed in N trials, then the probability that X takes a particular value is «given by a binomial distribution with mean Np and variance Np(1 ~p): PO == (Rea =o)" For example, if the coin is fair (p = 0.5) and is flipped fifty times, then the probebility that the head shows up 20 times is Px = 20) = (0.51 — 0.5) = 0.0419. 20, If the experiment is repeated many times, then the average munber of heads expected to show up is 50x 0.5 = 25, while its variance is 50 x 0.5 x 0.5 = 12.5. | | 190 Chapter 4 Classification ‘The task of predicting the class labels of test records can also be consid- ‘ered as a binomial experiment. Given a test set that contains N’ records, let X be the number of records correctly predicted by a model and p be the true accurucy of the model. By modeling the prediction task as 8 binomial experi- ment, X has a binomial distribution with mean Np and variance Np(1—p). It can be shown that the empirical accuracy, ace = X/N, also has binomial distribution with mean p and variance p(1—p)/N (see Exercise 12). Although the binomial distribution can be used to estimate the confidence interval for ‘ace, it is often approximated by @ normal distribution when N is sufficiently large. Based on she normal distribution, the following confidence interval for ‘ace can be derived: : ace—p Beer °(- Zaj2& i= pN s Aa) =l-a, (4.12) where Zaj2 and Z;—9/2 ate the upper and lower bounds obtained from a stan- dard normal distribution at confidence level (1—«). Since a standard normal distribution is symmetric around Z = 0, it follows that Zaj2 = Zi_a/2. Rear- ranging this inequality leads to the following confidence interval for p: 2x Nx acct 22 jy 4 Zoya) Zajq + ANace ~ ANacc? fal * Bory) Zoya Ee (4.13) 2N + Zin) ‘The following table shows the values of Za/2 at different confidence levels: Ta [0.99] 0.98 [095 [09 [08 [07 [05 Zajn [2-58 | 2.38 | 1.96 | 1.65 [1.28 [1.04 | 0.67 Example 4.4. Consider a model that has an accuracy of 80% when evaluated ‘on 100 test records. What is the confidence interval for its true accuracy at a 95% confidence level? ‘The confidence level of 95% corresponds to 4/9 = 1.96 according to the table given above. Inserting this term into Bquation 4.13 yields a confidence interval between 71.1% and 86.7%. The following table ‘shows the confidence interval when the number of records, NY, inereases: N 20, 30. 700 500) ioo0 [5000 Confidence | 0.584 0.670 O.7i1|~ 0.763 [ O.7TA | 0.789 Interval | '— 0.919 | — 0.888 | ~ 0.867 | — 0.833 | — 0.824 | - 0.811 Note that the confidence interval becomes tighter when NV increases, . 4.6 Methods for Comparing Classifiers 191. 4.6.2. Comparing the Performance of Two Models Consider a pair of models, M; and Mp, that are evaluated on two independent test sets, D; and D2. Let m, denote the number of records in D, and no denote the number of records jn Dz. In addition, suppose the error rate for My on Dy is e1 and the error rate for Mz on D2 is é2. Out goal is to test whether the observed difference between ¢1 and ep is statistically significant. ‘Assuming that n; and ng are sufficiently large, the error rates ¢, and e2 can be approximated using normal distributions. If the observed difference in the error rate is denoted as d= e, ~¢2, then d is also normally distributed with mean dy, its true difference, and variance, 03, The variance of d can be computed as follows: 2 €2(1 ~e2) aie where ¢3(1 —¢1)/m and ¢2(1 ~ ¢2)/na are the variances of the error rates. Finally, at tho (1 — a)% confidence level, it can be shown that the confidence interval for the true difference dy is given by the following equation: dy =a 2apnBe- (4.15) Example 4.5. Consider the problem described at the beginning of this sec- tion, Model Ma has an error rate of e; = 0.15 when applied to Ny = 30 test records, while model Mp has an error rate of e = 0.25 when applied to Np = 5000 test records. The observed difference in their error rates is d= (0.15 — 0.25| = 0.1. In this example, we are performing a two-sided test to check whether d; = 0 or d; # 0.’ The estimated variance of the observed difference in error rates ean be computed as follows: 42 _ 0.15(1—0.15) | 0.25(1 ~ 0.25) qe or 64 = 0.0655, Insorting this value into Equation 4.15, we obtain the following confidence interval for dy at 95% confidence level: dy = 0.14 1.96 x 0.0655 = 0.1 + 0.128, As the interval spans the value zero, we can conclude that the observed differ- ence is not statistically significant at a 95% confidence level. . | I | | | | I | I} | 192 Chapter 4 Classification ‘At what confidence level can we reject the hypothesis that d; = 0? ‘To do this, we need to determine the value of Zaz such that the confidence interval for d; does not span the value zero. We can reverse the preceding computation and look for the value Za/2 such that d > Za/2d4- Replacing the values of d and Gq gives Zap < 1.527. This value fist oseurs when (1a) $ 0.996 (for a two-sided test). The result suggests that the null hypothesis can be rejected ‘at confidence level of 93.6% or lower. 4.6.3 Comparing the Performance of Two Classifiers Suppose we want to compare the performance of two classifiers using the k-fold cross-validation approach. Initially, the data set D is divided into k equal-sized partitions. We then apply each classifier to construct a model from k— 1 of the partitions and test it on the remaining partition. This step is repeated times, each time using a different partition as the test set. Lot Mj; denote the model induced by classification technique Zi during the j* iteration. Note that each pair of models ‘Mi; and Mz; are tested on the same partition j. Let e1; and e2, be theit respective error rates. The difference between their error rates during the j% fold can be written a5 dj = e1j ~ €;. If k is sufficiently large, then d; is normally distributed with mean dj", which is the true difference in their error rates, and variance 0°. Unlike the previous approach, the overall variance in the observed differences is estimated using the following formula: (4.16) where d is the average difference. For this approach, we need to use a & distribution to compute the confidence interval for dj; Gf = EE toy eiFar ‘The coefficient: f(a),4-1 is obtained from a probability table with two input parameters, its confidence level (1a) and the number of degrees of freedom, k—1. The probability table for the distribution is shown in Table 4.6. Example 4.6. Suppose the estimated difference in the accuracy of models generated by two classification techniques has @ mean equal to 0.05 and a standard deviation equal to 0,002. If the accuracy is estimated using a 30-fold cross-validation approach, then at a 95% confidence level, the true accuracy difference is, dg? = 0.05 + 2.04 x 0.002, (417) | | | | 4.7 Bibliographic Notes 198 Table 4.6. Probability table for t-distribution. a=e) k-110.99 7 0.98 | 0.95 | 0.9 | 08 1 3.08 | 6.31 | 12.7 | 31.8 | 63.7 2 1.89 | 2.92 | 4.30 | 6.96 | 9.92 4 1.53 | 2.13 | 2.78 | 3.75 | 4.60 9 1.38 | 1.83 | 2.26 | 2.82 | 3.25 14 | 1.34 | 1.76 | 2.14 | 2.62 | 2.98 19 | 1.33 | 1.73 | 2.09 | 2.54 | 2.86 2 | 132/171 | 206 } 2.49 | 2.80 29 | 1.31 | 1.70 | 2.04 | 2.46] 2.76 Since the confidence interval does not span the value zero, the observed dif- ference between the techniques is statistically significant. . 4.7 Bibliographic Notes Early classification systenis were developed to organize a large collection of objects. For example, the Dewey Decimal and Library of Congress elassifica- tion systems were designed to catalog and index the vast number of library books. The categories are typically identified in a manual fashion, with the help of domain experts. Automated classification has been a subject of intensive research for many years. The study of classification in classical statistics is sometimes known as discriminant analysis, where the objective is to predict the group member- ship of an object based on a sot of predictor variables. A well-known classical method is Fisher’s linear discriminant analysis (117], which seeks to find a lin- car projection of the data that produces the greatest discrimination between objects that belong to different classes. Many pattern recognition probloms also require the discrimination of ob- jects from different classes. Examples include speech recognition, handwritten character identification, and image classification. Readers who are interested in the application of classification techniques for pattern recognition can refer to the survey articles by Jain et al, [122] and Kulkarni et al. [128] or classic pattern recognition books by Bishop [107], Duda et al. “{114), and Fukunaga [118]. The subject of classification is also a major research topic in the fields of neural networks, statistical learning, and machine lesrning. An in-depth treat I I

You might also like