Combining Classifiers: Outline
Combining Classifiers: Outline
Outline
Types of Classifier Outputs Fusion of Label Outputs Fusion of Continuous-Values Outputs Classifier selection Bagging Boosting
Majority Voting
Unanimity, simple majority and plurality Assume that the label outputs of the classifiers are given as c-dimensional binary vectors [di,1, . . . , di,c]T {0, 1}c, i=1, . . . , L, where di, j = 1 if Di labels x in j, and 0 otherwise.
The plurality vote (also known as majority vote) will result in an ensemble decision for class k if:
Weighted Voting
If the classifiers in the ensemble are not of identical accuracy, it is reasonable to attempt to give the more competent classifiers more power in making the final decision. The label outputs can be represented as degrees of support for the classe in the following way
Selecting weights
One way to select the weights for the classifiers
Consider an ensemble of L independent classifiers D1, . . . , DL, with individual accuracies p1, . . . , pL. The outputs are combined by the weighted majority vote The accuracy of the ensemble is maximized by assigning weights
Practical implementation
For each classifier Di, a c c confusion matrix CMi is calculated by applying Di to the training data set. The (k,s)th entry of this matrix, is the number of elements of the data set whose true class label was k, and were assigned by Di to class s. By Ns we denote the total number of elements of Z from class s. Taking as an estimate of the probability P(si|k), and Nk/N as an estimate of the prior probability for class k, the support for the class k is equivalent to
An Example
Results from a 10-fold crossvalidation with the Pima Indian Diabetes database from UCI. Each individual classifier is an MLP with one hidden layer with 25 nodes. o is for the training, and for the testing
Class-Indifferent Combiners
Decision Templates DempsterShafer Combination
Decision Profile
The current understanding is that the average, in general, might be less accurate than the product for some problems but is the more stable of the two
Trainable Combiners
The Weighted Average
Three groups can be distinguished based on the number of weights
L weights.
There is one weight per classifier. The weight for classifier Di is usually based on its estimated error rate.
c L weights
Weights are specific for each class
c c L weights.
The support for each class is obtained by a linear combination of all of the decision profile DP(x)
Fuzzy Integrals
Decision Templates
Decision Templates
Classifier selection
Decision-Independent Estimates Decision-Dependent Estimates Selection or Fusion?
the Kunchevas proposal
Kunchevas approach
Selection is guaranteed by design to give at least the same training accuracy as the best individual classifier D*. However, the model might overtrain To guard against overtraining we may use a statistical significance test (confidence intervals) and nominate a classifier only when it is significantly better than the others. If Di(j) is significantly better than the second best in the region Rj, then Di(j) can be nominated as the classifier responsible for Rj. Otherwise, a scheme involving more than one classifier might pay off.
Bootstrap AGGregatING
Training phase 1. Initialize the parameters D = , the ensemble. L, the number of classifiers to train. 2. For k = 1, . . . , L Take a bootstrap sample Sk from Z. Build a classifier Dk using Sk as the training set Add the classifier to the current ensemble, D = D Dk 3. Return D. Classification phase 4. Run D1.. . . ,DL on the input x. 5. The class with the maximum number of votes is chosen as the label for x.
Bagging for the rotated check-board data using bootstrap samples and independent samples: (a) averaged pairwise correlation versus the ensemble size; (b) testing error rate versus the ensemble size.
Boosting (AdaBoost.M1)
Training phase (1/2) 1. Initialize the parameters Set the weights (Usually ) Initialize the ensemble D = Pick L, the number of classifiers to train. 2. For k = 1, . . . , L Take a sample Sk from Z using distribution wk Build a classifier Dk using Sk as the training set. Calculate the weighted ensemble error at step k by
Boosting (AdaBoost.M1)
Training phase (2/2) If k = 0 or k 0.5, ignore Dk; reinitialize the weights wkj to 1/N and continue Else, calculate
Boosting (AdaBoost.M1)
Classification phase 4. Calculate the support for class t by
5. The class with the maximum support is chosen as the label for x.
all objects that are misclassified will have negative margins, and those correctly classified will have positive margins Margin distribution graphs for bagging and AdaBoost for the rotated check-board
Conclusions
In his invited lecture at the 3rd International Workshop on Multiple Classifier Systems, 2002, Ghosh proposes that: . . . our current understanding of ensemble-type multiclassifier systems is now quite mature . . . And yet, in an invited book chapter, the same year, Ho states that: . . . Many of the above questions are there because we do not yet have a scientific understanding of the classifier combination mechanisms. The area of combining classifiers is very dynamic and active at present, and is likely to grow and expand in the nearest future.