0% found this document useful (0 votes)
63 views15 pages

Combining Classifiers: Outline

This document discusses different techniques for combining multiple classifiers to improve classification accuracy. It covers major topics like majority voting, weighted voting, bagging, boosting and different fusion methods. Majority voting assigns the class with the most votes from the classifiers. Weighted voting assigns weights to classifiers based on accuracy. Bagging trains classifiers on bootstrap samples to add diversity. Boosting iteratively reweights training examples to focus on misclassified ones. The document compares various combination methods and their effects on accuracy and classifier independence.

Uploaded by

jawad_shah
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views15 pages

Combining Classifiers: Outline

This document discusses different techniques for combining multiple classifiers to improve classification accuracy. It covers major topics like majority voting, weighted voting, bagging, boosting and different fusion methods. Majority voting assigns the class with the most votes from the classifiers. Weighted voting assigns weights to classifiers based on accuracy. Bagging trains classifiers on bootstrap samples to add diversity. Boosting iteratively reweights training examples to focus on misclassified ones. The document compares various combination methods and their effects on accuracy and classifier independence.

Uploaded by

jawad_shah
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Combining Classifiers

Outline
Types of Classifier Outputs Fusion of Label Outputs Fusion of Continuous-Values Outputs Classifier selection Bagging Boosting

Types of Classifier Outputs


Type 0: The Oracle level Type 1: The Abstract level Type 2: The Rank level Type 3: The Measurement level

Fusion of Label Outputs


Majority voting Weighted voting Nave Bayes Combination Behavior Knowledge Space and Werneckes Method Dempster-Shafer Theory, Probabilistic Approximation, Classifier Combination Using Singular Value Decomposition, etc

Majority Voting
Unanimity, simple majority and plurality Assume that the label outputs of the classifiers are given as c-dimensional binary vectors [di,1, . . . , di,c]T {0, 1}c, i=1, . . . , L, where di, j = 1 if Di labels x in j, and 0 otherwise.
The plurality vote (also known as majority vote) will result in an ensemble decision for class k if:

Results about Majority Voting


Assuming that:
The number of classifiers, L, is odd The probability for each classifier to give the correct class label is p for any x n The classifier outputs are independent

The accuracy of the ensemble is:

Results about Majority Voting


If p > 0.5, then Pmaj is monotonically increasing and
Pmaj 1 as L Pmaj 0 as L

If p < 0.5, then Pmaj is monotonically decreasing and If p = 0.5


then Pmaj = 0.5 for any L.

Warning: Patterns of Success and Failure

Weighted Voting
If the classifiers in the ensemble are not of identical accuracy, it is reasonable to attempt to give the more competent classifiers more power in making the final decision. The label outputs can be represented as degrees of support for the classe in the following way

The discriminant function for class j obtained through weighted voting is

where bi is a coefficient for classifier Di.

Selecting weights
One way to select the weights for the classifiers
Consider an ensemble of L independent classifiers D1, . . . , DL, with individual accuracies p1, . . . , pL. The outputs are combined by the weighted majority vote The accuracy of the ensemble is maximized by assigning weights

Nave Bayes Combination


It assumes that the classifiers are mutually independent given a class label (conditional independence)

So the support for class k can be calculated as:

Practical implementation
For each classifier Di, a c c confusion matrix CMi is calculated by applying Di to the training data set. The (k,s)th entry of this matrix, is the number of elements of the data set whose true class label was k, and were assigned by Di to class s. By Ns we denote the total number of elements of Z from class s. Taking as an estimate of the probability P(si|k), and Nk/N as an estimate of the prior probability for class k, the support for the class k is equivalent to

BKS & Werneckes Method


Behavior knowledge space (BKS) is a fancy name for the multinomial combination (no independence assumption)
The label vector s gives an index to a cell in a look-up table (the BKS table) The table is designed using a labeled data set Z. Each zj Z is placed in the cell indexed by the s for that object. The number of elements in each cell are tallied and the most representative class label is selected for this cell. The highest score corresponds to the highest estimated posterior probability. Ties are resolved arbitrarily. Empty cells are labelled in some appropriate way. For example, we can choose a label at random or use the result from a majority vote between the elements of s.

Werneckes combination method is similar to BKS and aims at reducing overtraining.


It also uses a look-up table with labels. The difference is that in constructing the table, Wernecke considers the 95 percent confidence intervals of the frequencies in each cell. If there is overlap between the intervals, the prevailing class is not considered dominating enough for labeling the cell. The least wrong classifier among the L members of the team is identified and it is authorized to label the cell.

An Example
Results from a 10-fold crossvalidation with the Pima Indian Diabetes database from UCI. Each individual classifier is an MLP with one hidden layer with 25 nodes. o is for the training, and for the testing

Fusion of Continuous-Values Outputs


Class-Conscious Combiners
Nontrainable Combiners Trainable Combiners

Class-Indifferent Combiners
Decision Templates DempsterShafer Combination

Decision Profile

Nontrainable (Class-Conscious) Combiners


The average and the product are the two most intensively studied combiners. Yet, there is no guideline as to which one is better for a specific problem.

The current understanding is that the average, in general, might be less accurate than the product for some problems but is the more stable of the two

Trainable Combiners
The Weighted Average
Three groups can be distinguished based on the number of weights
L weights.
There is one weight per classifier. The weight for classifier Di is usually based on its estimated error rate.

c L weights
Weights are specific for each class

c c L weights.
The support for each class is obtained by a linear combination of all of the decision profile DP(x)

Fuzzy Integrals

Decision Templates

Decision Templates

Classifier selection
Decision-Independent Estimates Decision-Dependent Estimates Selection or Fusion?
the Kunchevas proposal

Some Selection Criteria


Decision-Independent Estimates (e.g. Direct k-nn Estimate) One way to estimate the competence is to identify the K nearest neighbors of x from either the training set or a validation set and find out how accurate the classifiers are on these K objects. K is a parameter of the algorithm, which needs to be tuned prior to the operational phase. Decision-Dependent Estimates (e.g. Direct k-nn Estimate) Let si be the class label assigned to x by classifier Di. Denote by Nx(si) the set of K nearest neighbors of x from Z, which classifier Di labeled as si. The competence of classifier Di for the given x, is calculated as the proportion of elements of Nx(si) whose true class label was si. This estimate is called the local class accuracy.

Kunchevas approach
Selection is guaranteed by design to give at least the same training accuracy as the best individual classifier D*. However, the model might overtrain To guard against overtraining we may use a statistical significance test (confidence intervals) and nominate a classifier only when it is significantly better than the others. If Di(j) is significantly better than the second best in the region Rj, then Di(j) can be nominated as the classifier responsible for Rj. Otherwise, a scheme involving more than one classifier might pay off.

Bootstrap AGGregatING
Training phase 1. Initialize the parameters D = , the ensemble. L, the number of classifiers to train. 2. For k = 1, . . . , L Take a bootstrap sample Sk from Z. Build a classifier Dk using Sk as the training set Add the classifier to the current ensemble, D = D Dk 3. Return D. Classification phase 4. Run D1.. . . ,DL on the input x. 5. The class with the maximum number of votes is chosen as the label for x.

Why Does Bagging Work?


If classifier outputs were independent and classifiers had the same individual accuracy p, then the majority vote is guaranteed to improve on the individual performance Bagging aims at developing independent classifiers by taking bootstrap replicates as the training sets. The samples are pseudo-independent because they are taken from the same Z. Warning: even if they were drawn independently from the distribution of the problem, the classifiers built on these training sets might not give independent outputs.

Example: Independent Samples, Bootstrap Samples, and Bagging

Bagging for the rotated check-board data using bootstrap samples and independent samples: (a) averaged pairwise correlation versus the ensemble size; (b) testing error rate versus the ensemble size.

Boosting (AdaBoost.M1)
Training phase (1/2) 1. Initialize the parameters Set the weights (Usually ) Initialize the ensemble D = Pick L, the number of classifiers to train. 2. For k = 1, . . . , L Take a sample Sk from Z using distribution wk Build a classifier Dk using Sk as the training set. Calculate the weighted ensemble error at step k by

(lkj = 1 if Dk misclassifies zj and lkj =0 otherwise)

Boosting (AdaBoost.M1)
Training phase (2/2) If k = 0 or k 0.5, ignore Dk; reinitialize the weights wkj to 1/N and continue Else, calculate

Update the individual weights

3. Return D and 1,,L

Boosting (AdaBoost.M1)
Classification phase 4. Calculate the support for class t by

5. The class with the maximum support is chosen as the label for x.

Why Does AdaBoost Work?


The Margin Theory (remember SVM)

all objects that are misclassified will have negative margins, and those correctly classified will have positive margins Margin distribution graphs for bagging and AdaBoost for the rotated check-board

Conclusions
In his invited lecture at the 3rd International Workshop on Multiple Classifier Systems, 2002, Ghosh proposes that: . . . our current understanding of ensemble-type multiclassifier systems is now quite mature . . . And yet, in an invited book chapter, the same year, Ho states that: . . . Many of the above questions are there because we do not yet have a scientific understanding of the classifier combination mechanisms. The area of combining classifiers is very dynamic and active at present, and is likely to grow and expand in the nearest future.

You might also like