Lec 18
Lec 18
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
Figure taken from Tan et. al. book “Introduction to Data Mining”
Ensemble Classifiers (EC)
• An ensemble classifier constructs a set of ‘base
classifiers’ from the training data
• Methods for constructing an EC
• Manipulating training set
• Manipulating input features
• Manipulating class labels
• Manipulating learning algorithms
Ensemble Classifiers (EC)
• Manipulating training set
• Multiple training sets are created by resampling the data
according to some sampling distribution
• Sampling distribution determines how likely it is that an
example will be selected for training – may vary from one trial
to another
• Classifier is built from each training set using a paritcular
learning algorithm
• Examples: Bagging & Boosting
Ensemble Classifiers (EC)
• Manipulating input features
• Subset of input features chosen to form each training set
• Subset can be chosen randomly or based on inputs given by
Domain Experts
• Good for data that has redundant features
• Random Forest is an example which uses DT as its base
classifierss
Ensemble Classifiers (EC)
• Manipulating class labels
• When no. of classes is sufficiently large
• Training data is transformed into a binary class problem by
randomly partitioning the class labels into 2 disjoint subsets,
A0 & A1
• Re-labelled examples are used to train a base classifier
• By repeating the class labeling and model building steps
several times, and ensemble of base classifiers is obtained
• How a new tuple is classified?
• Example – error correcting output codings (pp 307)
Ensemble Classifiers (EC)
• Manipulating learning algorithm
• Learning algorithms can be manipulated in such a way that
applying the algorithm several times on the same training
data may result in different models
• Example – ANN can produce different models by changing
network topology or the initial weights of links between
neurons
• Example – ensemble of DTs can be constructed by
introducing randomness into the tree growing procedure –
instrad of choosing the best split attribute at each node, we
randomly choose one of the top k attributes
Ensemble Classifiers (EC)
• First 3 approaches are generic – can be applied to any
classifier
• Fourth approach depends on the type of classifier used
• Base classifiers can be generated sequentially or in
parallel
Ensemble Classifiers
• Ensemble methods work better with ‘unstable
classifiers’
• Classifiers that are sensitive to minor perturbations in
the training set
• Examples:
– Decision trees
– Rule-based
– Artificial neural networks
Why does it work?
• Suppose there are 25 base classifiers
– Each classifier has error rate, e = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes a wrong
prediction:
25
æ 25 ö i
å ç
ç i ÷
i =13 è
÷e (1 - e ) 25 -i
= 0.06
ø
– CHK out yourself if it is correct!!
Example taken from Tan et. al. book “Introduction to Data Mining”
Examples of Ensemble Methods
Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging
• Accuracy of bagging:
k
Acc( M ) = å (0.632 * Acc( M i ) test _ set + 0.368 * Acc( M i ) train _ set )
i =1
• Works well for small data sets
• Example:
X 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1
Actual
Class
labels
Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging
• Decision Stump
• Single level decision binary
tree
• Entropy – x<=0.35 or
x<=0.75
• Accuracy at most 70%
Actual
Class
labels
Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging
Example taken from Tan et. al. book “Introduction to Data Mining”
Boosting
• Equal weights are assigned to each training tuple
(1/d for round 1)
• After a classifier Mi is learned, the weights are
adjusted to allow the subsequent classifier Mi+1 to
“pay more attention” to tuples that were
misclassified by Mi.
• Final boosted classifier M* combines the votes of
each individual classifier
• Weight of each classifier’s vote is a function of its
accuracy
• Adaboost – popular boosting algorithm
Adaboost
• Input:
– Training set D containing d tuples
– k rounds
– A classification learning scheme
• Output:
– A composite model
Adaboost
• Data set D containing d class-labeled tuples (X1,y1),
(X2,y2), (X3,y3),….(Xd,yd)
• Initially assign equal weight 1/d to each tuple
• To generate k base classifiers, we need k rounds or
iterations
• Round i, tuples from D are sampled with
replacement , to form Di (size d)
• Each tuple’s chance of being selected depends on its
weight
Adaboost
• Base classifier Mi, is derived from training tuples of
Di
• Error of Mi is tested using Di
• Weights of training tuples are adjusted depending
on how they were classified
– Correctly classified: Decrease weight
– Incorrectly classified: Increase weight
• Weight of a tuple indicates how hard it is to classify
it (directly proportional)
Adaboost
• Some classifiers may be better at classifying some
“hard” tuples than others
• We finally have a series of classifiers that
complement each other!
• Error rate of model Mi:d
error ( M i )= å w j * err ( X j )
j
where err(Xj) is the misclassification error for Xj(=1)
• If classifier error exceeds 0.5, we abandon it
• Try again with a new Di and a new Mi derived from it
Adaboost
• error (Mi) affects how the weights of training tuples are
updated
• If a tuple is correctly classified in round i, its weight is
multiplied by error ( M i )
1 - error ( M i )
• Adjust weights of all correctly classified tuples
• Now weights of all tuples (including the misclassified tuples)
are normalized sum _ of _ old _ weights
• Normalization factor = sum _ of _ new _ weights
error ( M i )
• Weight of a classifier Mi’s weight is log 1 - error ( M )
i
Adaboost
• The lower a classifier error rate, the more accurate it is, and
therefore, the higher its weight for voting should be
• Weight of a classifier Mi’s vote is
error ( M i )
log
1 - error ( M i )
• For each class c, sum the weights of each classifier that
assigned class c to X (unseen tuple)
• The class with the highest sum is the WINNER!
Example: AdaBoost
• Base classifiers: C1, C2, …, CT
• Error rate:
å w d (C ( x ) ¹ y )
N
1
ei = j i j j
N j =1
• Importance of a classifier:
1 æ 1 - ei ö
ai = lnçç ÷÷
2 è ei ø
Example: AdaBoost
• Weight update:
-a j
( j +1)
( j)
w ï ìexp if C j ( xi ) = yi
wi =i
í a
Z j ïî exp j if C j ( xi ) ¹ yi
where Z j is the normalization factor
C * ( x ) = arg max å a jd (C j ( x ) = y )
T
y j =1
• If any intermediate rounds produce error rate higher
than 50%, the weights are reverted back to 1/n and
the re-sampling procedure is repeated
• Classification:
Illustrating AdaBoost
Initial weights for each data point Data points
for training
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - a = 1.9459
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - a = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ a = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ a = 3.8744
Overall +++ - - - - - ++