100% found this document useful (1 vote)
59 views34 pages

Lec 18

This document provides an overview of ensemble learning techniques. It discusses the motivation for using ensemble methods, how ensemble classifiers are constructed by manipulating training sets, input features, class labels, or learning algorithms. Specific ensemble methods like bagging, boosting, and random forests are explained. Bagging works by training base classifiers on randomly sampled subsets of data and combining predictions through voting. Boosting iteratively focuses on misclassified examples by increasing their weights. AdaBoost is presented as a popular boosting algorithm.

Uploaded by

ABHIRAJ E
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
59 views34 pages

Lec 18

This document provides an overview of ensemble learning techniques. It discusses the motivation for using ensemble methods, how ensemble classifiers are constructed by manipulating training sets, input features, class labels, or learning algorithms. Specific ensemble methods like bagging, boosting, and random forests are explained. Bagging works by training base classifiers on randomly sampled subsets of data and combining predictions through voting. Boosting iteratively focuses on misclassified examples by increasing their weights. AdaBoost is presented as a popular boosting algorithm.

Uploaded by

ABHIRAJ E
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Ensemble Learning

Prof. Navneet Goyal


Ensemble Learning
• Introduction & Motivation
• Construction of Ensemble Classifiers
– Boosting (Ada Boost)
– Bagging
– Random Forests
• Empirical Comparison
Introduction & Motivation
• Suppose that you are a patient with a set of symptoms
• Instead of taking opinion of just one doctor (classifier),
you decide to take opinion of a few doctors!
• Is this a good idea? Indeed it is.
• Consult many doctors and then based on their diagnosis;
you can get a fairly accurate idea of the diagnosis.
• Majority voting - ‘bagging’
• More weightage to the opinion of some ‘good’
(accurate) doctors - ‘boosting’
• In bagging, you give equal weightage to all classifiers,
whereas in boosting you give weightage according to
the accuracy of the classifier.
Introduction & Motivation
• Suppose you are preparing for an exam
• As part of the preparation, you have to solve 50
problems
• You, as an individual, can solve only 15 problems
• But, one of your wingies can solve 20 problems, some of
which are same as your 15
• Another wingie can solve a set of 12 problems
• When you put them all together, you find that you guys
have solution to 32 problems
• 32 is much better than what you could solve
individually!!
• Collaborative Learning
Introduction & Motivation
• Why do you think I ask you to form groups for
assignments?
– I am against individual student doing an assignment
• May be coz I want to reduce my evaluation work!!
– NOT TRUE!!!
• Group members complement each other and do a
better job
– Provided it is not just one student who actually does the
assignment and rest piggyback!!
• If all students in a group contribute properly, it’s a
WIN-WIN situation for both students and faculty!!
Ensemble Methods
• Construct a set of classifiers from the training data

• Predict class label of previously unseen records by


aggregating predictions made by multiple classifiers
General Idea
Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

Figure taken from Tan et. al. book “Introduction to Data Mining”
Ensemble Classifiers (EC)
• An ensemble classifier constructs a set of ‘base
classifiers’ from the training data
• Methods for constructing an EC
• Manipulating training set
• Manipulating input features
• Manipulating class labels
• Manipulating learning algorithms
Ensemble Classifiers (EC)
• Manipulating training set
• Multiple training sets are created by resampling the data
according to some sampling distribution
• Sampling distribution determines how likely it is that an
example will be selected for training – may vary from one trial
to another
• Classifier is built from each training set using a paritcular
learning algorithm
• Examples: Bagging & Boosting
Ensemble Classifiers (EC)
• Manipulating input features
• Subset of input features chosen to form each training set
• Subset can be chosen randomly or based on inputs given by
Domain Experts
• Good for data that has redundant features
• Random Forest is an example which uses DT as its base
classifierss
Ensemble Classifiers (EC)
• Manipulating class labels
• When no. of classes is sufficiently large
• Training data is transformed into a binary class problem by
randomly partitioning the class labels into 2 disjoint subsets,
A0 & A1
• Re-labelled examples are used to train a base classifier
• By repeating the class labeling and model building steps
several times, and ensemble of base classifiers is obtained
• How a new tuple is classified?
• Example – error correcting output codings (pp 307)
Ensemble Classifiers (EC)
• Manipulating learning algorithm
• Learning algorithms can be manipulated in such a way that
applying the algorithm several times on the same training
data may result in different models
• Example – ANN can produce different models by changing
network topology or the initial weights of links between
neurons
• Example – ensemble of DTs can be constructed by
introducing randomness into the tree growing procedure –
instrad of choosing the best split attribute at each node, we
randomly choose one of the top k attributes
Ensemble Classifiers (EC)
• First 3 approaches are generic – can be applied to any
classifier
• Fourth approach depends on the type of classifier used
• Base classifiers can be generated sequentially or in
parallel
Ensemble Classifiers
• Ensemble methods work better with ‘unstable
classifiers’
• Classifiers that are sensitive to minor perturbations in
the training set
• Examples:
– Decision trees
– Rule-based
– Artificial neural networks
Why does it work?
• Suppose there are 25 base classifiers
– Each classifier has error rate, e = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes a wrong
prediction:
25
æ 25 ö i
å ç
ç i ÷
i =13 è
÷e (1 - e ) 25 -i
= 0.06
ø
– CHK out yourself if it is correct!!

Example taken from Tan et. al. book “Introduction to Data Mining”
Examples of Ensemble Methods

• How to generate an ensemble of classifiers?


– Bagging
– Boosting
– Random Forests
Bagging
• Also known as bootstrap aggregation
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
• Sampling uniformly with replacement
• Build classifier on each bootstrap sample
• 0.632 bootstrap
• Each bootstrap sample Di contains approx. 63.2% of
the original training data
• Remaining (36.8%) are used as test set

Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging

• Accuracy of bagging:
k
Acc( M ) = å (0.632 * Acc( M i ) test _ set + 0.368 * Acc( M i ) train _ set )
i =1
• Works well for small data sets
• Example:

X 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1
Actual
Class
labels

Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging
• Decision Stump
• Single level decision binary
tree
• Entropy – x<=0.35 or
x<=0.75
• Accuracy at most 70%

Actual
Class
labels

Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging

Accuracy of ensemble classifier: 100% J


Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging- Final Points
• Works well if the base classifiers are unstable
• Increased accuracy because it reduces the variance
of the individual classifier
• Does not focus on any particular instance of the
training data
• Therefore, less susceptible to model over-fitting
when applied to noisy data
• What if we want to focus on a particular instances of
training data?
Boosting
• An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal weights
– Unlike bagging, weights may change at the end of a
boosting round
Boosting
• Records that are wrongly classified will have their
weights increased
• Records that are classified correctly will have their
weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify


• Its weight is increased, therefore it is more likely
to be chosen again in subsequent rounds

Example taken from Tan et. al. book “Introduction to Data Mining”
Boosting
• Equal weights are assigned to each training tuple
(1/d for round 1)
• After a classifier Mi is learned, the weights are
adjusted to allow the subsequent classifier Mi+1 to
“pay more attention” to tuples that were
misclassified by Mi.
• Final boosted classifier M* combines the votes of
each individual classifier
• Weight of each classifier’s vote is a function of its
accuracy
• Adaboost – popular boosting algorithm
Adaboost
• Input:
– Training set D containing d tuples
– k rounds
– A classification learning scheme
• Output:
– A composite model
Adaboost
• Data set D containing d class-labeled tuples (X1,y1),
(X2,y2), (X3,y3),….(Xd,yd)
• Initially assign equal weight 1/d to each tuple
• To generate k base classifiers, we need k rounds or
iterations
• Round i, tuples from D are sampled with
replacement , to form Di (size d)
• Each tuple’s chance of being selected depends on its
weight
Adaboost
• Base classifier Mi, is derived from training tuples of
Di
• Error of Mi is tested using Di
• Weights of training tuples are adjusted depending
on how they were classified
– Correctly classified: Decrease weight
– Incorrectly classified: Increase weight
• Weight of a tuple indicates how hard it is to classify
it (directly proportional)
Adaboost
• Some classifiers may be better at classifying some
“hard” tuples than others
• We finally have a series of classifiers that
complement each other!
• Error rate of model Mi:d
error ( M i )= å w j * err ( X j )
j
where err(Xj) is the misclassification error for Xj(=1)
• If classifier error exceeds 0.5, we abandon it
• Try again with a new Di and a new Mi derived from it
Adaboost
• error (Mi) affects how the weights of training tuples are
updated
• If a tuple is correctly classified in round i, its weight is
multiplied by error ( M i )
1 - error ( M i )
• Adjust weights of all correctly classified tuples
• Now weights of all tuples (including the misclassified tuples)
are normalized sum _ of _ old _ weights
• Normalization factor = sum _ of _ new _ weights
error ( M i )
• Weight of a classifier Mi’s weight is log 1 - error ( M )
i
Adaboost
• The lower a classifier error rate, the more accurate it is, and
therefore, the higher its weight for voting should be
• Weight of a classifier Mi’s vote is
error ( M i )
log
1 - error ( M i )
• For each class c, sum the weights of each classifier that
assigned class c to X (unseen tuple)
• The class with the highest sum is the WINNER!
Example: AdaBoost
• Base classifiers: C1, C2, …, CT

• Error rate:

å w d (C ( x ) ¹ y )
N
1
ei = j i j j
N j =1
• Importance of a classifier:

1 æ 1 - ei ö
ai = lnçç ÷÷
2 è ei ø
Example: AdaBoost
• Weight update:
-a j
( j +1)
( j)
w ï ìexp if C j ( xi ) = yi
wi =i
í a
Z j ïî exp j if C j ( xi ) ¹ yi
where Z j is the normalization factor
C * ( x ) = arg max å a jd (C j ( x ) = y )
T

y j =1
• If any intermediate rounds produce error rate higher
than 50%, the weights are reverted back to 1/n and
the re-sampling procedure is repeated
• Classification:
Illustrating AdaBoost
Initial weights for each data point Data points
for training

0.1 0.1 0.1


Original
Data +++ - - - - - ++

B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - a = 1.9459
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - a = 1.9459

B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ a = 2.9323

B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ a = 3.8744

Overall +++ - - - - - ++

You might also like