Ensemble Learning
Ensemble Learning
1
Wisdom of the Crowd
2
Wisdom of the Crowd (of machines)
3
What is ensemble learning?
• An ensemble is a group of predictors
• An ensemble can be a strong learner even if each predictor is a weak learner
• Provided there are a sufficient number of weak learners and they are sufficiently diverse
4
How will combining leads to a strong learner?
• Think of a slightly biased coin with 51% chance of heads and 49% of tails.
• Law of large numbers: as you keep tossing the coin, assuming every toss is
independent of others, the ratio of heads gets closer and closer to the probability of
heads 51%.
5
How will combining leads to a strong learner?
• Tossing the coin 1000 times, we will end with more or less than 510 heads and 490
tails.
• For an ensemble of 1000 classifiers, each correct 51% of the time, the probability of
getting the majority of heads is up 75%
6
Ensemble Learning
7
Voting Classifier
• Train diverse predictors on the same data
8
Voting Classifier
• Use voting for predictions Hard voting:
The ensemble prediction is the
prediction of the majority
Soft voting:
The ensemble prediction is the class
with the highest averaged probability
averaged
10
Bagging and Pasting
• Use different random subset of samples
• Usually predictors of the same type are used
Combined prediction
12
Example of Pasting
Original data 1 2 3 4 5 6 7 8 9 10
Sample size = 6
Combined prediction
13
Random Subspaces and Random patches
X1 X2 X3 X4 X5
1 10 21 30 44 15
2 12 20 35 40 20
3 10 24 34 43 14
4 15 22 31 41 12
5 19 25 35 42 19
6 12 29 30 45 11
2 20 40 20 2 12 35 40 2 12 20 35 3 10 24 34 43 14 3 10 24 34 43 14 4 15 22 31 41 12
3 24 43 14 3 10 34 43 3 10 24 34 4 15 22 31 41 12 4 15 22 31 41 12 5 19 25 35 42 19
4 22 41 12 4 15 31 41 4 15 22 31 5 19 25 35 42 19 6 12 29 30 45 11 6 12 29 30 45 11
5 25 42 19 5 19 35 42 5 19 25 35
Sampling both instances and Features
6 29 45 11 6 12 30 45 6 12 29 30
15
Adaboost (Adaptive boosting)
• Instead of sampling, re-weigh samples
• Samples are given weights.
• Start with uniform weighting
• At each iteration, a model is learned and the samples are re-weighted so the next
classifiers focus on samples that were wrongly predicted by previous classifier
• Weights of correctly predicted samples are decreased
• Weights of incorrectly predicted samples are increased
16
Adaboost
Original data D1 Weighted data D2 Weighted data D3
1 1
4 4 1 4
2 2 5 2 5
5
5 1 1 5 1 Combined Classifier
5 2
2 2
1
3 3 4
3 3 3 3
4 2
4 4 5
5 1
2
Classifier 1 Classifier 2 Classifier 3
3 3
1 1 4
4 4 1 4
2 2 5 2
5 5
5 1 1 5 1
2 5 2
2
3 3 3
3 3 3
4 4 4
17
Adaboost algorithm
1
• Each sample weight 𝑤𝑖 in the training set is initialized to , where 𝑚 is the number of
𝑚
samples in training set.
• For each trained predictor j, compute its weighted error 𝑟𝑗 and its weight 𝛼𝑗
19
Gradient Boosting
𝐷𝑎𝑡𝑎𝑠𝑒𝑡 𝐷 = {𝑥𝑖 , 𝑦𝑖 }1𝑚
𝐷1 = {𝑥𝑖 , 𝑦𝑖 }1𝑚
20
Gradiant boosting: example
• Training data : square footage data on five apartments and their rent prices in dollars per
month
• We use the mean (average) of the rent prices as our initial model F0
21
Gradiant boosting: example
• Next, we train weak models Δ𝑖 to predict residuals for all i observations
22
Gradiant boosting: example
• Next, we train weak models Δ𝑖 to predict residuals for all i observations
24
Gradiant boosting: example
• Summing the three learners
25
Stacking (Stacked generalization)
• Basic idea:
• Train a separate model to perform aggregation of the predictions of the individual
classifiers
• Predictions from the train set are
used as features for level 1 model.
• Level 1 model is used to make a
prediction on the test set
Blender or meta-
learner
26
Stacking (Stacked generalization)
• Stacking with a hold-out set → Blending
27
Stacking
Training Subset 1 Model 1
• Training level 0 models
X1 X2 X3 X4 C
x1,1 x2,1 x3,1 x4,1 True
train
x1,2 x2,2 x3,2 x4,2 False Model 2
Training set … … … … …
29
Multilayer stacking
30
Conclusions
• Ensemble learning is about training multiple base models and combined them to obtain a
strong model with better performance
• Ideally low bias, low variance
• In bagging ensembles, instances of the same base model are trained in parallel on random
subsets of data and then aggregated
• Using random sampling reduce variance
• In boosting ensembles, instances of the same base model are trained iteratively, such that,
each model attempts to correct the predictions of the previous model.
• Stacking ensembles use multi-stage training. Different types of base models are trained at
the very first stage on top of which a meta-model is trained to make predictions based on
based model predictions
31
Ensemble learning on diabetes data
• Load the diabetes data and split it into training set, a validation set, and a test set
• 30% of data for test, 30% of training for validation
• Train various classifiers individually: Decision tree, KNN and SVM
• Voting ensemble: Combine the classifiers into an ensemble using hard or soft
voting.
• Use the validation set to find the best ensemble (it must outperforms the individual classifiers)
• Evaluate the best model found on the test set and compare the results