0% found this document useful (0 votes)
12 views32 pages

Ensemble Learning

The document discusses ensemble learning, which is the process of combining multiple machine learning models to obtain better predictive performance than could be obtained from any of the constituent models alone. It describes different types of ensembles including voting, bagging, boosting, and stacking. It also explains how combining models can lead to stronger learners even when the individual models are weak.

Uploaded by

Hiba Saghir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

Ensemble Learning

The document discusses ensemble learning, which is the process of combining multiple machine learning models to obtain better predictive performance than could be obtained from any of the constituent models alone. It describes different types of ensembles including voting, bagging, boosting, and stacking. It also explains how combining models can lead to stronger learners even when the individual models are weak.

Uploaded by

Hiba Saghir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

ENSEMBLE LEARNING

The science of combining models

1
Wisdom of the Crowd

2
Wisdom of the Crowd (of machines)

• The wisdom of the crowd


• A diverse set of models are Some unkown
likely to make better decisions distribution
as compared to single models
• Combining decisions from
multiple models to improve
the overall performance. Model 6 Model 4 Model 2 Model 1
Model 3
Model 5

3
What is ensemble learning?
• An ensemble is a group of predictors
• An ensemble can be a strong learner even if each predictor is a weak learner
• Provided there are a sufficient number of weak learners and they are sufficiently diverse

4
How will combining leads to a strong learner?
• Think of a slightly biased coin with 51% chance of heads and 49% of tails.

• Law of large numbers: as you keep tossing the coin, assuming every toss is
independent of others, the ratio of heads gets closer and closer to the probability of
heads 51%.

5
How will combining leads to a strong learner?
• Tossing the coin 1000 times, we will end with more or less than 510 heads and 490
tails.

• For an ensemble of 1000 classifiers, each correct 51% of the time, the probability of
getting the majority of heads is up 75%

6
Ensemble Learning

• Different types of ensembles


Voting
• Use the same/different learning
algorithms
• Homogeneous vs heterogeneous Bagging and
ensembles Pasting
EL approaches
• Use the same dataset/ random
subsets of data Boosting
• Use the same/different sets of
features Stacking

7
Voting Classifier
• Train diverse predictors on the same data

8
Voting Classifier
• Use voting for predictions Hard voting:
The ensemble prediction is the
prediction of the majority

Predictor 1 Predictor 2 Predictor 3 Predictor 4 Predictor 5 Ensemble’s prediction


Example
5 4 5 4 4 4 9
Voting Classifier
• Use voting for predictions

Soft voting:
The ensemble prediction is the class
with the highest averaged probability
averaged
10
Bagging and Pasting
• Use different random subset of samples
• Usually predictors of the same type are used

• Bagging: sampling with replacement


• For a given predictor, a training instance
may be sampled several times

• Pasting: sampling without replacement


• Training instances may be sampled several
times across predictors

• Training and predictions can be


performed in parallel
11
Example of Bagging
Original data 1 2 3 4 5 6 7 8 9 10
Sample size = 10

Bootstrap 1 Bootstrap 2 Bootstrap 3


7 8 10 8 2 5 10 10 5 9 1 8 5 10 5 5 9 6 3 7 1 4 9 1 2 3 2 7 3 2

Model 1 Model 2 Model 3

Combined prediction
12
Example of Pasting
Original data 1 2 3 4 5 6 7 8 9 10
Sample size = 6

Sample 1 Sample 2 Sample 3


7 8 10 1 2 3 1 8 5 10 9 6 1 4 9 2 3 7 6

Model 1 Model 2 Model 3

Combined prediction
13
Random Subspaces and Random patches
X1 X2 X3 X4 X5

1 10 21 30 44 15

2 12 20 35 40 20

3 10 24 34 43 14

4 15 22 31 41 12

5 19 25 35 42 19

6 12 29 30 45 11

Random Subspaces Random Patches


1 21 44 15 1 10 30 44 1 10 21 30 1 10 21 30 44 15 2 12 20 35 40 20 1 10 21 30 44 15

2 20 40 20 2 12 35 40 2 12 20 35 3 10 24 34 43 14 3 10 24 34 43 14 4 15 22 31 41 12

3 24 43 14 3 10 34 43 3 10 24 34 4 15 22 31 41 12 4 15 22 31 41 12 5 19 25 35 42 19

4 22 41 12 4 15 31 41 4 15 22 31 5 19 25 35 42 19 6 12 29 30 45 11 6 12 29 30 45 11

5 25 42 19 5 19 35 42 5 19 25 35
Sampling both instances and Features
6 29 45 11 6 12 30 45 6 12 29 30

Keep all instances and sample Features 14


Boosting
• Unlike bagging, individual predictors are trained sequentially, each trying to correct
its predecessor.

15
Adaboost (Adaptive boosting)
• Instead of sampling, re-weigh samples
• Samples are given weights.
• Start with uniform weighting

• At each iteration, a model is learned and the samples are re-weighted so the next
classifiers focus on samples that were wrongly predicted by previous classifier
• Weights of correctly predicted samples are decreased
• Weights of incorrectly predicted samples are increased

• Final prediction is a combination of model predictions weighted by their respective


error measures

16
Adaboost
Original data D1 Weighted data D2 Weighted data D3
1 1
4 4 1 4
2 2 5 2 5
5
5 1 1 5 1 Combined Classifier
5 2
2 2
1
3 3 4
3 3 3 3
4 2
4 4 5
5 1
2
Classifier 1 Classifier 2 Classifier 3
3 3
1 1 4
4 4 1 4
2 2 5 2
5 5
5 1 1 5 1
2 5 2
2
3 3 3
3 3 3
4 4 4

17
Adaboost algorithm
1
• Each sample weight 𝑤𝑖 in the training set is initialized to , where 𝑚 is the number of
𝑚
samples in training set.
• For each trained predictor j, compute its weighted error 𝑟𝑗 and its weight 𝛼𝑗

• Update sample weights

• For final predictions


18
Gradient Boosting
• Unlike AdaBoost, Gradient Boosting tries to fit the new predictor to the residual errors
made by the previous predictor.

19
Gradient Boosting
𝐷𝑎𝑡𝑎𝑠𝑒𝑡 𝐷 = {𝑥𝑖 , 𝑦𝑖 }1𝑚

𝐷1 = {𝑥𝑖 , 𝑦𝑖 }1𝑚

𝐷2 = {𝑥𝑖 , 𝑦𝑖 − ℎ1 (𝑥𝑖 )}1𝑚

𝐷3 = {𝑥𝑖 , 𝑦𝑖 − ℎ1 (𝑥𝑖 ) − ℎ2 (𝑥𝑖 )}1𝑚

20
Gradiant boosting: example
• Training data : square footage data on five apartments and their rent prices in dollars per
month
• We use the mean (average) of the rent prices as our initial model F0

21
Gradiant boosting: example
• Next, we train weak models Δ𝑖 to predict residuals for all i observations

22
Gradiant boosting: example
• Next, we train weak models Δ𝑖 to predict residuals for all i observations

• The residuals (blue dots) get smaller as we add more learners


23
Gradiant boosting: example

24
Gradiant boosting: example
• Summing the three learners

25
Stacking (Stacked generalization)
• Basic idea:
• Train a separate model to perform aggregation of the predictions of the individual
classifiers
• Predictions from the train set are
used as features for level 1 model.
• Level 1 model is used to make a
prediction on the test set
Blender or meta-
learner

26
Stacking (Stacked generalization)
• Stacking with a hold-out set → Blending

27
Stacking
Training Subset 1 Model 1
• Training level 0 models
X1 X2 X3 X4 C
x1,1 x2,1 x3,1 x4,1 True
train
x1,2 x2,2 x3,2 x4,2 False Model 2
Training set … … … … …

X1 X2 X3 X4 C x1,i x2,i x3,i x4,i True

x1,1 x2,1 x3,1 x4,1 True Model 3


x1,2 x2,2 x3,2 x4,2 False
… … … … … Training Subset 2
x1,n x2,n x3,n x4,n False X1 X2 X3 X4 C
x1,i+1 x2,i+1 x3,i+1 x4,i+1 True
x1,i+2 x2,i+2 x3,i+2 x4,i+2 False
… … … … …
x1,n x2,n x3,n x4,n True 28
Stacking
Model predictions True targets
Training Subset 2 Model 1
X1 X2 X3 X4 C M1 M2 M3 C
x1,i+1 x2,i+1 x3,i+1 x4,i+1 True True True False True
predict
x1,i+2 x2,i+2 x3,i+2 x4,i+2 False Model 2 False True True False
… … … … … … … … …
x1,n x2,n x3,n x4,n True True False True True
Model 3
Training set for level 1 model

Final predictions Generalizer

29
Multilayer stacking

30
Conclusions
• Ensemble learning is about training multiple base models and combined them to obtain a
strong model with better performance
• Ideally low bias, low variance

• In bagging ensembles, instances of the same base model are trained in parallel on random
subsets of data and then aggregated
• Using random sampling reduce variance

• In boosting ensembles, instances of the same base model are trained iteratively, such that,
each model attempts to correct the predictions of the previous model.
• Stacking ensembles use multi-stage training. Different types of base models are trained at
the very first stage on top of which a meta-model is trained to make predictions based on
based model predictions
31
Ensemble learning on diabetes data
• Load the diabetes data and split it into training set, a validation set, and a test set
• 30% of data for test, 30% of training for validation
• Train various classifiers individually: Decision tree, KNN and SVM
• Voting ensemble: Combine the classifiers into an ensemble using hard or soft
voting.
• Use the validation set to find the best ensemble (it must outperforms the individual classifiers)
• Evaluate the best model found on the test set and compare the results

• Stacking ensemble: using the previous classifiers


• Create a new training set (for the meta learner) using the predictions on the validation set
• Train a classifier (e.g, random forest) on the new training set
• Evaluate the model on the test set and compare the results
32

You might also like