ML8 Ensembles
ML8 Ensembles
ENSEMBLES
3
Why does it work?
Suppose there are 25 base classifiers
◦ Each classifier has error rate, = 0.35
◦ Assume classifiers are independent
◦ Probability that the ensemble classifier makes
a wrong prediction is the probability that 13
or more classifiers err :
i
25 25
i
i 13
(1 ) 25i
0.06
This value approaches 0 as the number of classifiers
increases. This is just the law of large numbers but in this
context is sometimes called “Condorcet’s Jury Theorem”
4
Homogenous Ensembles
Use a single, arbitrary learning algorithm but
manipulate training data to make it learn
multiple models
◦ Data1 Data2 … Data m
◦ Learner1 = Learner2 = … = Learner m
Different methods for changing training data:
◦ Bagging: Resample training data
◦ Boosting: Reweight training data
◦ DECORATE: Add additional artificial training data
In WEKA, these are called meta-learners, they
take a learning algorithm as an argument (base
learner) and create a new learning algorithm
5
Bagging
Create ensembles by repeatedly randomly resampling
the training data (Brieman, 1996)
Given a training set of size w, create m samples of size n
by drawing n examples from the original data, with
replacement
Combine the m resulting models using simple majority
vote
Decreases error by decreasing the variance in the
results due to unstable learners, algorithms (like
decision trees) whose output can change dramatically
when the training data is slightly changed
6
The Problem with Single Decision Trees
7
Bagging : Bootstrap Aggregating
8
Bagging : Bootstrap Aggregating
3. Average predictions
9
Bagging : Bootstrap Aggregating
As we add more trees... our average prediction error reduces
10
Bagging
Sampling with replacement
Training Data
Data ID
O riginal D ata 1 2 3 4 5 6 7 8 9 10
Baggin g (Round 1) 7 8 10 8 2 5 10 10 5 9
Baggin g (Round 2) 1 4 9 1 2 3 2 7 3 2
Baggin g (Round 3) 1 8 5 10 5 5 9 6 3 7
11
The 0.632 bootstrap
This method is also called the 0.632 bootstrap
◦ A particular training data has a probability of
1-1/n of not being picked
◦ Thus, its probability of ending up in the test
data (not selected) is:
n
1 1
1 e 0.368
n
◦ This means the training data will contain
approximately 63.2% of the instances
12
Bagging Algorithm
Bagging Example
Consider 1-dimensional data set:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
True False
yleft yright
14
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
15
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1
y 1 1 1 -1 -1 -1 1 1 1 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 -1 -1 -1 -1 1 1 1
16
Bagging Example
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1
y 1 -1 -1 -1 -1 -1 -1 1 1 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1
y 1 -1 -1 -1 -1 1 1 1 1 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1
y 1 1 -1 -1 -1 -1 -1 1 1 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1
y 1 1 -1 -1 -1 -1 -1 1 1 1
17
Bagging Example
Summary of Training sets:
Round Split Point Left Class Right Class
1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1
18
Bagging Example
Assume test set is the same as the original data
Use majority vote to determine class of
ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Predicted Sum 2 2 2 -6 -6 -6 -6 2 2 2
Class
Sign 1 1 1 -1 -1 -1 -1 1 1 1
19
Boosting
Originally developed by computational learning
theorists to guarantee performance improvements on
fitting training data for a weak learner that only needs
to generate a hypothesis with a training accuracy
greater than 0.5 (Schapire, 1990)
Revised to be a practical algorithm, AdaBoost, for
building ensembles that empirically improves
generalization performance (Freund & Shapire, 1996)
Examples are given weights. At each iteration, a new
hypothesis is learned and the examples are reweighted
to focus the system on examples that the most recently
learned classifier got wrong
22
Learning with Weighted Examples
Generic approach is to replicate examples in
the training set proportional to their weights
(e.g. 10 replicates of an example with a weight
of 0.01 and 100 for one with weight 0.1)
Most algorithms can be enhanced to efficiently
incorporate weights directly in the learning
algorithm so that the effect is the same (e.g.
implement the WeightedInstancesHandler
interface in WEKA).
For decision trees, for calculating information
gain, when counting example i, simply increment
the corresponding count by wi rather than by 1
23
Boosting
Records that are wrongly classified will have
their weights increased
Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
24
Boosting: Basic Algorithm
General Loop:
Set all examples to have equal uniform weights
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples
Decrease the weights of examples ht classifies
correctly
Base (weak) learner must focus on correctly classifying
the most highly weighted examples while strongly
avoiding over-fitting
During testing, each of the T hypotheses get a weighted
vote proportional to their accuracy on the training
data
25
Types of Boosting Algorithms
Underlying engine used for boosting algorithms
can be anything. It can be decision stamp, margin-
maximizing classification algorithm etc.There are
many boosting algorithms which use other types
of engine such as:
◦ AdaBoost (Adaptive Boosting)
◦ Gradient Tree Boosting
◦ GentleBoost
◦ LPBoost
◦ BrownBoost
◦ XGBoost
◦ CatBoost
◦ Lightgbm
26
AdaBoost (Adaptive Boosting)
It works on similar method as discussed above. It fits a
sequence of weak learners on different
weighted training data.
It starts by predicting original data set and gives equal weight
to each observation. If prediction is incorrect using the first
learner, then it gives higher weight to observation which have
been predicted incorrectly. Being an iterative process, it
continues to add learner(s) until a limit is reached in the
number of models or accuracy
Mostly, we use decision stamps with AdaBoost. But, we can
use any machine learning algorithms as base learner if
it accepts weight on training data set
We can use AdaBoost algorithms for both classification and
regression problem
27
AdaBoost Algorithm
28
Example: Error and Classifier Weight
in AdaBoost
Base classifiers: C1, C2,
…, CT
Importance of a
classifier:
1 1 i
i ln
2 i
30
Example: Data Instance Weight in
AdaBoost
Assume: N training data in D, T rounds, (xj,yj)
are the training data, Ci, ɑi are the classifier and
weight of the ith round, respectively
Weight update on all training data in D:
exp if
1. exp if
where is the normalization factor
$
if
$
% '$
$
· if
1 ) '$
2.
31
AdaBoost Example
Consider 1-dimensional data set:
xk
True False
yleft yright
32
AdaBoost Example
Training sets for the first 3 boosting rounds:
Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1
Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
33
AdaBoost Example
Weights
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009
Classification
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
34