0% found this document useful (0 votes)
7 views31 pages

ML8 Ensembles

The document discusses machine learning ensembles, focusing on methods like bagging and boosting to improve model accuracy by combining multiple classifiers. It explains how ensembles can reduce error rates by averaging predictions and emphasizes the importance of diverse and independent classifiers. Techniques such as resampling training data in bagging and adjusting weights in boosting are highlighted as effective strategies for enhancing learning performance.

Uploaded by

testphishingguy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views31 pages

ML8 Ensembles

The document discusses machine learning ensembles, focusing on methods like bagging and boosting to improve model accuracy by combining multiple classifiers. It explains how ensembles can reduce error rates by averaging predictions and emphasizes the importance of diverse and independent classifiers. Techniques such as resampling training data in bagging and adjusting weights in boosting are highlighted as effective strategies for enhancing learning performance.

Uploaded by

testphishingguy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

MACHINE LEARNING

ENSEMBLES

Slides adjusted from Raymond J. Mooney, University of Texas at Austin


Learning Ensembles
Learn multiple alternative definitions of a concept using
different training data or different learning algorithms
Combine decisions of multiple definitions, e.g. using
weighted voting Training Data

Data1 Data2  Data m

Learner1 Learner2  Learner m

Model1 Model2  Model m

Model Combiner Final Model


2
Value of Ensembles
When combing multiple independent and
diverse decisions each of which is at least
more accurate than random guessing,
random errors cancel each other out,
correct decisions are reinforced
Human ensembles are demonstrably better
◦ How many jelly beans in the jar? -
https://www.youtube.com/watch?v=iOucwX7Z1
HU&t=103s : Individual estimates vs. group
average
◦ Who Wants to be a Millionaire: Expert friend vs.
audience vote

3
Why does it work?
Suppose there are 25 base classifiers
◦ Each classifier has error rate,  = 0.35
◦ Assume classifiers are independent
◦ Probability that the ensemble classifier makes
a wrong prediction is the probability that 13
or more classifiers err :
  i
25 25

  
i
i 13 
 (1   ) 25i
 0.06

This value approaches 0 as the number of classifiers
increases. This is just the law of large numbers but in this
context is sometimes called “Condorcet’s Jury Theorem”
4
Homogenous Ensembles
Use a single, arbitrary learning algorithm but
manipulate training data to make it learn
multiple models
◦ Data1  Data2  …  Data m
◦ Learner1 = Learner2 = … = Learner m
Different methods for changing training data:
◦ Bagging: Resample training data
◦ Boosting: Reweight training data
◦ DECORATE: Add additional artificial training data
In WEKA, these are called meta-learners, they
take a learning algorithm as an argument (base
learner) and create a new learning algorithm

5
Bagging
Create ensembles by repeatedly randomly resampling
the training data (Brieman, 1996)
Given a training set of size w, create m samples of size n
by drawing n examples from the original data, with
replacement
Combine the m resulting models using simple majority
vote
Decreases error by decreasing the variance in the
results due to unstable learners, algorithms (like
decision trees) whose output can change dramatically
when the training data is slightly changed

6
The Problem with Single Decision Trees

7
Bagging : Bootstrap Aggregating

1. Sample records with replacement (aka "bootstrap"


the training data)
Sampling is the process of selecting a subset of items
from a vast collection of items
Bootstrap = Sampling with replacement. It means a data
point in a drawn sample can reappear in future drawn
samples as well

8
Bagging : Bootstrap Aggregating

2. Fit an overgrown tree to each resampled data set

3. Average predictions

9
Bagging : Bootstrap Aggregating
As we add more trees... our average prediction error reduces

10
Bagging
Sampling with replacement
Training Data
Data ID
O riginal D ata 1 2 3 4 5 6 7 8 9 10
Baggin g (Round 1) 7 8 10 8 2 5 10 10 5 9
Baggin g (Round 2) 1 4 9 1 2 3 2 7 3 2
Baggin g (Round 3) 1 8 5 10 5 5 9 6 3 7

Build classifier on each bootstrap sample


Each sample has probability (1 – 1/n)n of being selected
as test data
Training data = 1- (1 – 1/n)n of the original data

11
The 0.632 bootstrap
This method is also called the 0.632 bootstrap
◦ A particular training data has a probability of
1-1/n of not being picked
◦ Thus, its probability of ending up in the test
data (not selected) is:
n
 1 1
1    e  0.368
 n
◦ This means the training data will contain
approximately 63.2% of the instances

12
Bagging Algorithm
Bagging Example
Consider 1-dimensional data set:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Classifier is a decision stump


◦ Decision rule: x  k versus x > k
◦ Split point k is chosen based on entropy
xk

True False

yleft yright

14
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

15
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9
y 1 1 1 1 -1 -1 -1 -1 1 1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1
y 1 1 1 -1 -1 -1 1 1 1 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 -1 -1 -1 -1 1 1 1

16
Bagging Example
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1
y 1 -1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1
y 1 -1 -1 -1 -1 1 1 1 1 1

Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1
y 1 1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1
y 1 1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 10:


x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9
y 1 1 1 1 1 1 1 1 1 1

17
Bagging Example
Summary of Training sets:
Round Split Point Left Class Right Class
1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1

18
Bagging Example
Assume test set is the same as the original data
Use majority vote to determine class of
ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Predicted Sum 2 2 2 -6 -6 -6 -6 2 2 2
Class
Sign 1 1 1 -1 -1 -1 -1 1 1 1

19
Boosting
Originally developed by computational learning
theorists to guarantee performance improvements on
fitting training data for a weak learner that only needs
to generate a hypothesis with a training accuracy
greater than 0.5 (Schapire, 1990)
Revised to be a practical algorithm, AdaBoost, for
building ensembles that empirically improves
generalization performance (Freund & Shapire, 1996)
Examples are given weights. At each iteration, a new
hypothesis is learned and the examples are reweighted
to focus the system on examples that the most recently
learned classifier got wrong

22
Learning with Weighted Examples
Generic approach is to replicate examples in
the training set proportional to their weights
(e.g. 10 replicates of an example with a weight
of 0.01 and 100 for one with weight 0.1)
Most algorithms can be enhanced to efficiently
incorporate weights directly in the learning
algorithm so that the effect is the same (e.g.
implement the WeightedInstancesHandler
interface in WEKA).
For decision trees, for calculating information
gain, when counting example i, simply increment
the corresponding count by wi rather than by 1

23
Boosting
Records that are wrongly classified will have
their weights increased
Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

Example 4 is hard to classify


Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

24
Boosting: Basic Algorithm
General Loop:
Set all examples to have equal uniform weights
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples
Decrease the weights of examples ht classifies
correctly
Base (weak) learner must focus on correctly classifying
the most highly weighted examples while strongly
avoiding over-fitting
During testing, each of the T hypotheses get a weighted
vote proportional to their accuracy on the training
data

25
Types of Boosting Algorithms
Underlying engine used for boosting algorithms
can be anything. It can be decision stamp, margin-
maximizing classification algorithm etc.There are
many boosting algorithms which use other types
of engine such as:
◦ AdaBoost (Adaptive Boosting)
◦ Gradient Tree Boosting
◦ GentleBoost
◦ LPBoost
◦ BrownBoost
◦ XGBoost
◦ CatBoost
◦ Lightgbm

26
AdaBoost (Adaptive Boosting)
It works on similar method as discussed above. It fits a
sequence of weak learners on different
weighted training data.
It starts by predicting original data set and gives equal weight
to each observation. If prediction is incorrect using the first
learner, then it gives higher weight to observation which have
been predicted incorrectly. Being an iterative process, it
continues to add learner(s) until a limit is reached in the
number of models or accuracy
Mostly, we use decision stamps with AdaBoost. But, we can
use any machine learning algorithms as base learner if
it accepts weight on training data set
We can use AdaBoost algorithms for both classification and
regression problem

27
AdaBoost Algorithm

28
Example: Error and Classifier Weight
in AdaBoost
Base classifiers: C1, C2,
…, CT

Error rate: (i = index of


classifier, j=index of
instance)
 w  C ( x )  y 
N
1
i  j i j j
N j 1

Importance of a
classifier:
1  1  i 
i  ln  
2  i 
30
Example: Data Instance Weight in
AdaBoost
Assume: N training data in D, T rounds, (xj,yj)
are the training data, Ci, ɑi are the classifier and
weight of the ith round, respectively
Weight update on all training data in D:
exp if
1. exp if
where is the normalization factor

$
if
$
% '$
$
· if
1 ) '$
2.

Rescale the weights of all the examples so the total


sum weight remains 1

31
AdaBoost Example
Consider 1-dimensional data set:

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


y 1 1 1 -1 -1 -1 -1 1 1 1

Classifier is a decision stump


◦ Decision rule: x  k versus x > k
◦ Split point k is chosen based on entropy

xk

True False

yleft yright

32
AdaBoost Example
Training sets for the first 3 boosting rounds:
Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1
Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195

33
AdaBoost Example
Weights
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009

Classification

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

34

You might also like