0% found this document useful (0 votes)
79 views69 pages

Lecture 10 Ensemble Methods

This document provides an overview of ensemble learning techniques, including bagging, forests, and boosting. It discusses how combining multiple weak learners can create a stronger learner by exploiting the "wisdom of crowds" approach. Bagging trains models in parallel on bootstrap samples of the data. Forests extend bagging by also selecting random features at each node. Boosting trains models sequentially, increasing the weight of misclassified examples at each step. The goal of these ensemble methods is to combine diverse and independent weak learners to reduce variance and error.

Uploaded by

LIU Hengxu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views69 pages

Lecture 10 Ensemble Methods

This document provides an overview of ensemble learning techniques, including bagging, forests, and boosting. It discusses how combining multiple weak learners can create a stronger learner by exploiting the "wisdom of crowds" approach. Bagging trains models in parallel on bootstrap samples of the data. Forests extend bagging by also selecting random features at each node. Boosting trains models sequentially, increasing the weight of misclassified examples at each step. The goal of these ensemble methods is to combine diverse and independent weak learners to reduce variance and error.

Uploaded by

LIU Hengxu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Bagging

Forests
Boosting
Dropout

Lecture 10: Ensemble Learning


DD2421

Atsuto Maki

Autumn, 2020

Atsuto Maki Lecture 10: Ensemble Learning


Background: Ensemble Learning

We will describe and investigate algorithms to

train weak classifiers/regressors and how to combine them

to construct a classifier/regressor more powerful than any


of the individual ones.

They are called Ensemble learning, Commitee machine, etc.


Outline of the lecture

Wisdom of Crowds – why combine classifiers?

Bagging: static structure, parallel training


Forests: an extension of bagging
Boosting: static structure, serial training
(Example: face detection)
The Wisdom of Crowds

Crowd wiser than any individual


The collective knowledge of a diverse and independent
body of people typically exceeds the knowledge of any
single individual and can be harnessed by voting.

When ?
For which questions ?

See The Wisdom of Crowds by James Surowiecki published in


2004 to see this idea applied to business.
What makes a crowd wise?

Four elements required to form a wise crow (J. Surowiecki):


Diversity of opinion. People in crowd should have a
range of experiences, education and opinions.
Independence. Prediction by person in crowd is not
influenced by other people in the crowd.
Decentralization People have specializations and
local knowledge.
Aggregation. There is a mechanism for aggregating
all predictions into one single prediction.
What makes a crowd wise?

Four elements required to form a wise crow (J. Surowiecki):


Diversity of opinion. People in crowd should have a
range of experiences, education and opinions.
Independence. Prediction by person in crowd is not
influenced by other people in the crowd.
Decentralization People have specializations and
local knowledge.
Aggregation. There is a mechanism for aggregating
all predictions into one single prediction.
What makes a crowd wise?

Four elements required to form a wise crow (J. Surowiecki):


Diversity of opinion. People in crowd should have a
range of experiences, education and opinions.
Independence. Prediction by person in crowd is not
influenced by other people in the crowd.
Decentralization People have specializations and
local knowledge.
Aggregation. There is a mechanism for aggregating
all predictions into one single prediction.
What makes a crowd wise?

Four elements required to form a wise crow (J. Surowiecki):


Diversity of opinion. People in crowd should have a
range of experiences, education and opinions.
Independence. Prediction by person in crowd is not
influenced by other people in the crowd.
Decentralization People have specializations and
local knowledge.
Aggregation. There is a mechanism for aggregating
all predictions into one single prediction.
Combining classifiers

Will exploit Wisdom of crowd ideas for specific tasks by


combining classifier predictions and

aim to combine independent and diverse classifiers.

But will use labelled training data


to identify the expert classifiers in the pool;

to identify complementary classifiers;

to indicate how to best combine them.


Combining classifiers

Will exploit Wisdom of crowd ideas for specific tasks by


combining classifier predictions and

aim to combine independent and diverse classifiers.

But will use labelled training data


to identify the expert classifiers in the pool;

to identify complementary classifiers;

to indicate how to best combine them.


Ensemble method: Bagging
Bootstrap Aggregating
Use bootstrap replicates of training set by
sampling with replacement.
On each replicate learn one model – combined
altogether.
Binary classification example

True decision boundary Training data set Si

Estimate the true decision boundary with a decision tree


trained from some labeled training set Si .
High variance, Low bias classifiers

E.g. decision trees

High variance classifiers produce differing decision


boundaries which are highly dependent on the training
data.

Low bias classifiers produce decision boundaries which on


average are good approximations to the true decision
boundary.

Ensemble predictions using diverse high-variance, low-bias


classifiers reduce the variance of the ensemble classifier.
Estimated decision boundaries found using bootstrap
replicates:

S1 S2 S3 S4 S5

S6 S7 S8 S9 S10

Property of instability
Bagging - Bootstrap Aggregating

Input: Training data

S = {(x1 , y1 ), . . . , (xm , ym )}

of inputs xj ∈ Rd and their labels or real values


yj .

Iterate: for b = 1, . . . , B
1 Sample training examples, with replacement,
m0 times from S to create Sb (m0 ≤ m).
2 Use this bootstrap sample Sb to estimate the
regression or classification function fb .
Bagging - Bootstrap Aggregating

Input: Training data

S = {(x1 , y1 ), . . . , (xm , ym )}

of inputs xj ∈ Rd and their labels or real values


yj .

Iterate: for b = 1, . . . , B
1 Sample training examples, with replacement,
m0 times from S to create Sb (m0 ≤ m).
2 Use this bootstrap sample Sb to estimate the
regression or classification function fb .
Output: The bagging estimate for

Classification:
B
X
fbag (x) = argmax Ind (fb (x) = k)
1≤k≤K
b=1

Regression:
B
1 X
fbag (x) = fb (x)
B b=1

Note: Ind (x) = 1 if x = TRUE


otherwise, Ind (x) = 0
Bagging is a procedure to reduce the variance of our
classifier (when labelled training data is limited).

→ Powerful algorithm for controling overfitting.

Note: it only produces good results for high variance, low bias
classifiers.
Apply bagging to the original example

Ground S1 Classifier Bagged


truth from S1 classifier
B = 100.
Apply bagging to the original example

Ground S1 Classifier Bagged


truth from S1 classifier
B = 100.
If we bag a high bias, low variance classifier - oriented
horizontal and vertical lines - we don’t get any benefit.
A cartoon depiction of how bagging works

We train an 8 detector on the dataset, containing {6,8,9}.

Construct two different resampled dataset, each by


sampling with replacement.
A cartoon depiction of how bagging works
(cont.)

The first resampled dataset omits the 9 and repeats the 8.


→ The detector learns a loop on top of the digit for an 8.
The second resampled dataset omits the 6 and repeat the 9.
→ The detector learns a loop on the bottom for an 8.
Each rule is brittle.
If we average their output, then the detector is robust;
maximal confidence only when both loops of the 8 are
present.
See Chapter 7.11 of Deep Learning in www.deeplearningbook.org
(2016) by Ian Goodfellow.
Ensemble method: Forest
Decision/Random/Randomized Forest

Bagging + Random feature selection at each node


Decision Forests / Random Forests

Two kinds of randomnesses involved in:


- Sampling training data (the same as in Bagging)
- Feature selection at each node

Trees are less correlated, i.e. even higher variance


between weak learners.

A classifier suited to multi-class problem.


Decision Forests / Random Forests

Two kinds of randomnesses involved in:


- Sampling training data (the same as in Bagging)
- Feature selection at each node

Trees are less correlated, i.e. even higher variance


between weak learners.

A classifier suited to multi-class problem.


Decision Forests / Random Forests

Two kinds of randomnesses involved in:


- Sampling training data (the same as in Bagging)
- Feature selection at each node

Trees are less correlated, i.e. even higher variance


between weak learners.

A classifier suited to multi-class problem.


Ensemble method: Boosting

Started from a question:


Can a set of weak learners create a single strong
classifier where a weak learner performs only
slightly better than a chance? (Kearns, 1988)

Loop:
- Apply learner to weighted samples
- Increase weights of misclassified examples
Example: Ensemble Prediction
Voting of oriented hyper-planes can define convex regions.
Green region is the true boundary.

High-bias classifier Low-bias classifier

Low model complexity (small # of d.o.f.) =⇒ High-bias


High model complexity (large # of d.o.f.) =⇒ Low-bias
Example (cont.)

A diverse and complementary set of high-bias classifiers,


with performance better than chance, combined by voting
T
!
X
fV (x) = sign ht (x)
t=1

can produce a classifier with a low-bias.

ht ∈ H where H is a family of possible weak classifiers functions.


Ensemble Method: Boosting
Input: Training data S = {(x1 , y1 ), . . . , (xm , ym )} of inputs
xi and their labels yi ∈ {−1, 1} or real values.
H: a family of possible weak classifiers/regression functions.

Output: A strong classifier/regression function


T T
!
X X
fT (x) = sign αt ht (x) or fT (x) = αt ht (x)
t=1 t=1

weighted sum of weak classifiers

ht ∈ H t = 1, ..., T
αt : confidence/reliability
Ensemble Method: Boosting
Input: Training data S = {(x1 , y1 ), . . . , (xm , ym )} of inputs
xi and their labels yi ∈ {−1, 1} or real values.
H: a family of possible weak classifiers/regression functions.

Output: A strong classifier/regression function


T T
!
X X
fT (x) = sign αt ht (x) or fT (x) = αt ht (x)
t=1 t=1

weighted sum of weak classifiers

ht ∈ H t = 1, ..., T
αt : confidence/reliability
Ensemble Method: Boosting
Ensemble Method: Boosting

Core ideas (Just consider case of classification here.)


Performance of classifiers h1 , . . . , ht helps define ht+1 .
Remember: Each ht ∈ H
(t)
Maintain weight wi for each training example in S.
(t)
Large wi =⇒ xi has greater influence on choice of ht .
(t)
Iteration t: wi gets increased / decreased
if xi is wrongly / correctly classified by ht .
Binary classification example

True decision boundary Training data

H is the set of all possible oriented vertical and horizontal


lines.
Example

Round 1

Chosen weak classifier Re-weight training points Current strong classifier


(2)
1 = 0.19, α1 = 1.45 wi ’s f1 (x)
Example

Round 2

Chosen weak classifier Re-weight training points Current strong classifier


(3)
2 = 0.1512, α2 = 1.725 wi ’s f2 (x)
Example

Round 3

Chosen weak classifier Re-weight training points Current strong classifier


(4)
3 = 0.2324, α3 = 1.1946 wi ’s f3 (x)
Example

Round 4

Chosen weak classifier Re-weight training points Current strong classifier


(5)
4 = 0.2714, α4 = 0.9874 wi ’s f4 (x)
Example

Round 5

Chosen weak classifier Re-weight training points Current strong classifier


(6)
5 = 0.2616, α5 = 1.0375 wi ’s f5 (x)
Example

Round 6

Chosen weak classifier Re-weight training points Current strong classifier


(7)
6 = 0.2262, α6 = 1.2298 wi ’s f6 (x)
Example

Round 7

Chosen weak classifier Re-weight training points Current strong classifier


(8)
7 = 0.2680, α7 = 1.0049 wi ’s f7 (x)
Example

Round 8

Chosen weak classifier Re-weight training points Current strong classifier


(9)
8 = 0.3282, α8 = 0.7165 wi ’s f8 (x)
Example

Round 9

Chosen weak classifier Re-weight training points Current strong classifier


(10)
9 = 0.3048, α9 = 0.8246 wi ’s f9 (x)
Example

Round 10

Chosen weak classifier Re-weight training points Current strong classifier


(11)
10 = 0.2943, α10 = 0.8744 wi ’s f10 (x)
Example

Round 11

Chosen weak classifier Re-weight training points Current strong classifier


(12)
11 = 0.2876, α11 = 0.9071 wi ’s f11 (x)
Example

...........................
Example

Round 21

Chosen weak classifier Re-weight training points Current strong classifier


(22)
21 = 0.3491, α21 = 0.6232 wi ’s f21 (x)
Adaboost Algorithm (Freund & Schapire, 1997)

Given: Labeled training data

S = {(x1 , y1 ), . . . , (xm , ym )}

of inputs xj ∈ Rd and their labels


yj ∈ {−1, 1}.
A set/class H of T possible weak
classifiers.

(1)
Initialize: Introduce a weight, wj , for each training
sample.
(1) 1
Set wj = m
for each j.
Adaboost Algorithm (cont.)
Iterate: for t = 1, . . . , T
(t) (t)
1 Train weak classifier ht ∈ H using S and w1 , . . . , wm ;
select the one that minimizes the training error:
m
(t)
X
t = wj Ind (yj 6= ht (xj ))
j=1

(sum of the weights for misclassified samples)


2 Compute the reliability coefficient:
 
1 − t
αt = loge
t
t must be less than 0.5. Break out of loop if t ≈ .5
(t+1) (t)
3 Update weights by using: wj = wj exp(−αt yj ht (xj ))
Note the sign of yj ht (xj ).
4 Normalize the weights so that they sum to 1.
Adaboost Algorithm (cont.)
Iterate: for t = 1, . . . , T
(t) (t)
1 Train weak classifier ht ∈ H using S and w1 , . . . , wm ;
select the one that minimizes the training error:
m
(t)
X
t = wj Ind (yj 6= ht (xj ))
j=1

(sum of the weights for misclassified samples)


2 Compute the reliability coefficient:
 
1 − t
αt = loge
t
t must be less than 0.5. Break out of loop if t ≈ .5
(t+1) (t)
3 Update weights by using: wj = wj exp(−αt yj ht (xj ))
Note the sign of yj ht (xj ).
4 Normalize the weights so that they sum to 1.
Adaboost Algorithm (cont.)
Iterate: for t = 1, . . . , T
(t) (t)
1 Train weak classifier ht ∈ H using S and w1 , . . . , wm ;
select the one that minimizes the training error:
m
(t)
X
t = wj Ind (yj 6= ht (xj ))
j=1

(sum of the weights for misclassified samples)


2 Compute the reliability coefficient:
 
1 − t
αt = loge
t
t must be less than 0.5. Break out of loop if t ≈ .5
(t+1) (t)
3 Update weights by using: wj = wj exp(−αt yj ht (xj ))
Note the sign of yj ht (xj ).
4 Normalize the weights so that they sum to 1.
Dropout
A regularization method for Artificial Neural
Networks

(can be seen also as an ensemble method)


Dropout: “ultimate” ensemble learning in ANN

Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov,


Dropout: A Simple Way to Prevent Neural Networks
from Overtting. Journal of Machine Learning Research 15:
1929-1958, 2014.
Summary
Summary: Ensemble Prediction

Can combine many weak classifiers/regressors into a


stronger classifier; voting, averaging, bagging
if weak classifiers/regressors are better than random.

if there is sufficient de-correlation (independence)


amongst the weak classifiers/regressors.
Summary: Ensemble Prediction & Learning

Can combine many (high-bias) weak classifiers/regressors


into a strong classifier; boosting
if weak classifiers/regressors are chosen and combined
using knowledge of how well they and others performed on
the task on training data.

The selection and combination encourages the weak


classifiers to be complementary, diverse and de-correlated.
Appendix

P. Viola, M. J. Jones, Robust real-time face detection.


International Journal of Computer Vision 57(2): 137-154, 2004.
Viola & Jones Face Detection

Most state-of-the-art face detection on mobile phones,


digital cameras etc. are/were based on this algorithm.
Example of a classifier constructed using the Boosting
algorithm.
Viola & Jones: Training data

Positive training examples:


Image patches corresponding to
faces - (xi , 1).
Negative training examples:
Random image patches from
images not containing faces -
(xj , −1).

Note: All patches are re-scaled to


have same size. ⇑
Positive training examples
Viola & Jones: Weak classifier

FACE or NON-FACE
Input: x Apply filter: f j (x) Output: h(x) = (f j (x) > θ)

Filters used compute differences between sums of pixels in adjacent


rectangles. (These can be computed very quickly using Integral
Images.)
Viola & Jones: Filters Considered
Huge “Library” of Filters
Huge library of possible Haar-like filters, f 1 , . . . , f n with
n ≈ 16, 000, 000.
Viola & Jones: AdaBoost training

Recap: define weak classifier as


(
1 if f jt (x) > θt
ht (x) =
−1 otherwise

Use AdaBoost to efficiently choose the best weak


classifiers and to combine them.

Remember: a weak classifier corresponds to a filter type and a


threshold.
Viola & Jones: AdaBoost training (cont.)

For t = 1, . . . , T
for each filter type j
1 Apply filter, f j , to each example.
2 Sort examples by their filter responses.
3 Select best threshold for this classifier: θtj .
4 Keep record of error of this classifier: tj .
Select the filter-threshold combination (weak classifier
j ∗ ) with minimum error. Then set jt = j ∗ , t = tj ∗
and θt = θtj ∗ .
Re-weight examples according to the AdaBoost
formualae.
Note: (There are many tricks to make this implementation more efficient.)
Viola & Jones: Sliding window

Remember: Better classification rates if use a classifier,


fT , with large T .

Given a new image, I, detect the faces in the image by:


for each plausible face size s
for each possible patch centre c
1 Extract sub-patch of size s at c from I.
2 Re-scale patch to size of training patches.
3 Apply detector to patch.
4 Keep record of s and c if the detector returns positive.
This is a lot of patches to be examined. If T is very large
processing an image will be very slow!
Viola & Jones: Sliding window

Remember: Better classification rates if use a classifier,


fT , with large T .

Given a new image, I, detect the faces in the image by:


for each plausible face size s
for each possible patch centre c
1 Extract sub-patch of size s at c from I.
2 Re-scale patch to size of training patches.
3 Apply detector to patch.
4 Keep record of s and c if the detector returns positive.
This is a lot of patches to be examined. If T is very large
processing an image will be very slow!
Viola & aJones:
• Given nestedCascade of classifiers
set of classifier
% False Pos
hypothesis classes 0
But: vs false negdetermined b

99
only a tiny proportion of the patches will be faces and
many of them will not look anything like a face.

% Detection
Exploit this fact: Introduce a cascade of increasingly

50
strong classifiers

• Computational Risk Minimization


T T T
IMAGE Classifier 2 Classifier 3
SUB-WINDOW
Classifier 1 FACE
F F F

NON-FACE NON-FACE NON-FACE


Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001
Viola & aJones:
• Given nestedCascade of classifiers
set of classifier
% False Pos
hypothesis classes 0
But: vs false negdetermined b

99
only a tiny proportion of the patches will be faces and
many of them will not look anything like a face.

% Detection
Exploit this fact: Introduce a cascade of increasingly

50
strong classifiers

• Computational Risk Minimization


T T T
IMAGE Classifier 2 Classifier 3
SUB-WINDOW
Classifier 1 FACE
F F F

NON-FACE NON-FACE NON-FACE


Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001
Viola & Jones: Cascade of classifiers
Cascaded Classifier
50% 20% 2%
IMAGE 5 Features 20 Features
SUB-WINDOW
1 Feature FACE
F F F

NON-FACE NON-FACE NON-FACE

A 1 feature classifier achieves 100% detection rate and


• A 1 feature
about 50% falseclassifier achieves 100% detection
positive rate.
and about 50% false positive
A 5 feature classifier achieves rate. rate and
100% detection
40% false positive rate (20% cumulative) - using data
• A
from5 previous
featurestage
classifier
. achieves 100% detection
and
A 20 40%
featurefalse positive
classifier achievesrate
100%(20% cumulative)
detection rate
– using data from previous stage.
with 10% false positive rate (2% cumulative).
Viola & Jones: Typical Results

You might also like