0% found this document useful (0 votes)

3 views36 pages

09 EnsembleLearning

Uploaded by

Nicole Oo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views36 pages

09 EnsembleLearning

Uploaded by

Nicole Oo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Ensemble Learning

Hady W. Lauw
Photo by Felix Mittermeier from Pexels

IS712 Machine Learning

CLASSIFICATION AND REGRESSION TREES (CART)
Recursively partitioning the input space and defining a local model for each partition
Regression Tree
• Partition input space into regions
• Prediction is the mean response in each region
– Alternatively, fit a regression function locally

3
Classification Tree
• Partition input space into regions
• Prediction is the mode of class label distribution

4
Recursive Procedure to Grow a Tree

• Split function chooses the “best” feature j (among M features) and feature value t
(among the viable feature values of j) to split
|𝐷, | 𝐷.
𝑗 ∗, 𝑡 ∗ = arg min min 𝑐𝑜𝑠𝑡 𝐷, = {𝒙- , 𝑦- : 𝑥-" ≤ 𝑡} + 𝑐𝑜𝑠𝑡(𝐷. = {𝒙- , 𝑦- : 𝑥-" > 𝑡})
"∈{%,…,(} *∈+! |𝐷| 𝐷
5
Is it worth splitting?

• Is there gain to be made from further splits?

– The distribution in each region may already be sufficiently homogeneous
– The gain may be too small
|𝐷, | 𝐷.
𝐺𝑎𝑖𝑛 = 𝑐𝑜𝑠𝑡 𝐷 − 𝑐𝑜𝑠𝑡 𝐷, = {𝒙- , 𝑦- : 𝑥-" ≤ 𝑡} + 𝑐𝑜𝑠𝑡(𝐷. = {𝒙- , 𝑦- : 𝑥-" > 𝑡})
|𝐷| 𝐷

• Are there significant risks of overfitting?

– The tree may already be too deep
– The number of examples in a particular region may be too small

6
Regression Cost
• For a subset of data points D, quantify:
𝑐𝑜𝑠𝑡 𝐷 = @ 𝑦- − 𝑓(𝒙- , 𝑦- ) 0

-∈/

• In the simplest case, the prediction could just be the mean response
1
𝑓 𝒙- , 𝑦- = @ 𝑦-
|𝐷|
-∈/

• Alternatively, we can fit a regression function at each leaf node

7
Classification Cost: Misclassification
• First, we estimate the class proportions within a subset of data points D:
1
𝜋̂! = @ 𝟏(𝑦- = 𝑐)
|𝐷|
-∈/

• Predicted class is the mode of the class distribution

𝑦! = arg max 𝜋G1
1∈2

• Misclassification rate
1
𝑐𝑜𝑠𝑡 𝐷 = % 𝟏 𝑦! ≠ 𝑦̂ = 1 − 𝜋̂ %̂
𝐷
!∈#

8
Classification Cost: Entropy
• Define entropy of class distribution
G = − @ 𝜋J 1 log 𝜋J 1
𝐻 𝝅
1∈2

• Minimizing entropy is maximizing information gain

𝑖𝑛𝑓𝑜𝐺𝑎𝑖𝑛 𝑋, < 𝑡, 𝑌 = 𝐻 𝑌 − 𝐻(𝑌|𝑋, < 𝑡)

9
Classification Cost: Gini Index
• Define entropy of class distribution
G
𝐺𝑖𝑛𝑖 𝝅
= @ 𝜋J 1 (1 − 𝜋J 1 )
1∈2

= @ 𝜋J 1 − @ 𝜋J 10
1∈2 1∈2

= 1 − @ 𝜋J 10
1∈2

10
Classification Cost: Comparison

• Assume 2-class classification, where each class has 400 instances

Splits Misclassification Rate Entropy Gini Index
Split 1: (300, 100) 0.25 0.81 0.375
Split 2: (100, 300)
Split 1: (200, 400) 0.25 0.70 0.336
Split 2: (199, 1)

11
Example: Iris Dataset

12
Example: Iris Dataset - Unpruned

13
Example: Iris Dataset - Pruned

14
BAGGING
Reducing variance via Bootstrap AGGregating
Random Tree Classifier
• Sample 𝑘 out of 𝑀 features randomly
– Heuristic: 𝑘 = 𝑀
• Build a full decision tree based only on the 𝑘 features
• This is a high-variance model

Classification tree on 2 out of 4 features in Iris dataset

16
Random Forest Classifier
• To lower the variance, we can “bag” many random trees

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }

• For each sampled dataset 𝐷0 :
– Sample 𝑘 out of 𝑀 features randomly
– Train a full classification tree ℎ3 (𝒙) with the 𝑘 features
• The final classifier is the average of the trees
1
:
ℎ 𝒙 = > ℎ0 (𝒙)
𝐿
0∈{-,…,/}

17
Sampling: without vs. with replacement

https://www.spss-tutorials.com/spss-sampling-basics/

18
Bagging: Bootstrap Aggregating
• We can apply bagging to other models as well

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }

• For each sampled dataset 𝐷0 :
– Sample 𝑘 out of 𝑀 features randomly
– Train a model ℎ3 (𝒙) with the 𝑘 features
• The final model is the average of the predictions
1
:
ℎ 𝒙 = > ℎ0 (𝒙)
𝐿
0∈{-,…,/}

19
Bias-Variance Illustration

http://scott.fortmann-roe.com/docs/BiasVariance.html
20
Expected Loss for Regression

• Recall: assume a random noise 𝜖 ∈ 𝒩(0, 𝜎 .) is the source of residual error:

𝑦 = 𝑦J 𝒙 𝒘′ + 𝜖
𝑝 𝑦 𝒙, 𝒘′, 𝜎 0 = 𝒩 𝑦 𝑦J 𝒙|𝒘′ , 𝜎 0

E 𝑦 𝒙 = W 𝑦 𝑝 𝑦 𝒙, 𝒘′, 𝜎 0 d𝑦 = 𝑦(𝒙|𝒘′)
J

𝑣𝑎𝑟[𝑦|𝒙] = E 𝑦 − E 𝑦 𝒙 0 = E 𝜖 0 = 𝑣𝑎𝑟 𝜖 + E 𝜖 0 = 𝜎0

• Suppose that instead of learning from the “complete” data, we take sub-
samples, then we expect some variations in the squared loss we’d observe
– Let 𝑦 be the observed response variable
– Let 𝑦J be the optimal function (the 𝒙 and 𝒘′ omitted for simplicity of notation)
– Let 𝑦] be the function we learn from a particular sample
– We would like to characterize the expected squared loss E[ 𝑦 − 𝑦] 0 ] under the
distribution of subsamples
21
Bias-Variance Decomposition for Regression
E 𝑦 − 𝑦] 0 Note that:
= E 𝑦J + 𝜖 − 𝑦] 0 E𝜖 =0
0 E E 𝑦% − 𝑦% = E 𝑦% − E 𝑦% = 0
=E 𝑦J − E[𝑦])
] + 𝜖 + (E[𝑦]
] − 𝑦]
= E 𝑦J − E 𝑦] 0 + E 𝜖 0 + E E 𝑦] − 𝑦] 0
+2×E 𝜖 ×E[𝑦J − E 𝑦] ] + 2×E 𝜖]×E[E 𝑦] − 𝑦] + 2×E 𝑦J − E 𝑦] ×E E 𝑦] − 𝑦]
= E[ 𝑦J − E 𝑦] 0 ] + 𝜎 0 + E E 𝑦] − 𝑦] 0

• Squared bias E 𝑦! − E 𝑦E . = E 𝑦! − E ℎ0 (𝒙) .

– Contribution to squared loss due to deviation of the learnt function from the optimal
• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .

– Contribution to squared loss due to sensitivity to different training subsamples

• Irreducible error 𝜎 . = E 𝑦! − 𝑦 .

– Contribution to squared loss due to random noise in the data

22
Bagging Reduces Variance
• Weak law of large numbers when samples are i.i.d:
1
:
ℎ 𝒙 = > ℎ0 (𝒙) → E ℎ0 (𝒙) as 𝐿 → ∞
𝐿
0∈{-,…,/}

• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .

• If we replace ℎ3 with ℎ̀, variance reduces to 0, if indeed samples are i.i.d.

• Bagging samples are unlikely i.i.d., so variance may not disappear

completely, but would likely still be reduced effectively

23
Unbiased Estimate of Test Error
• Each bagging sample 𝐷0 only involves a subset of the training data 𝐷

• A specific training instance (𝒙6 , 𝑦6 ) is part of some sample, but not others
• Let 𝐷76 = {𝐷0 | 𝒙6 , 𝑦6 ∉ 𝐷0 } be the samples that do not contain this instance
-
• Let ℎ: 76 (𝒙) = ∑ ℎ (𝒙) be average of models trained on 𝐷76
|9!" | 9# ∈9!" 0

• The out-of-bag error is the average such error across the 𝑁 instances in 𝐷
1
𝜖::; 𝐷 = > 𝑙𝑜𝑠𝑠(ℎ: 76 𝒙6 , 𝑦6 )
𝑁
6∈{-,…,<}

24
BOOSTING
Reducing bias via iteratively improving weak learners
Boosting
• Consider a binary classification problem 𝑦 ∈ {−1, 1}

• A weak learner is a model for binary classification that has slightly better
performance than random guesses
– Example: a shallow classification tree

• Boosting seeks to create a strong learner from a weighted combination of

multiple weak learners

𝐻 𝒙 = sign > 𝛼= ℎ= (𝒙)

26
AdaBoost

• Training data 𝐷 has 𝑁 instances 𝒙6 , 𝑦6 6∈{-,…,<}

• Associate each instance 𝒙6 , 𝑦6 with a weight 𝑤6
• Assume we can train a model ℎ= that minimizes a weighted loss function
𝐿= = > 𝑤6 = 𝟏(ℎ= 𝒙6 ≠ 𝑦6 )
6∈{-,…,<}
– 𝟏(ℎ* 𝒙- = 𝑦- ) is an indicator function that yields 1 if the equality within holds, 0 otherwise
– An example model is a decision stump with a single feature split
-
• Initially, the weights of all instances are uniform 𝑤6 = <
• Subsequently, weights of misclassified instances are adjusted

27
Algorithm
• For iteration 𝑡 from 1 to 𝑇
– Fit a classifier ℎ* to the training data by minimizing the weighted loss function
&
𝐿& = % 𝑤! 𝟏(ℎ& 𝒙! ≠ 𝑦! )
!∈{(,…,+}
– Evaluate error of this iteration
∑!∈{(,…,+} 𝑤! & 𝟏(ℎ& 𝒙! ≠ 𝑦! )
𝜖& =
∑!∈{(,…,+} 𝑤! &
– Evaluate coefficient of this classifier
1 − 𝜖&
𝛼& = ln
𝜖&
– Update data weight coefficients
(&.() (&)
𝑤! = 𝑤! exp 𝛼& 𝟏(ℎ& 𝒙! ≠ 𝑦! )
• Final prediction is given by:
𝐻 𝒙! = 𝑠𝑖𝑔𝑛 % 𝛼& ℎ& 𝒙!
&∈{(,…,0}
28
Minimizes Sequential Exponential Error

• Sequential ensemble classifier:

1
𝐻* " 𝒙- = @ 𝛼* ℎ* 𝒙-
2
*∈{%,…,* " }

• Sequential exponential error:

𝐸= @ exp −𝑦- 𝐻*5 (𝒙- )
-∈{%,…,4}
1
= @ exp −𝑦- 𝐻* " 6% 𝒙- − 𝑦- 𝛼* " ℎ* " (𝒙- )
2
-∈{%,…,4}
(* " ) 1
= @ 𝑤- exp − 𝑦- 𝛼* " ℎ* " 𝒙-
2
-∈{%,…,4}
(= $ )
• Given 𝐻= $ , 𝑤6
is a constant, and we
seek to find the minimizing 𝛼= $ and ℎ= $
29
Sequential Exponential Error (cont’d)

• Correctly classified instances: 𝐶" = 𝒙- , 𝑦- |𝑦- . ℎ" 𝒙- ≥ 0

• Wrongly classified instances: 𝐶*̅ " = { 𝒙- , 𝑦- |𝑦- . ℎ*" 𝒙- < 0}
• Sequential exponential error:
(& ! ) 1
𝐸= % 𝑤! exp − 𝑦! 𝛼& ! ℎ& ! 𝒙!
2
!∈{(,…,+}
𝛼& ! (& ! ) 𝛼& ! (& ! )
= exp − % 𝑤! + exp % 𝑤!
2 2
!∈1"! !∈1"̅ !
𝛼! 𝛼! &! 𝛼& ! (& ! )
= exp & − exp − & % 𝑤! 1(ℎ& ! 𝑥! ≠ 𝑦! ) +exp − % 𝑤!
2 2 2
!∈{(,…,+} !∈{(,…,+}

• Minimizing the above with respect to 𝛼= $ gives us

1 − 𝜖&
𝛼& = ln
𝜖&
∑!∈{(,…,+} 𝑤! & 𝟏(ℎ& 𝒙! ≠ 𝑦! )
𝜖& =
∑!∈{(,…,+} 𝑤! &
30
Illustration

31
Illustration (cont’d)

32
Interpretations of Boosting

• A form of L1 regularization
– Each weak learner is a decision stump that relies on a single feature
– Boosting “selects” among these weak learners (features) that work well

• Margin maximization
– By iteratively adjusting weights of misclassified instances, boosting seeks the classifier
that maximizes the margin

• Functional gradient descent

– The functions are the “parameters”
– GradientBoost is a generic algorithm for boosting that accommodates various loss
functions

33
Boosting Loss Functions

34
Conclusion
• Classification and Regression Trees (CART)
– A class of models that partition the input space into regions and models each region
locally

• Ensemble learning
– Aggregating the predictions of multiple models

• Bagging
– Trains multiple models from sub-samples of dataset
– Reduces variance of a high-variance learning algorithm without affecting bias

• Boosting
– Combining multiple weak learners into a strong learner
– Reduces bias by iteratively over-weighting misclassified instances
35
References

• [PRML] Bishop, C. M. (2006). Pattern recognition and machine learning.

Springer.
– Chapter 14 (Combining Models)

• [MLaPP] Murphy K. P. (2012). Machine learning: a probabilistic perspective.

MIT Press.
– Chapter 16 (Adaptive Basis Function Models)

Practising The Piano (By Frank Marrick) (1958)
91% (33)
Practising The Piano (By Frank Marrick) (1958)
138 pages
Degenne, Forsé, 1999 - Introducing Social Networks
No ratings yet
Degenne, Forsé, 1999 - Introducing Social Networks
257 pages
Chapter 4
No ratings yet
Chapter 4
19 pages
DM - Lecture 4
No ratings yet
DM - Lecture 4
65 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
ML11 Generalization
No ratings yet
ML11 Generalization
40 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Boosting Reduces Bias
No ratings yet
Boosting Reduces Bias
7 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
ML8 Ensembles
No ratings yet
ML8 Ensembles
31 pages
Machine Learning: Ensemble Methods
No ratings yet
Machine Learning: Ensemble Methods
54 pages
16 Boosting
No ratings yet
16 Boosting
7 pages
Lecture 2.1 - AML
No ratings yet
Lecture 2.1 - AML
32 pages
IML Summary
No ratings yet
IML Summary
12 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
14-AI ML Ensemble 2022
No ratings yet
14-AI ML Ensemble 2022
41 pages
Machine Learning: Classification & Decision Trees
No ratings yet
Machine Learning: Classification & Decision Trees
24 pages
Foundation Model Evaluation
No ratings yet
Foundation Model Evaluation
10 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Module 3.5 Ensemble Learning XGBoost
No ratings yet
Module 3.5 Ensemble Learning XGBoost
26 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Large Scale Machine Learning With Python - XGBOOST - P236
No ratings yet
Large Scale Machine Learning With Python - XGBOOST - P236
19 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
Datagiri: Presented 17 November By: Himanshu Shrivastava
No ratings yet
Datagiri: Presented 17 November By: Himanshu Shrivastava
17 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
Session 10 - Ensemble Methods (XGBoost)
No ratings yet
Session 10 - Ensemble Methods (XGBoost)
37 pages
Artificial Intelligence Fundamentals: Learning: Boosting
No ratings yet
Artificial Intelligence Fundamentals: Learning: Boosting
24 pages
Ensemble Methods
No ratings yet
Ensemble Methods
31 pages
AIML Lect6 Ensembles
No ratings yet
AIML Lect6 Ensembles
41 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Tex
No ratings yet
Tex
7 pages
Lecture 10 Ensemble Methods
No ratings yet
Lecture 10 Ensemble Methods
69 pages
Chapter 19
No ratings yet
Chapter 19
30 pages
Lec13 PDF
No ratings yet
Lec13 PDF
10 pages
Unit - 3 ML
No ratings yet
Unit - 3 ML
17 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Assignment 3.docx 2
No ratings yet
Assignment 3.docx 2
23 pages
Ensemble Learning
No ratings yet
Ensemble Learning
52 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
Boosting
No ratings yet
Boosting
13 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Lec 29
No ratings yet
Lec 29
33 pages
Machine Learning
No ratings yet
Machine Learning
63 pages
MLB HA 6 Answers Final
No ratings yet
MLB HA 6 Answers Final
13 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
Al3451 Ia 2 Answer Key
No ratings yet
Al3451 Ia 2 Answer Key
12 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
112 pages
Ai512 Book
No ratings yet
Ai512 Book
127 pages
Chapter 7 - Ensemble
No ratings yet
Chapter 7 - Ensemble
12 pages
LearningFromExamples II
No ratings yet
LearningFromExamples II
36 pages
SML Book Draft Latest
No ratings yet
SML Book Draft Latest
194 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Lect 1
No ratings yet
Lect 1
24 pages
Lecture6 Notes
No ratings yet
Lecture6 Notes
5 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Introduction to Calculus
From Everand
Introduction to Calculus
Joan Van Glabek
4.5/5 (8)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Aldehydes and Ketones-02 Solved Problems
No ratings yet
Aldehydes and Ketones-02 Solved Problems
13 pages
Topic - Syllogism: DIRECTIONS For Questions 1 - 10: in Each of The Questions Below Are Given Three Statements Followed by
No ratings yet
Topic - Syllogism: DIRECTIONS For Questions 1 - 10: in Each of The Questions Below Are Given Three Statements Followed by
4 pages
2 Inverse Trigomnometric Functions
No ratings yet
2 Inverse Trigomnometric Functions
2 pages
Service Manual: Separation Unit 841
100% (1)
Service Manual: Separation Unit 841
160 pages
Nuclear and Radiochemistry: Prof. B.S.Tomar Prof. P.K.Mohapatra
No ratings yet
Nuclear and Radiochemistry: Prof. B.S.Tomar Prof. P.K.Mohapatra
1 page
ATS001386E GSS G GST G Garden Speaker Datasheet
No ratings yet
ATS001386E GSS G GST G Garden Speaker Datasheet
2 pages
VSD Atv212hu22n4
No ratings yet
VSD Atv212hu22n4
11 pages
Mockito Basics and BDDMockito Class
No ratings yet
Mockito Basics and BDDMockito Class
9 pages
Chemical Disinfectants and Antiseptics - Surgical Hand Disinfection - Test Method and Requirements (Phase 2, Step 2)
No ratings yet
Chemical Disinfectants and Antiseptics - Surgical Hand Disinfection - Test Method and Requirements (Phase 2, Step 2)
36 pages
Knapsack Problem
No ratings yet
Knapsack Problem
18 pages
Emat Sensor Design
No ratings yet
Emat Sensor Design
20 pages
Blast Furnace Burden Permeability: Oleh Nick Standish, October 2013
100% (1)
Blast Furnace Burden Permeability: Oleh Nick Standish, October 2013
43 pages
1st Lessonl
No ratings yet
1st Lessonl
4 pages
0404-Mathematics Paper+With+Sol. Evening
No ratings yet
0404-Mathematics Paper+With+Sol. Evening
11 pages
Air Handling Unit: Sba / Ba
No ratings yet
Air Handling Unit: Sba / Ba
48 pages
Does Oil Hinder Democracy?by Michael L. Ross
No ratings yet
Does Oil Hinder Democracy?by Michael L. Ross
37 pages
Textile Wet Processing Through Natural Product
No ratings yet
Textile Wet Processing Through Natural Product
14 pages
Ncert Solutions For Class 10 Maths Chapter 13
No ratings yet
Ncert Solutions For Class 10 Maths Chapter 13
31 pages
Thermodynamics and Fluid Mechanics
No ratings yet
Thermodynamics and Fluid Mechanics
3 pages
WTOS - Dell Wyse ThinOS v8.3 INI Guide
No ratings yet
WTOS - Dell Wyse ThinOS v8.3 INI Guide
126 pages
MEAN IMP QUESTION FOR END SEM (Ans)
No ratings yet
MEAN IMP QUESTION FOR END SEM (Ans)
40 pages
Ch.3 (Chemical Equilibrium) - 1-2
No ratings yet
Ch.3 (Chemical Equilibrium) - 1-2
31 pages
Methods of Mathematical Physics
No ratings yet
Methods of Mathematical Physics
31 pages
W Scientific Inquiry Design Lab
No ratings yet
W Scientific Inquiry Design Lab
7 pages
Grade 10 Final Examination
No ratings yet
Grade 10 Final Examination
6 pages
Variable Reluctance Speed Sensor E12A: Product ID
No ratings yet
Variable Reluctance Speed Sensor E12A: Product ID
3 pages
Class 12 Maths Project Helpful
No ratings yet
Class 12 Maths Project Helpful
23 pages

09 EnsembleLearning

Uploaded by

09 EnsembleLearning

Uploaded by

Ensemble Learning

IS712 Machine Learning

• Is there gain to be made from further splits?

• Are there significant risks of overfitting?

• Alternatively, we can fit a regression function at each leaf node

• Predicted class is the mode of the class distribution

• Minimizing entropy is maximizing information gain

• Assume 2-class classification, where each class has 400 instances

Classification tree on 2 out of 4 features in Iris dataset

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }

• Recall: assume a random noise 𝜖 ∈ 𝒩(0, 𝜎 .) is the source of residual error:

• Squared bias E 𝑦! − E 𝑦E . = E 𝑦! − E ℎ0 (𝒙) .

– Contribution to squared loss due to sensitivity to different training subsamples

– Contribution to squared loss due to random noise in the data

• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .

• If we replace ℎ3 with ℎ̀, variance reduces to 0, if indeed samples are i.i.d.

• Bagging samples are unlikely i.i.d., so variance may not disappear

• Boosting seeks to create a strong learner from a weighted combination of

𝐻 𝒙 = sign > 𝛼= ℎ= (𝒙)

• Training data 𝐷 has 𝑁 instances 𝒙6 , 𝑦6 6∈{-,…,<}

• Sequential ensemble classifier:

• Sequential exponential error:

• Correctly classified instances: 𝐶*" = 𝒙- , 𝑦- |𝑦- . ℎ*" 𝒙- ≥ 0

• Minimizing the above with respect to 𝛼= $ gives us

• Functional gradient descent

• [PRML] Bishop, C. M. (2006). Pattern recognition and machine learning.

• [MLaPP] Murphy K. P. (2012). Machine learning: a probabilistic perspective.

You might also like

• Correctly classified instances: 𝐶" = 𝒙- , 𝑦- |𝑦- . ℎ" 𝒙- ≥ 0