0% found this document useful (0 votes)
3 views36 pages

09 EnsembleLearning

Uploaded by

Nicole Oo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views36 pages

09 EnsembleLearning

Uploaded by

Nicole Oo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Ensemble Learning

Hady W. Lauw
Photo by Felix Mittermeier from Pexels

IS712 Machine Learning


CLASSIFICATION AND REGRESSION TREES (CART)
Recursively partitioning the input space and defining a local model for each partition
Regression Tree
• Partition input space into regions
• Prediction is the mean response in each region
– Alternatively, fit a regression function locally

3
Classification Tree
• Partition input space into regions
• Prediction is the mode of class label distribution

4
Recursive Procedure to Grow a Tree

• Split function chooses the “best” feature j (among M features) and feature value t
(among the viable feature values of j) to split
|𝐷, | 𝐷.
𝑗 ∗, 𝑡 ∗ = arg min min 𝑐𝑜𝑠𝑡 𝐷, = {𝒙- , 𝑦- : 𝑥-" ≤ 𝑡} + 𝑐𝑜𝑠𝑡(𝐷. = {𝒙- , 𝑦- : 𝑥-" > 𝑡})
"∈{%,…,(} *∈+! |𝐷| 𝐷
5
Is it worth splitting?

• Is there gain to be made from further splits?


– The distribution in each region may already be sufficiently homogeneous
– The gain may be too small
|𝐷, | 𝐷.
𝐺𝑎𝑖𝑛 = 𝑐𝑜𝑠𝑡 𝐷 − 𝑐𝑜𝑠𝑡 𝐷, = {𝒙- , 𝑦- : 𝑥-" ≤ 𝑡} + 𝑐𝑜𝑠𝑡(𝐷. = {𝒙- , 𝑦- : 𝑥-" > 𝑡})
|𝐷| 𝐷

• Are there significant risks of overfitting?


– The tree may already be too deep
– The number of examples in a particular region may be too small

6
Regression Cost
• For a subset of data points D, quantify:
𝑐𝑜𝑠𝑡 𝐷 = @ 𝑦- − 𝑓(𝒙- , 𝑦- ) 0

-∈/

• In the simplest case, the prediction could just be the mean response
1
𝑓 𝒙- , 𝑦- = @ 𝑦-
|𝐷|
-∈/

• Alternatively, we can fit a regression function at each leaf node

7
Classification Cost: Misclassification
• First, we estimate the class proportions within a subset of data points D:
1
𝜋̂! = @ 𝟏(𝑦- = 𝑐)
|𝐷|
-∈/

• Predicted class is the mode of the class distribution


𝑦! = arg max 𝜋G1
1∈2

• Misclassification rate
1
𝑐𝑜𝑠𝑡 𝐷 = % 𝟏 𝑦! ≠ 𝑦̂ = 1 − 𝜋̂ %̂
𝐷
!∈#

8
Classification Cost: Entropy
• Define entropy of class distribution
G = − @ 𝜋J 1 log 𝜋J 1
𝐻 𝝅
1∈2

• Minimizing entropy is maximizing information gain


𝑖𝑛𝑓𝑜𝐺𝑎𝑖𝑛 𝑋, < 𝑡, 𝑌 = 𝐻 𝑌 − 𝐻(𝑌|𝑋, < 𝑡)

9
Classification Cost: Gini Index
• Define entropy of class distribution
G
𝐺𝑖𝑛𝑖 𝝅
= @ 𝜋J 1 (1 − 𝜋J 1 )
1∈2

= @ 𝜋J 1 − @ 𝜋J 10
1∈2 1∈2

= 1 − @ 𝜋J 10
1∈2

10
Classification Cost: Comparison

• Assume 2-class classification, where each class has 400 instances


Splits Misclassification Rate Entropy Gini Index
Split 1: (300, 100) 0.25 0.81 0.375
Split 2: (100, 300)
Split 1: (200, 400) 0.25 0.70 0.336
Split 2: (199, 1)

11
Example: Iris Dataset

12
Example: Iris Dataset - Unpruned

13
Example: Iris Dataset - Pruned

14
BAGGING
Reducing variance via Bootstrap AGGregating
Random Tree Classifier
• Sample 𝑘 out of 𝑀 features randomly
– Heuristic: 𝑘 = 𝑀
• Build a full decision tree based only on the 𝑘 features
• This is a high-variance model

Classification tree on 2 out of 4 features in Iris dataset


16
Random Forest Classifier
• To lower the variance, we can “bag” many random trees

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }


• For each sampled dataset 𝐷0 :
– Sample 𝑘 out of 𝑀 features randomly
– Train a full classification tree ℎ3 (𝒙) with the 𝑘 features
• The final classifier is the average of the trees
1
:
ℎ 𝒙 = > ℎ0 (𝒙)
𝐿
0∈{-,…,/}

17
Sampling: without vs. with replacement

https://www.spss-tutorials.com/spss-sampling-basics/

18
Bagging: Bootstrap Aggregating
• We can apply bagging to other models as well

• Sample 𝐿 datasets from 𝐷 with replacement: 𝐷-, 𝐷., … , 𝐷/ }


• For each sampled dataset 𝐷0 :
– Sample 𝑘 out of 𝑀 features randomly
– Train a model ℎ3 (𝒙) with the 𝑘 features
• The final model is the average of the predictions
1
:
ℎ 𝒙 = > ℎ0 (𝒙)
𝐿
0∈{-,…,/}

19
Bias-Variance Illustration

http://scott.fortmann-roe.com/docs/BiasVariance.html
20
Expected Loss for Regression

• Recall: assume a random noise 𝜖 ∈ 𝒩(0, 𝜎 .) is the source of residual error:


𝑦 = 𝑦J 𝒙 𝒘′ + 𝜖
𝑝 𝑦 𝒙, 𝒘′, 𝜎 0 = 𝒩 𝑦 𝑦J 𝒙|𝒘′ , 𝜎 0

E 𝑦 𝒙 = W 𝑦 𝑝 𝑦 𝒙, 𝒘′, 𝜎 0 d𝑦 = 𝑦(𝒙|𝒘′)
J

𝑣𝑎𝑟[𝑦|𝒙] = E 𝑦 − E 𝑦 𝒙 0 = E 𝜖 0 = 𝑣𝑎𝑟 𝜖 + E 𝜖 0 = 𝜎0

• Suppose that instead of learning from the “complete” data, we take sub-
samples, then we expect some variations in the squared loss we’d observe
– Let 𝑦 be the observed response variable
– Let 𝑦J be the optimal function (the 𝒙 and 𝒘′ omitted for simplicity of notation)
– Let 𝑦] be the function we learn from a particular sample
– We would like to characterize the expected squared loss E[ 𝑦 − 𝑦] 0 ] under the
distribution of subsamples
21
Bias-Variance Decomposition for Regression
E 𝑦 − 𝑦] 0 Note that:
= E 𝑦J + 𝜖 − 𝑦] 0 E𝜖 =0
0 E E 𝑦% − 𝑦% = E 𝑦% − E 𝑦% = 0
=E 𝑦J − E[𝑦])
] + 𝜖 + (E[𝑦]
] − 𝑦]
= E 𝑦J − E 𝑦] 0 + E 𝜖 0 + E E 𝑦] − 𝑦] 0
+2×E 𝜖 ×E[𝑦J − E 𝑦] ] + 2×E 𝜖]×E[E 𝑦] − 𝑦] + 2×E 𝑦J − E 𝑦] ×E E 𝑦] − 𝑦]
= E[ 𝑦J − E 𝑦] 0 ] + 𝜎 0 + E E 𝑦] − 𝑦] 0

• Squared bias E 𝑦! − E 𝑦E . = E 𝑦! − E ℎ0 (𝒙) .

– Contribution to squared loss due to deviation of the learnt function from the optimal
• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .

– Contribution to squared loss due to sensitivity to different training subsamples


• Irreducible error 𝜎 . = E 𝑦! − 𝑦 .

– Contribution to squared loss due to random noise in the data

22
Bagging Reduces Variance
• Weak law of large numbers when samples are i.i.d:
1
:
ℎ 𝒙 = > ℎ0 (𝒙) → E ℎ0 (𝒙) as 𝐿 → ∞
𝐿
0∈{-,…,/}

• Variance E E 𝑦E − 𝑦E . = E E ℎ0 (𝒙) − ℎ0 (𝒙) .

• If we replace ℎ3 with ℎ̀, variance reduces to 0, if indeed samples are i.i.d.

• Bagging samples are unlikely i.i.d., so variance may not disappear


completely, but would likely still be reduced effectively

23
Unbiased Estimate of Test Error
• Each bagging sample 𝐷0 only involves a subset of the training data 𝐷

• A specific training instance (𝒙6 , 𝑦6 ) is part of some sample, but not others
• Let 𝐷76 = {𝐷0 | 𝒙6 , 𝑦6 ∉ 𝐷0 } be the samples that do not contain this instance
-
• Let ℎ: 76 (𝒙) = ∑ ℎ (𝒙) be average of models trained on 𝐷76
|9!" | 9# ∈9!" 0

• The out-of-bag error is the average such error across the 𝑁 instances in 𝐷
1
𝜖::; 𝐷 = > 𝑙𝑜𝑠𝑠(ℎ: 76 𝒙6 , 𝑦6 )
𝑁
6∈{-,…,<}

24
BOOSTING
Reducing bias via iteratively improving weak learners
Boosting
• Consider a binary classification problem 𝑦 ∈ {−1, 1}

• A weak learner is a model for binary classification that has slightly better
performance than random guesses
– Example: a shallow classification tree

• Boosting seeks to create a strong learner from a weighted combination of


multiple weak learners

𝐻 𝒙 = sign > 𝛼= ℎ= (𝒙)


=

26
AdaBoost

• Training data 𝐷 has 𝑁 instances 𝒙6 , 𝑦6 6∈{-,…,<}


• Associate each instance 𝒙6 , 𝑦6 with a weight 𝑤6
• Assume we can train a model ℎ= that minimizes a weighted loss function
𝐿= = > 𝑤6 = 𝟏(ℎ= 𝒙6 ≠ 𝑦6 )
6∈{-,…,<}
– 𝟏(ℎ* 𝒙- = 𝑦- ) is an indicator function that yields 1 if the equality within holds, 0 otherwise
– An example model is a decision stump with a single feature split
-
• Initially, the weights of all instances are uniform 𝑤6 = <
• Subsequently, weights of misclassified instances are adjusted

27
Algorithm
• For iteration 𝑡 from 1 to 𝑇
– Fit a classifier ℎ* to the training data by minimizing the weighted loss function
&
𝐿& = % 𝑤! 𝟏(ℎ& 𝒙! ≠ 𝑦! )
!∈{(,…,+}
– Evaluate error of this iteration
∑!∈{(,…,+} 𝑤! & 𝟏(ℎ& 𝒙! ≠ 𝑦! )
𝜖& =
∑!∈{(,…,+} 𝑤! &
– Evaluate coefficient of this classifier
1 − 𝜖&
𝛼& = ln
𝜖&
– Update data weight coefficients
(&.() (&)
𝑤! = 𝑤! exp 𝛼& 𝟏(ℎ& 𝒙! ≠ 𝑦! )
• Final prediction is given by:
𝐻 𝒙! = 𝑠𝑖𝑔𝑛 % 𝛼& ℎ& 𝒙!
&∈{(,…,0}
28
Minimizes Sequential Exponential Error

• Sequential ensemble classifier:


1
𝐻* " 𝒙- = @ 𝛼* ℎ* 𝒙-
2
*∈{%,…,* " }

• Sequential exponential error:


𝐸= @ exp −𝑦- 𝐻*5 (𝒙- )
-∈{%,…,4}
1
= @ exp −𝑦- 𝐻* " 6% 𝒙- − 𝑦- 𝛼* " ℎ* " (𝒙- )
2
-∈{%,…,4}
(* " ) 1
= @ 𝑤- exp − 𝑦- 𝛼* " ℎ* " 𝒙-
2
-∈{%,…,4}
(= $ )
• Given 𝐻= $ , 𝑤6
is a constant, and we
seek to find the minimizing 𝛼= $ and ℎ= $
29
Sequential Exponential Error (cont’d)

• Correctly classified instances: 𝐶*" = 𝒙- , 𝑦- |𝑦- . ℎ*" 𝒙- ≥ 0


• Wrongly classified instances: 𝐶*̅ " = { 𝒙- , 𝑦- |𝑦- . ℎ*" 𝒙- < 0}
• Sequential exponential error:
(& ! ) 1
𝐸= % 𝑤! exp − 𝑦! 𝛼& ! ℎ& ! 𝒙!
2
!∈{(,…,+}
𝛼& ! (& ! ) 𝛼& ! (& ! )
= exp − % 𝑤! + exp % 𝑤!
2 2
!∈1"! !∈1"̅ !
𝛼! 𝛼! &! 𝛼& ! (& ! )
= exp & − exp − & % 𝑤! 1(ℎ& ! 𝑥! ≠ 𝑦! ) +exp − % 𝑤!
2 2 2
!∈{(,…,+} !∈{(,…,+}

• Minimizing the above with respect to 𝛼= $ gives us


1 − 𝜖&
𝛼& = ln
𝜖&
∑!∈{(,…,+} 𝑤! & 𝟏(ℎ& 𝒙! ≠ 𝑦! )
𝜖& =
∑!∈{(,…,+} 𝑤! &
30
Illustration

31
Illustration (cont’d)

32
Interpretations of Boosting

• A form of L1 regularization
– Each weak learner is a decision stump that relies on a single feature
– Boosting “selects” among these weak learners (features) that work well

• Margin maximization
– By iteratively adjusting weights of misclassified instances, boosting seeks the classifier
that maximizes the margin

• Functional gradient descent


– The functions are the “parameters”
– GradientBoost is a generic algorithm for boosting that accommodates various loss
functions

33
Boosting Loss Functions

34
Conclusion
• Classification and Regression Trees (CART)
– A class of models that partition the input space into regions and models each region
locally

• Ensemble learning
– Aggregating the predictions of multiple models

• Bagging
– Trains multiple models from sub-samples of dataset
– Reduces variance of a high-variance learning algorithm without affecting bias

• Boosting
– Combining multiple weak learners into a strong learner
– Reduces bias by iteratively over-weighting misclassified instances
35
References

• [PRML] Bishop, C. M. (2006). Pattern recognition and machine learning.


Springer.
– Chapter 14 (Combining Models)

• [MLaPP] Murphy K. P. (2012). Machine learning: a probabilistic perspective.


MIT Press.
– Chapter 16 (Adaptive Basis Function Models)

36

You might also like