CS-11-01
CS-11-01
CS-11-01
AIML CZG565
Ensemble Learning
•These content of modules & context under topics are planned by the course owner
Dr. Sugata, with grateful acknowledgement to many others who made their course
materials freely available online
• We here by acknowledge all the contributors for their material and inputs.
• We have provided source information wherever necessary
•Students are requested to refer to the textbook w.r.t detailed content of the
presentation deck shared over canvas
•We have reduced the slides from canvas and modified the content flow to suit the
requirements of the course and for ease of class presentation
External: CS109 and CS229 Stanford lecture notes, Dr.Andrew NG and many others who
made their course materials freely available online
BITS Pilani, Pilani Campus
Course Plan
M5 Decision Tree
M8 Bayesian Learning
M9 Ensemble Learning
Original
D Training data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
Avg
– All base classifiers perform better than random guessing (error rate <
0.5 for binary classification)
• Ensemble makes a wrong prediction only if base classifiers error is more than 0.5
• Technique uses these subsets (bags) to get a fair idea of the distribution (complete
set).
• The size of subsets created for bagging may be less than the original set.
• Bootstrapping is a sampling technique in which we create subsets of observations
from the original dataset, with replacement.
• When you sample with replacement, items are independent. One item does not
affect the outcome of the other. You have 1/7 chance of choosing the first item and
a 1/7 chance of choosing the second item.
• If the two items are dependent, or linked to each other. When you choose the first
item, you have a 1/7 probability of picking a item. Assuming you don’t replace the
item, you only have six items to pick from. That gives you a 1/6 chance of choosing
a second item.
BITS Pilani, Pilani Campus
Bagging (Bootstrap Aggregating)
• Multiple subsets are created from the original dataset, selecting observations with
replacement.
D={(x1,y1),...,(xN,yN)}
75%
confidence
• Regression
M
yˆ = 1 å Gm (x )
M m= 1
• Classification:
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 X < = 0.7 y= 1
y 1 1 1 -1 -1 -1 1 1 1 1 X > 0.7 y= 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3 y = -1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35 y = 1
y 1 1 1 -1 -1 -1 -1 1 1 1 x > 0.35 y = -1
Introduction to Data Mining, 2nd Edition
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75 y = 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Source : Hastie et al.,”The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Springer (2009
BITS Pilani, Pilani Campus
Bagging - Sampling Process
• No cross validation?
• Remember, in bootstrapping we sample with replacement, and therefore not all
observations are used for each bootstrap sample. Its observed that on
average 1/3 observation are not used!
• In this situation, bagging may not be able to improve the performance of the
base classifiers significantly.
• It may even degrade the classifier's performance because the effective size
of each training set is about 37% smaller than the original data.
ϵ ens = σ 𝟏𝟏
𝒌=𝟔
𝟏𝟏
𝟔
𝟎. 𝟐𝟓𝒌 𝟏 − 𝟎. 𝟐𝟓 𝟏𝟏−𝒌 = 0.034
• Easy to parallelize
Bagging – Limitation
• Each tree is identically distributed (i.d.)
• The expectation of the average of B such trees is the same as the
expectation of any one of them
• The bias of bagged trees is the same as that of the individual trees
An average of B i.i.d. random variables, each with variance σ2, has variance: σ2/B
If i.d. (identical but not independent) and pair correlation r is present, then the
variance is:
As B increases the second term disappears but the first term remains
Remember we want i.i.d such as the bias to be the same and variance to be less?
Other ideas?
We will still get correlated trees unless … . w e randomly select the subset !
Bagging algorithms:
– Random forest
Boosting algorithms:
– AdaBoost
– Gradient Boosting
BITS Pilani
Random Forest
• Construct an ensemble of
decision trees by manipulating
training set as well as features
– Use bootstrap sample to train
every decision tree (similar to
Bagging)
– Use the following tree
induction algorithm:
• At every internal node of
decision tree, randomly
sample p attributes for
selecting split criterion
• Repeat this procedure
until all leaves are pure
(unpruned tree)
• Random Forest is ensemble machine learning algorithm that follows the bagging
technique.
• Random Forest need features that have at least some predictive power.
• The trees of the forest and more importantly their predictions need to be
uncorrelated (or at least have low correlations with each other).
• Algorithm can solve both type of problems i.e. classification and regression
• Power to handle large data set with higher dimensionality.
• It can handle thousands of input variables and identify most significant
variables so it is considered as one of the dimensionality reduction
methods.
Random forests are popular. Leo Breiman’s and Adele Cutler maintains a random forest
website where the software is freely available, and of course it is included in every
ML/STAT package
http://www.stat.berkeley.edu/~breiman/RandomForests/
For b = 1 to B:
• For regression, the default value for m is p/3 and the minimum node size is five.
• In practice the best values for these parameters will depend on the problem, and
they should be treated as tuning parameters.
• Like with Bagging, we can use OOB and therefore RF can be fit in one
sequence, with cross-validation being performed along the way. Once the OOB
error stabilizes, the training can be terminated.
• Algorithm can solve both type of problems i.e. classification and regression
• Random forests “cannot overfit” the data w.r.t to number of trees, since the more number of
trees, B does not mean there is a increase in the flexibility of the model
• Model outputs Importance of variable, which can be a very handy feature (on some
random data set).
Record the prediction accuracy on the OOB samples for each tree
Randomly permute the data for column j in the OOB samples and the record the
accuracy again.
The decrease in accuracy as a result of this permuting is averaged over all trees, and is
used as a measure of the importance of variable j in the random forest.
• When the number of variables is large, but the fraction of relevant variables is
small, random forests are likely to perform poorly when m is small. Because at
each split the chance can be small that the relevant variables will be selected
• For example, with 3 relevant and 100 not so relevant variables the probability of
any of the relevant variables being selected at any split is ~0.25
BITS Pilani
Boosting
• What if a data point is incorrectly predicted by the first model, and then the
next (probably all models), will combining the predictions provide better
results? Such situations are taken care of by boosting.
• Records that are wrongly classified will have their weights increased in the next round
• Records that are classified correctly will have their weights decreased in the next round
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Errors are calculated using the actual values and predicted values.
• The observations which are incorrectly predicted, are given higher weights.
(Here, the three misclassified blue-plus points will be given higher weights)
• Another model is created and predictions are made on the dataset. (This model
tries to correct the errors from the previous model)
w C ( x ) y
N
1
i j i j j
N j 1
Importance of a classifier:
1 1 i
i ln
2 i
https://en.wikipedia.org/wiki/AdaBoost#Choosing_αt
C * ( x ) arg max j C j ( x ) y
T
y j 1
From, L e
́ on Bottou
1 1 i
w C ( x ) y i ln
N
1
From, L e
́ on Bottou
i
N j 1
j i j j
2 i
1 1 i
i ln
w C ( x ) y
N
1
From, L e
́ on Bottou
i j i j j 2 i
N j 1
1 1 i
i ln
From, L e
́ on Bottou
2 i
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
From, L e
́ on Bottou
From, L e
́ on Bottou
From, L e
́ on Bottou
Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
Boosting Round 2:
Summary:x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
AdaBoost error function takes into account the fact that only the sign of the
final result is used, thus sum can be far larger than 1 without increasing
error
AdaBoost base learners
AdaBoost - Limitations
• Need more data.
• The idea of gradient boosting originated in the observation by Leo Breiman that
boosting can be interpreted as an optimization algorithm on a suitable cost
function
• predictor can be any machine learning algorithm like SVM, Logistic regression,
KNN , Decision tree etc. But Decision tree version of gradient boosting is much
popular
• Both high-weight data points and gradients tell us how to improve our model.
designed to be
highly efficient, flexible and
portable
• You are given (x1, y1),(x2, y2), ...,(xn, yn), and the task is to fit a model F(x) to
minimize square loss
• Rules:
– You are not allowed to remove anything from F or change any parameter in
F.
– F(x1) + h(x1) = y1
for m = 1, . . . ,M
Actual Predicted
Height Age Gender Weight weight 1
5.4 28 M 88 71.2
5.2 26 F 76 71.2
5 28 F 56 71.2
5.6 25 M 73 71.2
6 25 M 77 71.2
4 22 F 57 71.2
Combining the trees to make the new prediction: Fm(x) = Fm-1(x) +.hm(x)
Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example
In million US Dollars
Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example
Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example
Age >= 30
No Yes
# Rooms >= 10 0.9125
-0.3375 -0.0875
-0.4125
-0.4875
Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example
Age >= 30
+
No Yes
y1̂ = 1 ∑ y(i) = 0.5875
n
# Rooms >= 10 0.9125
n
i=1
-0.3375 -0.0875
-0.4125 -0.4875
Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example
• Step 3: Combine tree from step 1 with trees from step 2
x1# x2=City x3=Age y=Price r=Res
5 Boston 30 1.5 1.5 - 0.5875 = 0.9125
Age >= 30
+
No Yes
y1̂ = 1 ∑ y(i) = 0.5875
n
# Rooms >= 10 0.9125
n
i=1
-0.3375 -0.0875
-0.4125
-0.4875
E.g.,
predict 0.5875 + α × (−0.4125)
Lansing
Where learning rate between 0 and 1 (if α=1 , low bias but high variance)
Sebastian Raschka
Gradient Boosting -- Algorithm Overview
Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
for i=1 to n
B. Fit tree to ri,t values, and create terminal
nodes Rj,t for j = 1,...,Jt
...
Sebastian Raschka
Gradient Boosting -- Algorithm Overview
Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
for i=1 to n
Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion
n
Step 1: Initialize model h0(x) = argmin∑ L(y(i), y ̂ )
ŷ i=1
pred. target
n
1 (i)
n ∑ y
i=1
Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion
Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
pseudo residual of the t-th tree for i = 1 to n
and i-th example
Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion
Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
pseudo residual of the t-th tree for i = 1 to n
and i-th example
n
h0(x) = argmin∑ L(y(i), y )
ŷ i=1
...
+α yĵ ,T = argmin ∑ L(y(i), hT−1(x(i)) + ŷ )
ŷ x(i)∈Ri,j
Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion
n
h0(x) = argmin∑ L(y(i), y)
ŷ i=1
+α yĵ ,t=1
The idea is that we decrease the
pseudo residuals by a small amount
... at each step
+α yĵ ,T
Sebastian Raschka
XGBoost
https://arxiv.org/abs/1603.02754
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the
22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). ACM.
Sebastian Raschka
References
• https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-
6561eec192cb
• Unsupervised Learning
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956