CS-11-01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 124

Machine Learning

AIML CZG565
Ensemble Learning

BITS Pilani Course Faculty of M.Tech Cluster


BITS – CSIS - WILP
Pilani Campus
Machine Learning
Disclaimer and Acknowledgement

•These content of modules & context under topics are planned by the course owner
Dr. Sugata, with grateful acknowledgement to many others who made their course
materials freely available online
• We here by acknowledge all the contributors for their material and inputs.
• We have provided source information wherever necessary
•Students are requested to refer to the textbook w.r.t detailed content of the
presentation deck shared over canvas
•We have reduced the slides from canvas and modified the content flow to suit the
requirements of the course and for ease of class presentation

Slide Source / Preparation / Review:


From BITS Pilani WILP: Profs.Sugata, Chetana, Monali, Rajavadhana, Seetha, Anita

External: CS109 and CS229 Stanford lecture notes, Dr.Andrew NG and many others who
made their course materials freely available online
BITS Pilani, Pilani Campus
Course Plan

M1 Introduction & Mathematical Preliminaries

M2 Machine Learning Workflow

M3 Linear Models for Regression

M4 Linear Models for Classification

M5 Decision Tree

M6 Instance Based Learning

M7 Support Vector Machine

M8 Bayesian Learning

M9 Ensemble Learning

M10 Unsupervised Learning

M11 Machine Learning Model Evaluation/Comparison


BITS Pilani, Pilani Campus
Ensemble Learning
Ensemble Philosophy

• No Free Lunch Theorem: There is no


algorithm that is always the most accurate

• Each learning algorithm dictates a certain


model that comes with a set of assumptions

– Each algorithm converges to a different


solution and fails under different
circumstances
• The best tuned learners could miss
some examples and there could be
other learners which works better on Weak learner does only
(may be only) those ! slightly better than
– In the absence of a single expert ( a random guessing
superior model ) , a committee
(combinations of models) can do better !
• A committee can work in many ways ...
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Committee of Models

• Committee Members are base


learners !

• Major challenges dealing with


this committee

– Expertise of each of the


members (Does it help /
not?)

– Combining the results from


the members for better
performance

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Ensemble Methods

• Ensemble methods use multiple learning algorithms to obtain better predictive


performance than could be obtained from any of the constituent learning
algorithms alone

• Construct a set of classifiers from the training data


• Predict class label of test records by combining the predictions made by
multiple classifiers

• Tend to reduce problems related to over-fitting of the training data.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


GeneralApproach

Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus


Issue 1 : On the members ( Base Learners)

• It does not help if all


learners are good/bad at
roughly same thing

– Need Diverse Learners

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Issue 1 : On the members ( Base Learners)

• Use Different Algorithms


– Different algorithms make
different assumptions

• Use Different Hyper parameters,


– E.g. vary the structure of
neural nets

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Issue 1 : On the members ( Base Learners)
• Different input representations
– Uttered words + video
information of speakers clips

– image + text annotations

• Different training sets


– Draw different random samples
of data

– Partition data in the input space


and have learners specialized in
those spaces (mixture of
experts)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Issue -2 : Combining Results Base Learners

A Simple Combination Scheme:

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Issue -2 : Combining Results Base Learners

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Issue -2 : Combining Results Base Learners

Avg

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


When EnsembleMethods Work?

• Ensemble classifier performs better than the base classifiers when e is


smaller than 0.5

• Necessary conditions for an ensemble classifier to perform better than a


single classifier:

– Base classifiers should be independent of each other


– Base classifiers should do better than a classifier that performs
random guessing

BITS Pilani, Pilani Campus


Necessary Conditions for Ensemble Methods

• Ensemble Methods work better than a single base classifier if:

– All base classifiers are independent of each other

– All base classifiers perform better than random guessing (error rate <
0.5 for binary classification)

Classification error for an


ensemble of 25 base
classifiers, assuming their
errors are uncorrelated.

Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus


Why Ensemble Methods work?

• 25 base classifiers. Each classifier has error rate,  = 0.35


• If base classifiers are identical, then the ensemble will misclassify the same examples
predicted incorrectly by the base classifiers depicted by dotted line

• Assume errors made by classifiers are uncorrelated

• Ensemble makes a wrong prediction only if base classifiers error is more than 0.5

Classification error for an


ensemble of 25 base
classifiers, assuming their
errors are uncorrelated.

Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus


Types of Ensemble Methods

• By manipulating training set

– Example: bagging, boosting, random forests

• By manipulating input features

– Example: random forests

Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus


Bagging
Bootstrap Aggregating
Bagging (Bootstrap Aggregating)

• Technique uses these subsets (bags) to get a fair idea of the distribution (complete
set).

• The size of subsets created for bagging may be less than the original set.
• Bootstrapping is a sampling technique in which we create subsets of observations
from the original dataset, with replacement.

• When you sample with replacement, items are independent. One item does not
affect the outcome of the other. You have 1/7 chance of choosing the first item and
a 1/7 chance of choosing the second item.

• If the two items are dependent, or linked to each other. When you choose the first
item, you have a 1/7 probability of picking a item. Assuming you don’t replace the
item, you only have six items to pick from. That gives you a 1/6 chance of choosing
a second item.
BITS Pilani, Pilani Campus
Bagging (Bootstrap Aggregating)

• Multiple subsets are created from the original dataset, selecting observations with
replacement.

• A base model (weak model) is created on each of these subsets.

• The models run in parallel and are independent of each other.


• The final predictions are determined by combining the predictions from all the
models.

BITS Pilani, Pilani Campus


Bagging at trainingtime Build a model
For m=1:M, Obtain bootstrap
sample Dm from the training Gm(x) from
data D bootstrap data
M subsets (with Dm
replacement)
Size(M) ≤ N

Training set A base model (weak model) is


created on each of these subsets

D={(x1,y1),...,(xN,yN)}

BITS Pilani, Pilani Campus


Bagging at inferencetime

The models run in parallel


and are independent of
each other

A test sample Combine the


predictions from
all the models.

75%
confidence

BITS Pilani, Pilani Campus


The Bagging Model

• Regression
M
yˆ = 1 å Gm (x )
M m= 1

• Classification:

– Vote over classifier outputs


G1 (x) ,...,GM (x)

BITS Pilani, Pilani Campus


Bagging Algorithm

BITS Pilani, Pilani Campus


Bagging Example

Consider 1-dimensional data set:


Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Classifier is a decision stump
Decision rule: x  k versus x > k
x < 0.35 or X >= 0.75.
Split point k is chosen based on entropy

Decision tree with one internal node (the root)


xk which is immediately connected to the terminal
nodes (its leaves). Decision stump makes a
True False
prediction based on the value of just a single
yleft yright input feature. Sometimes they are also called
1-rules

BITS Pilani, Pilani Campus


Bagging Example

Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 X < = 0.7 y= 1
y 1 1 1 -1 -1 -1 1 1 1 1 X > 0.7 y= 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3  y = -1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35  y = 1
y 1 1 1 -1 -1 -1 -1 1 1 1 x > 0.35  y = -1
Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus


Bagging Example

Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75  y = 1

Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 10:


x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05  y = 1
y 1 1 1 1 1 1 1 1 1 1 x > 0.05  y = 1
Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus


Bagging Example

Summary of Training sets:


Round Split Point Left Class Right Class
1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1

Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus


Bagging Example

• Assume test set is the same as the original data


• Use majority vote to determine class of ensemble classifier
Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus


Bagging as DecisionTree

Source : Hastie et al.,”The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Springer (2009
BITS Pilani, Pilani Campus
Bagging - Sampling Process

• No cross validation?
• Remember, in bootstrapping we sample with replacement, and therefore not all
observations are used for each bootstrap sample. Its observed that on
average 1/3 observation are not used!

• We call them out-of-bag samples (OOB)


• We can predict the response for the i-th observation using each of the trees in
which that observation was OOB and do this for n observations

• Calculate overall OOB MSE or classification error

BITS Pilani, Pilani Campus


Bagging - VariableImportance

• Bagging results in improved accuracy over


prediction using a single tree

• Unfortunately, difficult to interpret the resulting


model. Bagging improves prediction
accuracy at the expense of interpretability.

• Calculate the total amount that the RSS or


entropy is decreased due to splits over a given
predictor, averaged over all trees.

• A visualize of the values of every feature in a


sample use case is shown here.

BITS Pilani, Pilani Campus


Bagging – Effect on Bias

• If a base classifier is stable, i.e., robust to minor perturbations in the training


set, then the error of the ensemble is primarily caused by bias in the base
classifier.

• In this situation, bagging may not be able to improve the performance of the
base classifiers significantly.

• It may even degrade the classifier's performance because the effective size
of each training set is about 37% smaller than the original data.

BITS Pilani, Pilani Campus


Additional Reading Material

Source Credit : Sebastian Raschka


BITS Pilani
Why MajorityVoting Works - Proof

• Assume n independent classifiers with a base error rate ϵ. Here, independent


means that the errors are uncorrelated

• Assume a binary classification task


• Assume the error rate is better than random guessing (i.e., lower than 0.5 for
binary classification)

∀ϵi ∈ {ϵ1, ϵ2, . . . , ϵn}, ϵi < 0.5


• The probability that we make a wrong prediction via the ensemble if k
classifiers predict the same class label is given by below PMF of binomial
distribution.
P(k) = 𝑛
𝑘
𝜖𝑘 1 − 𝜖 𝑛− k > ⌈n/2⌉

BITS Pilani, Pilani Campus


Cont….

• The probability that we make a wrong


prediction via the ensemble if k classifiers
predict the same class label is given by
below PMF of binomial distribution.
𝒏
P(k) = 𝒌
𝝐𝒌 𝟏 − 𝝐 𝒏−𝒌 k > ⌈n/2⌉

• Ensemble error: ϵ ens = σ 𝒏𝒌 𝒏


𝒌
𝝐𝒌 𝟏 − 𝝐 𝒏−𝒌

• Eg., If we consider 11 classifier’s result


then atleast 6 classifier must have
produced same label ie.,

ϵ ens = σ 𝟏𝟏
𝒌=𝟔
𝟏𝟏
𝟔
𝟎. 𝟐𝟓𝒌 𝟏 − 𝟎. 𝟐𝟓 𝟏𝟏−𝒌 = 0.034

BITS Pilani, Pilani Campus


Bagging –Advantages

• Reduces overfitting (variance)

• Normally uses one type of classifier

• Decision trees are popular

• Easy to parallelize

Bagging – Limitation
• Each tree is identically distributed (i.d.)
• The expectation of the average of B such trees is the same as the
expectation of any one of them

• The bias of bagged trees is the same as that of the individual trees

• Results in a model that is i.d. and not i.i.d

BITS Pilani, Pilani Campus


Bagging – Limitation (Cont…)

An average of B i.i.d. random variables, each with variance σ2, has variance: σ2/B
If i.d. (identical but not independent) and pair correlation r is present, then the
variance is:

As B increases the second term disappears but the first term remains

Why does bagging generate correlated trees?


Suppose that there is one very strong predictor in the data set, along with a number of
other moderately strong predictors.
Then all bagged trees will select the strong predictor at the top of the tree and
therefore all trees will look similar.
How do we avoid this?
Restriction on the choice of predictors and no.of.trees that can use the same predictor
may help. But this leads to too much variance

BITS Pilani, Pilani Campus


Bagging – Limitation (Cont…)

Remember we want i.i.d such as the bias to be the same and variance to be less?
Other ideas?

What if we consider only a subset of the predictors at each split?

We will still get correlated trees unless … . w e randomly select the subset !

BITS Pilani, Pilani Campus


Algorithmsbased onBagging and Boosting

Bagging algorithms:

– Random forest

Boosting algorithms:

– AdaBoost

– Gradient Boosting

BITS Pilani, Pilani Campus


Random Forest

BITS Pilani
Random Forest

• Ensemble method specifically


designed for decision tree
classifiers
• Random Forests grows many
trees
– Ensemble of unpruned
Lower correlation across trees
decision trees
– Each base classifier classifies
a “new” vector of attributes
from the original data
– Final result on classifying a
new instance: voting.
– Forest chooses the
classification result having the
most votes (over all the trees
in the forest) Image credit: https://medium.com

BITS Pilani, Pilani Campus


Random Forest Algorithm

• Construct an ensemble of
decision trees by manipulating
training set as well as features
– Use bootstrap sample to train
every decision tree (similar to
Bagging)
– Use the following tree
induction algorithm:
• At every internal node of
decision tree, randomly
sample p attributes for
selecting split criterion
• Repeat this procedure
until all leaves are pure
(unpruned tree)

BITS Pilani, Pilani Campus


Random Forest

• Trees that are trained on different sets of data (bagging)


• Trees use different features to make decisions.

Image credit: https://medium.com


BITS Pilani, Pilani Campus
Random Forest– Summary

• Random Forest is ensemble machine learning algorithm that follows the bagging
technique.

• The base estimators in random forest are decision trees.


• Random forest randomly selects a set of features which are used to decide the
best split at each node of the decision tree.

• Random subsets are created from the original dataset (bootstrapping).


• At each node in the decision tree, only a random set of features are considered
to decide the best split.

• A decision tree model is fitted on each of the subsets.


• The final prediction is calculated by averaging the predictions from all decision
trees.

BITS Pilani, Pilani Campus


Random Forest

• Random Forest need features that have at least some predictive power.
• The trees of the forest and more importantly their predictions need to be
uncorrelated (or at least have low correlations with each other).

• Algorithm can solve both type of problems i.e. classification and regression
• Power to handle large data set with higher dimensionality.
• It can handle thousands of input variables and identify most significant
variables so it is considered as one of the dimensionality reduction
methods.

BITS Pilani, Pilani Campus


Random Forest

Cheaper Feature Selection

BITS Pilani, Pilani Campus


Additional Reading Material

Random forests are popular. Leo Breiman’s and Adele Cutler maintains a random forest
website where the software is freely available, and of course it is included in every
ML/STAT package

http://www.stat.berkeley.edu/~breiman/RandomForests/

Source Credit : Original Paper : https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf


S Pilani
BITS Pilani, Pilani CBIT
am pus
Random Forest– Algorithm

For b = 1 to B:

(a) Draw a bootstrap sample Z∗ of size N from the training data.


(b) Grow a random-forest tree to the bootstrapped data, by recursively
repeating the following steps for each terminal node of the tree, until the
minimum node size nmin is reached.

i. Select m variables at random from the p variables.

ii. Pick the best variable/split-point among the m.


iii. Split the node into two child nodes.

Output the ensemble of trees.

To make a prediction at a new point x we do:


– For regression: average the results

– For classification: majority vote


BITS Pilani, Pilani Campus
Random Forest– Algorithm

The inventors make the following recommendations:


• For classification, the default value for m is √p and the minimum node size is
one. If m=p then its bagging.

• For regression, the default value for m is p/3 and the minimum node size is five.
• In practice the best values for these parameters will depend on the problem, and
they should be treated as tuning parameters.

• Like with Bagging, we can use OOB and therefore RF can be fit in one
sequence, with cross-validation being performed along the way. Once the OOB
error stabilizes, the training can be terminated.

BITS Pilani, Pilani Campus


Random Forest– Algorithm

BITS Pilani, Pilani Campus


Random Forest– Advantages

• Algorithm can solve both type of problems i.e. classification and regression

• Power to handle large data set with higher dimensionality.


• It can handle thousands of input variables and identify most significant variables so it is
considered as one of the dimensionality reduction methods.

• Random forests “cannot overfit” the data w.r.t to number of trees, since the more number of
trees, B does not mean there is a increase in the flexibility of the model

• Model outputs Importance of variable, which can be a very handy feature (on some
random data set).

 Record the prediction accuracy on the OOB samples for each tree
 Randomly permute the data for column j in the OOB samples and the record the
accuracy again.

 The decrease in accuracy as a result of this permuting is averaged over all trees, and is
used as a measure of the importance of variable j in the random forest.

BITS Pilani, Pilani Campus


Random Forest– Disadvantages

• May over-fit data sets that are particularly noisy.


• Random Forest can feel like a black box approach for statistical modelers – you
have very little control on what the model does. You can at best – try different
parameters and random seeds!

• When the number of variables is large, but the fraction of relevant variables is
small, random forests are likely to perform poorly when m is small. Because at
each split the chance can be small that the relevant variables will be selected

• For example, with 3 relevant and 100 not so relevant variables the probability of
any of the relevant variables being selected at any split is ~0.25

BITS Pilani, Pilani Campus


Random Forest– Disadvantages

BITS Pilani, Pilani Campus


Boosting

BITS Pilani
Boosting

• What if a data point is incorrectly predicted by the first model, and then the
next (probably all models), will combining the predictions provide better
results? Such situations are taken care of by boosting.

• Boosting is a sequential process, where each subsequent model attempts


to correct the errors of the previous model.

• The succeeding models are dependent on the previous model.

BITS Pilani, Pilani Campus


Boosting

Train predictors sequentially, each trying to correct its predecessor


Scalability
An iterative procedure to
adaptively change
Ada Boost
distribution of training data
by focusing more on
previously misclassified
records

Initially, all N records


are assigned equal
weights (for being
selected for training)

Unlike bagging, weights


may change at the end
of each boosting round

BITS Pilani, Pilani Campus


Boosting

• Records that are wrongly classified will have their weights increased in the next round

• Records that are classified correctly will have their weights decreased in the next round

Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify


• Its weight is increased, therefore it is more likely to be
chosen again in subsequent rounds

BITS Pilani, Pilani Campus


Boosting - Approach :

• A subset is created from the original dataset.

• Initially, all data points are given equal weights.

• A base model is created on this subset.

• This model is used to make predictions on the whole dataset.

• Errors are calculated using the actual values and predicted values.
• The observations which are incorrectly predicted, are given higher weights.
(Here, the three misclassified blue-plus points will be given higher weights)

• Another model is created and predictions are made on the dataset. (This model
tries to correct the errors from the previous model)

BITS Pilani, Pilani Campus


Boosting

• Multiple models are


created, each correcting the
errors of the previous
model.
• The final model (strong
learner) is the weighted
mean of all the models
(weak learners).

• Individual models would not


perform well on the entire
dataset, but they work well
for some part of the
dataset. Thus, each model
actually boosts the
performance of the
ensemble.

BITS Pilani, Pilani Campus


AdaBoost

• Adaptive boosting or AdaBoost is one of the


simplest boosting algorithms. Usually, decision
trees are used for modelling. Multiple sequential
models are created, each correcting the errors
from the last model.
• AdaBoost assigns weights to the observations
which are incorrectly predicted and the
subsequent model works to predict these values
correctly.

BITS Pilani, Pilani Campus


AdaBoost Algorithm

• Initially, all observations (n) in the dataset are


given equal weights (1/n).
• A model is built on a subset of data.
• Using this model, predictions are made on the
whole dataset.
• Errors are calculated by comparing the
predictions and actual values.
• While creating the next model, higher weights
are given to the data points which were
predicted incorrectly.

BITS Pilani, Pilani Campus


Adaboost Algorithm

• Weights can be determined using the error


value. For instance, higher the error more is
the weight assigned to the observation.
• This process is repeated until the error
function does not change, or the maximum
limit of the number of estimators is reached.

BITS Pilani, Pilani Campus


AdaBoost

Base classifiers Ci: C1, C2, …, CT


Error rate:
N input samples

 w  C ( x )  y 
N
1
i  j i j j
N j 1
Importance of a classifier:

1  1  i 
i  ln 
2  i 
https://en.wikipedia.org/wiki/AdaBoost#Choosing_αt

BITS Pilani, Pilani Campus


AdaBoost: Weight Update
 j
Weight Update: ( j 1)


wi( j )exp if C j ( xi )  yi
wi   j
Zj  exp if C j ( xi )  yi <- Eqn:5.88
where Z j is the normalization factor

C * ( x )  arg max  j C j ( x )  y 
T

y j 1

• Reduce weight if correctly classified else increase


• If any intermediate rounds produce error rate higher than
50%, the weights are reverted back to 1/n and the resampling
procedure is repeated

BITS Pilani, Pilani Campus


AdaBoost Algorithm

BITS Pilani, Pilani Campus


AdaBoost Algorithm

α 𝑖𝑛 𝑒𝑎𝑟𝑙𝑖𝑒𝑟 𝑠𝑙𝑖𝑑𝑒 𝑠𝑎𝑚𝑒 𝑎𝑠 𝞫 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


AdaBoost Algorithm

BITS Pilani, Pilani Campus


AdaBoost Algorithm

BITS Pilani, Pilani Campus


AdaBoost Algorithm

BITS Pilani, Pilani Campus


AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


AdaBoost Algorithm

Member classifier with less error are given


more weight in final ensemble hypothesis.
Final prediction is a weighted combination
of each members prediction

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Exampl
e

From, L e
́ on Bottou

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

1  1  i 
 w  C ( x )  y  i  ln 
N
1
From, L e
́ on Bottou
i 
N j 1
j i j j
2  i 

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

1  1  i 
i  ln 
 w  C ( x )  y 
N
1
From, L e
́ on Bottou
i  j i j j 2  i 
N j 1

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

1  1  i 
i  ln 
From, L e
́ on Bottou

2  i 
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

From, L e
́ on Bottou

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

How do we combine the results now?

From, L e
́ on Bottou

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

How do we combine the results now?

From, L e
́ on Bottou

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


90
AdaBoos
t
Example
Training sets for the first 3 boosting rounds:

Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2:
Summary:x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1

Round Split Point Left Class Right Class alpha


1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
91
AdaBoos
t
Weights
Example
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
Classification 1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

AdaBoost error function takes into account the fact that only the sign of the
final result is used, thus sum can be far larger than 1 without increasing
error
AdaBoost base learners

BITS Pilani, Pilani Campus


AdaBoost in practice

BITS Pilani, Pilani Campus


AdaBoost - Advantages

• Fast and Simple to Program

• No parameter tuning is required (except T)

• No assumption is made on weak learners

AdaBoost - Limitations
• Need more data.

• Affected by the presence of noise

• Doesn’t work well in the presence of large number of outliers

BITS Pilani, Pilani Campus


Gradient Boosting

• The idea of gradient boosting originated in the observation by Leo Breiman that
boosting can be interpreted as an optimization algorithm on a suitable cost
function

• optimize a cost function over function space by iteratively choosing a function


(weak hypothesis) that points in the negative gradient direction.

• predictor can be any machine learning algorithm like SVM, Logistic regression,
KNN , Decision tree etc. But Decision tree version of gradient boosting is much
popular

• In Gradient Boosting, "shortcomings” are identified by gradients.

• Recall that, in Adaboost, “shortcomings” are identified by high-weight data points.

• Both high-weight data points and gradients tell us how to improve our model.

BITS Pilani, Pilani Campus


XGBoost

• XGBoost (Extreme Gradient


Boosting) uses the gradient
boosting (GBM) framework
at its core.
• optimized distributed
gradient boosting library

designed to be
highly efficient, flexible and
portable

BITS Pilani, Pilani Campus


Gradient Boosting - Idea

• You are given (x1, y1),(x2, y2), ...,(xn, yn), and the task is to fit a model F(x) to
minimize square loss

• There are some mistakes:


F(x1) = 0.8, while y1 = 0.9,

F(x2) = 1.4 while y2 = 1.3...

How can you improve this model?

• Rules:
– You are not allowed to remove anything from F or change any parameter in
F.

– You can add an additional model (regression tree) h to F, so the new


prediction will be F(x) + h(x).

BITS Pilani, Pilani Campus


Gradient Boosting

You wish to improve the model such that

– F(x1) + h(x1) = y1

– F(x2) + h(x2) = y2 ...

– F(xn) + h(xn) = yn • Simple solution: yi − F(xi) are called residuals


• These are the parts that existing model F
Or, equivalently
cannot do well.
h(x1) = y1 − F(x1)
• The role of h is to compensate the
h(x2) = y2 − F(x2) ...
shortcoming of existing model F
h(xn) = yn − F(xn)
• If the new model F + h is still not satisfactory,
Fit a regression tree h to data we can add another regression tree...
(x1, y1 − F(x1)),(x2, y2 − F(x2)), ...,(xn, yn − F(xn))

BITS Pilani, Pilani Campus


Gradient boosting: Summary

• Gradient boosting involves three elements:


• A loss function to be optimized
• For example, regression may use a squared error and classification may use
logarithmic loss.
• A weak learner to make predictions E.g Decision tree/Decision stump
• An additive model to add weak learners to minimize the loss function.
• Trees are added one at a time, and existing trees in the model are not
changed.
• A gradient descent procedure is used to minimize the loss when adding trees.
• Instead of parameters, we have weak learners
• After calculating the loss, to perform the gradient descent procedure, we must
add a tree to the model that reduces the loss (i.e. follow the gradient).

BITS Pilani, Pilani Campus


Gradient boosting algorithm

let F0 be a “dummy” constant model

for m = 1, . . . ,M

for each pair (xi,yi) in the training set


compute the pseudo-residual R(yi, Fm-1(xi)) = negative gradient of the loss

Train a regression sub-model hm on the pseudo-residuals

Add hm to the ensemble: Fm(x) = Fm-1(x) +.hm(x)

return the ensemble FM

BITS Pilani, Pilani Campus


Gradient boosting: Example

F0 be a “dummy” constant model


Average value is predicted.

Actual Predicted
Height Age Gender Weight weight 1
5.4 28 M 88 71.2
5.2 26 F 76 71.2
5 28 F 56 71.2
5.6 25 M 73 71.2
6 25 M 77 71.2
4 22 F 57 71.2

Example credit: https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-6561eec192cb


BITS Pilani, Pilani Campus
Iteration 1

compute the pseudo-residual R(yi, Fm-1(xi)) =


negative gradient of the loss
Residual: h1(x) = y − F0(x)

A tree with maximum leaf


nodes as 4(hyper
parameter for DT) using
Height, Age and Gender to
predict the residuals(Error)

Example credit: https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-6561eec192cb


BITS Pilani, Pilani Campus
Iteration 1

Combining the trees to make the new prediction: Fm(x) = Fm-1(x) +.hm(x)

Learning rate : 0.1

F1(x) = F0(x) +.h1(x)

Example credit: https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-6561eec192cb


BITS Pilani, Pilani Campus
Iteration 1
Residual: h2(x) = y − F1(x)

Example credit: https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-6561eec192cb


BITS Pilani, Pilani Campus
Iteration 2

Example credit: https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-6561eec192cb


BITS Pilani, Pilani Campus
Additional Reading Material

Source Credit : Sebastian Raschka


BITS Pilani
Gradient Boosting -- Conceptual Overview

• Step 1: Construct a base tree (just the root node)

• Step 2: Build next tree based on errors of the


previous tree

• Step 3: Combine tree from step 1 with trees from


step 2. Go back to step 2.

Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example
In million US Dollars

x1# Rooms x2=City x3=Age y=Price


5 Boston 30 1.5
10 Madison 20 0.5
6 Lansing 20 0.25
5 Waunakee 10 0.1

• Step 1: Construct a base tree (just the root node)


1 n
y1̂ = n ∑ y(i) = 0.5875
i=1

Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example

• Step 2: Build next tree based on errors of the


previous tree

First, compute (pseudo) residuals: r1 = y1 − y ̂1


In million US Dollars

x1# x2=City x3=Age y=Price r1=Res


5 Boston 30 1.5 1.5 - 0.5875 = 0.9125
10 Madison 20 0.5 0.5 - 0.5875 = -0.0875
6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
5 Waunake 10 0.1 0.1 - 0.5875 = -0.4875

Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example

• Step 2: Build next tree based on errors of the


previous tree
Then, create a tree based on x1, … , xm to fit the residuals
x1# x2=City x3=Age y=Price r1=Residual
5 Boston 30 1.5 1.5 - 0.5875 = 0.9125
10 Madison 20 0.5 0.5 - 0.5875 = -0.0875
6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
5 Waunake 10 0.1 0.1 - 0.5875 = -0.4875

Age >= 30
No Yes
# Rooms >= 10 0.9125

-0.3375 -0.0875
-0.4125
-0.4875

Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example

• Step 3: Combine tree from step 1 with trees from step 2


x1# x2=City x3=Age y=Price r=Res
5 Boston 30 1.5 1.5 - 0.5875 = 0.9125
10 Madison 20 0.5 0.5 - 0.5875 = -0.0875
6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
5 Waunake 10 0.1 0.1 - 0.5875 = -0.4875

Age >= 30

+
No Yes
y1̂ = 1 ∑ y(i) = 0.5875
n
# Rooms >= 10 0.9125
n
i=1
-0.3375 -0.0875
-0.4125 -0.4875

Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example
• Step 3: Combine tree from step 1 with trees from step 2
x1# x2=City x3=Age y=Price r=Res
5 Boston 30 1.5 1.5 - 0.5875 = 0.9125

E.g., 10 Madison 20 0.5 0.5 - 0.5875 = -0.0875


predict 6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
Lansing 5 Waunakee 10 0.1 0.1 - 0.5875 = -0.4875

Age >= 30

+
No Yes
y1̂ = 1 ∑ y(i) = 0.5875
n
# Rooms >= 10 0.9125
n
i=1
-0.3375 -0.0875
-0.4125
-0.4875
E.g.,
predict 0.5875 + α × (−0.4125)
Lansing
Where learning rate between 0 and 1 (if α=1 , low bias but high variance)

Sebastian Raschka
Gradient Boosting -- Algorithm Overview

Step 0: Input data {⟨x(i), y(i)⟩}ni=1


Differentiable Loss function L(y(i), h(x(i)))
n
Step 1: Initialize model h0(x) = argmin∑ L(y(i), y ̂ )
ŷ i=1

Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
for i=1 to n
B. Fit tree to ri,t values, and create terminal
nodes Rj,t for j = 1,...,Jt
...
Sebastian Raschka
Gradient Boosting -- Algorithm Overview
Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
for i=1 to n

B. Fit tree to ri,t values, and create terminal nodes


Rj,t for j = 1,...,Jt

C. for j = 1,...,Jt, compute


yĵ ,t = argmin Σ L(y(i), ht−1(x(i)) + y ̂ )
ŷ x(i)∈Ri,j
Jt
D. Update ht(x) = ht−1(x) + α Σ yĵ ,t 𝕀(x ∈ Rj,t)
j=1
Step 3: Return ht(x)
Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

Step 0: Input data {⟨x(i), y(i)⟩}ni=1


Differentiable Loss function L(y(i), h(x(i)))

E.g., Sum-squared error in regression


1 2
SSE′= (y(i) − h(x(i)))
2
∂ 1 y(i) − h(x(i)) 2
( ) [chain rule]
∂h(x ) 2
(i)

= 2 × 1 (y(i) − h(x(i))) × (0 − 1) = − (y(i) − h(x(i)))


2
[neg. residual]

Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

n
Step 1: Initialize model h0(x) = argmin∑ L(y(i), y ̂ )
ŷ i=1

pred. target

turns out to be the average (in regression)

n
1 (i)
n ∑ y
i=1

Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

Loop to make T trees (e.g., T=100)

Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
pseudo residual of the t-th tree for i = 1 to n
and i-th example

Derivative of the loss function

Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

Loop to make T trees (e.g., T=100)

Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
pseudo residual of the t-th tree for i = 1 to n
and i-th example

Derivative of the loss function

B. Fit tree to ri,t values, and create


Age >= 30
terminal nodes Rj,t for j = 1,...,Jt No Yes
# Rooms >= 10 0.9125

Use features in dataset to fit treeR R3,t


-0.3375
-0.4125 1,t -0.0875
-0.4875
Sebastian Raschka
R2,t
Gradient Boosting -- Algorithm Overview Discussion
Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
for i = 1 to n

B. Fit tree to ri,t values, and create


terminal nodes Rj,t for j = 1,...,Jt

C. for j = 1,...,Jt, compute


yĵ ,t = argmin ∑ L(y(i), ht−1(x(i)) + y ̂ )
ŷ x(i)∈Ri,j
Compute the
residual for each Only consider
leaf node examples at that Like step 1 but
leaf node add previous
prediction
Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion
Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
for i = 1 to n

B. Fit tree to i,t values, and create


terminal nodes Rj,t for j = 1,...,Jt

C. for j = 1,...,Jt, compute


yĵ ,t = argmin ∑ L(y(i), ht−1(x(i)) +ŷ )
ŷ x(i)∈Ri,j
Jt
D. Update ht(x) = ht−1(x) + α ∑ yĵ ,t 𝕀(x ∈ Rj,t)
learning rate j=1 Summation just in case
between 0 and 1 examples end up in
(usually 0.1) multiple nodes
Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

For prediction, combine all T trees, e.g.,

n
h0(x) = argmin∑ L(y(i), y )
ŷ i=1

+α yĵ ,t=1 = argmin ∑ L(y(i), h(t=1)−1(x(i)) + ŷ )


ŷ x(i)∈Ri,j

...
+α yĵ ,T = argmin ∑ L(y(i), hT−1(x(i)) + ŷ )
ŷ x(i)∈Ri,j

Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

For prediction, combine all T trees, e.g.,

n
h0(x) = argmin∑ L(y(i), y)
ŷ i=1

+α yĵ ,t=1
The idea is that we decrease the
pseudo residuals by a small amount
... at each step

+α yĵ ,T

Sebastian Raschka
XGBoost
https://arxiv.org/abs/1603.02754
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the
22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). ACM.

Summary and Main Points:


▪ scalable implementation of gradient boosting
▪ Improvements include: regularized loss, sparsity-aware algorithm,
weighted quantile sketch for approximate tree learning, caching of
access patterns, data compression, sharding
▪ Decision trees based on CART
▪ Regularization term for penalizing model (tree) complexity
▪ Uses second order approximation for optimizing the objective
▪ Options for column-based and row-based subsampling
▪ Single-machine version of XGBoost supports the exact greedy
algorithm

Sebastian Raschka
References

• Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach , Vipin


Kumar
• Bishop - Pattern Recognition And Machine Learning - Springer 2006
• The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-
Micheline-Kamber-Jian-Pei-Data-Mining.-Concepts-and-Techniques-3rd-Edition-
Morgan-Kaufmann-2011
• A Gentle Introduction to Gradient Boosting Cheng Li [email protected]
College of Computer and Information Science Northeastern University
• https://en.wikipedia.org/wiki/Gradient_boosting#:~:text=Gradient%20boosting%2
0is%20a%20machine,which%20are%20typically%20decision%20trees

• https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-
6561eec192cb

Next Session Plan

• Unsupervised Learning
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like