0% found this document useful (0 votes)

33 views124 pages

CS 11 01

Uploaded by

rofoxov186

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views124 pages

CS 11 01

Uploaded by

rofoxov186

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 124

Machine Learning

AIML CZG565
Ensemble Learning

BITS Pilani Course Faculty of M.Tech Cluster

BITS – CSIS - WILP
Pilani Campus
Machine Learning
Disclaimer and Acknowledgement

•These content of modules & context under topics are planned by the course owner
Dr. Sugata, with grateful acknowledgement to many others who made their course
materials freely available online
• We here by acknowledge all the contributors for their material and inputs.
• We have provided source information wherever necessary
•Students are requested to refer to the textbook w.r.t detailed content of the
presentation deck shared over canvas
•We have reduced the slides from canvas and modified the content flow to suit the
requirements of the course and for ease of class presentation

Slide Source / Preparation / Review:

From BITS Pilani WILP: Profs.Sugata, Chetana, Monali, Rajavadhana, Seetha, Anita

External: CS109 and CS229 Stanford lecture notes, Dr.Andrew NG and many others who
made their course materials freely available online
BITS Pilani, Pilani Campus
Course Plan

M1 Introduction & Mathematical Preliminaries

M2 Machine Learning Workflow

M3 Linear Models for Regression

M4 Linear Models for Classification

M5 Decision Tree

M6 Instance Based Learning

M7 Support Vector Machine

M8 Bayesian Learning

M9 Ensemble Learning

M10 Unsupervised Learning

M11 Machine Learning Model Evaluation/Comparison

BITS Pilani, Pilani Campus
Ensemble Learning
Ensemble Philosophy

• No Free Lunch Theorem: There is no

algorithm that is always the most accurate

• Each learning algorithm dictates a certain

model that comes with a set of assumptions

– Each algorithm converges to a different

solution and fails under different
circumstances
• The best tuned learners could miss
some examples and there could be
other learners which works better on Weak learner does only
(may be only) those ! slightly better than
– In the absence of a single expert ( a random guessing
superior model ) , a committee
(combinations of models) can do better !
• A committee can work in many ways ...
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Committee of Models

• Committee Members are base

learners !

• Major challenges dealing with

this committee

– Expertise of each of the

members (Does it help /
not?)

– Combining the results from

the members for better
performance

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Ensemble Methods

• Ensemble methods use multiple learning algorithms to obtain better predictive

performance than could be obtained from any of the constituent learning
algorithms alone

• Construct a set of classifiers from the training data

• Predict class label of test records by combining the predictions made by
multiple classifiers

• Tend to reduce problems related to over-fitting of the training data.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

GeneralApproach

Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus

Issue 1 : On the members ( Base Learners)

• It does not help if all

learners are good/bad at
roughly same thing

– Need Diverse Learners

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Issue 1 : On the members ( Base Learners)

• Use Different Algorithms

– Different algorithms make
different assumptions

• Use Different Hyper parameters,

– E.g. vary the structure of
neural nets

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Issue 1 : On the members ( Base Learners)
• Different input representations
– Uttered words + video
information of speakers clips

– image + text annotations

• Different training sets

– Draw different random samples
of data

– Partition data in the input space

and have learners specialized in
those spaces (mixture of
experts)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Issue -2 : Combining Results Base Learners

A Simple Combination Scheme:

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Issue -2 : Combining Results Base Learners

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Issue -2 : Combining Results Base Learners

Avg

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

When EnsembleMethods Work?

• Ensemble classifier performs better than the base classifiers when e is

smaller than 0.5

• Necessary conditions for an ensemble classifier to perform better than a

single classifier:

– Base classifiers should be independent of each other

– Base classifiers should do better than a classifier that performs
random guessing

BITS Pilani, Pilani Campus

Necessary Conditions for Ensemble Methods

• Ensemble Methods work better than a single base classifier if:

– All base classifiers are independent of each other

– All base classifiers perform better than random guessing (error rate <
0.5 for binary classification)

Classification error for an

ensemble of 25 base
classifiers, assuming their
errors are uncorrelated.

Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus

Why Ensemble Methods work?

• 25 base classifiers. Each classifier has error rate,  = 0.35

• If base classifiers are identical, then the ensemble will misclassify the same examples
predicted incorrectly by the base classifiers depicted by dotted line

• Assume errors made by classifiers are uncorrelated

• Ensemble makes a wrong prediction only if base classifiers error is more than 0.5

Classification error for an

ensemble of 25 base
classifiers, assuming their
errors are uncorrelated.

Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus

Types of Ensemble Methods

• By manipulating training set

– Example: bagging, boosting, random forests

• By manipulating input features

– Example: random forests

Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus

Bagging
Bootstrap Aggregating
Bagging (Bootstrap Aggregating)

• Technique uses these subsets (bags) to get a fair idea of the distribution (complete
set).

• The size of subsets created for bagging may be less than the original set.
• Bootstrapping is a sampling technique in which we create subsets of observations
from the original dataset, with replacement.

• When you sample with replacement, items are independent. One item does not
affect the outcome of the other. You have 1/7 chance of choosing the first item and
a 1/7 chance of choosing the second item.

• If the two items are dependent, or linked to each other. When you choose the first
item, you have a 1/7 probability of picking a item. Assuming you don’t replace the
item, you only have six items to pick from. That gives you a 1/6 chance of choosing
a second item.
BITS Pilani, Pilani Campus
Bagging (Bootstrap Aggregating)

• Multiple subsets are created from the original dataset, selecting observations with
replacement.

• A base model (weak model) is created on each of these subsets.

• The models run in parallel and are independent of each other.

• The final predictions are determined by combining the predictions from all the
models.

BITS Pilani, Pilani Campus

Bagging at trainingtime Build a model
For m=1:M, Obtain bootstrap
sample Dm from the training Gm(x) from
data D bootstrap data
M subsets (with Dm
replacement)
Size(M) ≤ N

Training set A base model (weak model) is

created on each of these subsets

D={(x1,y1),...,(xN,yN)}

BITS Pilani, Pilani Campus

Bagging at inferencetime

The models run in parallel

and are independent of
each other

A test sample Combine the

predictions from
all the models.

75%
confidence

BITS Pilani, Pilani Campus

The Bagging Model

• Regression
M
yˆ = 1 å Gm (x )
M m= 1

• Classification:

– Vote over classifier outputs

G1 (x) ,...,GM (x)

BITS Pilani, Pilani Campus

Bagging Algorithm

BITS Pilani, Pilani Campus

Bagging Example

Consider 1-dimensional data set:

Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Classifier is a decision stump
Decision rule: x  k versus x > k
x < 0.35 or X >= 0.75.
Split point k is chosen based on entropy

Decision tree with one internal node (the root)

xk which is immediately connected to the terminal
nodes (its leaves). Decision stump makes a
True False
prediction based on the value of just a single
yleft yright input feature. Sometimes they are also called
1-rules

BITS Pilani, Pilani Campus

Bagging Example

Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 X < = 0.7 y= 1
y 1 1 1 -1 -1 -1 1 1 1 1 X > 0.7 y= 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3  y = -1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35  y = 1
y 1 1 1 -1 -1 -1 -1 1 1 1 x > 0.35  y = -1
Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus

Bagging Example

Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75  y = 1

Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 10:

x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05  y = 1
y 1 1 1 1 1 1 1 1 1 1 x > 0.05  y = 1
Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus

Bagging Example

Summary of Training sets:

Round Split Point Left Class Right Class
1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1

Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus

Bagging Example

• Assume test set is the same as the original data

• Use majority vote to determine class of ensemble classifier
Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class Introduction to Data Mining, 2nd Edition

BITS Pilani, Pilani Campus

Bagging as DecisionTree

Source : Hastie et al.,”The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Springer (2009
BITS Pilani, Pilani Campus
Bagging - Sampling Process

• No cross validation?
• Remember, in bootstrapping we sample with replacement, and therefore not all
observations are used for each bootstrap sample. Its observed that on
average 1/3 observation are not used!

• We call them out-of-bag samples (OOB)

• We can predict the response for the i-th observation using each of the trees in
which that observation was OOB and do this for n observations

• Calculate overall OOB MSE or classification error

BITS Pilani, Pilani Campus

Bagging - VariableImportance

• Bagging results in improved accuracy over

prediction using a single tree

• Unfortunately, difficult to interpret the resulting

model. Bagging improves prediction
accuracy at the expense of interpretability.

• Calculate the total amount that the RSS or

entropy is decreased due to splits over a given
predictor, averaged over all trees.

• A visualize of the values of every feature in a

sample use case is shown here.

BITS Pilani, Pilani Campus

Bagging – Effect on Bias

• If a base classifier is stable, i.e., robust to minor perturbations in the training

set, then the error of the ensemble is primarily caused by bias in the base
classifier.

• In this situation, bagging may not be able to improve the performance of the
base classifiers significantly.

• It may even degrade the classifier's performance because the effective size
of each training set is about 37% smaller than the original data.

BITS Pilani, Pilani Campus

Additional Reading Material

Source Credit : Sebastian Raschka

BITS Pilani
Why MajorityVoting Works - Proof

• Assume n independent classifiers with a base error rate ϵ. Here, independent

means that the errors are uncorrelated

• Assume a binary classification task

• Assume the error rate is better than random guessing (i.e., lower than 0.5 for
binary classification)

∀ϵi ∈ {ϵ1, ϵ2, . . . , ϵn}, ϵi < 0.5

• The probability that we make a wrong prediction via the ensemble if k
classifiers predict the same class label is given by below PMF of binomial
distribution.
P(k) = 𝑛
𝑘
𝜖𝑘 1 − 𝜖 𝑛− k > ⌈n/2⌉

BITS Pilani, Pilani Campus

Cont….

• The probability that we make a wrong

prediction via the ensemble if k classifiers
predict the same class label is given by
below PMF of binomial distribution.
𝒏
P(k) = 𝒌
𝝐𝒌 𝟏 − 𝝐 𝒏−𝒌 k > ⌈n/2⌉

• Ensemble error: ϵ ens = σ 𝒏𝒌 𝒏

𝒌
𝝐𝒌 𝟏 − 𝝐 𝒏−𝒌

• Eg., If we consider 11 classifier’s result

then atleast 6 classifier must have
produced same label ie.,

ϵ ens = σ 𝟏𝟏
𝒌=𝟔
𝟏𝟏
𝟔
𝟎. 𝟐𝟓𝒌 𝟏 − 𝟎. 𝟐𝟓 𝟏𝟏−𝒌 = 0.034

BITS Pilani, Pilani Campus

Bagging –Advantages

• Reduces overfitting (variance)

• Normally uses one type of classifier

• Decision trees are popular

• Easy to parallelize

Bagging – Limitation
• Each tree is identically distributed (i.d.)
• The expectation of the average of B such trees is the same as the
expectation of any one of them

• The bias of bagged trees is the same as that of the individual trees

• Results in a model that is i.d. and not i.i.d

BITS Pilani, Pilani Campus

Bagging – Limitation (Cont…)

An average of B i.i.d. random variables, each with variance σ2, has variance: σ2/B
If i.d. (identical but not independent) and pair correlation r is present, then the
variance is:

As B increases the second term disappears but the first term remains

Why does bagging generate correlated trees?

Suppose that there is one very strong predictor in the data set, along with a number of
other moderately strong predictors.
Then all bagged trees will select the strong predictor at the top of the tree and
therefore all trees will look similar.
How do we avoid this?
Restriction on the choice of predictors and no.of.trees that can use the same predictor
may help. But this leads to too much variance

BITS Pilani, Pilani Campus

Bagging – Limitation (Cont…)

Remember we want i.i.d such as the bias to be the same and variance to be less?
Other ideas?

What if we consider only a subset of the predictors at each split?

We will still get correlated trees unless … . w e randomly select the subset !

BITS Pilani, Pilani Campus

Algorithmsbased onBagging and Boosting

Bagging algorithms:

– Random forest

Boosting algorithms:

– AdaBoost

– Gradient Boosting

BITS Pilani, Pilani Campus

Random Forest

BITS Pilani
Random Forest

• Ensemble method specifically

designed for decision tree
classifiers
• Random Forests grows many
trees
– Ensemble of unpruned
Lower correlation across trees
decision trees
– Each base classifier classifies
a “new” vector of attributes
from the original data
– Final result on classifying a
new instance: voting.
– Forest chooses the
classification result having the
most votes (over all the trees
in the forest) Image credit: https://medium.com

BITS Pilani, Pilani Campus

Random Forest Algorithm

• Construct an ensemble of
decision trees by manipulating
training set as well as features
– Use bootstrap sample to train
every decision tree (similar to
Bagging)
– Use the following tree
induction algorithm:
• At every internal node of
decision tree, randomly
sample p attributes for
selecting split criterion
• Repeat this procedure
until all leaves are pure
(unpruned tree)

BITS Pilani, Pilani Campus

Random Forest

• Trees that are trained on different sets of data (bagging)

• Trees use different features to make decisions.

Image credit: https://medium.com

BITS Pilani, Pilani Campus
Random Forest– Summary

• Random Forest is ensemble machine learning algorithm that follows the bagging
technique.

• The base estimators in random forest are decision trees.

• Random forest randomly selects a set of features which are used to decide the
best split at each node of the decision tree.

• Random subsets are created from the original dataset (bootstrapping).

• At each node in the decision tree, only a random set of features are considered
to decide the best split.

• A decision tree model is fitted on each of the subsets.

• The final prediction is calculated by averaging the predictions from all decision
trees.

BITS Pilani, Pilani Campus

Random Forest

• Random Forest need features that have at least some predictive power.
• The trees of the forest and more importantly their predictions need to be
uncorrelated (or at least have low correlations with each other).

• Algorithm can solve both type of problems i.e. classification and regression
• Power to handle large data set with higher dimensionality.
• It can handle thousands of input variables and identify most significant
variables so it is considered as one of the dimensionality reduction
methods.

BITS Pilani, Pilani Campus

Random Forest

Cheaper Feature Selection

BITS Pilani, Pilani Campus

Additional Reading Material

Random forests are popular. Leo Breiman’s and Adele Cutler maintains a random forest
website where the software is freely available, and of course it is included in every
ML/STAT package

http://www.stat.berkeley.edu/~breiman/RandomForests/

Source Credit : Original Paper : https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf

S Pilani
BITS Pilani, Pilani CBIT
am pus
Random Forest– Algorithm

For b = 1 to B:

(a) Draw a bootstrap sample Z∗ of size N from the training data.

(b) Grow a random-forest tree to the bootstrapped data, by recursively
repeating the following steps for each terminal node of the tree, until the
minimum node size nmin is reached.

i. Select m variables at random from the p variables.

ii. Pick the best variable/split-point among the m.

iii. Split the node into two child nodes.

Output the ensemble of trees.

To make a prediction at a new point x we do:

– For regression: average the results

– For classification: majority vote

BITS Pilani, Pilani Campus
Random Forest– Algorithm

The inventors make the following recommendations:

• For classification, the default value for m is √p and the minimum node size is
one. If m=p then its bagging.

• For regression, the default value for m is p/3 and the minimum node size is five.
• In practice the best values for these parameters will depend on the problem, and
they should be treated as tuning parameters.

• Like with Bagging, we can use OOB and therefore RF can be fit in one
sequence, with cross-validation being performed along the way. Once the OOB
error stabilizes, the training can be terminated.

BITS Pilani, Pilani Campus

Random Forest– Algorithm

BITS Pilani, Pilani Campus

Random Forest– Advantages

• Algorithm can solve both type of problems i.e. classification and regression

• Power to handle large data set with higher dimensionality.

• It can handle thousands of input variables and identify most significant variables so it is
considered as one of the dimensionality reduction methods.

• Random forests “cannot overfit” the data w.r.t to number of trees, since the more number of
trees, B does not mean there is a increase in the flexibility of the model

• Model outputs Importance of variable, which can be a very handy feature (on some
random data set).

 Record the prediction accuracy on the OOB samples for each tree
 Randomly permute the data for column j in the OOB samples and the record the
accuracy again.

 The decrease in accuracy as a result of this permuting is averaged over all trees, and is
used as a measure of the importance of variable j in the random forest.

BITS Pilani, Pilani Campus

Random Forest– Disadvantages

• May over-fit data sets that are particularly noisy.

• Random Forest can feel like a black box approach for statistical modelers – you
have very little control on what the model does. You can at best – try different
parameters and random seeds!

• When the number of variables is large, but the fraction of relevant variables is
small, random forests are likely to perform poorly when m is small. Because at
each split the chance can be small that the relevant variables will be selected

• For example, with 3 relevant and 100 not so relevant variables the probability of
any of the relevant variables being selected at any split is ~0.25

BITS Pilani, Pilani Campus

Random Forest– Disadvantages

BITS Pilani, Pilani Campus

Boosting

BITS Pilani
Boosting

• What if a data point is incorrectly predicted by the first model, and then the
next (probably all models), will combining the predictions provide better
results? Such situations are taken care of by boosting.

• Boosting is a sequential process, where each subsequent model attempts

to correct the errors of the previous model.

• The succeeding models are dependent on the previous model.

BITS Pilani, Pilani Campus

Boosting

Train predictors sequentially, each trying to correct its predecessor

Scalability
An iterative procedure to
adaptively change
Ada Boost
distribution of training data
by focusing more on
previously misclassified
records

Initially, all N records

are assigned equal
weights (for being
selected for training)

Unlike bagging, weights

may change at the end
of each boosting round

BITS Pilani, Pilani Campus

Boosting

• Records that are wrongly classified will have their weights increased in the next round

• Records that are classified correctly will have their weights decreased in the next round

Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify

• Its weight is increased, therefore it is more likely to be
chosen again in subsequent rounds

BITS Pilani, Pilani Campus

Boosting - Approach :

• A subset is created from the original dataset.

• Initially, all data points are given equal weights.

• A base model is created on this subset.

• This model is used to make predictions on the whole dataset.

• Errors are calculated using the actual values and predicted values.
• The observations which are incorrectly predicted, are given higher weights.
(Here, the three misclassified blue-plus points will be given higher weights)

• Another model is created and predictions are made on the dataset. (This model
tries to correct the errors from the previous model)

BITS Pilani, Pilani Campus

Boosting

• Multiple models are

created, each correcting the
errors of the previous
model.
• The final model (strong
learner) is the weighted
mean of all the models
(weak learners).

• Individual models would not

perform well on the entire
dataset, but they work well
for some part of the
dataset. Thus, each model
actually boosts the
performance of the
ensemble.

BITS Pilani, Pilani Campus

AdaBoost

• Adaptive boosting or AdaBoost is one of the

simplest boosting algorithms. Usually, decision
trees are used for modelling. Multiple sequential
models are created, each correcting the errors
from the last model.
• AdaBoost assigns weights to the observations
which are incorrectly predicted and the
subsequent model works to predict these values
correctly.

BITS Pilani, Pilani Campus

AdaBoost Algorithm

• Initially, all observations (n) in the dataset are

given equal weights (1/n).
• A model is built on a subset of data.
• Using this model, predictions are made on the
whole dataset.
• Errors are calculated by comparing the
predictions and actual values.
• While creating the next model, higher weights
are given to the data points which were
predicted incorrectly.

BITS Pilani, Pilani Campus

Adaboost Algorithm

• Weights can be determined using the error

value. For instance, higher the error more is
the weight assigned to the observation.
• This process is repeated until the error
function does not change, or the maximum
limit of the number of estimators is reached.

BITS Pilani, Pilani Campus

AdaBoost

Base classifiers Ci: C1, C2, …, CT

Error rate:
N input samples

 w  C ( x )  y 
N
1
i  j i j j
N j 1
Importance of a classifier:

1  1  i 
i  ln 
2  i 
https://en.wikipedia.org/wiki/AdaBoost#Choosing_αt

BITS Pilani, Pilani Campus

AdaBoost: Weight Update
 j
Weight Update: ( j 1)


wi( j )exp if C j ( xi )  yi
wi   j
Zj  exp if C j ( xi )  yi <- Eqn:5.88
where Z j is the normalization factor

C * ( x )  arg max  j C j ( x )  y 
T

y j 1

• Reduce weight if correctly classified else increase

• If any intermediate rounds produce error rate higher than
50%, the weights are reverted back to 1/n and the resampling
procedure is repeated

BITS Pilani, Pilani Campus

AdaBoost Algorithm

BITS Pilani, Pilani Campus

AdaBoost Algorithm

α 𝑖𝑛 𝑒𝑎𝑟𝑙𝑖𝑒𝑟 𝑠𝑙𝑖𝑑𝑒 𝑠𝑎𝑚𝑒 𝑎𝑠 𝞫 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

AdaBoost Algorithm

BITS Pilani, Pilani Campus

AdaBoost Algorithm

BITS Pilani, Pilani Campus

AdaBoost Algorithm

BITS Pilani, Pilani Campus

AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

AdaBoost Algorithm

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

AdaBoost Algorithm

Member classifier with less error are given

more weight in ﬁnal ensemble hypothesis.
Final prediction is a weighted combination
of each members prediction

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Exampl
e

From, L e
́ on Bottou

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example

1  1  i 
 w  C ( x )  y  i  ln 
N
1
From, L e
́ on Bottou
i 
N j 1
j i j j
2  i 

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example

1  1  i 
i  ln 
 w  C ( x )  y 
N
1
From, L e
́ on Bottou
i  j i j j 2  i 
N j 1

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example

1  1  i 
i  ln 
From, L e
́ on Bottou

2  i 
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

From, L e
́ on Bottou

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example

How do we combine the results now?

From, L e
́ on Bottou

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example

How do we combine the results now?

From, L e
́ on Bottou

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

90
AdaBoos
t
Example
Training sets for the first 3 boosting rounds:

Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2:
Summary:x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1

Round Split Point Left Class Right Class alpha

1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
91
AdaBoos
t
Weights
Example
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
Classification 1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

AdaBoost error function takes into account the fact that only the sign of the
final result is used, thus sum can be far larger than 1 without increasing
error
AdaBoost base learners

BITS Pilani, Pilani Campus

AdaBoost in practice

BITS Pilani, Pilani Campus

AdaBoost - Advantages

• Fast and Simple to Program

• No parameter tuning is required (except T)

• No assumption is made on weak learners

AdaBoost - Limitations
• Need more data.

• Affected by the presence of noise

• Doesn’t work well in the presence of large number of outliers

BITS Pilani, Pilani Campus

Gradient Boosting

• The idea of gradient boosting originated in the observation by Leo Breiman that
boosting can be interpreted as an optimization algorithm on a suitable cost
function

• optimize a cost function over function space by iteratively choosing a function

(weak hypothesis) that points in the negative gradient direction.

• predictor can be any machine learning algorithm like SVM, Logistic regression,
KNN , Decision tree etc. But Decision tree version of gradient boosting is much
popular

• In Gradient Boosting, "shortcomings” are identified by gradients.

• Recall that, in Adaboost, “shortcomings” are identified by high-weight data points.

• Both high-weight data points and gradients tell us how to improve our model.

BITS Pilani, Pilani Campus

XGBoost

• XGBoost (Extreme Gradient

Boosting) uses the gradient
boosting (GBM) framework
at its core.
• optimized distributed
gradient boosting library

designed to be
highly efficient, flexible and
portable

BITS Pilani, Pilani Campus

Gradient Boosting - Idea

• You are given (x1, y1),(x2, y2), ...,(xn, yn), and the task is to fit a model F(x) to
minimize square loss

• There are some mistakes:

F(x1) = 0.8, while y1 = 0.9,

F(x2) = 1.4 while y2 = 1.3...

How can you improve this model?

• Rules:
– You are not allowed to remove anything from F or change any parameter in
F.

– You can add an additional model (regression tree) h to F, so the new

prediction will be F(x) + h(x).

BITS Pilani, Pilani Campus

Gradient Boosting

You wish to improve the model such that

– F(x1) + h(x1) = y1

– F(x2) + h(x2) = y2 ...

– F(xn) + h(xn) = yn • Simple solution: yi − F(xi) are called residuals

• These are the parts that existing model F
Or, equivalently
cannot do well.
h(x1) = y1 − F(x1)
• The role of h is to compensate the
h(x2) = y2 − F(x2) ...
shortcoming of existing model F
h(xn) = yn − F(xn)
• If the new model F + h is still not satisfactory,
Fit a regression tree h to data we can add another regression tree...
(x1, y1 − F(x1)),(x2, y2 − F(x2)), ...,(xn, yn − F(xn))

BITS Pilani, Pilani Campus

Gradient boosting: Summary

• Gradient boosting involves three elements:

• A loss function to be optimized
• For example, regression may use a squared error and classification may use
logarithmic loss.
• A weak learner to make predictions E.g Decision tree/Decision stump
• An additive model to add weak learners to minimize the loss function.
• Trees are added one at a time, and existing trees in the model are not
changed.
• A gradient descent procedure is used to minimize the loss when adding trees.
• Instead of parameters, we have weak learners
• After calculating the loss, to perform the gradient descent procedure, we must
add a tree to the model that reduces the loss (i.e. follow the gradient).

BITS Pilani, Pilani Campus

Gradient boosting algorithm

let F0 be a “dummy” constant model

for m = 1, . . . ,M

for each pair (xi,yi) in the training set

compute the pseudo-residual R(yi, Fm-1(xi)) = negative gradient of the loss

Train a regression sub-model hm on the pseudo-residuals

Add hm to the ensemble: Fm(x) = Fm-1(x) +.hm(x)

return the ensemble FM

BITS Pilani, Pilani Campus

Gradient boosting: Example

F0 be a “dummy” constant model

Average value is predicted.

Actual Predicted
Height Age Gender Weight weight 1
5.4 28 M 88 71.2
5.2 26 F 76 71.2
5 28 F 56 71.2
5.6 25 M 73 71.2
6 25 M 77 71.2
4 22 F 57 71.2

Example credit: https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-6561eec192cb

BITS Pilani, Pilani Campus
Iteration 1

compute the pseudo-residual R(yi, Fm-1(xi)) =

negative gradient of the loss
Residual: h1(x) = y − F0(x)

A tree with maximum leaf

nodes as 4(hyper
parameter for DT) using
Height, Age and Gender to
predict the residuals(Error)

Example credit: https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-6561eec192cb

BITS Pilani, Pilani Campus
Iteration 1

Combining the trees to make the new prediction: Fm(x) = Fm-1(x) +.hm(x)

Learning rate : 0.1

F1(x) = F0(x) +.h1(x)

Example credit: https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-6561eec192cb

BITS Pilani, Pilani Campus
Iteration 1
Residual: h2(x) = y − F1(x)

Example credit: https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-6561eec192cb

BITS Pilani, Pilani Campus
Iteration 2

Example credit: https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-6561eec192cb

BITS Pilani, Pilani Campus
Additional Reading Material

Source Credit : Sebastian Raschka

BITS Pilani
Gradient Boosting -- Conceptual Overview

• Step 1: Construct a base tree (just the root node)

• Step 2: Build next tree based on errors of the

previous tree

• Step 3: Combine tree from step 1 with trees from

step 2. Go back to step 2.

Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example
In million US Dollars

x1# Rooms x2=City x3=Age y=Price

5 Boston 30 1.5
10 Madison 20 0.5
6 Lansing 20 0.25
5 Waunakee 10 0.1

• Step 1: Construct a base tree (just the root node)

1 n
y1̂ = n ∑ y(i) = 0.5875
i=1

Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example

• Step 2: Build next tree based on errors of the

previous tree

First, compute (pseudo) residuals: r1 = y1 − y ̂1

In million US Dollars

x1# x2=City x3=Age y=Price r1=Res

5 Boston 30 1.5 1.5 - 0.5875 = 0.9125
10 Madison 20 0.5 0.5 - 0.5875 = -0.0875
6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
5 Waunake 10 0.1 0.1 - 0.5875 = -0.4875

Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example

• Step 2: Build next tree based on errors of the

previous tree
Then, create a tree based on x1, … , xm to fit the residuals
x1# x2=City x3=Age y=Price r1=Residual
5 Boston 30 1.5 1.5 - 0.5875 = 0.9125
10 Madison 20 0.5 0.5 - 0.5875 = -0.0875
6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
5 Waunake 10 0.1 0.1 - 0.5875 = -0.4875

Age >= 30
No Yes
# Rooms >= 10 0.9125

-0.3375 -0.0875
-0.4125
-0.4875

Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example

• Step 3: Combine tree from step 1 with trees from step 2

x1# x2=City x3=Age y=Price r=Res
5 Boston 30 1.5 1.5 - 0.5875 = 0.9125
10 Madison 20 0.5 0.5 - 0.5875 = -0.0875
6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
5 Waunake 10 0.1 0.1 - 0.5875 = -0.4875

Age >= 30

+
No Yes
y1̂ = 1 ∑ y(i) = 0.5875
n
# Rooms >= 10 0.9125
n
i=1
-0.3375 -0.0875
-0.4125 -0.4875

Sebastian Raschka
Gradient Boosting -- Conceptual Overview
--> A Regression-based Example
• Step 3: Combine tree from step 1 with trees from step 2
x1# x2=City x3=Age y=Price r=Res
5 Boston 30 1.5 1.5 - 0.5875 = 0.9125

E.g., 10 Madison 20 0.5 0.5 - 0.5875 = -0.0875

predict 6 Lansing 20 0.25 0.25 - 0.5875 = -0.3375
Lansing 5 Waunakee 10 0.1 0.1 - 0.5875 = -0.4875

Age >= 30

+
No Yes
y1̂ = 1 ∑ y(i) = 0.5875
n
# Rooms >= 10 0.9125
n
i=1
-0.3375 -0.0875
-0.4125
-0.4875
E.g.,
predict 0.5875 + α × (−0.4125)
Lansing
Where learning rate between 0 and 1 (if α=1 , low bias but high variance)

Sebastian Raschka
Gradient Boosting -- Algorithm Overview

Step 0: Input data {⟨x(i), y(i)⟩}ni=1

Differentiable Loss function L(y(i), h(x(i)))
n
Step 1: Initialize model h0(x) = argmin∑ L(y(i), y ̂ )
ŷ i=1

Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
for i=1 to n
B. Fit tree to ri,t values, and create terminal
nodes Rj,t for j = 1,...,Jt
...
Sebastian Raschka
Gradient Boosting -- Algorithm Overview
Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
for i=1 to n

B. Fit tree to ri,t values, and create terminal nodes

Rj,t for j = 1,...,Jt

C. for j = 1,...,Jt, compute

yĵ ,t = argmin Σ L(y(i), ht−1(x(i)) + y ̂ )
ŷ x(i)∈Ri,j
Jt
D. Update ht(x) = ht−1(x) + α Σ yĵ ,t 𝕀(x ∈ Rj,t)
j=1
Step 3: Return ht(x)
Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

Step 0: Input data {⟨x(i), y(i)⟩}ni=1

Differentiable Loss function L(y(i), h(x(i)))

E.g., Sum-squared error in regression

1 2
SSE′= (y(i) − h(x(i)))
2
∂ 1 y(i) − h(x(i)) 2
( ) [chain rule]
∂h(x ) 2
(i)

= 2 × 1 (y(i) − h(x(i))) × (0 − 1) = − (y(i) − h(x(i)))

2
[neg. residual]

Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

n
Step 1: Initialize model h0(x) = argmin∑ L(y(i), y ̂ )
ŷ i=1

pred. target

turns out to be the average (in regression)

n
1 (i)
n ∑ y
i=1

Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

Loop to make T trees (e.g., T=100)

Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
pseudo residual of the t-th tree for i = 1 to n
and i-th example

Derivative of the loss function

Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

Loop to make T trees (e.g., T=100)

Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
pseudo residual of the t-th tree for i = 1 to n
and i-th example

Derivative of the loss function

B. Fit tree to ri,t values, and create

Age >= 30
terminal nodes Rj,t for j = 1,...,Jt No Yes
# Rooms >= 10 0.9125

Use features in dataset to fit treeR R3,t

-0.3375
-0.4125 1,t -0.0875
-0.4875
Sebastian Raschka
R2,t
Gradient Boosting -- Algorithm Overview Discussion
Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
for i = 1 to n

B. Fit tree to ri,t values, and create

terminal nodes Rj,t for j = 1,...,Jt

C. for j = 1,...,Jt, compute

yĵ ,t = argmin ∑ L(y(i), ht−1(x(i)) + y ̂ )
ŷ x(i)∈Ri,j
Compute the
residual for each Only consider
leaf node examples at that Like step 1 but
leaf node add previous
prediction
Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion
Step 2: for t = 1 to T
∂L(y (i), h(x(i)))
A. Compute pseudo residual ri,t = − [ ]
∂h(x(i)) h(x)=ht−1(x)
for i = 1 to n

B. Fit tree to i,t values, and create

terminal nodes Rj,t for j = 1,...,Jt

C. for j = 1,...,Jt, compute

yĵ ,t = argmin ∑ L(y(i), ht−1(x(i)) +ŷ )
ŷ x(i)∈Ri,j
Jt
D. Update ht(x) = ht−1(x) + α ∑ yĵ ,t 𝕀(x ∈ Rj,t)
learning rate j=1 Summation just in case
between 0 and 1 examples end up in
(usually 0.1) multiple nodes
Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

For prediction, combine all T trees, e.g.,

n
h0(x) = argmin∑ L(y(i), y )
ŷ i=1

+α yĵ ,t=1 = argmin ∑ L(y(i), h(t=1)−1(x(i)) + ŷ )

ŷ x(i)∈Ri,j

...
+α yĵ ,T = argmin ∑ L(y(i), hT−1(x(i)) + ŷ )
ŷ x(i)∈Ri,j

Sebastian Raschka
Gradient Boosting -- Algorithm Overview Discussion

For prediction, combine all T trees, e.g.,

n
h0(x) = argmin∑ L(y(i), y)
ŷ i=1

+α yĵ ,t=1
The idea is that we decrease the
pseudo residuals by a small amount
... at each step

+α yĵ ,T

Sebastian Raschka
XGBoost
https://arxiv.org/abs/1603.02754
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the
22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). ACM.

Summary and Main Points:

▪ scalable implementation of gradient boosting
▪ Improvements include: regularized loss, sparsity-aware algorithm,
weighted quantile sketch for approximate tree learning, caching of
access patterns, data compression, sharding
▪ Decision trees based on CART
▪ Regularization term for penalizing model (tree) complexity
▪ Uses second order approximation for optimizing the objective
▪ Options for column-based and row-based subsampling
▪ Single-machine version of XGBoost supports the exact greedy
algorithm

Sebastian Raschka
References

• Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach , Vipin

Kumar
• Bishop - Pattern Recognition And Machine Learning - Springer 2006
• The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-
Micheline-Kamber-Jian-Pei-Data-Mining.-Concepts-and-Techniques-3rd-Edition-
Morgan-Kaufmann-2011
• A Gentle Introduction to Gradient Boosting Cheng Li [email protected]
College of Computer and Information Science Northeastern University
• https://en.wikipedia.org/wiki/Gradient_boosting#:~:text=Gradient%20boosting%2
0is%20a%20machine,which%20are%20typically%20decision%20trees

• https://medium.com/nerd-for-tech/gradient-boost-for-regression-explained-
6561eec192cb

Next Session Plan

• Unsupervised Learning
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Some Imp MCQ of Differential Geometry
100% (3)
Some Imp MCQ of Differential Geometry
11 pages
Printed-Prof - Sachin Gupta-ASM - Lecture - 1 To 8 Combined
No ratings yet
Printed-Prof - Sachin Gupta-ASM - Lecture - 1 To 8 Combined
154 pages
Algorithms: Richard Johnsonbaugh Marcus Schaefer
0% (2)
Algorithms: Richard Johnsonbaugh Marcus Schaefer
5 pages
Unit - 3 ML
No ratings yet
Unit - 3 ML
17 pages
Huawei Questions: True or False
No ratings yet
Huawei Questions: True or False
24 pages
An Introduction of Ensemble Learning
100% (1)
An Introduction of Ensemble Learning
40 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
ABC Hospital - Solution
100% (1)
ABC Hospital - Solution
12 pages
AML All Merged PDF Class 9 To 16
No ratings yet
AML All Merged PDF Class 9 To 16
327 pages
ML CS13 - Bagging
No ratings yet
ML CS13 - Bagging
39 pages
L1 - Introduction
No ratings yet
L1 - Introduction
21 pages
IDS All Merged 4x1 Landscape Print
No ratings yet
IDS All Merged 4x1 Landscape Print
210 pages
RL2.1 Data Preprocessing Concepts 1
No ratings yet
RL2.1 Data Preprocessing Concepts 1
16 pages
CS F415 L1 - Introduction
No ratings yet
CS F415 L1 - Introduction
24 pages
Clustering 2
No ratings yet
Clustering 2
80 pages
RL3.1 Data Descriptions 1
No ratings yet
RL3.1 Data Descriptions 1
18 pages
DmUnit 3
No ratings yet
DmUnit 3
42 pages
Module 3
No ratings yet
Module 3
132 pages
L1 - Introduction
No ratings yet
L1 - Introduction
21 pages
BAZG522 L7 Classification 1
No ratings yet
BAZG522 L7 Classification 1
65 pages
CS L03 MachineLearning Basics 01
No ratings yet
CS L03 MachineLearning Basics 01
66 pages
CS 12
No ratings yet
CS 12
65 pages
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-04 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-04 Reference-Material-I
69 pages
CS L03 MachineLearning Basics 01
No ratings yet
CS L03 MachineLearning Basics 01
73 pages
Welcome To:: Introduction To Classification & Classification Algorithms
No ratings yet
Welcome To:: Introduction To Classification & Classification Algorithms
42 pages
IDS8 Midsem Review
No ratings yet
IDS8 Midsem Review
24 pages
S1-21 - DSECLZC415 Data Pre-Processing: BITS Pilani
No ratings yet
S1-21 - DSECLZC415 Data Pre-Processing: BITS Pilani
54 pages
rl8.3 - Text - Mining 1
No ratings yet
rl8.3 - Text - Mining 1
28 pages
2025 Ensemble Learning
No ratings yet
2025 Ensemble Learning
25 pages
IDS6
No ratings yet
IDS6
64 pages
DM - Unit-1 - Fundamentals of Data Mining
No ratings yet
DM - Unit-1 - Fundamentals of Data Mining
43 pages
ML - Interview Prep
No ratings yet
ML - Interview Prep
9 pages
S2-19 - DSECLZC415 Data Pre-Processing: BITS Pilani
No ratings yet
S2-19 - DSECLZC415 Data Pre-Processing: BITS Pilani
47 pages
Lecture-8-HCL-DSE - Sumita Narang
No ratings yet
Lecture-8-HCL-DSE - Sumita Narang
37 pages
Merged Presentation Choladeck Choladeck-Compressed
No ratings yet
Merged Presentation Choladeck Choladeck-Compressed
239 pages
Ensemble Learning
No ratings yet
Ensemble Learning
15 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
2 Crisp DM Toolbox
No ratings yet
2 Crisp DM Toolbox
29 pages
UNIT-3 Notes
No ratings yet
UNIT-3 Notes
12 pages
04 Classification
No ratings yet
04 Classification
72 pages
Technical Report
No ratings yet
Technical Report
10 pages
ML - 5
No ratings yet
ML - 5
53 pages
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
ML CS-2 CS3 Student Reference V1.0
No ratings yet
ML CS-2 CS3 Student Reference V1.0
88 pages
Classification - Naive Bayes Classifier: DR - Aruna Malapati Asst Professor Dept of CS & IT BITS Pilani, Hyderabad Campus
No ratings yet
Classification - Naive Bayes Classifier: DR - Aruna Malapati Asst Professor Dept of CS & IT BITS Pilani, Hyderabad Campus
9 pages
Aiml ML Session 13
No ratings yet
Aiml ML Session 13
78 pages
Classification
No ratings yet
Classification
4 pages
MLSlides1 Selected Shared
No ratings yet
MLSlides1 Selected Shared
21 pages
Cluster - ML CS-1 V1.0 - For Student Reference
No ratings yet
Cluster - ML CS-1 V1.0 - For Student Reference
59 pages
Unit5 Ch8 - DataMiningClassification
No ratings yet
Unit5 Ch8 - DataMiningClassification
67 pages
Project Report 2
No ratings yet
Project Report 2
11 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
47 pages
06-Classification Part1
No ratings yet
06-Classification Part1
44 pages
IDS 6 Classification
No ratings yet
IDS 6 Classification
44 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
Unit 3
No ratings yet
Unit 3
63 pages
AIML ML Session 5 Session 6 - Student Common Reference (With More Additional Read
No ratings yet
AIML ML Session 5 Session 6 - Student Common Reference (With More Additional Read
84 pages
IDS 6 EvaluationMetrics
No ratings yet
IDS 6 EvaluationMetrics
34 pages
AIML ML Session 4 - Student Common Reference (With More Additional Reading Materials) Part 2
No ratings yet
AIML ML Session 4 - Student Common Reference (With More Additional Reading Materials) Part 2
45 pages
ML Unit II - Final
No ratings yet
ML Unit II - Final
138 pages
ML U-3
No ratings yet
ML U-3
16 pages
Designing Strategies For Autonomous Stock Trading Agents Using A Random Forest Approach
No ratings yet
Designing Strategies For Autonomous Stock Trading Agents Using A Random Forest Approach
11 pages
Red Wine Quality Detection
No ratings yet
Red Wine Quality Detection
17 pages
Data Structure and Algorithm MCQ: A) B) C) D)
No ratings yet
Data Structure and Algorithm MCQ: A) B) C) D)
12 pages
Chapter3 ProblemSolvingBySearching
No ratings yet
Chapter3 ProblemSolvingBySearching
61 pages
Chap 4 - Linear Inequation Paper
No ratings yet
Chap 4 - Linear Inequation Paper
1 page
ME362 Control System Engineering
No ratings yet
ME362 Control System Engineering
2 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
L117, L18, L19, L20, L21 - Module 5 - Source Coding - II
No ratings yet
L117, L18, L19, L20, L21 - Module 5 - Source Coding - II
53 pages
Tutorial 10 Gemini
No ratings yet
Tutorial 10 Gemini
5 pages
Dynamic Programming
No ratings yet
Dynamic Programming
8 pages
Department of Electrical and Electronic Engineering Course Outline
No ratings yet
Department of Electrical and Electronic Engineering Course Outline
6 pages
Legendre's Linear Equations
No ratings yet
Legendre's Linear Equations
3 pages
Al 101 Faith Biag
No ratings yet
Al 101 Faith Biag
7 pages
23BCS10595 Manya Gupta Assignment1
No ratings yet
23BCS10595 Manya Gupta Assignment1
6 pages
33.real Time Drowsy Driver Detection in Matlab
No ratings yet
33.real Time Drowsy Driver Detection in Matlab
5 pages
Unit 1 - Ai - KCS071
No ratings yet
Unit 1 - Ai - KCS071
32 pages
STA457 Week 7 Notes
No ratings yet
STA457 Week 7 Notes
61 pages
The Ising Model As A Window On Quantum Gravity With Matter
No ratings yet
The Ising Model As A Window On Quantum Gravity With Matter
17 pages
Modeling of ROV
No ratings yet
Modeling of ROV
19 pages
Pengenalan AI
No ratings yet
Pengenalan AI
34 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
Robotics TP
No ratings yet
Robotics TP
15 pages
Final Asfaw BDU MSC Thesis
No ratings yet
Final Asfaw BDU MSC Thesis
124 pages
Parallel Algorithm by Rc4
No ratings yet
Parallel Algorithm by Rc4
22 pages
Deep Super Learner: A Deep Ensemble For Classification Problems
No ratings yet
Deep Super Learner: A Deep Ensemble For Classification Problems
12 pages
ERD ML Prediction
No ratings yet
ERD ML Prediction
1 page
Python Machine Learning: Learn how to build powerful Python machine learning algorithms to generate useful data insights with this data analysis tutorial
From Everand
Python Machine Learning: Learn how to build powerful Python machine learning algorithms to generate useful data insights with this data analysis tutorial
Sebastian Raschka
4/5 (20)
The Supervised Learning Workshop - Second Edition: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition
From Everand
The Supervised Learning Workshop - Second Edition: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition
Blaine Bateman
No ratings yet