0% found this document useful (0 votes)
29 views25 pages

2025 Ensemble Learning

Uploaded by

Pavankumar Palla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views25 pages

2025 Ensemble Learning

Uploaded by

Pavankumar Palla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Presenter

B.E., M.Tech., Ph.D.


Senior Associate Professor, Grade 2, School of Computing Science Engineering and Artificial Intelligence
Assistant Director, Centre for Innovation in Teaching and Learning
VIT Bhopal University

"Invitation to Co-Create: Let's Collaborate"

+919945379089
Email(s): [email protected],[email protected]

Google Scholar || SCOPUS || ORCID || LinkedIn || YouTube


Data + Math + Binding Language ⇒ Machine Learning
Introduction to Machine Learning
What is Classification in Machine Learning?

https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/

What is Ensemble Learning?


Ensemble learning is a machine learning technique where multiple models, often referred to as "weak learners" or "base
models," are combined to improve overall predictive performance. The core idea is that by aggregating the outputs of
multiple models, the ensemble can achieve better generalization and accuracy than any individual model alone.

For Numerical Prediction ⇒ Averaging is the technique

For Categorical Prediction ⇒ Majority Voting is the technique

Why is Ensemble Learning considered in ML?


● Ensemble methods aim to improve models' predictability by combining several models to
make one very reliable model.

● Ensemble methods work best when the base models make different types of errors, allowing
the combined model to compensate for individual weaknesses.

● Predictions from individual models are combined using techniques like averaging, voting, or
stacking to make the final prediction.

● Ensembles are generally more robust in terms of overfitting and noise in the data.
● The most popular ensemble methods are bagging, boosting, and stacking.
● Ensemble methods are ideal for regression and classification, where they reduce bias and
variance to boost the accuracy of models.
Categories of Ensemble Methods
Parallel ensemble techniques – Random Forest

Homogenous base learners

Sequential ensemble techniques - Adaptive Boosting (AdaBoost)

Heterogeneous base learners

Bias and Variance


Bias: The error is due to overly simplistic assumptions in the model. High bias can cause the model to miss important
patterns, leading to underfitting.

Variance: The error is due to sensitivity to small fluctuations in the training data. High variance can cause the
model to overfit, capturing noise instead of the underlying pattern.

A model with underfitting: Straight line failing to capture the curve.

A model with overfitting: Complex wavy line that fits every point, including noise.

A model with good / best fit: Smooth curve capturing the main trend.

Total Error = Bias² + Variance + Irreducible Error.


Sampling with replacement is called a bootstrapping method.

i. Random Forest can be adapted to classification or numeric prediction problems

ii. It classifies the data based on voting whereas, for numeric prediction, it uses an

average method

Random forest is a supervised learning algorithm that is used for both classification as well as

regression. However, it is mainly used for classification problems. As we know a forest is made up of trees

and more trees means a more robust forest.

Similarly, a random forest algorithm creates decision trees on data samples then gets the

prediction from each of them, and finally selects the best solution employing voting. It is an ensemble

method that is better than a single decision tree because it reduces the over-fitting by averaging the

result.
The random forest is a model made up of many decision trees. Rather than just simply averaging the

prediction of trees.

This model uses two key concepts:

1. A random sampling of training data points when building trees

2. Random subsets of features considered when splitting nodes

For the training phase, each tree in a random forest learns from a random sample of the data

points. The samples are drawn with replacement, known as bootstrapping, which means that some

samples will be used multiple times in a single tree.

A subset of all the features is considered for splitting each node in each decision tree. Generally,

this is set to sqrt(n_features) for classification. For example, if there are 16 features, at each node

in each tree, only 4 features will be considered for splitting the node.

⇒ The random forest itself is a bagging methodology or Ensemble.

⇒ Random Forest builds the structure based on the Gini Impurity.


Numerical Example: Construct a Decision Tree by using “Gini Index” as a criterion
We are going to use this data sample. Let’s
try to use information gain as a
criterion. Here, we have 5 columns out of
which 4 columns have continuous data and
5th column consists of class labels.

A, B, C, and D attributes can be considered


as predictors and E column class labels can
be considered as a target variable. To
construct a decision tree from this data, we
have to convert continuous data into
categorical data.

We have chosen some random values to


categorize each attribute.

Gini Index for Var A

Var A has a value >=5 for 12 records out of 16 and 4 records with a value <5 value.
● For Var A >= 5 & class == positive: 5/12
● For Var A >= 5 & class == negative: 7/12
o gini(5,7) = 1- ( (5/12)^2 + (7/12)^2 ) = 0.4860
● For Var A <5 & class == positive: 3/4
● For Var A <5 & class == negative: 1/4
o gini(3,1) = 1- ( (3/4)^2 + (1/4)^2 ) = 0.375

By adding weight and sum each of the gini indices:

Gini Index for Var B

Var B has a value >=3 for 12 records out of 16 and 4 records with a value <5 value.

● For Var B >= 3 & class == positive: 8/12


● For Var B >= 3 & class == negative: 4/12
o gini(8,4) = 1- ( (8/12)2 + (4/12)2 ) = 0.446
● For Var B <3 & class == positive: 0/4
● For Var B <3 & class == negative: 4/4
o gin(0,4) = 1- ( (0/4)2 + (4/4)2 ) = 0

Gini Index for Var C

Var C has a value >=4.2 for 6 records out of 16 and 10 records with a value <4.2 value.

● For Var C >= 4.2 & class == positive: 0/6


● For Var C >= 4.2 & class == negative: 6/6
o gini(0,6) = 1- ( (0/8)2 + (6/6)2 ) = 0
● For Var C < 4.2& class == positive: 8/10
● For Var C < 4.2 & class == negative: 2/10
o gin(8,2) = 1- ( (8/10)2 + (2/10)2 ) = 0.32

Gini Index for Var D

Var D has a value >=1.4 for 5 records out of 16 and 11 records with a value <1.4 value.

● For Var D >= 1.4 & class == positive: 0/5


● For Var D >= 1.4 & class == negative: 5/5
o gini(0,5) = 1- ( (0/5)2 + (5/5)2 ) = 0
● For Var D < 1.4 & class == positive: 8/11
● For Var D < 1.4 & class == negative: 3/11
o gin(8,3) = 1- ( (8/11)2 + (3/11)2 ) = 0.397
In the case of the Gini Index, we need to consider the minimum value as a root node. Among
the 4 attributes, C has the minimum Gini Index value of 2.

Hence, we need to consider C as a root node.


In general, the more trees in the forest the more robust the forest looks. In the same way in the
random forest classifier, the higher the number of trees in the forest gives the high accuracy results.

Model Validation Metrics


Confusion Matrix - A Meme Card
❖Sensitivity is called a Recall

F1 Score: In most real-life classification problems, imbalanced class distribution exists and
thus F1-score is a better metric to evaluate our model on.

Accuracy can be used when the class distribution is similar while F1-score is a better metric
when there are imbalanced classes as in the above case.

Online Calculator for the Confusion Matrix & Other Metrics


http://onlineconfusionmatrix.com/
Main Types of Ensemble Methods

1. Bagging (Bootstrap Aggregating):


How it works?

○ Take random subsets of the data (with replacement) to create different datasets (called

bootstraps).
○ Train the same type of model (e.g., decision trees) on each dataset.

○ Combine the results by averaging (for regression) or voting (for classification).

● Example: Random Forest

○ Random Forest is an extension of bagging that uses decision trees. It also introduces a

random selection of features at each split to increase diversity.

2. Boosting
How it works?

○ Train a model on the data.

○ Analyze where the model made errors.

○ Train a new model that focuses on correcting those errors.

○ Repeat this process multiple times, combining models in a weighted manner.

● The key difference from bagging is that models are trained sequentially, and each new model

learns from the mistakes of the previous ones.

● Example: Gradient Boosting, AdaBoost

○ AdaBoost: Assigns higher weights to incorrectly predicted samples.

○ Gradient Boosting: Optimizes a loss function using gradient descent.


3. Stacking (Stacked Generalization)
How it works?

1. Train multiple different models (e.g., decision tree, logistic regression, SVM) on the same

dataset.

2. Use the predictions from these models as input features for a meta-model (e.g., logistic

regression).

3. The meta-model learns to make the final predictions by combining the outputs of the base

models.

● Key difference: Stacking uses multiple different types of models and combines them with a

meta-model, unlike bagging and boosting, which typically use the same type of model.

Which models should be ensemble?


Let us consider models A, B, and C with an accuracy of 87%, 82%, and 72% respectively. Suppose, A and
B are highly correlated and C is not at all correlated with both A & B. In this type of scenario instead of
combining models A & B, model C should be combined with model A or model B to reduce generalized
errors.
6 ways to increase the quality of the model
Sample Calculation
Let us consider n=7 (Number of Samples)

The initial weight attached to each observation is 1/7

Tree stumps define a 1-level decision tree. The main idea is that at each step, we want to find the

best stump, i.e the best data split

Each decision tree is called a stump, having 1-level.

For each feature, we need to build a tree and make the decision of which decision tree becomes the base

learning model. [This is done based on Gini Index, entropy, or information gain]
The initial Weight of each Observation is 1/7 in this case.

Total Error in the initial stage becomes (TE =1/7) for each record.

Performance of stump = ½ ln [1-TE/TE] [The difference between log and ln is that log is defined for

base 10 and ln is denoted for base e. For example, the log of base 2 is represented as log2 and the log of

base e, i.e. loge = ln (natural log).]

=1/2 ln [1-(1/7)/(1/7)]

= ½ ln [6]

=0.897

Now we can update the weight, for every wrongly classified record we need to increase the weight, and for

every correctly classified record, we need to reduce the weight.

Below is the way to calculate the weight for the correctly classified records

New Stage Weight = Previous Stage Weight * expperformance

= 1/7 * exp0.895

= 0.349

Below is the way to calculate the weight for the incorrectly classified records

New Stage Weight = Previous Stage Weight * exp-performance


= 1/7 * exp-0.895

=0.5 Model will grab the attention

If we calculate the sum of initial weights it becomes 1.

But in the next stage, it will produce more than one.

To normalize the weights, we need to divide each record weight by the sum of all the record weights.

Finally, each boosting technique differs from the other based on baseline classifier, split criterion, working

principle, and weights.

Adaboost is a Boosting algorithm that increases the accuracy by giving more weightage to the target

which is misclassified by the model.

The gradient Boosting Algorithm increases the accuracy by minimizing the loss function(error which is a
difference between actual and predicted value) and having them as targets for the next decision tree
building.

Sklearn Gradient Boosting is defaulted to the Decision tree only.


Stacking (Regression & Classification)
Hands-on Notebook
https://drive.google.com/file/d/1MC17y0t--JwCbnJYF7D-q3MnsFclH9NV/view?usp=sharing

You might also like