0% found this document useful (0 votes)
25 views50 pages

UMl - Unit 3

The document discusses ensemble learning methods in machine learning, specifically focusing on bagging, boosting, and stacking. It explains how each technique improves model performance by addressing issues of bias and variance, with bagging reducing variance, boosting reducing bias, and stacking enhancing overall accuracy. The document also outlines the processes, advantages, and disadvantages of each method, helping readers understand when to use them effectively.

Uploaded by

Dhruvee Vadhvana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views50 pages

UMl - Unit 3

The document discusses ensemble learning methods in machine learning, specifically focusing on bagging, boosting, and stacking. It explains how each technique improves model performance by addressing issues of bias and variance, with bagging reducing variance, boosting reducing bias, and stacking enhancing overall accuracy. The document also outlines the processes, advantages, and disadvantages of each method, helping readers understand when to use them effectively.

Uploaded by

Dhruvee Vadhvana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Ensemble Learning Methods:

Bagging, Boosting and Stacking

Introduction

Machine learning is great! But there’s one thing that makes it even better:

ensemble learning. Ensemble learning helps enhance the performance of

machine learning models. The concept behind it is simple. Multiple machine

learning models are combined to obtain a more accurate model.

Bagging, boosting and stacking are the three most popular ensemble learning

techniques. Each of these techniques offers a unique approach to improving

predictive accuracy. Each technique is used for a different purpose, with the

use of each depending on varying factors. Although each technique is

different, many of us find it hard to distinguish between them. Knowing when

or why we should use each technique is difficult.

In this blog, I’ll explain the difference between bagging, boosting and stacking.

I’ll explain their purposes, their processes, as well as their advantages and

disadvantages. So that by the end of this article, you will understand how

each technique works and which technique to use and when.


By understanding the differences, you’ll be able to choose the best method for

improving your model’s accuracy.

What is Ensemble Learning?


Ensemble learning is a machine learning technique combining multiple

individual models to create a stronger, more accurate predictive model. By

leveraging the diverse strengths of different models, ensemble learning aims

to mitigate errors, enhance performance, and increase the overall robustness

of predictions, leading to improved results across various tasks in machine

learning and data analysis.

How Did Ensemble Learning Come into Existence?


One of the first uses of ensemble methods was the bagging technique. This

technique was developed to overcome instability in decision trees. In fact, an

example of the bagging technique is the random forest algorithm. The random

forest is an ensemble of multiple decision trees. Decision trees tend to be

prone to overfitting. Because of this, a single decision tree can’t be relied on

for making predictions. To improve the prediction accuracy of decision trees,

bagging is employed to form a random forest. The resulting random forest has

a lower variance compared to the individual trees.

The success of bagging led to the development of other ensemble techniques

such as boosting, stacking, and many others. Today, these developments are

an important part of machine learning.


How Ensemble Learning Works?
Ensemble learning is a learning method that consists of combining multiple

machine learning models.

A problem in machine learning is that individual models tend to perform

poorly. In other words, they tend to have low prediction accuracy. To mitigate

this problem, we combine multiple models to get one with a better

performance.

The individual models that we combine are known as weak learners. We call

them weak learners because they either have a high bias or high variance.

Because they either have high bias or variance, weak learners cannot learn

efficiently and perform poorly.

High-bias and High-variance Models


● A high-bias model results from not learning data well enough. It is not

related to the distribution of the data. Hence future predictions will be

unrelated to the data and thus incorrect.


● A high variance model results from learning the data too well. It varies

with each data point. Hence it is impossible to predict the next point

accurately.

Both high bias and high variance models thus cannot generalize properly.

Thus, weak learners will either make incorrect generalizations or fail to

generalize altogether. Because of this, the predictions of weak learners

cannot be relied on by themselves.


As we know from the bias-variance trade-off, an underfit model has high bias

and low variance, whereas an overfit model has high variance and low bias. In

either case, there is no balance between bias and variance. For there to be a

balance, both the bias and variance need to be low. Ensemble learning tries

to balance this bias-variance trade-off by reducing either the bias or the

variance.

It aims to reduce the bias if we have a weak model with high bias and low

variance. Ensemble learning will aim to reduce the variance if we have a weak

model with high variance and low bias. This way, the resulting model will be

much more balanced, with low bias and variance. Thus, the resulting model

will be known as a strong learner. This model will be more generalized than

the weak learners. It will thus be able to make accurate predictions.

Monitoring Ensemble Learning Models


Ensemble learning improves a model’s performance in mainly three ways:
● By reducing the variance of weak learners

● By reducing the bias of weak learners,

● By improving the overall accuracy of strong learners.

Bagging is used to reduce the variance of weak learners. Boosting is used to

reduce the bias of weak learners. Stacking is used to improve the overall

accuracy of strong learners.

Reducing Variance with Bagging


We use bagging for combining weak learners of high variance. Bagging aims

to produce a model with lower variance than the individual weak models.

These weak learners are homogenous, meaning they are of the same type.

Bagging is also known as Bootstrap aggregating. It consists of two steps:

bootstrapping and aggregation.

Bootstrapping
Involves resampling subsets of data with replacement from an initial dataset.

In other words, subsets of data are taken from the initial dataset. These

subsets of data are called bootstrapped datasets or, simply, bootstraps.

Resampled ‘with replacement’ means an individual data point can be sampled

multiple times. Each bootstrap dataset is used to train a weak learner.

Aggregating
The individual weak learners are trained independently from each other. Each

learner makes independent predictions. The results of those predictions are

aggregated at the end to get the overall prediction. The predictions are

aggregated using either max voting or averaging.

Max Voting
It is a commonly used for classification problems that consists of taking the

mode of the predictions (the most occurring prediction). It is called voting

because like in election voting, the premise is that ‘the majority rules’. Each

model makes a prediction. A prediction from each model counts as a single

‘vote’. The most occurring ‘vote’ is chosen as the representative for the

combined model.

Averaging
It is generally used for regression problems. It involves taking the average of

the predictions. The resulting average is used as the overall prediction for the

combined model.

Steps of Bagging
The steps of bagging are as follows:

● We have an initial training dataset containing n-number of instances.

● We create a m-number of subsets of data from the training set. We

take a subset of N sample points from the initial dataset for each subset.

Each subset is taken with replacement. This means that a specific data

point can be sampled more than once.

● For each subset of data, we train the corresponding weak learners

independently. These models are homogeneous, meaning that they are

of the same type.

● Each model makes a prediction.

● The predictions are aggregated into a single prediction. For this, either

max voting or averaging is used.


Reducing Bias by Boosting
We use boosting for combining weak learners with high bias. Boosting aims to

produce a model with a lower bias than that of the individual models. Like in

bagging, the weak learners are homogeneous.

Boosting involves sequentially training weak learners. Here, each subsequent

learner improves the errors of previous learners in the sequence. A sample of

data is first taken from the initial dataset. This sample is used to train the first

model, and the model makes its prediction. The samples can either be

correctly or incorrectly predicted. The samples that are wrongly predicted are

reused for training the next model. In this way, subsequent models can

improve on the errors of previous models.

Unlike bagging, which aggregates prediction results at the end, boosting

aggregates the results at each step. They are aggregated using weighted

averaging.

Weighted averaging involves giving all models different weights depending on

their predictive power. In other words, it gives more weight to the model with

the highest predictive power. This is because the learner with the highest

predictive power is considered the most important.

Steps of Boosting
Boosting works with the following steps:

● We sample m-number of subsets from an initial training dataset.

● Using the first subset, we train the first weak learner.

● We test the trained weak learner using the training data. As a result of

the testing, some data points will be incorrectly predicted.

● Each data point with the wrong prediction is sent into the second subset

of data, and this subset is updated.

● Using this updated subset, we train and test the second weak learner.

● We continue with the following subset until the total number of subsets

is reached.

● We now have the total prediction. The overall prediction has already

been aggregated at each step, so there is no need to calculate it.


Improving Model Accuracy with Stacking
We use stacking to improve the prediction accuracy of strong learners.

Stacking aims to create a single robust model from multiple heterogeneous

strong learners.

Stacking differs from bagging and boosting in that:

● It combines strong learners

● It combines heterogeneous models

● It consists of creating a Metamodel. A metamodel is a model created

using a new dataset.

Individual heterogeneous models are trained using an initial dataset. These

models make predictions and form a single new dataset using those

predictions. This new data set is used to train the metamodel, which makes

the final prediction. The prediction is combined using weighted averaging.

Because stacking combines strong learners, it can combine bagged or

boosted models.

Steps of Stacking
The steps of Stacking are as follows:

● We use initial training data to train m-number of algorithms.

● Using the output of each algorithm, we create a new training set.

● Using the new training set, we create a meta-model algorithm.

● Using the results of the meta-model, we make the final prediction. The

results are combined using weighted averaging.

When to use Bagging vs Boosting vs Stacking?


If you want to reduce the overfitting or variance of your model, you use

bagging and if you are looking to reduce underfitting or bias, you use

boosting. However, if you want to increase predictive accuracy, use stacking.

Bagging and boosting both works with homogeneous weak learners. Stacking

works using heterogeneous solid learners.

All three of these methods can work with either classification or regression

problems.
One disadvantage of boosting is that it is prone to variance or overfitting. It is

thus not advisable to use boosting for reducing variance. Boosting will do a

worse job in reducing variance as compared to bagging.

On the other hand, the converse is true. It is not advisable to use bagging to

reduce bias or underfitting. This is because bagging is more prone to bias and

does not help reduce bias.

Stacked models have the advantage of better prediction accuracy than

bagging or boosting. But because they combine bagged or boosted models,

they have the disadvantage of needing much more time and computational

power. If you are looking for faster results, it’s advisable not to use stacking.

However, stacking is the way to go if you’re looking for high accuracy.

How Ensemble Learning Works?


Ensemble learning is a learning method that consists of combining multiple

machine learning models.

A problem in machine learning is that individual models tend to perform

poorly. In other words, they tend to have low prediction accuracy. To mitigate

this problem, we combine multiple models to get one with a better

performance.

The individual models that we combine are known as weak learners. We call

them weak learners because they either have a high bias or high variance.
Because they either have high bias or variance, weak learners cannot learn

efficiently and perform poorly.

High-bias and High-variance Models


● A high-bias model results from not learning data well enough. It is not

related to the distribution of the data. Hence future predictions will be

unrelated to the data and thus incorrect.

● A high variance model results from learning the data too well. It varies

with each data point. Hence it is impossible to predict the next point

accurately.
Both high bias and high variance models thus cannot generalize properly.

Thus, weak learners will either make incorrect generalizations or fail to

generalize altogether. Because of this, the predictions of weak learners

cannot be relied on by themselves.

As we know from the bias-variance trade-off, an underfit model has high bias

and low variance, whereas an overfit model has high variance and low bias. In

either case, there is no balance between bias and variance. For there to be a

balance, both the bias and variance need to be low. Ensemble learning tries

to balance this bias-variance trade-off by reducing either the bias or the

variance.

It aims to reduce the bias if we have a weak model with high bias and low

variance. Ensemble learning will aim to reduce the variance if we have a weak

model with high variance and low bias. This way, the resulting model will be

much more balanced, with low bias and variance. Thus, the resulting model
will be known as a strong learner. This model will be more generalized than

the weak learners. It will thus be able to make accurate predictions.

Monitoring Ensemble Learning Models


Ensemble learning improves a model’s performance in mainly three ways:

● By reducing the variance of weak learners

● By reducing the bias of weak learners,

● By improving the overall accuracy of strong learners.

Bagging is used to reduce the variance of weak learners. Boosting is used to

reduce the bias of weak learners. Stacking is used to improve the overall

accuracy of strong learners.

Reducing Variance with Bagging


We use bagging for combining weak learners of high variance. Bagging aims

to produce a model with lower variance than the individual weak models.

These weak learners are homogenous, meaning they are of the same type.

Bagging is also known as Bootstrap aggregating. It consists of two steps:

bootstrapping and aggregation.

Bootstrapping
Involves resampling subsets of data with replacement from an initial dataset.

In other words, subsets of data are taken from the initial dataset. These

subsets of data are called bootstrapped datasets or, simply, bootstraps.

Resampled ‘with replacement’ means an individual data point can be sampled

multiple times. Each bootstrap dataset is used to train a weak learner.

Aggregating
The individual weak learners are trained independently from each other. Each

learner makes independent predictions. The results of those predictions are

aggregated at the end to get the overall prediction. The predictions are

aggregated using either max voting or averaging.

Max Voting
It is a commonly used for classification problems that consists of taking the

mode of the predictions (the most occurring prediction). It is called voting

because like in election voting, the premise is that ‘the majority rules’. Each
model makes a prediction. A prediction from each model counts as a single

‘vote’. The most occurring ‘vote’ is chosen as the representative for the

combined model.

Averaging
It is generally used for regression problems. It involves taking the average of

the predictions. The resulting average is used as the overall prediction for the

combined model.

Steps of Bagging

The steps of bagging are as follows:

● We have an initial training dataset containing n-number of instances.


● We create a m-number of subsets of data from the training set. We

take a subset of N sample points from the initial dataset for each subset.

Each subset is taken with replacement. This means that a specific data

point can be sampled more than once.

● For each subset of data, we train the corresponding weak learners

independently. These models are homogeneous, meaning that they are

of the same type.

● Each model makes a prediction.

● The predictions are aggregated into a single prediction. For this, either

max voting or averaging is used.

Reducing Bias by Boosting


We use boosting for combining weak learners with high bias. Boosting aims to

produce a model with a lower bias than that of the individual models. Like in

bagging, the weak learners are homogeneous.

Boosting involves sequentially training weak learners. Here, each subsequent

learner improves the errors of previous learners in the sequence. A sample of

data is first taken from the initial dataset. This sample is used to train the first

model, and the model makes its prediction. The samples can either be

correctly or incorrectly predicted. The samples that are wrongly predicted are

reused for training the next model. In this way, subsequent models can

improve on the errors of previous models.


Unlike bagging, which aggregates prediction results at the end, boosting

aggregates the results at each step. They are aggregated using weighted

averaging.

Weighted averaging involves giving all models different weights depending on

their predictive power. In other words, it gives more weight to the model with

the highest predictive power. This is because the learner with the highest

predictive power is considered the most important.

Steps of Boosting

Boosting works with the following steps:

● We sample m-number of subsets from an initial training dataset.

● Using the first subset, we train the first weak learner.


● We test the trained weak learner using the training data. As a result of

the testing, some data points will be incorrectly predicted.

● Each data point with the wrong prediction is sent into the second subset

of data, and this subset is updated.

● Using this updated subset, we train and test the second weak learner.

● We continue with the following subset until the total number of subsets

is reached.

● We now have the total prediction. The overall prediction has already

been aggregated at each step, so there is no need to calculate it.

Improving Model Accuracy with Stacking


We use stacking to improve the prediction accuracy of strong learners.

Stacking aims to create a single robust model from multiple heterogeneous

strong learners.

Stacking differs from bagging and boosting in that:

● It combines strong learners

● It combines heterogeneous models

● It consists of creating a Metamodel. A metamodel is a model created

using a new dataset.

Individual heterogeneous models are trained using an initial dataset. These

models make predictions and form a single new dataset using those
predictions. This new data set is used to train the metamodel, which makes

the final prediction. The prediction is combined using weighted averaging.

Because stacking combines strong learners, it can combine bagged or

boosted models.

Steps of Stacking

The steps of Stacking are as follows:

● We use initial training data to train m-number of algorithms.

● Using the output of each algorithm, we create a new training set.

● Using the new training set, we create a meta-model algorithm.


● Using the results of the meta-model, we make the final prediction. The

results are combined using weighted averaging.

When to use Bagging vs Boosting vs Stacking?

If you want to reduce the overfitting or variance of your model, you use

bagging and if you are looking to reduce underfitting or bias, you use

boosting. However, if you want to increase predictive accuracy, use stacking.

Bagging and boosting both works with homogeneous weak learners. Stacking

works using heterogeneous solid learners.


All three of these methods can work with either classification or regression

problems.

One disadvantage of boosting is that it is prone to variance or overfitting. It is

thus not advisable to use boosting for reducing variance. Boosting will do a

worse job in reducing variance as compared to bagging.

On the other hand, the converse is true. It is not advisable to use bagging to

reduce bias or underfitting. This is because bagging is more prone to bias and

does not help reduce bias.

Stacked models have the advantage of better prediction accuracy than

bagging or boosting. But because they combine bagged or boosted models,

they have the disadvantage of needing much more time and computational

power. If you are looking for faster results, it’s advisable not to use stacking.

However, stacking is the way to go if you’re looking for high accuracy.

Boosting
Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model by
using weak models in series. Firstly, a model is built from the training data. Then
the second model is built which tries to correct the errors present in the first
model. This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number of
models is added.
Boosting Algorithms
There are several boosting algorithms. The original ones, proposed by Robert
Schapire and Yoav Freund were not adaptive and could not take full advantage
of the weak learners. Schapire and Freund then developed AdaBoost, an
adaptive boosting algorithm that won the prestigious Gödel Prize. AdaBoost
was the first really successful boosting algorithm developed for the purpose of
binary classification. AdaBoost is short for Adaptive Boosting and is a very
popular boosting technique that combines multiple “weak classifiers” into a
single “strong classifier”.
Algorithm:

1. Initialise the dataset and assign equal weight to each of the data point.

2. Provide this as input to the model and identify the wrongly classified

data points.

3. Increase the weight of the wrongly classified data points and decrease

the weights of correctly classified data points. And then normalize the

weights of all data points.

4. if (got required results)

Goto step 5

else

Goto step 2

5. End
An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and

weighted dataset.

To read more refer to this article: Boosting and AdaBoost in ML

Similarities Between Bagging and Boosting

Bagging and Boosting, both being the commonly used methods, have a
universal similarity of being classified as ensemble methods. Here we will
explain the similarities between them.
1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.

3. Both make the final decision by averaging the N learners (or taking the

majority of them i.e Majority Voting).

4. Both are good at reducing variance and provide higher stability.

Boosting vs Bagging

Boosting Bagging

Bagging is a method of
In Boosting we combine predictions that
combining the same type of
belong to different types
prediction

The main aim of boosting is to decrease The main aim of bagging is to


bias, not variance decrease variance not bias

At every successive layer Models are


All the models have the same
weighted according to their
weightage
performance.

New Models are influenced by the All the models are independent
accuracy of previous Models of each other
Boosting in Machine Learning | Boosting and
AdaBoost

What is Boosting
Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model by
using weak models in series. Firstly, a model is built from the training data. Then
the second model is built which tries to correct the errors present in the first
model. This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number of
models are added.

Advantages of Boosting

● Improved Accuracy – Boosting can improve the accuracy of the model

by combining several weak models’ accuracies and averaging them for

regression or voting over them for classification to increase the

accuracy of the final model.

● Robustness to Overfitting – Boosting can reduce the risk of overfitting

by reweighting the inputs that are classified wrongly.

● Better handling of imbalanced data – Boosting can handle the

imbalance data by focusing more on the data points that are

misclassified
● Better Interpretability – Boosting can increase the interpretability of

the model by breaking the model decision process into multiple

processes.

Training of Boosting Model

1. Initialise the dataset and assign equal weight to each of the data point.

2. Provide this as input to the model and identify the wrongly classified

data points.

3. Increase the weight of the wrongly classified data points.

4. if (got required results)

Goto step 5

else

Goto step 2

5. End
Training a boosting model

The Explanation for Training the Boosting Model:


The above diagram explains the AdaBoost algorithm in a very simple way. Let’s
try to understand it in a stepwise process:
● B1 consists of 10 data points which consist of two types namely

plus(+) and minus(-) and 5 of which are plus(+) and the other 5 are

minus(-) and each one has been assigned equal weight initially. The

first model tries to classify the data points and generates a vertical

separator line but it wrongly classifies 3 plus(+) as minus(-).


● B2 consists of the 10 data points from the previous model in which the

3 wrongly classified plus(+) are weighted more so that the current

model tries more to classify these pluses(+) correctly. This model

generates a vertical separator line that correctly classifies the

previously wrongly classified pluses(+) but in this attempt, it wrongly

classifies three minuses(-).

● B3 consists of the 10 data points from the previous model in which the

3 wrongly classified minus(-) are weighted more so that the current

model tries more to classify these minuses(-) correctly. This model

generates a horizontal separator line that correctly classifies the

previously wrongly classified minuses(-).

● B4 combines together B1, B2, and B3 in order to build a strong

prediction model which is much better than any individual model used.

Types Of Boosting Algorithms

There are several types of boosting algorithms some of the most famous and
useful models are as :
1. Gradient Boosting – It is a boosting technique that builds a final model

from the sum of several weak learning algorithms that were trained on

the same dataset. It operates on the idea of stagewise addition. The

first weak learner in the gradient boosting algorithm will not be trained

on the dataset; instead, it will simply return the mean of the relevant

column. The residual for the first weak learner algorithm’s output will
then be calculated and used as the output column or target column for

the next weak learning algorithm that will be trained. The second

weak learner will be trained using the same methodology, and the

residuals will be computed and utilized as an output column once

more for the third weak learner, and so on until we achieve zero

residuals. The dataset for gradient boosting must be in the form of

numerical or categorical data, and the loss function used to generate

the residuals must be differential at all times.

2. XGBoost – In addition to the gradient boosting technique, XGBoost is

another boosting machine learning approach. The full name of the

XGBoost algorithm is the eXtreme Gradient Boosting algorithm, which

is an extreme variation of the previous gradient boosting technique.

The key distinction between XGBoost and GradientBoosting is that

XGBoost applies a regularisation approach. It is a regularised version

of the current gradient-boosting technique. Because of this, XGBoost

outperforms a standard gradient boosting method, which explains why

it is also faster than that. Additionally, it works better when the

dataset contains both numerical and categorical variables.

3. Adaboost – AdaBoost is a boosting algorithm that also works on the

principle of the stagewise addition method where multiple weak

learners are used for getting strong learners. The value of the alpha

parameter, in this case, will be indirectly proportional to the error of

the weak learner, Unlike Gradient Boosting in XGBoost, the alpha


parameter calculated is related to the errors of the weak learner, here

the value of the alpha parameter will be indirectly proportional to the

error of the weak learner.

4. CatBoost – The growth of decision trees inside CatBoost is the primary

distinction that sets it apart from and improves upon competitors. The

decision trees that are created in CatBoost are symmetric. As there is a

unique sort of approach for handling categorical datasets, CatBoost

works very well on categorical datasets compared to any other

algorithm in the field of machine learning. The categorical features in

CatBoost are encoded based on the output columns. As a result, the

output column’s weight will be taken into account while training or

encoding the categorical features, increasing its accuracy on

categorical datasets.

Disadvantages of Boosting Algorithms

Boosting algorithms also have some disadvantages these are:


● Boosting Algorithms are vulnerable to the outliers

● It is difficult to use boosting algorithms for Real-Time applications.

● It is computationally expensive for large datasets


AdaBoost Algorithm:
Introduction
Boosting is an ensemble modeling technique that was first presented by

Freund and Schapire in the year 1997. Since then, Boosting has been a

prevalent technique for tackling binary classification problems. These

algorithms improve the prediction power by converting a number of weak

learners to strong learners.

The principle behind boosting algorithms is that we first build a model on the

training dataset and then build a second model to rectify the errors present in

the first model. This procedure is continued until and unless the errors are

minimized and the dataset is predicted correctly. Boosting algorithms work in

a similar way, it combines multiple models (weak learners) to reach the final

output (strong learners).

Learning Objectives
● To understand what the AdaBoost algorithm is and how it works.

● To understand what stumps are.

● To find out how boosting algorithms help increase the accuracy of ML

models.

This article was published as a part of the Data Science Blogathon


Table of contents
● Introduction

● What Is the AdaBoost Algorithm?

● Understanding the Working of the AdaBoost Algorithm

● Step 1: Assigning Weights

● Step 2: Classify the Samples

● Step 3: Calculate the Influence

● Step 4: Calculate TE and Performance

● Step 5: Decrease Errors

● Step 6: New Dataset

● Step 7: Repeat Previous Steps

● Conclusion

● Frequently Asked Questions

What Is the AdaBoost Algorithm?


There are many machine learning algorithms to choose from for your problem

statements. One of these algorithms for predictive modeling is called

AdaBoost.

AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used

as an Ensemble Method in Machine Learning. It is called Adaptive Boosting

as the weights are re-assigned to each instance, with higher weights assigned

to incorrectly classified instances.


What this algorithm does is that it builds a model and gives equal weights to

all the data points. It then assigns higher weights to points that are wrongly

classified. Now all the points with higher weights are given more importance in

the next model. It will keep training models until and unless a lower error is

received.

Your Pathway to Generative AI Excellence Begins Here

Master 26+ Cutting Edge Tools with GenAI Pinnacle Program


Download Brochure

Let’s take an example to understand this, suppose you built a decision tree

algorithm on the Titanic dataset, and from there, you get an accuracy of 80%.

After this, you apply a different algorithm and check the accuracy, and it

comes out to be 75% for KNN and 70% for Linear Regression.

We see the accuracy differs when we build a different model on the same

dataset. But what if we use combinations of all these algorithms to make the

final prediction? We’ll get more accurate results by taking the average of the

results from these models. We can increase the prediction power in this way.

If you want to understand this visually, I strongly recommend you go through

this article.

Here we will be more focused on mathematics intuition.


There is another ensemble learning algorithm called the gradient boosting

algorithm. In this algorithm, we try to reduce the error instead of wights, as in

AdaBoost. But in this article, we will only be focussing on the mathematical

intuition of AdaBoost.

Understanding the Working of the AdaBoost Algorithm


Let’s understand what and how this algorithm works under the hood with the

following tutorial.

Step 1: Assigning Weights


The Image shown below is the actual representation of our dataset. Since the

target column is binary, it is a classification problem. First of all, these data

points will be assigned some weights. Initially, all the weights will be equal.

The formula to calculate the sample weights is:


Where N is the total number of data points

Here since we have 5 data points, the sample weights assigned will be 1/5.

Step 2: Classify the Samples


We start by seeing how well “Gender” classifies the samples and will see how

the variables (Age, Income) classify the samples.

We’ll create a decision stump for each of the features and then calculate the

Gini Index of each tree. The tree with the lowest Gini Index will be our first

stump.

Here in our dataset, let’s say Gender has the lowest gini index, so it will be our

first stump.

Step 3: Calculate the Influence


We’ll now calculate the “Amount of Say” or “Importance” or “Influence” for this

classifier in classifying the data points using this formula:

The total error is nothing but the summation of all the sample weights of

misclassified data points.


Here in our dataset, let’s assume there is 1 wrong output, so our total error will

be 1/5, and the alpha (performance of the stump) will be:

Note: Total error will always be between 0 and 1.

0 Indicates perfect stump, and 1 indicates horrible stump.


From the graph above, we can see that when there is no misclassification,

then we have no error (Total Error = 0), so the “amount of say (alpha)” will be

a large number.

When the classifier predicts half right and half wrong, then the Total Error =

0.5, and the importance (amount of say) of the classifier will be 0.

If all the samples have been incorrectly classified, then the error will be very

high (approx. to 1), and hence our alpha value will be a negative integer.

Step 4: Calculate TE and Performance


You must be wondering why it is necessary to calculate the TE and

performance of a stump. Well, the answer is very simple, we need to update


the weights because if the same weights are applied to the next model, then

the output received will be the same as what was received in the first model.

The wrong predictions will be given more weight, whereas the correct

predictions weights will be decreased. Now when we build our next model

after updating the weights, more preference will be given to the points with

higher weights.

After finding the importance of the classifier and total error, we need to finally

update the weights, and for this, we use the following formula:

The amount of, say (alpha) will be negative when the sample is correctly

classified.

The amount of, say (alpha) will be positive when the sample is miss-classified.

There are four correctly classified samples and 1 wrong. Here, the sample

weight of that datapoint is 1/5, and the amount of say/performance of the

stump of Gender is 0.69.

New weights for correctly classified samples are:


For wrongly classified samples, the updated weights will be:

Note
See the sign of alpha when I am putting the values, the alpha is negative

when the data point is correctly classified, and this decreases the sample

weight from 0.2 to 0.1004. It is positive when there is misclassification, and

this will increase the sample weight from 0.2 to 0.3988


We know that the total sum of the sample weights must be equal to 1, but

here if we sum up all the new sample weights, we will get 0.8004. To bring

this sum equal to 1, we will normalize these weights by dividing all the weights

by the total sum of updated weights, which is 0.8004. So, after normalizing the

sample weights, we get this dataset, and now the sum is equal to 1.

Step 5: Decrease Errors


Now, we need to make a new dataset to see if the errors decreased or not.

For this, we will remove the “sample weights” and “new sample weights”

columns and then, based on the “new sample weights,” divide our data points

into buckets.
Step 6: New Dataset
We are almost done. Now, what the algorithm does is selects random

numbers from 0-1. Since incorrectly classified records have higher sample

weights, the probability of selecting those records is very high.

Suppose the 5 random numbers our algorithm take is

0.38,0.26,0.98,0.40,0.55.

Now we will see where these random numbers fall in the bucket, and

according to it, we’ll make our new dataset shown below.


This comes out to be our new dataset, and we see the data point, which was

wrongly classified, has been selected 3 times because it has a higher weight.

Step 7: Repeat Previous Steps


Now this act as our new dataset, and we need to repeat all the above steps

i.e.

● Assign equal weights to all the data points.

● Find the stump that does the best job classifying the new collection of

samples by finding their Gini Index and selecting the one with the lowest

Gini index.

● Calculate the “Amount of Say” and “Total error” to update the previous

sample weights.

● Normalize the new sample weights.

Iterate through these steps until and unless a low training error is achieved.

Suppose, with respect to our dataset, we have constructed 3 decision trees

(DT1, DT2, DT3) in a sequential manner. If we send our test data now, it will

pass through all the decision trees, and finally, we will see which class has the

majority, and based on that, we will do predictions

for our test dataset.

Conclusion
You have finally mastered this algorithm if you understand each and every line

of this article.

We started by introducing you to what Boosting is and what are its various

types to make sure that you understand the Adaboost classifier and where

AdaBoost falls exactly. We then applied straightforward math and saw how

every part of the formula worked.

In the next article, I will explain Gradient Descent and Xtreme Gradient

Descent algorithm, which are a few more important Boosting techniques to

enhance the prediction power.

If you want to know about the python implementation for beginners of the

AdaBoost machine learning model from scratch, then visit this complete guide

from analytics vidhya. This article mentions the difference between bagging

and boosting, as well as the advantages and disadvantages of the AdaBoost

algorithm.

Key Takeaways

● In this article, we understood how boosting works.

● We understood the maths behind adaboost.

● We learned how weak learners are used as estimators to increase

accuracy.
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import BaggingClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

# Load your dataset or create a sample dataset

# df = pd.read_csv("your_dataset.csv")

# X = df.drop("target_column", axis=1)

# y = df["target_column"]

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Create a base decision tree classifier

base_classifier = DecisionTreeClassifier(random_state=42)

# Create a BaggingClassifier with 5 base classifiers (decision trees)


bagging_classifier = BaggingClassifier(base_classifier, n_estimators=5,
random_state=42)

# Train the BaggingClassifier on the training data

bagging_classifier.fit(X_train, y_train)

# Make predictions on the test set

predictions = bagging_classifier.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy}")

You might also like