UMl - Unit 3
UMl - Unit 3
Introduction
Machine learning is great! But there’s one thing that makes it even better:
Bagging, boosting and stacking are the three most popular ensemble learning
predictive accuracy. Each technique is used for a different purpose, with the
In this blog, I’ll explain the difference between bagging, boosting and stacking.
I’ll explain their purposes, their processes, as well as their advantages and
disadvantages. So that by the end of this article, you will understand how
example of the bagging technique is the random forest algorithm. The random
bagging is employed to form a random forest. The resulting random forest has
such as boosting, stacking, and many others. Today, these developments are
poorly. In other words, they tend to have low prediction accuracy. To mitigate
performance.
The individual models that we combine are known as weak learners. We call
them weak learners because they either have a high bias or high variance.
Because they either have high bias or variance, weak learners cannot learn
with each data point. Hence it is impossible to predict the next point
accurately.
Both high bias and high variance models thus cannot generalize properly.
and low variance, whereas an overfit model has high variance and low bias. In
either case, there is no balance between bias and variance. For there to be a
balance, both the bias and variance need to be low. Ensemble learning tries
variance.
It aims to reduce the bias if we have a weak model with high bias and low
variance. Ensemble learning will aim to reduce the variance if we have a weak
model with high variance and low bias. This way, the resulting model will be
much more balanced, with low bias and variance. Thus, the resulting model
will be known as a strong learner. This model will be more generalized than
reduce the bias of weak learners. Stacking is used to improve the overall
to produce a model with lower variance than the individual weak models.
These weak learners are homogenous, meaning they are of the same type.
Bootstrapping
Involves resampling subsets of data with replacement from an initial dataset.
In other words, subsets of data are taken from the initial dataset. These
Aggregating
The individual weak learners are trained independently from each other. Each
aggregated at the end to get the overall prediction. The predictions are
Max Voting
It is a commonly used for classification problems that consists of taking the
because like in election voting, the premise is that ‘the majority rules’. Each
‘vote’. The most occurring ‘vote’ is chosen as the representative for the
combined model.
Averaging
It is generally used for regression problems. It involves taking the average of
the predictions. The resulting average is used as the overall prediction for the
combined model.
Steps of Bagging
The steps of bagging are as follows:
take a subset of N sample points from the initial dataset for each subset.
Each subset is taken with replacement. This means that a specific data
● The predictions are aggregated into a single prediction. For this, either
produce a model with a lower bias than that of the individual models. Like in
data is first taken from the initial dataset. This sample is used to train the first
model, and the model makes its prediction. The samples can either be
correctly or incorrectly predicted. The samples that are wrongly predicted are
reused for training the next model. In this way, subsequent models can
aggregates the results at each step. They are aggregated using weighted
averaging.
their predictive power. In other words, it gives more weight to the model with
the highest predictive power. This is because the learner with the highest
Steps of Boosting
Boosting works with the following steps:
● We test the trained weak learner using the training data. As a result of
● Each data point with the wrong prediction is sent into the second subset
● Using this updated subset, we train and test the second weak learner.
● We continue with the following subset until the total number of subsets
is reached.
● We now have the total prediction. The overall prediction has already
strong learners.
models make predictions and form a single new dataset using those
predictions. This new data set is used to train the metamodel, which makes
boosted models.
Steps of Stacking
The steps of Stacking are as follows:
● Using the results of the meta-model, we make the final prediction. The
bagging and if you are looking to reduce underfitting or bias, you use
Bagging and boosting both works with homogeneous weak learners. Stacking
All three of these methods can work with either classification or regression
problems.
One disadvantage of boosting is that it is prone to variance or overfitting. It is
thus not advisable to use boosting for reducing variance. Boosting will do a
On the other hand, the converse is true. It is not advisable to use bagging to
reduce bias or underfitting. This is because bagging is more prone to bias and
they have the disadvantage of needing much more time and computational
power. If you are looking for faster results, it’s advisable not to use stacking.
poorly. In other words, they tend to have low prediction accuracy. To mitigate
performance.
The individual models that we combine are known as weak learners. We call
them weak learners because they either have a high bias or high variance.
Because they either have high bias or variance, weak learners cannot learn
● A high variance model results from learning the data too well. It varies
with each data point. Hence it is impossible to predict the next point
accurately.
Both high bias and high variance models thus cannot generalize properly.
As we know from the bias-variance trade-off, an underfit model has high bias
and low variance, whereas an overfit model has high variance and low bias. In
either case, there is no balance between bias and variance. For there to be a
balance, both the bias and variance need to be low. Ensemble learning tries
variance.
It aims to reduce the bias if we have a weak model with high bias and low
variance. Ensemble learning will aim to reduce the variance if we have a weak
model with high variance and low bias. This way, the resulting model will be
much more balanced, with low bias and variance. Thus, the resulting model
will be known as a strong learner. This model will be more generalized than
reduce the bias of weak learners. Stacking is used to improve the overall
to produce a model with lower variance than the individual weak models.
These weak learners are homogenous, meaning they are of the same type.
Bootstrapping
Involves resampling subsets of data with replacement from an initial dataset.
In other words, subsets of data are taken from the initial dataset. These
Aggregating
The individual weak learners are trained independently from each other. Each
aggregated at the end to get the overall prediction. The predictions are
Max Voting
It is a commonly used for classification problems that consists of taking the
because like in election voting, the premise is that ‘the majority rules’. Each
model makes a prediction. A prediction from each model counts as a single
‘vote’. The most occurring ‘vote’ is chosen as the representative for the
combined model.
Averaging
It is generally used for regression problems. It involves taking the average of
the predictions. The resulting average is used as the overall prediction for the
combined model.
Steps of Bagging
take a subset of N sample points from the initial dataset for each subset.
Each subset is taken with replacement. This means that a specific data
● The predictions are aggregated into a single prediction. For this, either
produce a model with a lower bias than that of the individual models. Like in
data is first taken from the initial dataset. This sample is used to train the first
model, and the model makes its prediction. The samples can either be
correctly or incorrectly predicted. The samples that are wrongly predicted are
reused for training the next model. In this way, subsequent models can
aggregates the results at each step. They are aggregated using weighted
averaging.
their predictive power. In other words, it gives more weight to the model with
the highest predictive power. This is because the learner with the highest
Steps of Boosting
● Each data point with the wrong prediction is sent into the second subset
● Using this updated subset, we train and test the second weak learner.
● We continue with the following subset until the total number of subsets
is reached.
● We now have the total prediction. The overall prediction has already
strong learners.
models make predictions and form a single new dataset using those
predictions. This new data set is used to train the metamodel, which makes
boosted models.
Steps of Stacking
If you want to reduce the overfitting or variance of your model, you use
bagging and if you are looking to reduce underfitting or bias, you use
Bagging and boosting both works with homogeneous weak learners. Stacking
problems.
thus not advisable to use boosting for reducing variance. Boosting will do a
On the other hand, the converse is true. It is not advisable to use bagging to
reduce bias or underfitting. This is because bagging is more prone to bias and
they have the disadvantage of needing much more time and computational
power. If you are looking for faster results, it’s advisable not to use stacking.
Boosting
Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model by
using weak models in series. Firstly, a model is built from the training data. Then
the second model is built which tries to correct the errors present in the first
model. This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number of
models is added.
Boosting Algorithms
There are several boosting algorithms. The original ones, proposed by Robert
Schapire and Yoav Freund were not adaptive and could not take full advantage
of the weak learners. Schapire and Freund then developed AdaBoost, an
adaptive boosting algorithm that won the prestigious Gödel Prize. AdaBoost
was the first really successful boosting algorithm developed for the purpose of
binary classification. AdaBoost is short for Adaptive Boosting and is a very
popular boosting technique that combines multiple “weak classifiers” into a
single “strong classifier”.
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified
data points.
3. Increase the weight of the wrongly classified data points and decrease
the weights of correctly classified data points. And then normalize the
Goto step 5
else
Goto step 2
5. End
An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and
weighted dataset.
Bagging and Boosting, both being the commonly used methods, have a
universal similarity of being classified as ensemble methods. Here we will
explain the similarities between them.
1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.
3. Both make the final decision by averaging the N learners (or taking the
Boosting vs Bagging
Boosting Bagging
Bagging is a method of
In Boosting we combine predictions that
combining the same type of
belong to different types
prediction
New Models are influenced by the All the models are independent
accuracy of previous Models of each other
Boosting in Machine Learning | Boosting and
AdaBoost
What is Boosting
Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model by
using weak models in series. Firstly, a model is built from the training data. Then
the second model is built which tries to correct the errors present in the first
model. This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number of
models are added.
Advantages of Boosting
misclassified
● Better Interpretability – Boosting can increase the interpretability of
processes.
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified
data points.
Goto step 5
else
Goto step 2
5. End
Training a boosting model
plus(+) and minus(-) and 5 of which are plus(+) and the other 5 are
minus(-) and each one has been assigned equal weight initially. The
first model tries to classify the data points and generates a vertical
● B3 consists of the 10 data points from the previous model in which the
prediction model which is much better than any individual model used.
There are several types of boosting algorithms some of the most famous and
useful models are as :
1. Gradient Boosting – It is a boosting technique that builds a final model
from the sum of several weak learning algorithms that were trained on
first weak learner in the gradient boosting algorithm will not be trained
on the dataset; instead, it will simply return the mean of the relevant
column. The residual for the first weak learner algorithm’s output will
then be calculated and used as the output column or target column for
the next weak learning algorithm that will be trained. The second
weak learner will be trained using the same methodology, and the
more for the third weak learner, and so on until we achieve zero
learners are used for getting strong learners. The value of the alpha
distinction that sets it apart from and improves upon competitors. The
categorical datasets.
Freund and Schapire in the year 1997. Since then, Boosting has been a
The principle behind boosting algorithms is that we first build a model on the
training dataset and then build a second model to rectify the errors present in
the first model. This procedure is continued until and unless the errors are
a similar way, it combines multiple models (weak learners) to reach the final
Learning Objectives
● To understand what the AdaBoost algorithm is and how it works.
models.
● Conclusion
AdaBoost.
as the weights are re-assigned to each instance, with higher weights assigned
all the data points. It then assigns higher weights to points that are wrongly
classified. Now all the points with higher weights are given more importance in
the next model. It will keep training models until and unless a lower error is
received.
Let’s take an example to understand this, suppose you built a decision tree
algorithm on the Titanic dataset, and from there, you get an accuracy of 80%.
After this, you apply a different algorithm and check the accuracy, and it
comes out to be 75% for KNN and 70% for Linear Regression.
We see the accuracy differs when we build a different model on the same
dataset. But what if we use combinations of all these algorithms to make the
final prediction? We’ll get more accurate results by taking the average of the
results from these models. We can increase the prediction power in this way.
this article.
intuition of AdaBoost.
following tutorial.
points will be assigned some weights. Initially, all the weights will be equal.
Here since we have 5 data points, the sample weights assigned will be 1/5.
We’ll create a decision stump for each of the features and then calculate the
Gini Index of each tree. The tree with the lowest Gini Index will be our first
stump.
Here in our dataset, let’s say Gender has the lowest gini index, so it will be our
first stump.
The total error is nothing but the summation of all the sample weights of
then we have no error (Total Error = 0), so the “amount of say (alpha)” will be
a large number.
When the classifier predicts half right and half wrong, then the Total Error =
If all the samples have been incorrectly classified, then the error will be very
high (approx. to 1), and hence our alpha value will be a negative integer.
the output received will be the same as what was received in the first model.
The wrong predictions will be given more weight, whereas the correct
predictions weights will be decreased. Now when we build our next model
after updating the weights, more preference will be given to the points with
higher weights.
After finding the importance of the classifier and total error, we need to finally
update the weights, and for this, we use the following formula:
The amount of, say (alpha) will be negative when the sample is correctly
classified.
The amount of, say (alpha) will be positive when the sample is miss-classified.
There are four correctly classified samples and 1 wrong. Here, the sample
Note
See the sign of alpha when I am putting the values, the alpha is negative
when the data point is correctly classified, and this decreases the sample
here if we sum up all the new sample weights, we will get 0.8004. To bring
this sum equal to 1, we will normalize these weights by dividing all the weights
by the total sum of updated weights, which is 0.8004. So, after normalizing the
sample weights, we get this dataset, and now the sum is equal to 1.
For this, we will remove the “sample weights” and “new sample weights”
columns and then, based on the “new sample weights,” divide our data points
into buckets.
Step 6: New Dataset
We are almost done. Now, what the algorithm does is selects random
numbers from 0-1. Since incorrectly classified records have higher sample
0.38,0.26,0.98,0.40,0.55.
Now we will see where these random numbers fall in the bucket, and
wrongly classified, has been selected 3 times because it has a higher weight.
i.e.
● Find the stump that does the best job classifying the new collection of
samples by finding their Gini Index and selecting the one with the lowest
Gini index.
● Calculate the “Amount of Say” and “Total error” to update the previous
sample weights.
Iterate through these steps until and unless a low training error is achieved.
(DT1, DT2, DT3) in a sequential manner. If we send our test data now, it will
pass through all the decision trees, and finally, we will see which class has the
Conclusion
You have finally mastered this algorithm if you understand each and every line
of this article.
We started by introducing you to what Boosting is and what are its various
types to make sure that you understand the Adaboost classifier and where
AdaBoost falls exactly. We then applied straightforward math and saw how
In the next article, I will explain Gradient Descent and Xtreme Gradient
If you want to know about the python implementation for beginners of the
AdaBoost machine learning model from scratch, then visit this complete guide
from analytics vidhya. This article mentions the difference between bagging
algorithm.
Key Takeaways
accuracy.
import pandas as pd
# df = pd.read_csv("your_dataset.csv")
# X = df.drop("target_column", axis=1)
# y = df["target_column"]
base_classifier = DecisionTreeClassifier(random_state=42)
bagging_classifier.fit(X_train, y_train)
predictions = bagging_classifier.predict(X_test)
# Calculate accuracy
print(f"Accuracy: {accuracy}")