ML Mod 5.1
ML Mod 5.1
Let’s look at both of them in detail and understand the Difference between
Bagging and Boosting.
Bagging
Bootstrap Aggregating, also known as bagging, is a machine learning ensemble
meta-algorithm designed to improve the stability and accuracy of machine
learning algorithms used in statistical classification and regression. It
decreases the variance and helps to avoid overfitting. It is usually applied
to decision tree methods. Bagging is a special case of the model averaging
approach.
Description of the Technique
Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is
selected via row sampling with a replacement method (i.e., there can be
repetitive elements from different d tuples) from D (i.e., bootstrap). Then a
classifier model Mi is learned for each training set D < i. Each classifier
Mi returns its class prediction. The bagged classifier M* counts the votes and
assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging
Module 05 1
Step 1: Multiple subsets are created from the original data set with equal
tuples, selecting observations with replacement.
Step 3: Each model is learned in parallel with each training set and
independent of each other.
The Random Forest model uses Bagging, where decision tree models with
higher variance are present. It makes random feature selection to grow trees.
Several random trees make a Random Forest.
Boosting
Boosting is an ensemble modeling technique designed to create a strong
classifier by combining multiple weak classifiers. The process involves building
models sequentially, where each new model aims to correct the errors made by
the previous ones.
Module 05 2
Initially, a model is built using the training data.
Training on weighted data: The subsequent model learns from the weighted
dataset, focusing its attention on harder-to-learn examples (those with
higher weights).
Boosting Algorithms
There are several boosting algorithms. The original ones, proposed by Robert
Schapire and Yoav Freund were not adaptive and could not take full advantage
of the weak learners. Schapire and Freund then developed AdaBoost, an
adaptive boosting algorithm that won the prestigious Gödel Prize. AdaBoost
was the first really successful boosting algorithm developed for the purpose of
binary classification. AdaBoost is short for Adaptive Boosting and is a very
popular boosting technique that combines multiple “weak classifiers” into a
single “strong classifier”.
Module 05 3
An illustration presenting the intuition behind the boosting algorithm, consisting
of the parallel learners and weighted dataset.
3. Both make the final decision by averaging the N learners (or taking the
majority of them i.e Majority Voting).
Module 05 4
bias. variance.
Random Forest
In prediction, the algorithm aggregates the results of all trees, either by voting
(for classification tasks) or by averaging (for regression tasks) This
collaborative decision-making process, supported by multiple trees with their
insights, provides an example stable and precise results. Random forests are
widely used for classification and regression functions, which are known for
Module 05 5
their ability to handle complex data, reduce overfitting, and provide reliable
forecasts in different environments.
Module 05 6
What is Bagging and Boosting?
Bagging is an ensemble learning model, where multiple week models are
trained on different subsets of the training data. Each subset is sampled with
replacement and prediction is made by averaging the prediction of the week
models for regression problem and considering majority vote for classification
problem.
Boosting trains multiple based models sequentially. In this method, each model
tries to correct the errors made by the previous models. Each model is trained
on a modified version of the dataset, the instances that were misclassified by
the previous models are given more weight. The final prediction is made by
weighted voting.
Algorithm for Random Forest Work:
2. Step 2:Build the decision trees associated with the selected data
points(Subsets).
3. Step 3:Choose the number N for decision trees that you want to build.
5. Step 5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.
Module 05 7
different aspects of the data, fostering a diverse set of predictors within the
ensemble.
Module 05 8
5. Built-in Cross-Validation: Random Forest is like having a personal coach
that keeps you in check. As it trains each decision tree, it also sets aside a
secret group of cases (out-of-bag) for testing. This built-in validation
ensures your model doesn't just ace the training but also performs well on
new challenges.
6. Handling Missing Values: Life is full of uncertainties, just like datasets with
missing values. Random Forest is the friend who adapts to the situation,
making predictions using the information available. It doesn't get flustered
by missing pieces; instead, it focuses on what it can confidently tell us.
Module 05 9
Provides a built-in mechanism Many algorithms may lack an
for assessing variable explicit feature importance
Variable
importance, aiding in feature assessment, making it challenging
Importance
selection and interpretation of to identify crucial variables for
influential factors. predictions.
What is Cross-Validation?
Cross validation is a technique used in machine learning to evaluate the
performance of a model on unseen data. It involves dividing the available data
into multiple folds or subsets, using one of these folds as a validation set, and
training the model on the remaining folds. This process is repeated multiple
times, each time using a different fold as the validation set. Finally, the results
from each validation step are averaged to produce a more robust estimate of
the model’s performance. Cross validation is an important step in the machine
learning process and helps to ensure that the model selected for deployment is
robust and generalizes well to new data.
Types of Cross-Validation
There are several types of cross validation techniques, including k-fold cross
validation, leave-one-out cross validation, and Holdout validation, Stratified
Cross-Validation. The choice of technique depends on the size and nature of
the data, as well as the specific requirements of the modeling problem.
Module 05 10
In K-Fold Cross Validation, we split the dataset into k number of subsets
(known as folds) then we perform training on the all the subsets but leave
one(k-1) subset for the evaluation of the trained model. In this method, we
iterate k times with a different subset reserved for testing purpose each time.
Module 05 11
Total instances: 25
Value of k : 5
No. Iteration Training set observations
Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21
22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21
22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Module 05 12
17 18 19] [20 21 22 23 24]
Advantages of cross-validation:
2. More “efficient” use of data as every observation is used for both training
and testing.
4. Data Efficient: Cross validation allows the use of all the available data for
both training and validation, making it a more data-efficient method
compared to traditional validation techniques.
Module 05 13
Disadvantages:
1. Computationally Expensive: Cross validation can be computationally
expensive, especially when the number of folds is large or when the model
is complex and requires a long time to train.
Decision Tree:
Module 05 14
A Decision tree is a flowchart-like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (terminal node) holds a class label.
A tree can be “learned” by splitting the source set into subsets based on an
attribute value test. This process is repeated on each derived subset in a
recursive manner called recursive partitioning. The recursion is completed
when the subset at a node all has the same value of the target variable, or
when splitting no longer adds value to the predictions.
Bagging:
A Bagging classifier is an ensemble meta-estimator that fits base classifiers
each on random subsets of the original dataset and then aggregate their
individual predictions (either by voting or by averaging) to form a final
prediction. Such a meta-estimator can typically be used as a way to reduce the
variance of a black-box estimator (e.g., a decision tree), by introducing
randomization into its construction procedure and then making an ensemble
out of it.
Each base classifier is trained in parallel with a training set which is generated
by randomly drawing, with replacement, N examples(or data) from the original
training dataset, where N is the size of the original training set. The training set
for each of the base classifiers is independent of each other. Many of the
original data may be repeated in the resulting training set while others may be
left out.
Module 05 15
Bagging classifier
Random Forest:
Every decision tree has high variance, but when we combine all of them
together in parallel then the resultant variance is low as each decision tree gets
perfectly trained on that particular sample data and hence the output doesn’t
depend on one decision tree but multiple decision trees. In the case of a
classification problem, the final output is taken by using the majority voting
classifier. In the case of a regression problem, the final output is the mean of all
the outputs. This part is Aggregation.
The basic idea behind this is to combine multiple decision trees in determining
the final output rather than relying on individual decision trees.
Random Forest has multiple decision trees as base learning models. We
randomly perform row sampling and feature sampling from the dataset forming
sample datasets for every model. This part is called Bootstrap.
Boosting:
Boosting is an ensemble modelling, technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model by
using weak models in series. Firstly, a model is built from the training data.
Then the second model is built which tries to correct the errors present in the
first model. This procedure is continued and models are added until either the
Module 05 16
complete training data set is predicted correctly or the maximum number of
models are added.
Boosting
Gradient Boosting
Gradient Boosting is a popular boosting algorithm. In gradient boosting, each
predictor corrects its predecessor’s error. In contrast to Adaboost, the weights
of the training instances are not tweaked, instead, each predictor is trained
using the residual errors of predecessor as labels.
There is a technique called the Gradient Boosted Trees whose base learner is
CART (Classification and Regression Trees).
XGBoost
XGBoost is an implementation of Gradient Boosted decision trees. XGBoost
models majorly dominate in many Kaggle Competitions.
In this algorithm, decision trees are created in sequential form. Weights play an
important role in XGBoost. Weights are assigned to all the independent
variables which are then fed into the decision tree which predicts results. The
weight of variables predicted wrong by the tree is increased and these
variables are then fed to the second decision tree. These individual
classifiers/predictors then ensemble to give a strong and more precise model. It
Module 05 17
can work on regression, classification, ranking, and user-defined prediction
problems.
Advantages of XGBoost:
1. Performance: XGBoost has a strong track record of producing high-quality
results in various machine learning tasks, especially in Kaggle competitions,
where it has been a popular choice for winning solutions.
Disadvantages of XGBoost:
1. Computational Complexity: XGBoost can be computationally intensive,
especially when training large models, making it less suitable for resource-
constrained systems.
Module 05 18