0% found this document useful (0 votes)
31 views18 pages

ML Mod 5.1

ml

Uploaded by

neha1831sewani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views18 pages

ML Mod 5.1

ml

Uploaded by

neha1831sewani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Module 05

As we know, Ensemble learning helps improve machine learning results by


combining several models. This approach allows the production of better
predictive performance compared to a single model. Basic idea is to learn a set
of classifiers (experts) and to allow them to vote. Bagging and Boosting are
two types of Ensemble Learning. These two decrease the variance of a single
estimate as they combine several estimates from different models. So the result
may be a model with higher stability. Let’s understand these two terms in a
glimpse.

1. Bagging: It is a homogeneous weak learners’ model that learns from each


other independently in parallel and combines them for determining the
model average.

2. Boosting: It is also a homogeneous weak learners’ model but works


differently from Bagging. In this model, learners learn sequentially and
adaptively to improve model predictions of a learning algorithm.

Let’s look at both of them in detail and understand the Difference between
Bagging and Boosting.

Bagging
Bootstrap Aggregating, also known as bagging, is a machine learning ensemble
meta-algorithm designed to improve the stability and accuracy of machine
learning algorithms used in statistical classification and regression. It
decreases the variance and helps to avoid overfitting. It is usually applied
to decision tree methods. Bagging is a special case of the model averaging
approach.
Description of the Technique
Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is
selected via row sampling with a replacement method (i.e., there can be
repetitive elements from different d tuples) from D (i.e., bootstrap). Then a
classifier model Mi is learned for each training set D < i. Each classifier
Mi returns its class prediction. The bagged classifier M* counts the votes and
assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging

Module 05 1
Step 1: Multiple subsets are created from the original data set with equal
tuples, selecting observations with replacement.

Step 2: A base model is created on each of these subsets.

Step 3: Each model is learned in parallel with each training set and
independent of each other.

Step 4: The final predictions are determined by combining the predictions


from all the models.

An illustration for the concept of bootstrap aggregating (Bagging)


Example of Bagging

The Random Forest model uses Bagging, where decision tree models with
higher variance are present. It makes random feature selection to grow trees.
Several random trees make a Random Forest.

To read more refer to this article: Bagging classifier

Boosting
Boosting is an ensemble modeling technique designed to create a strong
classifier by combining multiple weak classifiers. The process involves building
models sequentially, where each new model aims to correct the errors made by
the previous ones.

Module 05 2
Initially, a model is built using the training data.

Subsequent models are then trained to address the mistakes of their


predecessors.

boosting assigns weights to the data points in the original dataset.

Higher weights: Instances that were misclassified by the previous model


receive higher weights.

Lower weights: Instances that were correctly classified receive lower


weights.

Training on weighted data: The subsequent model learns from the weighted
dataset, focusing its attention on harder-to-learn examples (those with
higher weights).

This iterative process continues until:

The entire training dataset is accurately predicted, or

A predefined maximum number of models is reached.

Boosting Algorithms

There are several boosting algorithms. The original ones, proposed by Robert
Schapire and Yoav Freund were not adaptive and could not take full advantage
of the weak learners. Schapire and Freund then developed AdaBoost, an
adaptive boosting algorithm that won the prestigious Gödel Prize. AdaBoost
was the first really successful boosting algorithm developed for the purpose of
binary classification. AdaBoost is short for Adaptive Boosting and is a very
popular boosting technique that combines multiple “weak classifiers” into a
single “strong classifier”.

Algorithm:Initialise the dataset and assign equal weight to


each of the data point.Provide this as input to the model and
identify the wrongly classified data points.Increase the
weight of the wrongly classified data points and decrease
the weights of correctly classified data points. And then
normalize the weights of all data points.if (got required
results) Goto step 5else Goto step 2End

Module 05 3
An illustration presenting the intuition behind the boosting algorithm, consisting
of the parallel learners and weighted dataset.

To read more refer to this article: Boosting and AdaBoost in ML

Similarities Between Bagging and Boosting


Bagging and Boosting, both being the commonly used methods, have a
universal similarity of being classified as ensemble methods. Here we will
explain the similarities between them.

1. Both are ensemble methods to get N learners from 1 learner.

2. Both generate several training data sets by random sampling.

3. Both make the final decision by averaging the N learners (or taking the
majority of them i.e Majority Voting).

4. Both are good at reducing variance and provide higher stability.

Differences Between Bagging and Boosting


S.NO Bagging Boosting

The simplest way of combining


A way of combining predictions
1. predictions that belong to the
that belong to the different types.
same type.

2. Aim to decrease variance, not Aim to decrease bias, not

Module 05 4
bias. variance.

Each model receives equal Models are weighted according to


3.
weight. their performance.

New models are influenced by the


Each model is built
4. performance of previously built
independently.
models.

Different training data subsets Iteratively train models, with each


are selected using row new model focusing on correcting
5. sampling with replacement and the errors (misclassifications or
random sampling methods high residuals) of the previous
from the entire training dataset. models

Bagging tries to solve the over-


6. Boosting tries to reduce bias.
fitting problem.

If the classifier is unstable If the classifier is stable and


7. (high variance), then apply simple (high bias) the apply
bagging. boosting.

In this base classifiers are In this base classifiers are trained


8.
trained parallelly. sequentially.

Example: The Random forest Example: The AdaBoost uses


9
model uses Bagging. Boosting techniques

Random Forest

What is the Random Forest Algorithm?


Random Forest algorithm is a powerful tree learning technique in Machine
Learning. It works by creating a number of Decision Trees during the training
phase. Each tree is constructed using a random subset of the data set to
measure a random subset of features in each partition. This randomness
introduces variability among individual trees, reducing the risk
of overfitting and improving overall prediction performance.

In prediction, the algorithm aggregates the results of all trees, either by voting
(for classification tasks) or by averaging (for regression tasks) This
collaborative decision-making process, supported by multiple trees with their
insights, provides an example stable and precise results. Random forests are
widely used for classification and regression functions, which are known for

Module 05 5
their ability to handle complex data, reduce overfitting, and provide reliable
forecasts in different environments.

Random Forest Algorithm

What are Ensemble Learning models?


Ensemble learning models work just like a group of diverse experts teaming up
to make decisions – think of them as a bunch of friends with different strengths
tackling a problem together. Picture it as a group of friends with different skills
working on a project. Each friend excels in a particular area, and by combining
their strengths, they create a more robust solution than any individual could
achieve alone.

Similarly, in ensemble learning, different models, often of the same type or


different types, team up to enhance predictive performance. It's all about
leveraging the collective wisdom of the group to overcome individual limitations
and make more informed decisions in various machine learning tasks. Some
popular ensemble models include- XGBoost, AdaBoost, LightGBM, Random
Forest, Bagging, Voting etc.

Module 05 6
What is Bagging and Boosting?
Bagging is an ensemble learning model, where multiple week models are
trained on different subsets of the training data. Each subset is sampled with
replacement and prediction is made by averaging the prediction of the week
models for regression problem and considering majority vote for classification
problem.
Boosting trains multiple based models sequentially. In this method, each model
tries to correct the errors made by the previous models. Each model is trained
on a modified version of the dataset, the instances that were misclassified by
the previous models are given more weight. The final prediction is made by
weighted voting.
Algorithm for Random Forest Work:

1. Step 1: Select random K data points from the training set.

2. Step 2:Build the decision trees associated with the selected data
points(Subsets).

3. Step 3:Choose the number N for decision trees that you want to build.

4. Step 4:Repeat Step 1 and 2.

5. Step 5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.

How Does Random Forest Work?


The random Forest algorithm works in several steps which are discussed
below-->

Ensemble of Decision Trees: Random Forest leverages the power


of ensemble learning by constructing an army of Decision Trees. These
trees are like individual experts, each specializing in a particular aspect of
the data. Importantly, they operate independently, minimizing the risk of the
model being overly influenced by the nuances of a single tree.

Random Feature Selection: To ensure that each decision tree in the


ensemble brings a unique perspective, Random Forest employs random
feature selection. During the training of each tree, a random subset of
features is chosen. This randomness ensures that each tree focuses on

Module 05 7
different aspects of the data, fostering a diverse set of predictors within the
ensemble.

Bootstrap Aggregating or Bagging: The technique of bagging is a


cornerstone of Random Forest's training strategy which involves creating
multiple bootstrap samples from the original dataset, allowing instances to
be sampled with replacement. This results in different subsets of data for
each decision tree, introducing variability in the training process and
making the model more robust.

Decision Making and Voting: When it comes to making predictions, each


decision tree in the Random Forest casts its vote. For classification tasks,
the final prediction is determined by the mode (most frequent prediction)
across all the trees. In regression tasks, the average of the individual tree
predictions is taken. This internal voting mechanism ensures a balanced
and collective decision-making process.

Key Features of Random Forest


Some of the Key Features of Random Forest are discussed below-->

1. High Predictive Accuracy: Imagine Random Forest as a team of decision-


making wizards. Each wizard (decision tree) looks at a part of the problem,
and together, they weave their insights into a powerful prediction tapestry.
This teamwork often results in a more accurate model than what a single
wizard could achieve.

2. Resistance to Overfitting: Random Forest is like a cool-headed mentor


guiding its apprentices (decision trees). Instead of letting each apprentice
memorize every detail of their training, it encourages a more well-rounded
understanding. This approach helps prevent getting too caught up with the
training data which makes the model less prone to overfitting.

3. Large Datasets Handling: Dealing with a mountain of data? Random Forest


tackles it like a seasoned explorer with a team of helpers (decision trees).
Each helper takes on a part of the dataset, ensuring that the expedition is
not only thorough but also surprisingly quick.

4. Variable Importance Assessment: Think of Random Forest as a detective


at a crime scene, figuring out which clues (features) matter the most. It
assesses the importance of each clue in solving the case, helping you
focus on the key elements that drive predictions.

Module 05 8
5. Built-in Cross-Validation: Random Forest is like having a personal coach
that keeps you in check. As it trains each decision tree, it also sets aside a
secret group of cases (out-of-bag) for testing. This built-in validation
ensures your model doesn't just ace the training but also performs well on
new challenges.

6. Handling Missing Values: Life is full of uncertainties, just like datasets with
missing values. Random Forest is the friend who adapts to the situation,
making predictions using the information available. It doesn't get flustered
by missing pieces; instead, it focuses on what it can confidently tell us.

7. Parallelization for Speed: Random Forest is your time-saving buddy.


Picture each decision tree as a worker tackling a piece of a puzzle
simultaneously. This parallel approach taps into the power of modern tech,
making the whole process faster and more efficient for handling large-scale
projects.

Random Forest vs. Other Machine


Learning Algorithms
Some of the key-differences are discussed below.

Feature Random Forest Other ML Algorithms

Typically relies on a single model


Utilizes an ensemble of
(e.g., linear regression, support
decision trees, combining their
Ensemble vector machine) without the
outputs for predictions,
Approach ensemble approach, potentially
fostering robustness and
leading to less resilience against
accuracy.
noise.

Some algorithms may be prone to


Resistant to overfitting due to
overfitting, especially when
Overfitting the aggregation of diverse
dealing with complex datasets, as
Resistance decision trees, preventing
they may excessively adapt to
memorization of training data.
training noise.

Exhibits resilience in handling


missing values by leveraging Other algorithms may require
Handling of available features for imputation or elimination of
Missing Data predictions, contributing to missing data, potentially impacting
practicality in real-world model training and performance.
scenarios.

Module 05 9
Provides a built-in mechanism Many algorithms may lack an
for assessing variable explicit feature importance
Variable
importance, aiding in feature assessment, making it challenging
Importance
selection and interpretation of to identify crucial variables for
influential factors. predictions.

Capitalizes on parallelization, Some algorithms may have limited


enabling the simultaneous parallelization capabilities,
Parallelization
training of decision trees, potentially leading to longer
Potential
resulting in faster computation training times for extensive
for large datasets. datasets.

What is Cross-Validation?
Cross validation is a technique used in machine learning to evaluate the
performance of a model on unseen data. It involves dividing the available data
into multiple folds or subsets, using one of these folds as a validation set, and
training the model on the remaining folds. This process is repeated multiple
times, each time using a different fold as the validation set. Finally, the results
from each validation step are averaged to produce a more robust estimate of
the model’s performance. Cross validation is an important step in the machine
learning process and helps to ensure that the model selected for deployment is
robust and generalizes well to new data.

What is cross-validation used for?


The main purpose of cross validation is to prevent overfitting, which occurs
when a model is trained too well on the training data and performs poorly on
new, unseen data. By evaluating the model on multiple validation sets, cross
validation provides a more realistic estimate of the model’s generalization
performance, i.e., its ability to perform well on new, unseen data.

Types of Cross-Validation
There are several types of cross validation techniques, including k-fold cross
validation, leave-one-out cross validation, and Holdout validation, Stratified
Cross-Validation. The choice of technique depends on the size and nature of
the data, as well as the specific requirements of the modeling problem.

4. K-Fold Cross Validation

Module 05 10
In K-Fold Cross Validation, we split the dataset into k number of subsets
(known as folds) then we perform training on the all the subsets but leave
one(k-1) subset for the evaluation of the trained model. In this method, we
iterate k times with a different subset reserved for testing purpose each time.

Note: It is always suggested that the value of k should be 10


as the lower value of k takes towards validation and higher
value of k leads to LOOCV method.

Example of K Fold Cross Validation


The diagram below shows an example of the training subsets and evaluation
subsets generated in k-fold cross-validation. Here, we have total 25 instances.
In first iteration we use the first 20 percent of data for evaluation, and the
remaining 80 percent for training ([1-5] testing and [5-25] training) while in the
second iteration we use the second subset of 20 percent for evaluation, and
the remaining three subsets of the data for training ([5-10] testing and [1-5 and
10-25] training), and so on.

Module 05 11
Total instances: 25
Value of k : 5
No. Iteration Training set observations
Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21
22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21
22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Module 05 12
17 18 19] [20 21 22 23 24]

Comparison between cross-validation


and hold out method
Advantages of train/test split:
1. This runs K times faster than Leave One Out cross-validation because K-
fold cross-validation repeats the train/test split K-times.

2. Simpler to examine the detailed results of the testing process.

Advantages of cross-validation:

1. More accurate estimate of out-of-sample accuracy.

2. More “efficient” use of data as every observation is used for both training
and testing.

Advantages and Disadvantages of Cross


Validation
Advantages:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting by
providing a more robust estimate of the model’s performance on unseen
data.

2. Model Selection: Cross validation can be used to compare different models


and select the one that performs the best on average.

3. Hyperparameter tuning: Cross validation can be used to optimize the


hyperparameters of a model, such as the regularization parameter, by
selecting the values that result in the best performance on the validation
set.

4. Data Efficient: Cross validation allows the use of all the available data for
both training and validation, making it a more data-efficient method
compared to traditional validation techniques.

Module 05 13
Disadvantages:
1. Computationally Expensive: Cross validation can be computationally
expensive, especially when the number of folds is large or when the model
is complex and requires a long time to train.

2. Time-Consuming: Cross validation can be time-consuming, especially


when there are many hyperparameters to tune or when multiple models
need to be compared.

3. Bias-Variance Tradeoff: The choice of the number of folds in cross


validation can impact the bias-variance tradeoff, i.e., too few folds may
result in high variance, while too many folds may result in high bias.

XGBoost is an optimized distributed gradient boosting library designed for


efficient and scalable training of machine learning models. It is an ensemble
learning method that combines the predictions of multiple weak models to
produce a stronger prediction. XGBoost stands for “Extreme Gradient Boosting”
and it has become one of the most popular and widely used machine learning
algorithms due to its ability to handle large datasets and its ability to achieve
state-of-the-art performance in many machine learning tasks such as
classification and regression.
One of the key features of XGBoost is its efficient handling of missing values,
which allows it to handle real-world data with missing values without requiring
significant pre-processing. Additionally, XGBoost has built-in support for
parallel processing, making it possible to train models on large datasets in a
reasonable amount of time.
XGBoost can be used in a variety of applications, including Kaggle
competitions, recommendation systems, and click-through rate prediction,
among others. It is also highly customizable and allows for fine-tuning of
various model parameters to optimize performance.
XgBoost stands for Extreme Gradient Boosting, which was proposed by the
researchers at the University of Washington. It is a library written in C++ which
optimizes the training for Gradient Boosting.
Before understanding the XGBoost, we first need to understand the trees
especially the decision tree:

Decision Tree:

Module 05 14
A Decision tree is a flowchart-like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (terminal node) holds a class label.
A tree can be “learned” by splitting the source set into subsets based on an
attribute value test. This process is repeated on each derived subset in a
recursive manner called recursive partitioning. The recursion is completed
when the subset at a node all has the same value of the target variable, or
when splitting no longer adds value to the predictions.

Bagging:
A Bagging classifier is an ensemble meta-estimator that fits base classifiers
each on random subsets of the original dataset and then aggregate their
individual predictions (either by voting or by averaging) to form a final
prediction. Such a meta-estimator can typically be used as a way to reduce the
variance of a black-box estimator (e.g., a decision tree), by introducing
randomization into its construction procedure and then making an ensemble
out of it.

Each base classifier is trained in parallel with a training set which is generated
by randomly drawing, with replacement, N examples(or data) from the original
training dataset, where N is the size of the original training set. The training set
for each of the base classifiers is independent of each other. Many of the
original data may be repeated in the resulting training set while others may be
left out.

Bagging reduces overfitting (variance) by averaging or voting, however, this


leads to an increase in bias, which is compensated by the reduction in variance
though.

Module 05 15
Bagging classifier

Random Forest:
Every decision tree has high variance, but when we combine all of them
together in parallel then the resultant variance is low as each decision tree gets
perfectly trained on that particular sample data and hence the output doesn’t
depend on one decision tree but multiple decision trees. In the case of a
classification problem, the final output is taken by using the majority voting
classifier. In the case of a regression problem, the final output is the mean of all
the outputs. This part is Aggregation.
The basic idea behind this is to combine multiple decision trees in determining
the final output rather than relying on individual decision trees.
Random Forest has multiple decision trees as base learning models. We
randomly perform row sampling and feature sampling from the dataset forming
sample datasets for every model. This part is called Bootstrap.

Boosting:
Boosting is an ensemble modelling, technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model by
using weak models in series. Firstly, a model is built from the training data.
Then the second model is built which tries to correct the errors present in the
first model. This procedure is continued and models are added until either the

Module 05 16
complete training data set is predicted correctly or the maximum number of
models are added.

Boosting

Gradient Boosting
Gradient Boosting is a popular boosting algorithm. In gradient boosting, each
predictor corrects its predecessor’s error. In contrast to Adaboost, the weights
of the training instances are not tweaked, instead, each predictor is trained
using the residual errors of predecessor as labels.
There is a technique called the Gradient Boosted Trees whose base learner is
CART (Classification and Regression Trees).

XGBoost
XGBoost is an implementation of Gradient Boosted decision trees. XGBoost
models majorly dominate in many Kaggle Competitions.
In this algorithm, decision trees are created in sequential form. Weights play an
important role in XGBoost. Weights are assigned to all the independent
variables which are then fed into the decision tree which predicts results. The
weight of variables predicted wrong by the tree is increased and these
variables are then fed to the second decision tree. These individual
classifiers/predictors then ensemble to give a strong and more precise model. It

Module 05 17
can work on regression, classification, ranking, and user-defined prediction
problems.

Advantages of XGBoost:
1. Performance: XGBoost has a strong track record of producing high-quality
results in various machine learning tasks, especially in Kaggle competitions,
where it has been a popular choice for winning solutions.

2. Scalability: XGBoost is designed for efficient and scalable training of


machine learning models, making it suitable for large datasets.

3. Customizability: XGBoost has a wide range of hyperparameters that can be


adjusted to optimize performance, making it highly customizable.

4. Handling of Missing Values: XGBoost has built-in support for handling


missing values, making it easy to work with real-world data that often has
missing values.

5. Interpretability: Unlike some machine learning algorithms that can be


difficult to interpret, XGBoost provides feature importances, allowing for a
better understanding of which variables are most important in making
predictions.

Disadvantages of XGBoost:
1. Computational Complexity: XGBoost can be computationally intensive,
especially when training large models, making it less suitable for resource-
constrained systems.

2. Overfitting: XGBoost can be prone to overfitting, especially when trained on


small datasets or when too many trees are used in the model.

3. Hyperparameter Tuning: XGBoost has many hyperparameters that can be


adjusted, making it important to properly tune the parameters to optimize
performance. However, finding the optimal set of parameters can be time-
consuming and requires expertise.

4. Memory Requirements: XGBoost can be memory-intensive, especially when


working with large datasets, making it less suitable for systems with limited
memory resources.

Module 05 18

You might also like