ML-4th Unit
ML-4th Unit
Ensemble Methods:
1) Bagging
2) Stacking
3) Boosting
4) Gradient Boosting
5) Random Forest
Bootstrapping and Cross Validation:
1. Bootstrapping:
o Purpose: Bootstrapping is primarily used to establish empirical
distribution functions for a wide range of statistics. It helps estimate
parameters and build ensemble models.
o Method:
▪ Selects samples with replacement from the original dataset.
▪ The bootstrapped data sets can be as large as the original
dataset.
▪ Contains repeated elements in every subset.
▪ Relies on random sampling.
o Advantages:
▪ Faster than cross-validation.
▪ Validates building with sample size N (instead of a fraction of
N).
o Limitations:
▪ Not as strong as cross-validation for model validation.
1
▪ More about ensemble models or parameter estimation .
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for
language L, which runs on machine A and produces code for machine A.
2. Cross-Validation:
o Purpose: Cross-validation measures the generalization performance
of a model.
o Method:
▪ Divides the available data into multiple folds or subsets.
▪ Uses one fold as a validation set and trains the model on the
remaining folds.
▪ Repeats this process with different validation sets.
▪ Averages results to estimate model performance.
o Advantages:
▪ Prevents overfitting by evaluating on multiple validation sets.
▪ Provides a more realistic estimate of generalization
performance.
o Types:
▪ K-Fold Cross Validation: Divides data into k subsets.
▪ Leave-One-Out Cross Validation (LOOCV): Trains on all
but one data point.
o Comparison:
▪ Cross-validation resamples without replacement, while
bootstrapping resamples with replacement.
▪ Cross-validation produces smaller surrogate data sets.
BRIEFLY EXPLAIN:
There are some common methods that are used for cross-validation. These
methods are given below:
We divide our input dataset into a training set and test or validation set in the
validation set approach. Both the subsets are given 50% of the dataset.
But it has one of the big disadvantages that we are just using a 50% dataset to
train our model, so the model may miss out to capture important information of
the dataset. It also tends to give the underfitted model.
Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there
are total n datapoints in the original input dataset, then n-p data points will be
used as the training dataset and the p data points as the validation set. This
complete process is repeated for all the samples, and the average error is
calculated to know the effectiveness of the model.
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the
model as we iteratively check against one data point.
K-Fold Cross-Validation
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into
5 folds. On 1st iteration, the first fold is reserved for test the model, and rest are
used to train the model. On 2nd iteration, the second fold is used to test the model,
and rest are used to train the model. This process will continue until each fold is
not used for the test fold.
This technique is similar to k-fold cross-validation with some little changes. This
approach works on stratification concept, it is a process of rearranging the data to
ensure that each fold or group is a good representative of the complete dataset.
To deal with the bias and variance, it is one of the best approaches.
It can be understood with an example of housing prices, such that the price of
some houses can be much high than other houses. To tackle such situations, a
stratified k-fold cross-validation technique is useful.
Holdout Method
This method is the simplest cross-validation technique among all. In this method,
we need to remove a subset of the training data and use it to get prediction results
by training it on the rest part of the dataset.
The error that occurs in this process tells how well our model will perform with
the unknown dataset. Although this approach is simple to perform, it still faces
the issue of high variance, and it also produces misleading results sometimes.
o Train/test split: The input data is divided into two parts, that are training
set and test set on a ratio of 70:30, 80:20, etc. It provides a high variance,
which is one of the biggest disadvantages.
o Training Data: The training data is used to train the model, and the
dependent variable is known.
o Test Data: The test data is used to make the predictions from the
model that is already trained on the training data. This has the same
features as training data but not the part of that.
o Cross-Validation dataset: It is used to overcome the disadvantage of
train/test split by splitting the dataset into groups of train/test splits, and
averaging the result. It can be used if we want to optimize our model that
has been trained on the training dataset for the best performance. It is more
efficient as compared to train/test split as every observation is used for the
training and testing both.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given
below:
o For the ideal conditions, it provides the optimum output. But for the
inconsistent data, it may produce a drastic result. So, it is one of the big
disadvantages of cross-validation, as there is no certainty of the type of
data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may
face the differences between the training set and validation sets. Such as if
we create a model for the prediction of stock market values, and the data is
trained on the previous 5 years stock values, but the realistic future values
for the next 5 years may drastically different, so it is difficult to expect the
correct output for such situations.
Applications of Cross-Validation
These performance metrics help us understand how well our model has performed
for the given data. In this way, we can improve the model's performance by tuning
the hyper-parameters. Each ML model aims to generalize well on unseen/new
data, and performance metrics help determine how well the model generalizes on
the new dataset.
o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC
I. Accuracy
The accuracy metric is one of the simplest Classification metrics to implement,
and it can be determined as the number of correct predictions to the total number
of predictions.
The confusion matrix is simple to implement, but the terminologies used in this
matrix might be confusing for beginners.
A typical confusion matrix for a binary classifier looks like the below
image(However, it can be extended to use for classifiers with more than two
classes).
We can determine the following from the above matrix:
o In the matrix, columns are for the prediction values, and rows specify the
Actual values. Here Actual and prediction give two possible classes, Yes
or No. So, if we are predicting the presence of a disease in a patient, the
Prediction column with Yes means, Patient has the disease, and for NO,
the Patient doesn't have the disease.
o In this example, the total number of predictions are 165, out of which 110
time predicted yes, whereas 55 times predicted No.
o However, in reality, 60 cases in which patients don't have the disease,
whereas 105 cases in which patients have the disease.
III Precision
The precision metric is used to overcome the limitation of Accuracy. The
precision determines the proportion of positive prediction that was actually
correct. It can be calculated as the True Positive or predictions that are actually
true to the total positive predictions (True Positive and False Positive).
V. F-Scores
F-score or F1 Score is a metric to evaluate a binary classification model on the
basis of predictions that are made for the positive class. It is calculated with the
help of Precision and Recall. It is a type of single score that represents both
Precision and Recall. So, the F1 Score can be calculated as the harmonic mean
of both precision and Recall, assigning equal weight to each of them.
VI. AUC-ROC
TPR or true Positive rate is a synonym for Recall, hence can be calculated as:
AUC is known for Area Under the ROC curve. As its name suggests, AUC
calculates the two-dimensional area under the entire ROC curve, as shown below
image:
AUC calculates the performance across all the thresholds and provides an
aggregate measure. The value of AUC ranges from 0 to 1. It means a model with
100% wrong prediction will have an AUC of 0.0, whereas models with 100%
correct predictions will have an AUC of 1.0.
Mean Absolute Error or MAE is one of the simplest metrics, which measures the
absolute difference between actual and predicted values, where absolute means
taking a number as Positive.
To understand MAE, let's take an example of Linear Regression, where the model
draws a best fit line between dependent and independent variables. To measure
the MAE or error in prediction, we need to calculate the difference between actual
values and predicted values. But in order to find the absolute error for the
complete dataset, we need to find the mean absolute of the complete dataset.
Here,
Y is the Actual outcome, Y' is the predicted outcome, and N is the total number
of data points.
MAE is much more robust for the outliers. One of the limitations of MAE is that
it is not differentiable, so for this, we need to apply different optimizers such as
Gradient Descent. However, to overcome this limitation, another metric can be
used, which is Mean Squared Error or MSE.
Mean Squared error or MSE is one of the most suitable metrics for Regression
evaluation. It measures the average of the Squared difference between predicted
values and the actual value given by the model.
Since in MSE, errors are squared, therefore it only assumes non-negative values,
and it is usually positive and non-zero.
Moreover, due to squared differences, it penalizes small errors also, and hence it
leads to over-estimation of how bad the model is.
Here,
Y is the Actual outcome, Y' is the predicted outcome, and N is the total number
of data points.
The R squared score will always be less than or equal to 1 without concerning if
the values are too large or small.
To overcome the issue of R square, adjusted R squared is used, which will always
show a lower value than R². It is because it adjusts the values of increasing
predictors and only shows improvement if there is a real improvement.
Here,
2. ROC Curve
What is AUC-ROC Curve?
AUC-ROC curve is a performance measurement metric of a classification model
at different threshold values. Firstly, let's understand ROC (Receiver Operating
Characteristic curve) curve.
ROC Curve
ROC or Receiver Operating Characteristic curve represents a probability graph
to show the performance of a classification model at different threshold levels.
The curve is plotted between two parameters, which are:
TPR:
TPR or True Positive rate is a synonym for Recall, which can be calculated as:
Now, to efficiently calculate the values at any threshold level, we need a method,
which is AUC.
In the ROC curve, AUC computes the performance of the binary classifier across
different thresholds and provides an aggregate measure. The value of AUC ranges
from 0 to 1, which means an excellent model will have AUC near 1, and hence it
will show a good measure of Separability.
o AUC is used to measure how well the predictions are ranked instead of
giving their absolute values. Hence, we can say AUC is Scale-Invariant.
o It measures the quality of predictions of the model without considering the
selected classification threshold. It means AUC is classification-
threshold-invariant.
AD
Although the AUC-ROC curve is only used for binary classification problems,
we can also use it for multiclass classification problems. For multi-class
classification problems, we can plot N number of AUC curves for N number of
classes with the One vs ALL method.
For example, if we have three different classes, X, Y, and Z, then we can plot a
curve for X against Y & Z, a second plot for Y against X & Z, and the third plot
for Z against Y and X.
AD
AD
1. Classification of 3D model
The curve is used to classify a 3D model and separate it from the normal
models. With the specified threshold level, the curve classifies the non-
3D and separates out the 3D models.
2. Healthcare
The curve has various applications in the healthcare sector. It can be used
to detect cancer disease in patients. It does this by using false positive and
false negative rates, and accuracy depends on the threshold value used for
the curve.
3. BinaryClassification
AUC-ROC curve is mainly used for binary classification problems to
evaluate their performance.
3. Validation splits
In machine learning, validation splits are used to assess the performance of a
model on a dataset that was not used during training. This helps to ensure that the
model generalizes well to unseen data. Here are the common types of validation
splits:
1. Train-Test Split
This is the most basic form of validation split, where the dataset is divided into
two parts: a training set and a testing set. Typically, 70-80% of the data is used
for training, and the remaining 20-30% is used for testing.
2. K-Fold Cross-Validation
In k-fold cross-validation, the dataset is divided into k subsets (or folds). The
model is trained and validated k times, each time using a different fold as the
validation set and the remaining k-1 folds as the training set. The results are then
averaged to produce a single performance metric.
5. Leave-P-Out Cross-Validation
Similar to LOOCV, but instead of leaving one instance out, p instances are left
out for validation. The process is repeated for all possible combinations of p
instances.
• Process: Train the model multiple times, each time with a different
combination of p instances left out for validation.
This involves repeating the k-fold cross-validation process multiple times with
different random splits. This provides a more robust estimate of the model’s
performance.
7. Holdout Validation
The dataset is split into three parts: training, validation, and testing sets. The
training set is used to train the model, the validation set is used to tune
hyperparameters and make decisions about the model, and the testing set is used
to evaluate the final model.
For time series data, traditional cross-validation methods are not appropriate
because they do not respect the temporal order of data. Instead, time series split
methods, such as rolling or expanding windows, are used.
• Rolling Window: The training set is fixed, and the validation set moves
forward in time.
• Expanding Window: The training set grows with each iteration,
incorporating more past data.
9. Nested Cross-Validation
Nested cross-validation is used to avoid the bias that can result from optimizing
hyperparameters and evaluating the model on the same data. It consists of two
loops: an outer loop for model evaluation and an inner loop for hyperparameter
tuning.
The choice of validation split depends on various factors such as the size of the
dataset, the nature of the problem (e.g., time series vs. independent observations),
and the computational resources available. Here are some guidelines:
2ND HALF
Ensemble Methods:
Ensemble methods in machine learning refer to techniques that combine the
predictions of multiple individual models (often called base models or
learners) to improve overall predictive performance. Instead of relying on a
single model's prediction, ensemble methods leverage the wisdom of the
crowd principle, where the collective prediction of multiple models tends to
be more accurate and robust than that of any individual model.
Key characteristics of ensemble methods include:
Ensemble methods are widely used in machine learning because they often
lead to improved predictive performance, better generalization to unseen data,
and increased robustness against noise and outliers in the data. They are
applied across various domains, including classification, regression, and
anomaly detection, among others.
Briefly Explain:
Ensemble means ‘a collection of things’ and in Machine Learning
terminology, Ensemble learning refers to the approach of combining multiple
ML models to produce a more accurate and robust prediction compared to any
individual model. It implements an ensemble of fast algorithms (classifiers)
such as decision trees for learning and allows them to vote.
1.Bagging
Bagging (or Bootstrap aggregating) is a type of ensemble learning in which
multiple base models are trained independently and in parallel on different
subsets of the training data. Each subset is generated using bootstrap sampling,
in which data points are picked at random with replacement. In the case of the
bagging classifier, the final prediction is made by aggregating the predictions
of the all-base model using majority voting. In the models of regression, the
final prediction is made by averaging the predictions of the all-base model,
and that is known as bagging regression.
o Step 1: Multiple subsets are made from the original information set with
identical tuples, deciding on observations with replacement.
o Step 2: A base model is created on all subsets.
o Step 3: Every version is found in parallel with each training set and
unbiased.
o Step 4: The very last predictions are determined by combining the
forecasts from all models.
1. IT:
Bagging can also improve the precision and accuracy of IT structures, together
with network intrusion detection structures. In the meantime, this study seems at
how Bagging can enhance the accuracy of network intrusion detection and reduce
the rates of fake positives.
2. Environment:
Ensemble techniques, together with Bagging, were carried out inside the area of
far-flung sensing. This study indicates how it has been used to map the styles of
wetlands inside a coastal landscape.
3. Finance:
Bagging has also been leveraged with deep gaining knowledge of models within
the finance enterprise, automating essential tasks, along with fraud detection,
credit risk reviews, and option pricing issues. This research demonstrates how
Bagging amongst different device studying techniques was leveraged to assess
mortgage default hazard. This highlights how Bagging limits threats by saving
you from credit score card fraud within the banking and economic institutions.
4. Healthcare:
The Bagging has been used to shape scientific data predictions. These studies
(PDF, 2.8 MB) show that ensemble techniques had been used for various
bioinformatics issues, including gene and protein selection, to perceive a selected
trait of interest. More significantly, this study mainly delves into its use to expect
the onset of diabetes based on various threat predictors.
What are the Advantages and Disadvantages of Bagging?
There are many advantages of Bagging. The benefit of Bagging is given below -
2. Variance reduction:
The Bagging can reduce the variance inside a getting to know set of rules which
is especially helpful with excessive-dimensional facts, where missing values can
result in better conflict, making it more liable to overfitting and stopping correct
generalization to new datasets.
1. Flexible less:
As a method, Bagging works particularly correctly with algorithms that are much
less solid. One which can be more stable or a problem with high amounts of bias
does now not provide an awful lot of gain as there is less variation in the dataset
of the version. As noted within the hands-On guide for machine learning, "the
bagging is a linear regression version will efficaciously just return the original
predictions for huge enough b."
2. Loss of interpretability:
The Bagging slows down and grows extra in depth because of the quantity of
iterations growth. accordingly, it is no longer adequately suitable for actual-time
applications. Clustered structures or large processing cores are perfect for quickly
growing bagged ensembles on massive look-at units.
The Bagging is tough to draw unique business insights via Bagging because of
the averaging concerned throughout predictions. While the output is more precise
than any person's information point, a more accurate or whole dataset may yield
greater precision within a single classification or regression model.
2.BOOSTING
Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model by
using weak models in series. Firstly, a model is built from the training data. Then
the second model is built which tries to correct the errors present in the first
model. This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number of
models are added.
Advantages of Boosting
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified
data points.
3. Increase the weight of the wrongly classified data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
Types Of Boosting Algorithms
There are several types of boosting algorithms some of the most famous and
useful models are as :
The main task of it is decrease the variance but The main task of it is decrease
not bias. the bias but not variance.
Here each of the model is different weight. Here each of the model is same
weight.
Each of the model is built here independently. Each of the model is built here
dependently.
This training records subsets are decided on Each new subset consists of the
using row sampling with alternative and random factors that were misclassified
sampling techniques from the whole training through preceding models.
dataset.
If the classifier is volatile (excessive variance), If the classifier is stable and easy
then apply bagging. (excessive bias) the practice
boosting.
In the bagging base, the classifier is works In the boosting base, the
parallelly. classifier is works sequentially.
1. They both are ensemble techniques to get the N novices from 1 learner.
2. Each generates numerous training statistics sets through random sampling.
3. They each make the very last decision by averaging the N number of
beginners (or they take most of the people of them, i.e., the Majority of
voting).
4. The Bagging and boosting are exact at reducing the variance and offer
better stability.
Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists
of two or more base/learner's models and a meta-model that combines the
predictions of the base models. These base models are called level 0 models, and
the meta-model is known as the level 1 model. So, the Stacking ensemble method
includes original (training) data, primary level models, primary level
prediction, secondary level model, and final prediction. The basic architecture
of stacking can be represented as shown below the image.
o Original data: This data is divided into n-folds and is also considered test
data or training data.
o Base models: These models are also referred to as level-0 models. These
models use training data and provide compiled predictions (level-0) as an
output.
o Level-0 Predictions: Each base model is triggered on some training data
and provides different predictions, which are known as level-0
predictions.
o Meta Model: The architecture of the stacking model consists of one meta-
model, which helps to best combine the predictions of the base models.
The meta-model is also known as the level-1 model.
o Level-1 Prediction: The meta-model learns how to best combine the
predictions of the base models and is trained on different predictions made
by individual base models, i.e., data not used to train the base models are
fed to the meta-model, predictions are made, and these predictions, along
with the expected outputs, provide the input and output pairs of the training
dataset used to fit the meta-model.
Blending –
4. Gradient boosting
Gradient Boosting Machine (GBM) is one of the most popular forward learning
ensemble methods in machine learning. It is a powerful technique for building
predictive models for regression and classification tasks.
GBM helps us to get a predictive model in form of an ensemble of weak
prediction models such as decision trees. Whenever a decision tree performs as a
weak learner then the resulting algorithm is called gradient-boosted trees.
It enables us to combine the predictions from various learner models and build a
final predictive model having the correct prediction.
o Loss function
o Weak learners
o Additive model
1. Loss function:
Although, there is a big family of Loss functions in machine learning that can be
used depending on the type of tasks being solved. The use of the loss function is
estimated by the demand of specific characteristics of the conditional distribution
such as robustness. While using a loss function in our task, we must specify the
loss function and the function to calculate the corresponding negative gradient.
Once, we get these two functions, they can be implemented into gradient boosting
machines easily. However, there are several loss functions have been already
proposed for GBM algorithms.
Based on the type of response variable y, loss function can be classified into
different types as follows:
1. Continuous response, y ∈ R:
o Gaussian L2 loss function
o Laplace L1 loss function
o Huber loss function, δ specified
o Quantile loss function, α specified
2. Categorical response, y ∈ {0, 1}:
o Binomial loss function
o Adaboost loss function
3. Other families of response variables:
o Loss functions for survival models
o Loss functions count data
o Custom loss functions
2. Weak Learner:
Weak learners are the base learner models that learn from past errors and help in
building a strong predictive model design for boosting algorithms in machine
learning. Generally, decision trees work as a weak learners in boosting
algorithms.
f(x)=B∑b=1fb(x)
Hence, trees are constructed greedily, choosing the best split points based
on purity scores like Gini or minimizing the loss.
3. Additive Model:
The additive model is defined as adding trees to the model. Although we should
not add multiple trees at a time, only a single tree must be added so that existing
trees in the model are not changed. Further, we can also prefer the gradient
descent method by adding trees to reduce the loss.
Instead of level-wise growth, Light GBM prefers leaf-wise growth of the nodes
of the tree. Further, in light GBM, the primary node is split into two secondary
nodes and later it chooses one secondary node to be split. This split of a secondary
node depends upon which between two nodes has a higher loss.
5.Random Forest
Random Forest algorithm is a powerful tree learning technique in Machine
Learning. It works by creating a number of Decision Trees during the training
phase. Each tree is constructed using a random subset of the data set to measure
a random subset of features in each partition. This randomness introduces
variability among individual trees, reducing the risk of overfitting and improving
overall prediction performance. In prediction, the algorithm aggregates the results
of all trees, either by voting (for classification tasks) or by averaging (for
regression tasks) This collaborative decision-making process, supported by
multiple trees with their insights, provides an example stable and precise results.
How Does Random Forest Work?
The random Forest algorithm works in several steps which are discussed below–
>
For Random Forest modeling, some key-steps of data preparation are discussed
below:
1. Reduced Overfitting:
o By averaging multiple trees, Random Forest reduces the risk of
overfitting compared to a single decision tree.
2. Robustness:
o It is robust to outliers and noise in the data.
3. Feature Importance:
o Random Forest provides insights into feature importance, helping in
feature selection and understanding the data.
4. Versatility:
o It can be used for both classification and regression problems.
Applications:
Healthcare: Disease prediction, medical diagnosis, genomics.
Exhibits resilience in
Other algorithms may
handling missing values by
require imputation or
Handling of leveraging available
elimination of missing data,
Missing Data features for predictions,
potentially impacting model
contributing to practicality
training and performance.
in real-world scenarios.
Capitalizes on
Some algorithms may have
parallelization, enabling the
limited parallelization
Parallelization simultaneous training of
capabilities, potentially
Potential decision trees, resulting in
leading to longer training
faster computation for large
times for extensive datasets.
datasets.