0% found this document useful (0 votes)
4 views30 pages

ML UNIT 4 Notes

The document discusses ensemble learning techniques, focusing on the bias-variance tradeoff, bagging, and boosting methods such as Random Forest, AdaBoost, and XGBoost. It explains the concepts of bias and variance in machine learning models, detailing how to reduce high bias and manage overfitting through various algorithms. Additionally, it outlines the implementation of bagging and boosting, highlighting their effectiveness in improving model accuracy and stability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views30 pages

ML UNIT 4 Notes

The document discusses ensemble learning techniques, focusing on the bias-variance tradeoff, bagging, and boosting methods such as Random Forest, AdaBoost, and XGBoost. It explains the concepts of bias and variance in machine learning models, detailing how to reduce high bias and manage overfitting through various algorithms. Additionally, it outlines the implementation of bagging and boosting, highlighting their effectiveness in improving model accuracy and stability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT-IV

Ensemble Learning and Machine Learning in


Practice
Bias – Variance Tradeoff – Bagging and Boosting (Random forests, Adaboost, XG boost
inclusive) – Metrics & Error Correction. Class Imbalance – SMOTE – One Class SVM –
Optimization of hyper parameters.

BIAS
Bias in machine learning refers to the difference between a model’s predictions and
the actual distribution of the value it tries to predict. Models with high bias oversimplify
thedata distribution rule/function, resulting in high errors in both the training outcomes and
test data analysis results.

Bias is typically measured by evaluating the performance of a model on a training


dataset. One common way to calculate bias is to use performance metrics such as mean
squared error (MSE) or mean absolute error (MAE), which determine the difference between
the predicted and real values of the training data.

The level of bias in a model is heavily influenced by the quality and quantity of

training data involved. Using insufficient data will result in flawed predictions. At the same
time, it can also result from the choice of an inappropriate model.

High-bias model features


Underfitting. High-bias models often underfit the data, meaning they oversimplify the
solution based on generalization. As a result, the proposed distribution does not correspond to
the actual distribution.Low training accuracy. The lack of proper processing of training data
results in high training loss and low training accuracy.

Oversimplification. The oversimplified nature of high-bias models limits their ability


to identify complex features in the training data, making them inefficient for solving
complicated problems.

How to reduce high bias?


 Incorporating additional features from data to improve the model’s accuracy.
 Increasing the number of training iterations to allow the model to learn more
complex data.
 Avoiding high-bias algorithms such as linear regression, logistic regression,
discriminant analysis, etc. and instead using nonlinear algorithms such as k-
nearest neighbours, SVM, decision trees, etc.
 Decreasing regularization at various levels to help the model learn the training
set more effectively and prevent underfitting.

Variance
The variability of model prediction for a given data point which tells us the spread of
our data is called the variance of the model.The model with high variance has a very complex
fit to the training data and thus is not able to fit accurately on the data which it hasn’t seen
before. As a result, such models perform very well on training data but have high error rates
on test data.
When a model is high on variance, it is then said to as Overfitting of Data. Overfitting
is fitting the training set accurately via complex curve and high order hypothesis but is not the
solution as the error with unseen data is high. While training a data model variance should be

kept low.

Variance errors are either low or high-variance errors.


Low variance: Low variance means that the model is less sensitive to changes in the
training data and can produce consistent estimates of the target function with different subsets
of data from the same distribution. This is the case of underfitting when the model fails to
generalize on both training and test data.

High variance: High variance means that the model is very sensitive to changes in the
training data and can result in significant changes in the estimate of the target function when
trained on different subsets of data from the same distribution. This is the case of overfitting
when the model performs well on the training data but poorly on new, unseen test data. It fits
the training data too closely that it fails on the new training dataset.

There can be four combinations between bias and variance.

High Bias, Low Variance: A model with high bias and low variance is said to be
underfitting.

 High Variance, Low Bias: A model with high variance and low bias is
said to be overfitting.

 High-Bias, High-Variance: A model has both high bias and high variance, which
means that the model is not able to capture the underlying patterns in the data (high
bias) and is also too sensitive to changes in the training data (high variance). As a

result, the model will produce inconsistent and inaccurate predictions on average.

 Low Bias, Low Variance: A model that has low bias and low variance means that
the model is able to capture the underlying patterns in the data (low bias) and is not
too sensitive to changes in the training data (low variance). This is the ideal scenario
for a machine learning model, as it is able to generalize well to new, unseen data and
produce consistent and accurate predictions. But in practice, it’s not possible.

 Now we know that the ideal case will be Low Bias and Low variance, but in practice,
it is not possible. So, we trade off between Bias and variance to achieve a balanced
bias and variance.

Machine Learning
Algorithm Bias Variance

 Linear Regression  High Bias  Less Variance

 Decision Tree  Low Bias  High Variance

 Random Forest  Low Bias  High Variance


Machine Learning
Algorithm Bias Variance

 Bagging  Low Bias  High Variance

Bias Variance Tradeoff


 If the algorithm is too simple (hypothesis with linear equation) then it may be on high
bias and low variance condition and thus is error-prone.If algorithms fit too complex
(hypothesis with high degree equation) then it may be on high variance and low bias.
In the latter condition, the new entries will not perform well. Well, there is something
between both of these conditions, known as a Trade-off or Bias Variance Trade-off.

 This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph,
the perfect tradeoff will be like this.

 We try to optimize the value of the total error for the model by using the Bias-
Variance Tradeoff.
 The best fit will be given by the hypothesis on the tradeoff point. The error to
complexity graph to show trade-off is given as –
Region for the Least Value of Total Error

 This is referred to as the best point chosen for the training of the algorithm which
gives low error in training as well as testing data.
 What is the difference between bias-variance decomposition and bias-variance
tradeoff?
 Bias-variance decomposition and bias-variance tradeoff are closely related concepts.
 Bias-variance decomposition is a mathematical technique that divides
the generalization error in a predictive model into two components: bias and variance.
 In machine learning, as you try to minimize one component of the error (e.g., bias),
the other component (e.g., variance) tends to increase, and vice versa. Finding the
right balance of bias and variance is key to creating an effective and accurate model.
This is called the bias-variance tradeoff.

Bagging and Boosting (Random forests, Adaboost, XG boost inclusive)


Bagging and Boosting are two types of Ensemble Learning
Bagging
Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-
algorithm designed to improve the stability and accuracy of machine learning algorithms used
in statistical classification and regression. It decreases the variance and helps to
avoid overfitting. It is usually applied to decision tree methods. Bagging is a special case of

the model averaging approach.

Description of the Technique


Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is selected
via row sampling with a replacement method (i.e., there can be repetitive elements from
different d tuples) from D (i.e., bootstrap). Then a classifier model Mi is learned for each
training set D <i. Each classifier Mi returns its class prediction. The bagged classifier M*
counts the votes and assigns the class with the most votes to X (unknown sample).

Implementation Steps of Bagging


Step 1: Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
Step 2: A base model is created on each of these subsets.
Step 3: Each model is learned in parallel with each training set and independent of
each other.
Step 4: The final predictions are determined by combining the predictions from all the
models.

Boosting
Boosting is an ensemble modeling technique designed to create a strong classifier by
combining multiple weak classifiers. The process involves building models sequentially,
where each new model aims to correct the errors made by the previous ones.

 Subsequent models are then trained to address the mistakes of their


predecessors.
 boosting assigns weights to the data points in the original dataset.
 Higher weights: Instances that were misclassified by the previous model
receive higher weights.
 Lower weights: Instances that were correctly classified receive lower weights.
 Training on weighted data: The subsequent model learns from the weighted
dataset, focusing its attention on harder-to-learn examples (those with higher
weights).
 This iterative process continues until:
 The entire training dataset is accurately predicted, or
 A predefined maximum number of models is reached.
 Boosting Algorithms

There are several boosting algorithms. The original ones, proposed by Robert
Schapire and Yoav Freund were not adaptive and could not take full advantage of the weak
learners. Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that
won the prestigious Gödel Prize. AdaBoost was the first really successful boosting algorithm
developed for the purpose of binary classification. AdaBoost is short for Adaptive Boosting
and is a very popular boosting technique that combines multiple “weak classifiers” into a
single “strong classifier”.

Algorithm:
 Initialise the dataset and assign equal weight to each of the data point.
 Provide this as input to the model and identify the wrongly classified data
points.
 Increase the weight of the wrongly classified data points and decrease the
weights of correctly classified data points. And then normalize the weights of
all data points.
 if(gotrequiredresults)
Gotostep5
else
Goto step 2
 End
random Forest algorithm is a powerful tree learning technique in Machine Learning to
make predictions and then we do voting of all the tress to make prediction. They are widely
used for classification and regression task.

 It is a type of classifier that uses many decision trees to make predictions.


 It takes different random parts of the dataset to train each tree and then it
combines the results by averaging them. This approach helps improve the
accuracy of predictions. Random Forest is based on ensemble learning.

Imagine asking a group of friends for advice on where to go for vacation. Each friend
gives their recommendation based on their unique perspective and preferences (decision trees
trained on different subsets of data). You then make your final decision by considering the
majority opinion or averaging their suggestions (ensemble prediction).
Process starts with a dataset with rows and their corresponding class labels (columns).

 Then - Multiple Decision Trees are created from the training data. Each tree is trained
on a random subset of the data (with replacement) and a random subset of features.
This process is known as bagging or bootstrap aggregating.

 Each Decision Tree in the ensemble learns to make predictions independently.

 When presented with a new, unseen instance, each Decision Tree in the ensemble
makes a prediction.

The final prediction is made by combining the predictions of all the Decision Trees.
This is typically done through a majority vote (for classification) or averaging (for
regression).

Key Features of Random Forest

Handles Missing Data: Automatically handles missing values during training, eliminating the
need for manual imputation.

 Algorithm ranks features based on their importance in making predictions offering


valuable insights for feature selection and interpretability.

 Scales Well with Large and Complex Data without significant performance
degradation.

 Algorithm is versatile and can be applied to both classification tasks (e.g., predicting
categories) and regression tasks (e.g., predicting continuous values).

How Random Forest Algorithm Works?

 The random Forest algorithm works in several steps:

 Random Forest builds multiple decision trees using random samples of the data. Each
tree is trained on a different subset of the data which makes each tree unique.

 When creating each tree the algorithm randomly selects a subset of features or
variables to split the data rather than using all available features at a time. This adds
diversity to the trees.

 Each decision tree in the forest makes a prediction based on the data it was trained on.
When making final prediction random forest combines the results from all the trees.
 For classification tasks the final prediction is decided by a majority vote. This means
that the category predicted by most trees is the final prediction.

 For regression tasks the final prediction is the average of the predictions from all the
trees.

 The randomness in data samples and feature selection helps to prevent the model from
overfitting making the predictions more accurate and reliable.

Advantages of Random Forest:

 It overcomes the problem of over fitting by averaging or combining the results of


different decision trees.

 Random forests work well for a large range of data items than a single decision tree
does.

 Random forests are very flexible and possess very high accuracy.

 Disadvantages of Random Forest :

 Complexity is the main disadvantage of Random forest algorithms.

 More computational resources are required to implement Random Forest algorithm


and very time-consuming in comparison with other algorithms.

 It is less intuitive in case when we have a large collection of decision trees.

Applications of Random Forest:

 Banking: Banking sector mostly uses this algorithm for the identification of loan risk.

 Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.

 Marketing: Marketing trends can be identified using this algorithm.

Ada Boost algorithm in Machine Learning.

Machine studying algorithms have the notable potential to make predictions and
decisions primarily based on patterns and information. However, not all algorithms are
created equal. Some perform better on positive sorts of data, even as others may
additionally warfare. AdaBoost, short for Adaptive Boosting, is a powerful ensemble
learning algorithm that could decorate the overall Performance of susceptible,
inexperienced persons and create a sturdy classifier. , we're going to dive into the world of
AdaBoost, exploring its principles, working mechanism, and practical applications.

Introduction to AdaBoost

AdaBoost is a boosting set of rules that was added with the aid of Yoav Freund and
Robert Schapire in 1996. It is part of a class of ensemble getting-to-know strategies that
aim to improve the overall performance of gadget getting-to-know fashions by combining
the outputs of a couple of weaker fashions, known as vulnerable, inexperienced persons
or base novices. The fundamental idea at the back of AdaBoost is to offer greater weight
to the schooling instances that are misclassified through the modern-day model, thereby
focusing on the samples that are tough to classify.

How AdaBoost Works


1. Weight Initialization

At the start, every schooling instance is assigned an identical weight. These weights
determine the importance of every example in the getting-to-know method.

2. Model Training

A weak learner is skilled at the dataset, with the aim of minimizing class errors. A
weak learner is usually an easy model, which includes a selection stump (a one-stage
decision tree) or a small neural network.

3. Weighted Error Calculation

After the vulnerable learner is skilled, its miles are used to make predictions at the
education dataset. The weighted mistakes are then calculated by means of summing up
the weights of the misclassified times. This step emphasizes the importance of the
samples which are tough to classify.

4. Model Weight Calculation


The weight of the susceptible learner is calculated primarily based on their
Performance in classifying the training data. Models that perform properly are assigned
higher weights, indicating that they're more reliable.

5. Update Instance Weights

The example weights are updated to offer more weight to the misclassified samples
from the previous step. This adjustment focuses on the studying method at the times that
the present-day model struggles with.

6. Repeat

Steps 2 through five are repeated for a predefined variety of iterations or till a
distinctive overall performance threshold is met.

7. Final Model Creation

The very last sturdy model (also referred to as the ensemble) is created by means of
combining the weighted outputs of all weak newcomers. Typically, the fashions with
better weights have an extra influence on the final choice.

8. Classification

To make predictions on new records, AdaBoost uses the very last ensemble model.
Each vulnerable learner contributes its prediction, weighted with the aid of its
significance, and the blended result is used to categorize the enter.

Key Concepts in AdaBoost

To gain deeper information about AdaBoost, it's critical to be acquainted with some
key principles associated with the algorithm:

1. Weak Learners

Weak novices are the individual fashions that make up the ensemble. These are
generally fashions with accuracy barely higher than random hazards. In the context of
AdaBoost, weak beginners are trained sequentially, with each new model focusing on the
instances that preceding models determined difficult to classify.

2. Strong Classifier

The strong classifier, additionally known as the ensemble, is the final version created
by combining the predictions of all weak first-year students. It has the collective know-
how of all of the fashions and is capable of making correct predictions.

3. Weighted Voting

In AdaBoost, every susceptible learner contributes to the very last prediction with a
weight-based totally on its Performance. This weighted vote-casting machine ensures that
the greater correct fashions have a greater say in the final choice.

4. Error Rate

The error rate is the degree of ways a vulnerable learner plays on the schooling
statistics. It is used to calculate the load assigned to each vulnerable learner. Models with
lower error fees are given higher weights.

5. Iterations

The range of iterations or rounds in AdaBoost is a hyperparameter that determines


what number of susceptible newbies are educated. Increasing the range of iterations may
additionally result in a more complex ensemble; however, it can also increase the risk of
overfitting.

Advantages of AdaBoost

AdaBoost gives numerous blessings that make it a popular choice in gadget


mastering:

1. Improved Accuracy

AdaBoost can notably improve the accuracy of susceptible, inexperienced persons,


even when the usage of easy fashions. By specializing in misclassified instances, it adapts
to the tough areas of the records distribution.
2. Versatility

AdaBoost can be used with a number of base newbies, making it a flexible set of rules
that may be carried out for unique forms of problems.

3. Feature Selection

It routinely selects the most informative features, lowering the need for giant function
engineering.

4. Resistance to Overfitting

AdaBoost tends to be much less at risk of overfitting compared to a few different


ensemble methods, thanks to its recognition of misclassified instances.

XGboost inclusive

XGBoost is an optimized implementation of Gradient Boosting and is a type


of ensemble learning method. Ensemble learning combines multiple weak models to form
a stronger model.XGBoost uses decision trees as its base learners combining them
sequentially to improve the model’s performance. Each new tree is trained to correct the
errors made by the previous tree and this process is called boosting.

It has built-in parallel processing to train models on large datasets quickly. XGBoost
also supports customizations allowing users to adjust model parameters to optimize
performance based on the specific problem.

How XGBoost Works?

It builds decision trees sequentially with each tree attempting to correct the mistakes
made by the previous one. The process can be broken down as follows:

1. Start with a base learner: The first model decision tree is trained on the data. In
regression tasks this base model simply predict the average of the target variable.

2. Calculate the errors: After training the first tree the errors between the predicted and
actual values are calculated.
3. Train the next tree: The next tree is trained on the errors of the previous tree. This step
attempts to correct the errors made by the first tree.

4. Repeat the process: This process continues with each new tree trying to correct the
errors of the previous trees until a stopping criterion is met.

5. Combine the predictions: The final prediction is the sum of the predictions from all
the trees.

Advantages of XGboost
 XGBoost is highly scalable and efficient as It is designed to handle large datasets with
millions or even billions of instances and features.

 XGBoost implements parallel processing techniques and utilizes hardware


optimization, such as GPU acceleration, to speed up the training process. This
scalability and efficiency make XGBoost suitable for big data applications and real-
time predictions.

 It provides a wide range of customizable parameters and regularization techniques,


allowing users to fine-tune the model according to their specific needs.

 XGBoost offers built-in feature importance analysis, which helps identify the most
influential features in the dataset. This information can be valuable for feature
selection, dimensionality reduction, and gaining insights into the underlying data
patterns.

 XGBoost has not only demonstrated exceptional performance but has also become a
go-to tool for data scientists and machine learning practitioners across various
languages. It has consistently outperformed other algorithms in Kaggle competitions,
showcasing its effectiveness in producing high-quality predictive models.

Disadvantages of XGBoost
 XGBoost can be computationally intensive especially when training complex models
making it less suitable for resource-constrained systems.

 Despite its robustness, XGBoost can still be sensitive to noisy data or outliers,

necessitating careful data preprocessing for optimal performance.


 XGBoost is prone to overfitting on small datasets or when too many trees are used in
the model.

 While feature importance scores are available, the overall model can be challenging to
interpret compared to simpler methods like linear regression or decision trees. This
lack of transparency may be a drawback in fields like healthcare or finance where
interpretability is critical.

 XGBoost is a powerful and flexible tool that works well for many machine learning
tasks. Its ability to handle large datasets and deliver high accuracy makes it useful.

Metrics
Metrics & Error Correction in Ensemble Learning
Ensemble learning combines multiple models to improve performance, reduce variance,
and enhance generalization. To assess and refine ensemble models, we rely on metrics
and error correction techniques.
Error correction

Error correction in machine learning (ML) refers to techniques used to identify,


minimize, and handle errors that occur during model training, evaluation, and prediction.
Errors in ML can arise due to various reasons, such as noisy data, overfitting,
underfitting, biased models, and incorrect feature selection. Implementing proper error
correction methods can significantly enhance the performance and reliability of ML
models.

2. Types of Errors in Machine Learning


A) Bias-Variance Tradeoff Errors
Bias Error:

Occurs when the model is too simple and fails to capture the complexity of the data.

Leads to underfitting, where the model performs poorly on both training and test data.

Variance Error:
 Occurs when the model learns noise instead of patterns, making it highly sensitive to
training data.
 Leads to overfitting, where the model performs well on training data but poorly on
new data.
B) Classification Errors

 False Positives (FP): Model incorrectly classifies a negative sample as positive.

 False Negatives (FN): Model incorrectly classifies a positive sample as negative.

 Misclassification Rate: The percentage of incorrect predictions made by the model.

C) Regression Errors

 Mean Squared Error (MSE): Measures the average squared difference between
actual and predicted values.
 Mean Absolute Error (MAE): Measures the average absolute difference between
actual and predicted values.
 Root Mean Squared Error (RMSE): Square root of MSE, providing an
interpretable metric for error magnitude.
D) Data-Related Errors
 Missing Data: Gaps in dataset values can lead to biased predictions.

 Noisy Data: Incorrect or irrelevant information can degrade model performance.

 Outliers: Extreme values can distort model training and influence predictions.

3. Techniques for Error Correction in ML


A) Data Preprocessing Techniques

 ✔Handling Missing Data:


 Use mean/median/mode imputation to fill missing values.
 Use predictive modeling (e.g., kNN imputation) to estimate missing values.
 Drop rows/columns if missing data is minimal.
 ✔Handling Noisy Data:
 Apply data smoothing techniques such as moving averages.
 Use feature scaling (normalization or standardization).
 Remove inconsistent data points.

✔Detecting and Removing Outliers:


 Use Z-score method (standard deviations away from the mean).
 Use Interquartile Range (IQR) to identify outliers.
 ✔Feature Engineering and Selection:
 Remove irrelevant or redundant features.
 Use Principal Component Analysis (PCA) for dimensionality reduction.
 Perform one-hot encoding for categorical variables.
 B) Model Optimization Techniques

✔Hyperparameter Tuning:
 Use Grid Search, Random Search, or Bayesian Optimization to find optimal
parameters.
 Adjust parameters such as learning rate, number of layers, and regularization strength.

✔Regularization Techniques:
 Apply L1 (Lasso) and L2 (Ridge) regularization to reduce overfitting.
 ✔Cross-Validation:
 Use k-fold cross-validation to improve generalization.
✔Handling Class Imbalance:
 Use SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic
samples.
 Use class weighting to assign higher importance to the minority class.
C) Error Analysis & Correction
✔Confusion Matrix Analysis:
 Evaluate False Positives (FP), False Negatives (FN), True Positives (TP), and True
Negatives (TN).

 Optimize for precision, recall, and F1-score based on the use case.

✔Resampling Techniques:
 Undersampling: Reduces majority class instances to balance the dataset.
 Oversampling: Increases minority class samples to prevent bias.
✔Anomaly Detection:
 Use algorithms like One-Class SVM, Isolation Forest, or Local Outlier Factor (LOF).
✔Boosting Techniques:
 Use ensemble methods like AdaBoost, Gradient Boosting, or XGBoost to iteratively
reduce errors.
Class imbalance
 Class imbalance is a machine learning issue that occurs when there is an uneven
distribution of data across different classes. This can lead to biased models that
misclassify the minority class.

Causes
 An irregular distribution of data between classes

 A disproportionate number of instances in one class compared to the other

 Effects Lower accuracy in object detection, Difficulty training models, Decreased


model performance, Models may struggle to generalize well to unseen data, and
Models may not make accurate predictions for minority classes.

 Solutions

 Resampling: Modify the sample distribution by either over-sampling or under-


sampling

 Data augmentation: Create new training samples by applying transformations to


existing data

 Synthetic Minority Oversampling Technique (SMOTE): Generate synthetic instances


for the minority class

 Use a different performance metric: Consider using a different metric to evaluate


model performance

 Collect more data: Gather more data to address the imbalance.

SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) is a widely used method in


machine learning to address the challenge of imbalanced datasets. Imbalanced datasets
occur when certain classes are underrepresented, leading to biased models that may

overlook these minority classes. SMOTE mitigates this issue by generating synthetic
samples for the minority class, thereby balancing the dataset and improving model
performance.

How SMOTE Works


Identifying Nearest Neighbors: Determine the k-nearest neighbors (typically k=5) of
the selected sample within the minority class.Generating Synthetic Samples: Randomly
select one of the k-nearest neighbors and create a synthetic sample along the line segment
joining the original sample and the neighbor.

Types of SMOTE

Several variations of SMOTE have been developed to enhance its effectiveness:


a) Borderline-SMOTE

Focuses on generating synthetic samples near the decision boundary between classes.

Helps improve classification in complex datasets.

b) SMOTE-NC (Nominal Continuous SMOTE)

Designed for datasets with categorical features.Uses a modified distance measure to


handle categorical data.

c) SMOTE-Tomek Links

Combines SMOTE with Tomek links to remove overlapping samples, improving data
quality.

d) ADASYN (Adaptive Synthetic Sampling)

Similar to SMOTE but generates more synthetic samples for difficult-to-learn


instances.

When to Use SMOTE

SMOTE is particularly beneficial when dealing with highly imbalanced datasets


where the minority class is underrepresented, and model performance on this class is

critical. However, it's essential to assess its impact on your specific dataset and consider
combining it with other techniques, such as ensemble methods or cost-sensitive learning,
to achieve optimal results.

Advantages of SMOTE
 Reduces Overfitting: By generating new synthetic samples rather than duplicating
existing ones, SMOTE helps prevent models from overfitting to specific samples.

 Improves Model Performance: Balancing the dataset enables models to learn


decision boundaries more effectively, enhancing performance on the minority class.

 Enhances Minority Class Representation: Synthetic samples provide better


coverage of the minority class feature space, aiding the model in learning its
characteristics.

Limitations and Considerations


 Potential for Overlapping Classes: SMOTE may generate synthetic samples that
overlap with the majority class, potentially leading to misclassifications.

 Computational Complexity: The process of generating synthetic samples and


identifying nearest neighbors can be computationally intensive, especially for large
datasets.

 Assumption of Feature Space Linearity: SMOTE assumes that interpolating


between minority class samples is valid, which may not hold true in all cases.

One-Class Support Vector Machines

One-Class Support Vector Machine is a special variant of Support Vector


Machine that is primarily designed for outlier, anomaly, or novelty detection. The
objective behind using one-class SVM is to identify instances that deviate significantly
from the norm. Unlike other traditional Machine Learning models, one-class SVM is not
used to perform binary or multiclass classification tasks but to detect outliers or novelties
within the dataset. Some of the key working principles of one-class SVM is discussed
below.

Outlier Boundary: One-Class SVM operates by defining a boundary around the


majority class (normal instances) in the feature space. This boundary is constructed to
encapsulate the normal data points, effectively creating a region of normalcy.

 Margin Maximization: This algorithm strives to maximize the margin around the
normal instances, allowing for a more robust separation between normal and
anomalous data points. This margin is crucial for accurately identifying outliers
during testing.

 High sensitivity: One-Class SVM has an in-build hyperparameter called "nu,"


representing an upper bound on the fraction of margin errors and support vectors.
Fine-tuning this parameter influences the model's sensitivity to outliers.

How One-Class SVM Works?


One-Class Support Vector Machines (OCSVM) operate on a fascinating principle
inspired by the idea of isolating the norm from the abnormal in a dataset. Unlike
traditional Support Vector Machines (SVM), which are adept at handling binary and
multiclass classification problems, OCSVM specializes in the nuanced task of anomaly
detection.

Conceptual Foundation: OCSVM establishes itself on the premise that the majority of
real-world data is inherently normal. In most scenarios, outliers or anomalies are rare
occurrences that deviate significantly from the usual patterns. OCSVM's goal is to define
a boundary encapsulating the normal instances in the feature space, thereby creating a
region of familiarity.

Outlier Boundary Definition: The algorithm crafts a boundary around the normal
instances, often referred to as the "normalcy region." This boundary is strategically
positioned to maximize the margin around the normal data points, allowing for a clear
delineation between what is considered ordinary and what may be deemed unusual. Think
of it as drawing a protective circle around the typical instances to shield them from the
outliers or anomalies.

Margin Maximization: The heart of OCSVM lies in its commitment to maximizing the
margin between the normal instances and the boundary. A larger margin provides a robust
separation, enhancing the model's ability to discern anomalies during testing. This
emphasis on margin maximization is akin to creating a safety buffer around the normal
instances, fortifying the model against the influence of potential outliers or anomalies.

Training Process: During the training phase, OCSVM exclusively leverages the majority
class or normal instances. This unimodal focus distinguishes it from traditional SVMs,
which necessitate examples from both classes for effective training. By concentrating
solely on the norm, OCSVM tailors itself to scenarios where anomalies are sparse, and
labeled instances of anomalies are hard to come by. It comes with a
fantastic hyperparameter called 'nu'. This parameter acts as an upper bound on the fraction
of margin errors and support vectors allowed by the model. Tuning the nu parameter
enables practitioners to strike a balance between the desire for a stringent model that
minimizes false positives (normal instances misclassified as anomalies) and a more
lenient model that embraces a higher fraction of anomalies.

Testing and Anomaly Identification: Armed with the learned normalcy region, OCSVM
can swiftly identify anomalies during testing. Instances falling outside the defined
boundary are flagged as potential outliers. The model essentially acts as a vigilant
guardian, scrutinizing new data points and signaling if they exhibit behavior significantly
different from the norm.

One-Class SVM in Anomaly Detection

In the domain of anomaly detection, One-Class Support Vector Machines (OCSVM)


serve as a robust and versatile tool designed to discern normal patterns from irregular
occurrences. Notably, OCSVM takes a distinctive approach by training exclusively on the
majority class, which represents normal instances, eliminating the need for labeled
anomalous data during training. This is particularly advantageous in real-world scenarios
where anomalies are rare, making it challenging to obtain sufficient labeled samples. The
core principle of OCSVM involves defining a boundary around the normal instances in
the feature space, achieved through a specified kernel function and a nuanced parameter
termed "nu." This parameter acts as an upper limit on the fraction of margin errors and
support vectors, enabling users to fine-tune the model's sensitivity to outliers. During the
testing phase, instances falling outside this learned boundary are flagged as potential
outliers, facilitating efficient anomaly identification. OCSVM's adaptability extends to
various applications, including fraud detection in financial transactions, fault monitoring
in industrial systems, and network intrusion detection. Its innate ability to capture
complex, non-linear relationships and its focus on the majority class make it a valuable
asset in safeguarding systems against unexpected events and ensuring robust anomaly
detection across diverse domains.

Use Cases of One-Class SVM


There are several real-world use-cases of One-Class SVM which are listed below
1. Detecting fraud in financial transactions: OCSVM excels in rare cases that are not
uncommon associated with fraudulent activity in financial transactions. With
specialized training in general practices, it becomes adept at distinguishing specific
patterns. During testing, deviations from these scholarly models are immediately
flagged, indicating that the fraud can be treated as abnormalities.

2. Fault detection in commercial systems: Companies that rely on complex devices


can benefit from OCSVM’s real-time monitoring of defects or anomalies. When
applied to sensor data, OCSVM identifies abnormal behavior, and identifies potential
errors. Early detection through OCSVM prevents maintenance, reduces downtime and
increases operational efficiency.

3. Network Intrusion Detection: OCSVM can play an important role in continuously


monitoring computer networks to protect against malicious activity. It helps identify
unusual network behaviors that may indicate a possible attack. OCSVM works well in
situations where most network traffic is normal and anomalies are very rare.

4. Quality Control in Manufacturing: Strict quality control is needed in


manufacturing to ensure fault-free products. OCSVM is applied on sensor data or
product characteristics to detect deviations from the perfect product. It helps to detect
defects early during production.

5. One-Class SVM Kernel Trick

 Linear Kernel: The linear kernel is the simplest form of a kernel and is equivalent to
performing a linear transformation. It is suitable when the relationship between the
features is approximately linear. The decision boundary in the higher-dimensional
space is a hyperplane.

 Polynomial Kernel: The polynomial kernel introduces non-linearity by considering


not just the dot product but also higher-order interactions between features. It is
characterized by a user-defined degree parameter (degree). A higher degree allows the
model to capture more complex relationships but in the same time it may increase the
risk of overfitting.

 Sigmoid Kernel: The sigmoid kernel is particularly suitable for scenarios where the
data distribution is not well defined or exhibits sigmoidal patterns. It is often used in
neural network-inspired SVMs. The gamma and coef0 parameters govern the shape
and position of the decision boundary.

 Radial Basis Function (RBF) or Gaussian Kernel: The RBF kernel is versatile for
handling complex, non-linear relationships. It transforms data into a space where
intricate decision boundaries can be drawn. Well-suited when the exact form of
relationships is unknown or intricate.

 Precomputed Kernel: This kernel allows users to provide a precomputed kernel


matrix instead of the actual data. Useful when the kernel matrix is computed using a
custom kernel function or when using pairwise similarities between instances.

Hyper parameters Optimization methods

Hyper parameters are those parameters that we set for training.


Hyperparameters have major impacts on accuracy and efficiency while training the
model. Therefore it needed to be set accurately to get better and efficient results.

Hyperparameters Optimization Technique

Exhaustive Search Methods

 Grid Search: In Grid Search, the possible values of hyperparameters are defined in
the set. Then these sets of possible values of hyperparameters are combined by using
Cartesian product and form a multidimensional grid. Then we try all the parameters in
the grid and select the hyperparameter setting with the best result.

 Random Search: This is another variant of Grid Search in which instead of trying all
the points in the grid we try random points. This solves a couple of problems that are
in Grid Search such as we don’t need to expand our search space exponentially every
time add a new hyperparameter

Drawback:

 Random Search and Grid Search are easy to implement and can run in parallel but
here are few drawbacks of these algorithm:

 If the hyperparameter search space is large, it takes a lot of time and computational
power to optimize the hyperparameter.

 There is no guarantee that these algorithms find local maxima if the sample is not
meticulously done.

Bayesian Optimization:

 Instead of random guess, In bayesian optimization we use our previous knowledge to


guess the hyper parameter. They use these results to form a probabilistic model
mapping hyperparameters to a probability function of a score on the objective
function. These probability function is defined below.

 P(score(y)hyperparameters(x))P(hyperparameters(x)score(y))

1. This function is also called “surrogate” of objective function. It is much easier to


optimize than Objective function

2. Build a surrogate probability model of the objective function


3. Find the hyperparameters that perform best on the surrogate

4. Apply these hyperparameters to the original objective function

5. Update the surrogate model by using the new results

6. Repeat steps 2–4 until n number of iteration

Sequential Model-Based Optimization:


7. Sequential Model-Based Optimization (SMBO) is a method of applying Bayesian
optimization. Here sequential refers to running trials one after another, each time
improving hyperparameters by applying Bayesian probability model (surrogate).

8. There are 5 important parameters of SMBO:

9. Domain of the hyperparameter over which .

10. An objective function which outputs a score which we want to optimize.

11. A surrogate distribution of objective function

12. A selection function to select which hyperparameter to choose next. Generally we


take Expected Improvement into the consideration

A data structure contains history of previous (score, hyperparmeter) pairs


which are used in previous iterations.There are many different version SMBO
hyperparameter optimization algorithm. These common difference between them is
the surrogate functions. Some surrogate function such as Gaussian Process, Random
Forest Regression, Tree Prazen Estimator. In this post we will discuss Tree Prazen
Estimator below.

Tree Prazen Estimators:


Tree Prazen Estimators uses tree-structure for optimizing the hyperparameter.
Many hyperprameter can be optimized by using this method such as number of layers,
optimizer in the model, number of neurons in each layer. In tree prazen estimator
instead of calculating P(yx)P(xy) we calculate P(xy)P(yx) and P(y) (where y is an
intermediate score that decides how good this hyperparameter values such as
validation loss and x is hyperparameter).
to initialize the algorithm. Then we divide the observations into two groups: the best
performing one (e.g. the upper quartile) and the rest, taking y* as the splitting value
for the two groups.

Then we calculate the probability of hyperparameter being in each of these


groups In first of Tree Prazen Estimator, we sample the validation loss by random
search in order such as

P(xy)P(yx) = f(x) if y<y* and P(xy)P(yx) = g(x) if y>y*.

The two densities and g are modelled using Parzen estimators (also known as
kernel density estimators) which are a simple average of kernels centred on existing
data points.P(y) is calculated using the fact that p(y<y*)= f(y*) which defines the
percentile split in the two categories.

Using Baye’s rule (i.e. p(x, y) = p(y) P(xy)P(yx)), it can be shown ) that the
definition of expected improvements equivalent to f(x)/g(x).In this final step we try to
maximize the \frac{f(x)}{g(x)}

Drawback:
The biggest disadvantage of Tree Prazen Estimator that it selects
hyperparameter independently of each other, that somehow effects the efficiency and
computation required because in most of the neural networks there are relationships
between different hperparameters

Other Hyperparameter Estimation Algorithms:


Hyperband:
The underlying principle of this algorithm is that if a hyperparameter
configuration is destined to be the best after a large number of iterations, it is more
likely to perform in the top half of configurations after a small number of iterations.
Below is step-by-step implementation of Hyperband.

Randomly sample n number of hyperparameter sets in the search


space.After k iterations evaluate the validation loss of these hyperpameters.

Discard the half of lowest performing hyperparameters.Run the good ones


for k iterations more and evaluate and discard the bottom half.

Repeat until we have only one model of hyperparameter left.


Drawbacks:
If number of samples is large some good performing hyperparameter sets
which required some time to converge may be discarded early in the optimization.

Population based Training (PBT):

Population based Training (PBT) starts similar to random based training by


training many models in parallel. But rather than the networks training independently,
it uses information from the remainder of the population to refine the hyperparameters
and direct computational resources to models which show promise. This takes its
inspiration from genetic algorithms where each member of the population, referred to
as a worker, can exploit information from the rest of the population. for instance, a
worker might copy the model parameters from a far better performing worker. It also
can explore new hyperparameters by changing the present values randomly.

Bayesian Optimization and HyperBand (BOHB):


BOHB (Bayesian Optimization and HyperBand) is a combination of the
Hyperband algorithm and Bayesian optimization. First, it uses Hyperband capability
to sample many configurations with a small budget to explore quickly and efficiently
the hyper-parameter search space and get very soon promising configurations, then it
uses the Bayesian optimizer predictive capability to propose set of hyperparameters
that are close to optimum. This algorithm can also be run in parallel (as Hyperband)
which overcomes a strong drawback of Bayesian optimization.

You might also like