ML UNIT 4 Notes
ML UNIT 4 Notes
BIAS
Bias in machine learning refers to the difference between a model’s predictions and
the actual distribution of the value it tries to predict. Models with high bias oversimplify
thedata distribution rule/function, resulting in high errors in both the training outcomes and
test data analysis results.
The level of bias in a model is heavily influenced by the quality and quantity of
training data involved. Using insufficient data will result in flawed predictions. At the same
time, it can also result from the choice of an inappropriate model.
Variance
The variability of model prediction for a given data point which tells us the spread of
our data is called the variance of the model.The model with high variance has a very complex
fit to the training data and thus is not able to fit accurately on the data which it hasn’t seen
before. As a result, such models perform very well on training data but have high error rates
on test data.
When a model is high on variance, it is then said to as Overfitting of Data. Overfitting
is fitting the training set accurately via complex curve and high order hypothesis but is not the
solution as the error with unseen data is high. While training a data model variance should be
kept low.
High variance: High variance means that the model is very sensitive to changes in the
training data and can result in significant changes in the estimate of the target function when
trained on different subsets of data from the same distribution. This is the case of overfitting
when the model performs well on the training data but poorly on new, unseen test data. It fits
the training data too closely that it fails on the new training dataset.
High Bias, Low Variance: A model with high bias and low variance is said to be
underfitting.
High Variance, Low Bias: A model with high variance and low bias is
said to be overfitting.
High-Bias, High-Variance: A model has both high bias and high variance, which
means that the model is not able to capture the underlying patterns in the data (high
bias) and is also too sensitive to changes in the training data (high variance). As a
result, the model will produce inconsistent and inaccurate predictions on average.
Low Bias, Low Variance: A model that has low bias and low variance means that
the model is able to capture the underlying patterns in the data (low bias) and is not
too sensitive to changes in the training data (low variance). This is the ideal scenario
for a machine learning model, as it is able to generalize well to new, unseen data and
produce consistent and accurate predictions. But in practice, it’s not possible.
Now we know that the ideal case will be Low Bias and Low variance, but in practice,
it is not possible. So, we trade off between Bias and variance to achieve a balanced
bias and variance.
Machine Learning
Algorithm Bias Variance
This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph,
the perfect tradeoff will be like this.
We try to optimize the value of the total error for the model by using the Bias-
Variance Tradeoff.
The best fit will be given by the hypothesis on the tradeoff point. The error to
complexity graph to show trade-off is given as –
Region for the Least Value of Total Error
This is referred to as the best point chosen for the training of the algorithm which
gives low error in training as well as testing data.
What is the difference between bias-variance decomposition and bias-variance
tradeoff?
Bias-variance decomposition and bias-variance tradeoff are closely related concepts.
Bias-variance decomposition is a mathematical technique that divides
the generalization error in a predictive model into two components: bias and variance.
In machine learning, as you try to minimize one component of the error (e.g., bias),
the other component (e.g., variance) tends to increase, and vice versa. Finding the
right balance of bias and variance is key to creating an effective and accurate model.
This is called the bias-variance tradeoff.
Boosting
Boosting is an ensemble modeling technique designed to create a strong classifier by
combining multiple weak classifiers. The process involves building models sequentially,
where each new model aims to correct the errors made by the previous ones.
There are several boosting algorithms. The original ones, proposed by Robert
Schapire and Yoav Freund were not adaptive and could not take full advantage of the weak
learners. Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that
won the prestigious Gödel Prize. AdaBoost was the first really successful boosting algorithm
developed for the purpose of binary classification. AdaBoost is short for Adaptive Boosting
and is a very popular boosting technique that combines multiple “weak classifiers” into a
single “strong classifier”.
Algorithm:
Initialise the dataset and assign equal weight to each of the data point.
Provide this as input to the model and identify the wrongly classified data
points.
Increase the weight of the wrongly classified data points and decrease the
weights of correctly classified data points. And then normalize the weights of
all data points.
if(gotrequiredresults)
Gotostep5
else
Goto step 2
End
random Forest algorithm is a powerful tree learning technique in Machine Learning to
make predictions and then we do voting of all the tress to make prediction. They are widely
used for classification and regression task.
Imagine asking a group of friends for advice on where to go for vacation. Each friend
gives their recommendation based on their unique perspective and preferences (decision trees
trained on different subsets of data). You then make your final decision by considering the
majority opinion or averaging their suggestions (ensemble prediction).
Process starts with a dataset with rows and their corresponding class labels (columns).
Then - Multiple Decision Trees are created from the training data. Each tree is trained
on a random subset of the data (with replacement) and a random subset of features.
This process is known as bagging or bootstrap aggregating.
When presented with a new, unseen instance, each Decision Tree in the ensemble
makes a prediction.
The final prediction is made by combining the predictions of all the Decision Trees.
This is typically done through a majority vote (for classification) or averaging (for
regression).
Handles Missing Data: Automatically handles missing values during training, eliminating the
need for manual imputation.
Scales Well with Large and Complex Data without significant performance
degradation.
Algorithm is versatile and can be applied to both classification tasks (e.g., predicting
categories) and regression tasks (e.g., predicting continuous values).
Random Forest builds multiple decision trees using random samples of the data. Each
tree is trained on a different subset of the data which makes each tree unique.
When creating each tree the algorithm randomly selects a subset of features or
variables to split the data rather than using all available features at a time. This adds
diversity to the trees.
Each decision tree in the forest makes a prediction based on the data it was trained on.
When making final prediction random forest combines the results from all the trees.
For classification tasks the final prediction is decided by a majority vote. This means
that the category predicted by most trees is the final prediction.
For regression tasks the final prediction is the average of the predictions from all the
trees.
The randomness in data samples and feature selection helps to prevent the model from
overfitting making the predictions more accurate and reliable.
Random forests work well for a large range of data items than a single decision tree
does.
Random forests are very flexible and possess very high accuracy.
Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.
Machine studying algorithms have the notable potential to make predictions and
decisions primarily based on patterns and information. However, not all algorithms are
created equal. Some perform better on positive sorts of data, even as others may
additionally warfare. AdaBoost, short for Adaptive Boosting, is a powerful ensemble
learning algorithm that could decorate the overall Performance of susceptible,
inexperienced persons and create a sturdy classifier. , we're going to dive into the world of
AdaBoost, exploring its principles, working mechanism, and practical applications.
Introduction to AdaBoost
AdaBoost is a boosting set of rules that was added with the aid of Yoav Freund and
Robert Schapire in 1996. It is part of a class of ensemble getting-to-know strategies that
aim to improve the overall performance of gadget getting-to-know fashions by combining
the outputs of a couple of weaker fashions, known as vulnerable, inexperienced persons
or base novices. The fundamental idea at the back of AdaBoost is to offer greater weight
to the schooling instances that are misclassified through the modern-day model, thereby
focusing on the samples that are tough to classify.
At the start, every schooling instance is assigned an identical weight. These weights
determine the importance of every example in the getting-to-know method.
2. Model Training
A weak learner is skilled at the dataset, with the aim of minimizing class errors. A
weak learner is usually an easy model, which includes a selection stump (a one-stage
decision tree) or a small neural network.
After the vulnerable learner is skilled, its miles are used to make predictions at the
education dataset. The weighted mistakes are then calculated by means of summing up
the weights of the misclassified times. This step emphasizes the importance of the
samples which are tough to classify.
The example weights are updated to offer more weight to the misclassified samples
from the previous step. This adjustment focuses on the studying method at the times that
the present-day model struggles with.
6. Repeat
Steps 2 through five are repeated for a predefined variety of iterations or till a
distinctive overall performance threshold is met.
The very last sturdy model (also referred to as the ensemble) is created by means of
combining the weighted outputs of all weak newcomers. Typically, the fashions with
better weights have an extra influence on the final choice.
8. Classification
To make predictions on new records, AdaBoost uses the very last ensemble model.
Each vulnerable learner contributes its prediction, weighted with the aid of its
significance, and the blended result is used to categorize the enter.
To gain deeper information about AdaBoost, it's critical to be acquainted with some
key principles associated with the algorithm:
1. Weak Learners
Weak novices are the individual fashions that make up the ensemble. These are
generally fashions with accuracy barely higher than random hazards. In the context of
AdaBoost, weak beginners are trained sequentially, with each new model focusing on the
instances that preceding models determined difficult to classify.
2. Strong Classifier
The strong classifier, additionally known as the ensemble, is the final version created
by combining the predictions of all weak first-year students. It has the collective know-
how of all of the fashions and is capable of making correct predictions.
3. Weighted Voting
In AdaBoost, every susceptible learner contributes to the very last prediction with a
weight-based totally on its Performance. This weighted vote-casting machine ensures that
the greater correct fashions have a greater say in the final choice.
4. Error Rate
The error rate is the degree of ways a vulnerable learner plays on the schooling
statistics. It is used to calculate the load assigned to each vulnerable learner. Models with
lower error fees are given higher weights.
5. Iterations
Advantages of AdaBoost
1. Improved Accuracy
AdaBoost can be used with a number of base newbies, making it a flexible set of rules
that may be carried out for unique forms of problems.
3. Feature Selection
It routinely selects the most informative features, lowering the need for giant function
engineering.
4. Resistance to Overfitting
XGboost inclusive
It has built-in parallel processing to train models on large datasets quickly. XGBoost
also supports customizations allowing users to adjust model parameters to optimize
performance based on the specific problem.
It builds decision trees sequentially with each tree attempting to correct the mistakes
made by the previous one. The process can be broken down as follows:
1. Start with a base learner: The first model decision tree is trained on the data. In
regression tasks this base model simply predict the average of the target variable.
2. Calculate the errors: After training the first tree the errors between the predicted and
actual values are calculated.
3. Train the next tree: The next tree is trained on the errors of the previous tree. This step
attempts to correct the errors made by the first tree.
4. Repeat the process: This process continues with each new tree trying to correct the
errors of the previous trees until a stopping criterion is met.
5. Combine the predictions: The final prediction is the sum of the predictions from all
the trees.
Advantages of XGboost
XGBoost is highly scalable and efficient as It is designed to handle large datasets with
millions or even billions of instances and features.
XGBoost offers built-in feature importance analysis, which helps identify the most
influential features in the dataset. This information can be valuable for feature
selection, dimensionality reduction, and gaining insights into the underlying data
patterns.
XGBoost has not only demonstrated exceptional performance but has also become a
go-to tool for data scientists and machine learning practitioners across various
languages. It has consistently outperformed other algorithms in Kaggle competitions,
showcasing its effectiveness in producing high-quality predictive models.
Disadvantages of XGBoost
XGBoost can be computationally intensive especially when training complex models
making it less suitable for resource-constrained systems.
Despite its robustness, XGBoost can still be sensitive to noisy data or outliers,
While feature importance scores are available, the overall model can be challenging to
interpret compared to simpler methods like linear regression or decision trees. This
lack of transparency may be a drawback in fields like healthcare or finance where
interpretability is critical.
XGBoost is a powerful and flexible tool that works well for many machine learning
tasks. Its ability to handle large datasets and deliver high accuracy makes it useful.
Metrics
Metrics & Error Correction in Ensemble Learning
Ensemble learning combines multiple models to improve performance, reduce variance,
and enhance generalization. To assess and refine ensemble models, we rely on metrics
and error correction techniques.
Error correction
Occurs when the model is too simple and fails to capture the complexity of the data.
Leads to underfitting, where the model performs poorly on both training and test data.
Variance Error:
Occurs when the model learns noise instead of patterns, making it highly sensitive to
training data.
Leads to overfitting, where the model performs well on training data but poorly on
new data.
B) Classification Errors
C) Regression Errors
Mean Squared Error (MSE): Measures the average squared difference between
actual and predicted values.
Mean Absolute Error (MAE): Measures the average absolute difference between
actual and predicted values.
Root Mean Squared Error (RMSE): Square root of MSE, providing an
interpretable metric for error magnitude.
D) Data-Related Errors
Missing Data: Gaps in dataset values can lead to biased predictions.
Outliers: Extreme values can distort model training and influence predictions.
✔Hyperparameter Tuning:
Use Grid Search, Random Search, or Bayesian Optimization to find optimal
parameters.
Adjust parameters such as learning rate, number of layers, and regularization strength.
✔Regularization Techniques:
Apply L1 (Lasso) and L2 (Ridge) regularization to reduce overfitting.
✔Cross-Validation:
Use k-fold cross-validation to improve generalization.
✔Handling Class Imbalance:
Use SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic
samples.
Use class weighting to assign higher importance to the minority class.
C) Error Analysis & Correction
✔Confusion Matrix Analysis:
Evaluate False Positives (FP), False Negatives (FN), True Positives (TP), and True
Negatives (TN).
Optimize for precision, recall, and F1-score based on the use case.
✔Resampling Techniques:
Undersampling: Reduces majority class instances to balance the dataset.
Oversampling: Increases minority class samples to prevent bias.
✔Anomaly Detection:
Use algorithms like One-Class SVM, Isolation Forest, or Local Outlier Factor (LOF).
✔Boosting Techniques:
Use ensemble methods like AdaBoost, Gradient Boosting, or XGBoost to iteratively
reduce errors.
Class imbalance
Class imbalance is a machine learning issue that occurs when there is an uneven
distribution of data across different classes. This can lead to biased models that
misclassify the minority class.
Causes
An irregular distribution of data between classes
Solutions
SMOTE
overlook these minority classes. SMOTE mitigates this issue by generating synthetic
samples for the minority class, thereby balancing the dataset and improving model
performance.
Types of SMOTE
Focuses on generating synthetic samples near the decision boundary between classes.
c) SMOTE-Tomek Links
Combines SMOTE with Tomek links to remove overlapping samples, improving data
quality.
critical. However, it's essential to assess its impact on your specific dataset and consider
combining it with other techniques, such as ensemble methods or cost-sensitive learning,
to achieve optimal results.
Advantages of SMOTE
Reduces Overfitting: By generating new synthetic samples rather than duplicating
existing ones, SMOTE helps prevent models from overfitting to specific samples.
Margin Maximization: This algorithm strives to maximize the margin around the
normal instances, allowing for a more robust separation between normal and
anomalous data points. This margin is crucial for accurately identifying outliers
during testing.
Conceptual Foundation: OCSVM establishes itself on the premise that the majority of
real-world data is inherently normal. In most scenarios, outliers or anomalies are rare
occurrences that deviate significantly from the usual patterns. OCSVM's goal is to define
a boundary encapsulating the normal instances in the feature space, thereby creating a
region of familiarity.
Outlier Boundary Definition: The algorithm crafts a boundary around the normal
instances, often referred to as the "normalcy region." This boundary is strategically
positioned to maximize the margin around the normal data points, allowing for a clear
delineation between what is considered ordinary and what may be deemed unusual. Think
of it as drawing a protective circle around the typical instances to shield them from the
outliers or anomalies.
Margin Maximization: The heart of OCSVM lies in its commitment to maximizing the
margin between the normal instances and the boundary. A larger margin provides a robust
separation, enhancing the model's ability to discern anomalies during testing. This
emphasis on margin maximization is akin to creating a safety buffer around the normal
instances, fortifying the model against the influence of potential outliers or anomalies.
Training Process: During the training phase, OCSVM exclusively leverages the majority
class or normal instances. This unimodal focus distinguishes it from traditional SVMs,
which necessitate examples from both classes for effective training. By concentrating
solely on the norm, OCSVM tailors itself to scenarios where anomalies are sparse, and
labeled instances of anomalies are hard to come by. It comes with a
fantastic hyperparameter called 'nu'. This parameter acts as an upper bound on the fraction
of margin errors and support vectors allowed by the model. Tuning the nu parameter
enables practitioners to strike a balance between the desire for a stringent model that
minimizes false positives (normal instances misclassified as anomalies) and a more
lenient model that embraces a higher fraction of anomalies.
Testing and Anomaly Identification: Armed with the learned normalcy region, OCSVM
can swiftly identify anomalies during testing. Instances falling outside the defined
boundary are flagged as potential outliers. The model essentially acts as a vigilant
guardian, scrutinizing new data points and signaling if they exhibit behavior significantly
different from the norm.
Linear Kernel: The linear kernel is the simplest form of a kernel and is equivalent to
performing a linear transformation. It is suitable when the relationship between the
features is approximately linear. The decision boundary in the higher-dimensional
space is a hyperplane.
Sigmoid Kernel: The sigmoid kernel is particularly suitable for scenarios where the
data distribution is not well defined or exhibits sigmoidal patterns. It is often used in
neural network-inspired SVMs. The gamma and coef0 parameters govern the shape
and position of the decision boundary.
Radial Basis Function (RBF) or Gaussian Kernel: The RBF kernel is versatile for
handling complex, non-linear relationships. It transforms data into a space where
intricate decision boundaries can be drawn. Well-suited when the exact form of
relationships is unknown or intricate.
Grid Search: In Grid Search, the possible values of hyperparameters are defined in
the set. Then these sets of possible values of hyperparameters are combined by using
Cartesian product and form a multidimensional grid. Then we try all the parameters in
the grid and select the hyperparameter setting with the best result.
Random Search: This is another variant of Grid Search in which instead of trying all
the points in the grid we try random points. This solves a couple of problems that are
in Grid Search such as we don’t need to expand our search space exponentially every
time add a new hyperparameter
Drawback:
Random Search and Grid Search are easy to implement and can run in parallel but
here are few drawbacks of these algorithm:
If the hyperparameter search space is large, it takes a lot of time and computational
power to optimize the hyperparameter.
There is no guarantee that these algorithms find local maxima if the sample is not
meticulously done.
Bayesian Optimization:
P(score(y)hyperparameters(x))P(hyperparameters(x)score(y))
The two densities and g are modelled using Parzen estimators (also known as
kernel density estimators) which are a simple average of kernels centred on existing
data points.P(y) is calculated using the fact that p(y<y*)= f(y*) which defines the
percentile split in the two categories.
Using Baye’s rule (i.e. p(x, y) = p(y) P(xy)P(yx)), it can be shown ) that the
definition of expected improvements equivalent to f(x)/g(x).In this final step we try to
maximize the \frac{f(x)}{g(x)}
Drawback:
The biggest disadvantage of Tree Prazen Estimator that it selects
hyperparameter independently of each other, that somehow effects the efficiency and
computation required because in most of the neural networks there are relationships
between different hperparameters