0% found this document useful (0 votes)
22 views48 pages

Module 3 Modified

Uploaded by

shitalastik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views48 pages

Module 3 Modified

Uploaded by

shitalastik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Presenter’s Name

Dr. Shital Bhatt


Associate Professor
School of Computational and Data Sciences

www.vidyashilpuniversity.co www.vidyashilpuniversity.co
m m
Bias and Variance in Machine
Learning
 There are various ways to evaluate a machine-learning model. We can
use MSE (Mean Squared Error) for Regression; Precision, Recall, and ROC
(Receiver operating characteristics) for a Classification Problem along
with Absolute Error. In a similar way, Bias and Variance help us in
parameter tuning and deciding better-fitted models among several built.
 Bias is one type of error that occurs due to wrong assumptions about
data such as assuming data is linear when in reality, data follows a
complex function. On the other hand, variance gets introduced with high
sensitivity to variations in training data. This also is one type of error
since we want to make our model robust against noise. There are two
types of error in machine learning. Reducible error and Irreducible error.
Bias and Variance come under reducible error.
What is Bias?
 Bias is simply defined as the inability of the model because of that there is some
difference or error occurring between the model’s predicted value and the actual
value. These differences between actual or expected values and the predicted
values are known as error or bias error or error due to bias. Bias is a systematic
error that occurs due to wrong assumptions in the machine learning process.
 Let YY be the true value of a parameter, and let Y’ be an estimator of Y based on
a sample of data. Then, the bias of the estimator Y^Y^ is given by:
 Bias(Y’)=E(Y’)–Y
 where E(Y’) ’ is the expected value of the estimator Y’. It is the measurement of
the model that how well it fits the data.
• Low Bias: Low bias value means fewer assumptions are taken to build the
target function. In this case, the model will closely match the training dataset.
• High Bias: High bias value means more assumptions are taken to build the
target function. In this case, the model will not match the training dataset
closely.
 The high-bias model will not be able to capture the dataset trend. It is
considered as the underfitting model which has a high error rate. It is
due to a very simplified algorithm.
 For example, a linear regression model may have a high bias if the data
has a non-linear relationship.
Ways to reduce high bias in Machine
Learning:
• Use a more complex model: One of the main reasons for high bias is the
very simplified model. it will not be able to capture the complexity of the
data. In such cases, we can make our mode more complex by increasing the
number of hidden layers in the case of a deep neural network. Or we can use
a more complex model like Polynomial regression for non-linear datasets,
CNN for image processing, and RNN for sequence learning.
• Increase the number of features: By adding more features to train the
dataset will increase the complexity of the model. And improve its ability to
capture the underlying patterns in the data.
• Reduce Regularization of the model: Regularization techniques such as
L1 or L2 regularization can help to prevent overfitting and improve the
generalization ability of the model. if the model has a high bias, reducing the
strength of regularization or removing it altogether can help to improve its
performance.
• Increase the size of the training data: Increasing the size of the training
data can help to reduce bias by providing the model with more examples to
learn from the dataset.
What is Variance?
 Variance is the measure of spread in data from its mean position. In
machine learning variance is the amount by which the performance of a
predictive model changes when it is trained on different subsets of the
training data. More specifically, variance is the variability of the model
that how much it is sensitive to another subset of the training dataset.
i.e. how much it can adjust on the new subset of the training dataset.
 Let Y be the actual values of the target variable, and Y’ be the
predicted values of the target variable. Then the variance of a model
can be measured as the expected value of the square of the difference
between predicted values and the expected value of the predicted
values. Variance=E[(Y’–E[Y’])^2]
 Variance errors are either low or high-variance errors.
• Low variance: Low variance means that the model is less sensitive to
changes in the training data and can produce consistent estimates of
the target function with different subsets of data from the same
distribution. This is the case of underfitting when the model fails to
generalize on both training and test data.
• High variance: High variance means that the model is very sensitive to
changes in the training data and can result in significant changes in the
estimate of the target function when trained on different subsets of data
from the same distribution. This is the case of overfitting when the
model performs well on the training data but poorly on new, unseen test
data. It fits the training data too closely that it fails on the new training
dataset.
Ways to Reduce the reduce Variance in
Machine Learning:

• Cross-validation: By splitting the data into training and testing sets multiple times, cross-
validation can help identify if a model is overfitting or underfitting and can be used to tune
hyperparameters to reduce variance.
• Feature selection: By choosing the only relevant feature will decrease the model’s complexity.
and it can reduce the variance error.
• Regularization: We can use L1 or L2 regularization to reduce variance in machine learning
models
• Ensemble methods: It will combine multiple models to improve generalization performance.
Bagging, boosting, and stacking are common ensemble methods that can help reduce variance
and improve generalization performance.
• Simplifying the model: Reducing the complexity of the model, such as decreasing the
number of parameters or layers in a neural network, can also help reduce variance and improve
generalization performance.
• Early stopping: Early stopping is a technique used to prevent overfitting by stopping the
training of the deep learning model when the performance on the validation set stops
improving.
Different Combinations of Bias-
Variance
• High Bias, Low Variance: A model with high bias and low variance is
said to be underfitting.
• High Variance, Low Bias: A model with high variance and low bias is
said to be overfitting.
• High-Bias, High-Variance: A model has both high bias and high
variance, which means that the model is not able to capture the
underlying patterns in the data (high bias) and is also too sensitive to
changes in the training data (high variance). As a result, the model will
produce inconsistent and inaccurate predictions on average.
• Low Bias, Low Variance: A model that has low bias and low variance
means that the model is able to capture the underlying patterns in the
data (low bias) and is not too sensitive to changes in the training data
(low variance). This is the ideal scenario for a machine learning model,
as it is able to generalize well to new, unseen data and produce
consistent and accurate predictions. But in practice, it’s not possible.
 Now we know that the ideal case will be Low Bias and Low variance,
but in practice, it is not possible. So, we trade off between Bias and
variance to achieve a balanced bias and variance.
 A model with balanced bias and variance is said to have optimal
generalization performance. This means that the model is able to
capture the underlying patterns in the data without overfitting or
underfitting. The model is likely to be just complex enough to capture
the complexity of the data, but not too complex to overfit the training
data. This can happen when the model has been carefully tuned to
achieve a good balance between bias and variance, by adjusting the
hyperparameters and selecting an appropriate model architecture.
Bias Variance Tradeoff
 If the algorithm is too simple (hypothesis with linear equation) then it
may be on high bias and low variance condition and thus is error-prone.
If algorithms fit too complex (hypothesis with high degree equation)
then it may be on high variance and low bias. In the latter condition, the
new entries will not perform well. Well, there is something between both
of these conditions, known as a Trade-off or Bias Variance Trade-off. This
tradeoff in complexity is why there is a tradeoff between bias and
variance. An algorithm can’t be more complex and less complex at the
same time. For the graph, the perfect tradeoff will be like this.
What is the difference between parameter
and hyperparameter?

• Model parameters: These are the parameters that are estimated by


the model from the given data. For example the weights of a deep
neural network.
• Model hyperparameters: These are the parameters that cannot be
estimated by the model from the given data. These parameters are used
to estimate the model parameters. For example, the learning rate in
deep neural networks.
Model Parameters:
 Model parameters are configuration variables that are internal to the
model, and a model learns them on its own. For example, W Weights
or Coefficients of independent variables in the Linear regression
model. or Weights or Coefficients of independent variables in
SVM, weight, and biases of a neural network, cluster centroid in
clustering. Some key points for model parameters are as follows:
• They are used by the model for making predictions.
• They are learned by the model from the data itself
• These are usually not set manually.
• These are the part of the model and key to a machine learning
Algorithm.
Model Hyperparameters:

 Hyperparameters are those parameters that are explicitly defined by the


user to control the learning process. Some key points for model
parameters are as follows:
• These are usually defined manually by the machine learning engineer.
• One cannot know the exact best value for hyperparameters for the
given problem. The best value can be determined either by the rule of
thumb or by trial and error.
• Some examples of Hyperparameters are the learning rate for
training a neural network, K in the KNN algorithm,
What is hyperparameter tuning and why it is
important?

 Hyperparameter tuning (or hyperparameter optimization) is the process of


determining the right combination of hyperparameters that maximizes the
model performance. It works by running multiple trials in a single training
process. Each trial is a complete execution of your training application with
values for your chosen hyperparameters, set within the limits you specify.
This process once finished will give you the set of hyperparameter values
that are best suited for the model to give optimal results.
 Hyperparameter tuning is the process of selecting the optimal values for a
machine learning model’s hyperparameters. Hyperparameters are settings
that control the learning process of the model, such as the learning rate,
the number of neurons in a neural network, or the kernel size in a support
vector machine. The goal of hyperparameter tuning is to find the values
that lead to the best performance on a given task.
 Manual hyperparameter tuning
 Manual hyperparameter tuning involves experimenting with different
sets of hyperparameters manually i.e. each trial with a set of
hyperparameters will be performed by you.
 Advantages of manual hyperparameter optimization:
• Tuning hyperparameters manually means more control over the process.
• If you’re researching or studying tuning and how it affects the network
weights then doing it manually would make sense.
 Disadvantages of manual hyperparameter optimization:
• Manual tuning is a tedious process since there can be many trials and
keeping track can prove costly and time-consuming.
• This isn’t a very practical approach when there are a lot of
hyperparameters to consider.
 Automated hyperparameter tuning
 Automated hyperparameter tuning utilizes already existing algorithms to
automate the process. The steps you follow are:
• First, specify a set of hyperparameters and limits to those hyperparameters’
values (note: every algorithm requires this set to be a specific data structure,
e.g. dictionaries are common while working with algorithms).
• Then the algorithm does the heavy lifting for you. It runs those trials and
fetches you the best set of hyperparameters that will give optimal results.
• The k in kNN or K-Nearest Neighbour algorithm
• Learning rate for training a neural network
• Train-test split ratio
• Batch Size
• Number of Epochs
• Branches in Decision Tree
• Number of clusters in Clustering Algorithm
Types of Hyperparameters

 Model-Specific Hyperparameters: These control the structure or


complexity of the model.
 Example:The depth of a decision tree.The number of neurons in a neural
network layer.
 Algorithm-Specific Hyperparameters: These control how the algorithm
optimizes the model.
 Example:The learning rate in gradient descent.The number of iterations
for training.
Examples of Common
Hyperparameters

 1. Learning Rate (`alpha`):


 - Determines the step size at each iteration while moving toward a minimum of the loss
function.
 - Small learning rates converge slowly, while large learning rates may cause the model to
overshoot the minimum.

 2. Number of Epochs:
 - The number of times the learning algorithm works through the entire training dataset.
 - Too few epochs can lead to underfitting, while too many can lead to overfitting.

 3. Batch Size:
 - Number of training samples used to compute the gradient during optimization.
 - Smaller batches allow for more frequent updates, while larger batches provide a more
accurate estimate of the gradient.
 4. Regularization Parameter (`lambda` or `alpha`):

 - Penalizes large weights in models to avoid overfitting. Common regularization


techniques include:

 - Lasso (L1 regularization).

 - Ridge (L2 regularization).

 5. Number of Hidden Layers/Neurons in Neural Networks:

 - Controls the depth and capacity of the network.

 - More layers can learn more complex features but may lead to overfitting.
 6. Max Depth for Decision Trees:

 - Limits the number of splits in decision trees.

 - A deeper tree can model more complex patterns but may lead to overfitting.

 7. Dropout Rate (Neural Networks):

 - Used to randomly drop units (along with their connections) during training to prevent
overfitting.

 8. Number of Neighbors in K-Nearest Neighbors (KNN):

 - Controls how many neighbors are considered when predicting a class.


Hyperparameter Tuning

 1. Grid Search:

 - Try every possible combination of hyperparameter values from a predefined set. This
can be computationally expensive, but it guarantees that the best combination will be
found within the search space.

 2. Random Search:

 - Instead of searching all possible combinations, it randomly samples the


hyperparameter space for a fixed number of times.
 3. Bayesian Optimization:

 - This method tries to intelligently explore the hyperparameter space by predicting the
performance based on previous trials. It can be more efficient than grid search and
random search.

 4. Gradient-Based Optimization:

 - Methods like gradient descent or its variants can also be applied to hyperparameter
tuning by treating the hyperparameter space as a continuous function.
Ensemble Methods
 Ensemble methods in machine learning are techniques that combine
predictions from multiple models to improve the performance,
accuracy, and robustness of predictive outcomes. They are widely
used because they help reduce issues like overfitting, enhance
prediction accuracy, and often provide more stable and
generalized results compared to single models. Here’s a deep dive
into the most common ensemble techniques: Bagging, Boosting, and
Stacking, along with examples, code, and visuals to illustrate each
approach.
Types of Ensemble Methods

 The main types of ensemble methods are:


• Bagging (Bootstrap Aggregating): Creates multiple subsets of the
dataset, trains individual models on these subsets, and combines their
predictions.
• Boosting: Sequentially builds models that focus on correcting the errors
of previous models.
• Stacking: Combines different types of models, often in layers, to create
a more powerful prediction model.
 Bagging with Random Forest
 Bagging reduces variance by training multiple versions of a model on
random subsets of the data. A commonly used bagging method is the
Random Forest, which combines predictions from multiple decision trees,
reducing overfitting while maintaining flexibility.
• Random Forest trains multiple decision trees on random subsets of the data
and combines their predictions, reducing variance and improving accuracy.
 Boosting with AdaBoost
 Boosting sequentially builds models that focus on the errors of previous
models. Each model tries to correct the mistakes of the previous one,
gradually improving the performance. AdaBoost (Adaptive Boosting) is a
popular boosting algorithm.
• In AdaBoost, a weak learner (a simple model, like a shallow decision tree) is
used in multiple iterations.
• Each new model is trained to give more weight to the misclassified
samples, focusing on harder cases.
 Gradient Boosting with XGBoost
 Gradient Boosting minimizes the error by training each new model on
the residuals (errors) of the previous model. XGBoost (Extreme Gradient
Boosting) is a powerful library optimized for speed and performance.
 Stacking with Multiple Classifiers
 Stacking combines predictions from different types of models. Unlike
bagging or boosting, stacking uses a meta-learner to make the final
predictions based on the outputs of the individual base models.
• Each base model learns independently on the data, providing diverse
perspectives.
Performance Metrics for Regression
 1. Mean Absolute Error (MAE)
 The Mean Absolute Error is the average of the absolute differences between the
predicted and actual values. It gives an idea of the average error in the
predictions.
• Lower MAE indicates a more accurate model.
• MAE is sensitive to outliers since it only considers absolute error.

2. Mean Squared Error (MSE)


 Mean Squared Error squares the differences between actual and predicted
values. Squaring penalizes larger errors more, making MSE sensitive to outliers.
• Lower MSE indicates better model performance.
• Because of the squaring effect, MSE emphasizes large errors more than MAE.
 3. Root Mean Squared Error (RMSE)
 RMSE is simply the square root of MSE. It maintains the same unit as the
target variable, making interpretation easier.

• Lower RMSE is better.


 4. Mean Absolute Percentage Error (MAPE)
 MAPE provides an error rate as a percentage. It’s useful when
interpreting errors relative to the size of the actual values.
• Lower MAPE indicates better performance.
 5. R-Squared (R²)
 R-Squared measures the proportion of variance in the target variable
explained by the model. It ranges from 0 to 1, with 1 indicating a perfect
model.

• Higher R² values indicate a better fit.


• If R² is 0, the model does no better than the mean prediction, and if it is
negative, the model performs worse than predicting the mean.
Classification Metrics
 1. Accuracy
 Accuracy measures the proportion of correctly classified instances out of the total
instances.

• Higher accuracy indicates better overall performance. However, if the dataset is


imbalanced, accuracy alone may not be sufficient.
 2. Precision
 Precision is the ratio of true positives to all predicted positives, showing the
percentage of positive predictions that are actually correct.
 High precision is particularly useful when false positives have a high cost, such as in
marketing, where you want to avoid wasting resources on uninterested clients.
 3. Recall (Sensitivity)
 Recall measures the ability of a model to find all positive instances,
calculated as the ratio of true positives to all actual positives.

 High recall indicates that the model is effective at capturing positive


instances. It’s important in scenarios where missing positive cases is
costly, like identifying potential customers.
 4. F1 Score
 The F1 Score is the harmonic mean of precision and recall. It provides a
balanced metric when there’s a trade-off between precision and recall.
 The F1 score is useful when you need a single metric that balances
precision and recall.
 5. Confusion Matrix
 The confusion matrix shows the counts of true positives, true negatives, false
positives, and false negatives.
• A confusion matrix provides insight into the types of errors the model makes. For
instance, false negatives indicate missed opportunities.
 6. ROC-AUC Score and ROC Curve
 The ROC-AUC (Receiver Operating Characteristic - Area Under Curve) score measures
the model's ability to distinguish between classes. The ROC curve shows the true
positive rate against the false positive rate at various threshold levels.
• ROC-AUC Score: A higher ROC-AUC score means the model is better at
distinguishing between the positive and negative classes.
• ROC Curve: The closer the curve is to the top-left corner, the better the model's
performance.
 Final Interpretation
• Accuracy gives an overall success rate, but alone may not be sufficient.
• Precision is useful when the cost of false positives is high.
• Recall is vital when you need to capture as many positives as possible.
• F1 Score is a balanced metric, helpful when you need to consider both
precision and recall.
• ROC-AUC Score provides insight into the model's ability to separate
classes.
Clustering Evaluation Metrics
 Unlike supervised learning, clustering lacks a ground truth. Thus, we rely on
different metrics to evaluate the "quality" of clusters, often based on internal
criteria. Here are some commonly used clustering evaluation metrics:
 1. Silhouette Score
 The Silhouette Score measures how similar an object is to its own cluster
compared to other clusters. A higher score indicates that clusters are well-
separated and compact.
 Where:
• a: The mean distance between a sample and all other points in the same
cluster.
• b: The mean distance between a sample and all points in the next nearest
cluster.
• A Silhouette Score closer to 1 indicates well-separated clusters, whereas a
score near 0 suggests overlapping clusters.
 2. Calinski-Harabasz Index
 The Calinski-Harabasz Index, or Variance Ratio Criterion, measures the
ratio of the sum of between-cluster dispersion to within-cluster
dispersion. A higher score indicates more distinct clusters.
• Higher values represent clusters that are dense and well-separated.
 3. Davies-Bouldin Score
 The Davies-Bouldin Score is the average similarity ratio of each cluster
with the cluster that is most similar to it. Lower values indicate better
clustering.
• Lower values of the Davies-Bouldin Score indicate well-separated
clusters. Values closer to zero are ideal.
Model Selection
 Model selection involves choosing the best model for a particular dataset and
task, often by evaluating different algorithms or parameter settings. We use
various techniques such as cross-validation, hyperparameter tuning, and
metrics comparison to help select the most suitable model.
 Model Selection Process Overview
1. Data Preprocessing: Clean and preprocess the dataset.
2. Train-Test Split: Split the data into training and test sets.
3. Baseline Models: Train and evaluate different baseline models.
4. Cross-Validation: Use cross-validation to get more robust evaluation metrics.
5. Hyperparameter Tuning: Use techniques like Grid Search or Randomized
Search to optimize model parameters.
6. Model Evaluation: Compare models using metrics like accuracy, precision,
recall, F1 score, and AUC-ROC.
Case: Credit Card Fraud Detection
 Objective: Build a model to classify whether a given transaction is
fraudulent or not based on features available in the dataset.
 Overview of the Process
1. Load and Preprocess the Data: Read the dataset and perform
necessary cleaning and preprocessing.
2. Exploratory Data Analysis (EDA): Understand the dataset better
through visualizations and statistics.
3. Model Selection: Train several models on the dataset.
4. Evaluate Models: Use different metrics to assess model performance.
5. Hyperparameter Tuning: Optimize model parameters for better
performance.
6. Final Evaluation: Evaluate the best model on a test set.

You might also like