0% found this document useful (0 votes)
2 views29 pages

Types Machine Learning Algorithm New (1)

Machine learning algorithms are categorized into four types: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Supervised learning uses labeled data for prediction and classification, while unsupervised learning identifies patterns in unlabeled data. Key applications of these algorithms span various domains, including medical, finance, and education, with popular Python libraries like NumPy, Pandas, and Scikit-Learn facilitating data manipulation and model building.

Uploaded by

pandit27165
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views29 pages

Types Machine Learning Algorithm New (1)

Machine learning algorithms are categorized into four types: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Supervised learning uses labeled data for prediction and classification, while unsupervised learning identifies patterns in unlabeled data. Key applications of these algorithms span various domains, including medical, finance, and education, with popular Python libraries like NumPy, Pandas, and Scikit-Learn facilitating data manipulation and model building.

Uploaded by

pandit27165
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Types Machine learning algorithms

Machine learning algorithms are broadly classified into four types based on how they learn from
data:

Types of Machine Learning Algorithms


1. Supervised Learning
 The model learns from labeled data (input-output pairs).
 Used for prediction and classification tasks.
Examples:
 Regression Algorithms (for continuous output):
o Linear Regression
o Polynomial Regression
o Ridge & Lasso Regression
 Classification Algorithms (for categorical output):
o Logistic Regression
o Decision Trees
o Random Forest
o Support Vector Machine (SVM)
o K-Nearest Neighbors (KNN)
o Naïve Bayes
2. Unsupervised Learning
 The model learns from unlabeled data by identifying patterns and structures.ML model
discover possible patterns in the data
 Used for clustering and association rule learning.
Examples:
 Clustering Algorithms (grouping similar data):
o K-Means Clustering
o Hierarchical Clustering
o DBSCAN (Density-Based Spatial Clustering)
 Dimensionality Reduction Algorithms:
o Principal Component Analysis (PCA)
o t-SNE (t-Distributed Stochastic Neighbor Embedding)
o Autoencoders
 Association Rule Learning (finding relationships between items):
o Apriori Algorithm
o Eclat Algorithm
3. Semi-Supervised Learning
 Uses a mix of labeled and unlabeled data.
 Useful when obtaining labeled data is expensive or time-consuming.
 Often used in medical diagnosis and fraud detection.
Examples:
 Self-training
 Label propagation
 Semi-supervised Support Vector Machines
4. Reinforcement Learning
 The model learns through trial and error using a reward system.
 Used in robotics, gaming, and self-driving cars.
Examples:
 Q-Learning
 Deep Q Networks (DQN)
 Policy Gradient Methods
 Actor-Critic Methods (A3C, PPO)

Supervised Learning
 Supervised Learning is a type of machine learning where historical and labeled data is
applied and machine learning algorithm predict a value.
 Historical data means known data from the past.( for example on what price house have
been sold in past).
 Machine learning algorithm learns from labeled data (i.e., input-output pairs).
 Labeled data means desired output is known.
 The model is trained using historical data, where each input (features) has a
corresponding correct output (label).
 The goal is to learn a mapping function that can make accurate predictions on new,
unseen data.

Key Points
 Uses labeled data
 Learns from historical data to make predictions.
 Used for classification and regression tasks.
 Performance is evaluated using metrics (e.g., accuracy, MSE, precision, recall).

 In Supervised learning we collect and organize data set based on history


 House Price Prediction: A regression model is trained on past house sale prices
(based on square footage, number of rooms, location) and predicts future house
prices.
Features Label
Area m2 Bedrooms Washroom Price
200 3 2 500000
X Train X Test
190 2 1 450000
230 3 3 650000
180 1 1 400000
Y Train Y Test
210 2 2 550000
House Price Prediction Dataset
 Data set have feature(X) and label(Y)
 Data set is divided in four component X Train,Y Train,X test,Y Test for fair evaluation.
 So if a new house comes on the market with a known area, number of bedrooms, and
number of bathrooms, the model can predict its expected price based on historical data
and learned patterns.
 So for this model input is house feature and output is prediction of selling price.

Supervised Model

Types of Supervised Learning Algorithms: Two main types of level is used

1. Classification Algorithms (Predict categorical values) :


 Predict an assigned category.

Example: Tumor Prediction (Cancerous or Benign) in Supervised Learning


In this classification problem, we use historical medical data to train a machine learning model to
predict whether a tumor is cancerous (malignant) or benign based on various features.
Handwriting Recognition Using Machine Learning (Supervised Learning Example)
Handwriting recognition is a classification problem where a model learns to recognize
handwritten characters or digits from images. The most common dataset for this task is MNIST
(Modified National Institute of Standards and Technology dataset), which contains 28×28
pixel grayscale images of handwritten digits (0-9).
 Decision Trees (Creates a tree structure for decision-making. Example: Loan
approval system)
 Random Forest(Uses multiple decision trees to improve accuracy.Example : Disease
diagnosis)
 Support Vector Machines (SVM) (Finds the best boundary between classes.
Example: Handwritten digit recognition)
 K-Nearest Neighbors (KNN)( Classifies data based on nearby neighbors.
Example :Recommender systems)
 Naïve Bayes(Uses probability theory for classification. Example: Sentiment analysis)
 Logistic Regression (Estimates probability and assigns classes. For Example : Email
spam detection).
2. Regression Algorithms (Predict continuous values): Regression is a supervised learning
technique used to predict continuous values, such as house prices, stock prices, temperature, Test
scores, etc.
 Linear Regression (Fits a straight line to predict values. Example : House Price
Prediction)
 Polynomial Regression:( Fits a curve to the data.Ex. Fits a curve to the data)
 Ridge & Lasso Regression: (Prevents overfitting using regularization.Ex. Sales
Forecasting)

Unsupervised Learning
In Unsupervised Learning, there is no labeled output (such as house prices). Instead, the model
identifies patterns or groups within the data.
For housing data, unsupervised learning can be used for:
1. Clustering (Grouping Similar Houses)
 K-Means Clustering: Groups houses into clusters based on features like area,
number of bedrooms, and location. For example, it can segment houses into
"Luxury," "Affordable," and "Mid-range" categories.
 Hierarchical Clustering: Builds a tree-like structure to show relationships
between different house categories.
 DBSCAN: Identifies housing price anomalies and clusters based on density.
2. Anomaly Detection (Identifying Outliers)
 Helps detect houses with abnormally high or low prices compared to similar
properties.
 Techniques: Isolation Forest, One-Class SVM, Autoencoders
3. Dimensionality Reduction (Feature Reduction)
 If there are many features (e.g., location, area, amenities), Principal
Component Analysis (PCA) can reduce complexity while preserving
important patterns.

Domains where AI is used


1. Medical
2. Education
3. Insurance & Banking
4. Finance & marketing
5. Quality Control
6. Regulatory Compliance
7. Automotive & Manufacturing
8. Others

Python Libraries for Data Science

1. NumPy (Numerical Python)

NumPy is a fundamental library for numerical computing in Python. It provides support for
large, multi-dimensional arrays and matrices, along with a collection of mathematical functions.

Use Cases:

o Efficient handling of numerical data


o Vectorized operations (faster than regular Python lists)
o Linear algebra, random number generation
2. Pandas (Data Analysis Library)

Pandas is a data manipulation and analysis library that provides data structures like Series (1D)
and DataFrame (2D, similar to tables in databases).

Use Cases:

o Loading and manipulating structured data


o Handling missing values
o Data filtering, aggregation, and transformation

3. Matplotlib (Data Visualization)

Matplotlib is a powerful plotting library for visualizing data through graphs, charts, and
histograms.

Use Cases:

o Line plots, bar charts, scatter plots, histograms


o Customizing plots (labels, colors, legends)
o Saving figures in different formats

4. Seaborn (Statistical Data Visualization)

Seaborn is built on top of Matplotlib and provides more aesthetically pleasing and informative
statistical graphics.

Use Cases:

o Plotting distributions, relationships, and categorical data


o Heatmaps and correlation matrices
o Built-in dataset support
5. OpenCV (Computer Vision)

OpenCV is an open-source library for image processing and computer vision tasks.

Use Cases:

o Face and object detection


o Image transformations and filtering
o Real-time video processing
 Example: Loading and displaying an image

Scikit-Learn (Machine Learning)

Scikit-Learn is the most widely used machine learning library for building and evaluating
models.

Use Cases:

o Implementing regression, classification, and clustering models


o Feature selection and dimensionality reduction
o Model evaluation and hyperparameter tuning
 Example: Training a simple regression model

TensorFlow / PyTorch (Deep Learning)

TensorFlow and PyTorch are the most popular deep learning frameworks for building neural
networks.

 Use Cases:
o Image and text classification
o Building deep learning models (CNNs, RNNs, Transformers)
o Large-scale ML model training
 Example: Defining a simple neural network using PyTorch

NumPy

 NumPy (Numerical Python) is a fundamental library for numerical computing in


Python.
 It is a Python library for creating N dimensional arrays(1D,2D,3D)
 It also provide mathematical functions, and linear algebra operations, statistical
distribution ,trigonometric and random number capability making it essential for data
science, machine learning, and scientific computing.
 Ability to quickly broadcast function, Broadcasting allows NumPy to perform operations
between arrays of different shapes without explicit looping. It automatically expands
smaller arrays to match the shape of larger ones, making operations efficient.
 NumPy arrays and Python lists both store collections of data, but NumPy arrays are
faster, more memory-efficient, and optimized for numerical operations.

Converting a 1D Python List to a NumPy Array

import numpy as np

# Python List
mylist = [1, 2, 3, 4, 5]
print(mylist) # Output: [1, 2, 3, 4, 5]
type(mylist) #list

# NumPy Array
np.array(mylist) #array([1,2,3,4,5]

myarr=np.array(mylist)
myarr #([1,2,3,4,5])
type myarr #numpy.ndarray

Converting a 2D List (Nested List) to a NumPy Array

# Python nested list (2D list)


import numpy as np
my_matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
# Convert to NumPy array
np_matrix = np.array(my_matrix)
print(np_matrix)

np.arange() – Creating NumPy Arrays with Ranges

 The np.arange() function in NumPy creates an array with evenly spaced values between a
start and stop value.
 It works similarly to Python’s range() but produces a NumPy array instead of a list.

Syntax

Import numpy as np
np.arange(start, stop, step, dtype)

 start :(Optional) Starting value (default = 0)


 stop :The end value (exclusive)
 step :(Optional) Step size (default = 1)
 dtype :(Optional) Data type of the array

import numpy as np

arr = np.arange(3) # Creates [0, 1, 2]


print(arr)

Specifying a Start and Stop Value

arr = np.arange(0, 10) # Start from 0, stop before 10


print(arr)

output
[0 1 2 3 4 5 6 7 8 9]

Using a Custom Step Size

arr = np.arange(1, 10, 2) # Start=1, Stop=10, Step=2

print(arr)

output

[1 3 5 7 9]

Creating an Array with Floating-Point Values

arr = np.arange(1, 5, 0.5) # Start=1, Stop=5, Step=0.5

print(arr)

output [1. 1.5 2. 2.5 3. 3.5 4. 4.5]

Using dtype to Specify Data Type

arr = np.arange(1, 10, 2, dtype=float) # Force float values

print(arr)

output [1. 3. 5. 7. 9.]

np.zeros() – Create NumPy Arrays Filled with Zeros

 The np.zeros() function creates an array filled with zero values, useful for initializing
arrays in numerical computing, data science, and machine learning.
 Create the shape.

Syntax

import numpy as np
np.zeros(shape, dtype=float)
 shape → The shape of the array (integer for 1D, tuple for multi-dimensional).
 dtype → (Optional) The data type of the array elements (default = float).

Creating a 1D Zero Array

import numpy as np
arr = np.zeros(5) # Creates an array with 5 zeros
print(arr)

O/P
[0. 0. 0. 0. 0.]

Creating a 2D Zero Matrix

arr = np.zeros((3, 4)) # 3 rows, 4 columns


print(arr)

O/P

[[0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.]]

Specifying Data Type (dtype)

arr = np.zeros((2, 3), dtype=int) # Integer zeros


print(arr)

Output

[[0 0 0]
[0 0 0]]

Useful when working with integer-based computations.

Creating a 3D Zero Array

arr = np.zeros((2, 3, 4)) # 2 blocks, 3 rows, 4 columns


print(arr)

Output
[[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]]

When to Use np.zeros()

 Initializing arrays before computation (avoids uninitialized memory issues).


 Creating placeholders for machine learning models & deep learning weights.
 Allocating memory for matrices before filling them with actual values.

np.ones() – Create NumPy Arrays Filled with Ones

The np.ones() function creates an array filled with ones, useful for initializing arrays in
numerical computing, data science, and machine learning.

Syntax

import numpy as np
np.ones(shape, dtype=float)

Creating a 1D Ones Array

import numpy as np
arr = np.ones(4) # Creates an array with 5 ones
print(arr)

output
[1. 1. 1. 1. ]

Creating a 2D Ones Matrix

arr = np.ones((3, 4)) # 3 rows, 4 columns


print(arr)

output

[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
Specifying Data Type (dtype)

arr = np.ones((2, 3), dtype=int) # Integer ones


print(arr)

output

[[1 1 1]
[1 1 1]]

Creating a 3D Ones Array

arr = np.ones((2, 3, 4)) # 2 blocks, 3 rows, 4 columns


print(arr)

output

[[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]

[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]]

When to Use np.ones()?

 Initializing weight matrices in machine learning models.


 Creating mask arrays for computations.
 Generating feature matrices for standardization (e.g., adding a bias column of ones).

np.linspace() – Create Evenly Spaced Numbers in NumPy

import numpy as np

Syntax

np.linspace(start, stop, num=50, endpoint=True, dtype=None)

 start → The starting value of the sequence.


 stop → The ending value of the sequence.
 num → (Optional) Number of values to generate (default = 50).
 endpoint → (Optional) If True (default), stop is included. If False, stop is excluded.

 dtype → (Optional) Data type of the output array.


Creating a Basic Linspace Array

import numpy as np
arr = np.linspace(0, 10, 3) # 3 evenly spaced numbers between 1 and 10
print(arr)

output
[ 0. 5. 10. ]

Unlike np.arange(), you specify the number of elements instead of the step size

Excluding the Endpoint

arr = np.linspace(1, 10, num=5, endpoint=False)


print(arr)

output
[1. 2.8 4.6 6.4 8.2]

Here, 10 is not included, and spacing is adjusted accordingly.

Generating Integer Linspace Values

arr = np.linspace(1, 10, num=5, dtype=int) # Force integer values


print(arr)

output

[ 1 3 5 7 10]

By default, np.linspace() generates floats, but you can specify integers.

Getting Both Values & Step Size (retstep=True)

arr, step = np.linspace(1, 10, num=5, retstep=True)


print("Array:", arr)
print("Step size:", step)

output
Array: [ 1. 3.25 5.5 7.75 10. ]
Step size: 2.25

retstep=True returns the spacing between values.


np.eye() : Create an Identity Matrix in NumPy

 The np.eye() function creates an identity matrix, a square matrix with 1s on the diagonal
and 0s everywhere else.
 Identity matrices are widely used in linear algebra, machine learning, and deep learning.

Syntax

import numpy as np
np.eye(N, M=None, k=0, dtype=float)

 N → Number of rows.
 M → (Optional) Number of columns (default = N, creating a square matrix).
 k → (Optional) Diagonal offset (0 for main diagonal, positive for upper diagonals,
negative for lower diagonals).
 dtype → (Optional) Data type of the output matrix (default = float).

Creating a Square Identity Matrix

import numpy as np
arr = np.eye(4) # 4×4 identity matrix
print(arr)

output

[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]

Creating a Rectangular Identity-Like Matrix

arr = np.eye(3, 5) # 3 rows, 5 columns


print(arr)

output

[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]]
Shifting the Diagonal (k parameter)

arr = np.eye(4, k=1) # Shift diagonal up by 1


print(arr)

output

[[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]
[0. 0. 0. 0.]]

Creating an Integer Identity Matrix

arr = np.eye(3, dtype=int)


print(arr)

output
[[1 0 0]
[0 1 0]
[0 0 1]]

np.eye() in Machine Learning

 Used in algorithms like Principal Component Analysis (PCA).


 Used for feature scaling and transformation matrices.
 Helpful in solving linear equations and regularization techniques (like L2 regularization
in Ridge Regression).
 Challenges of ML
 Assumption ML
 Underfitting
 Overfitting
 Bias variance trade off
 No free Lunch Theorem
 Trade off between flexibility and interpretability
Main Challenges of Machine Learning

 Is the model perfect


When we train a model using a given dataset, it is difficult to say whether
the model is perfect or not. Because we evaluate the performance of the model using the
training and testing datasets, but this only shows how well it performs on that specific
data. The real test of the model’s accuracy happens when it is used on new, unseen data
in the future. Only then can we determine if the model is truly working well.
 Which Model to choose
As we have seen, different Machine Learning tasks, like classification, have many
models available; it is difficult to select the right one. For example, in a house price
prediction problem, which is a regression task, there are many regression models to
choose from. However, selecting the best model is challenging because there is no single
criterion for determining the most suitable one.

 Sufficient/Insufficient Data
It is often unclear whether a dataset is sufficient. While we have data, we may not
know if it is enough to train a reliable model and make accurate predictions.

 What is Data quality


o Error in Data
o Missing Value
Data quality can be checked by looking at errors, missing values, and noise in the
dataset. If the data has too many issues and the model learns from it, the model may not
be able to learn the best patterns.
 How good are results?
If a model is trained on a given dataset, it is difficult to determine whether it is
good or not. This is because evaluation is based on the available dataset, but its
performance on future data remains uncertain. A model can only be considered good if it
performs well on unseen or future data.
 Are the data represented correctly
o Are height and weight enough? Should we look at age also
o How should we represent age— as a number or as categories like young, middle-
aged, and old?
Data can be represented in different ways, making it difficult to determine
whether it is in the correct format.
 How to ensure generalization
o Ability of an ML algorithm to do well on future test data
An ML model may perform well on the given data, but its performance on
future data is uncertain. This concept is known as generalization means how much
our model is general, which refers to how well the model performs on new,
unseen data. Predicting generalization error is difficult during training and testing,
as it only becomes clear when the model encounters real-world data.

Examples of bad Data

 Insufficient quantity of training data


 Non representative training data
 Poor quality data
 Irrelevant Feature

Example of machine learning Models Problems

 Overfitting the training data


 Underfitting the training data
Assumption in ML

 IID(Independent and Identically distribution of data) assumption


o Independent(instances are independent to each other)
 Each data sample is drawn independently of others.
 The outcome of one sample does not affect another.
 This ensures there are no dependencies or correlation between the samples
in the dataset.
o Identically(All instances of dataset these come from same probability distribution)
 All data points come from the same underlying distribution
 If the data is drawn from a Gaussian distribution (A Gaussian
distribution, also known as the normal distribution, is a
fundamental probability distribution in statistics and Machine
Learning. It is characterized by a bell-shaped curve that is
symmetric around the mean.), all sample must follow the same
Gaussian distribution.
o This assumption ensures the learning algorithm can generalize from training data to
unseen data, as the test set is assumed to follow the same probability distribution
o Test and training data are generated by a probability distribution.

Validation and Test Data in Machine Learning

When training a Machine Learning model, we typically split the dataset into three parts:

1. Training Set: Used to train the model by learning patterns from the data.
2. Validation Set: Used to fine-tune the model by selecting the best hyperparameters and
preventing overfitting.
3. Test Set: Used to evaluate the final performance of the trained model on completely
unseen data.
Differences between Validation and Test Data

1. Purpose: Validation data is used to tune the model and optimize hyperparameters, while
test data is used to evaluate the final model’s performance on unseen data.
2. When Used: The validation set is used during model training to improve performance,
whereas the test set is only used after training is complete to assess how well the model
generalizes.
3. Impact: The validation set helps in selecting the best model by preventing overfitting,
while the test set provides an unbiased estimate of the model’s real-world performance.

Underfitting the Training Data

Underfitting happens when a model is too simple to learn patterns in data, leading to poor
performance

1. High Training Error and High test error


When we train our model on the training data, there may be a high training error.
After training, when we make predictions, there may be many errors in the
predictions. If the model does not perform well on the training data, it will also
not perform well on the test data. Therefore, if a model is underfitted to the
training data, it will definitely not work well on the test data either.
2. It Occurs when
o Model is too simple to learn the underlying structure of the data.
Underfitting occurs when the model lacks the complexity needed
to capture patterns in the data. This happens when the model is too
simple relative to the problem it is trying to solve.
o We have very less amount of data
Machine learning models require sufficient data to generalize well.
If we train a model on too little data, it may not learn meaningful
relationships and may result in underfitting.
3. If model is underfitting the training data, adding more training data will not help.
The Model is Too Simple: Underfitting happens when the model is too basic to
capture the underlying patterns in the data. Even with more data, if the model is
inherently too simple, it will still fail to learn the relationships effectively.
4. How to fix this Problem

1. Use a More Powerful Model

A simple model may not be enough.Selecting more powerful model


with more parameter.

Polynomial Regression instead of Linear Regression


Deep Decision Trees instead of Shallow Trees
Neural Networks instead of Basic Machine Learning Models

2. Feature Selection & Engineering

 Remove irrelevant or highly correlated features that might be


introducing noise.
 Use Principal Component Analysis (PCA) or Feature Selection
techniques to reduce dimensionality.

4. Gather More Training Data

 If possible, collect more training data to help the model learn better general patterns
instead of memorizing specific examples.

5.Reduce Noise :

 Reduce noise by handling missing values, removing outliers, and


eliminating irrelevant features using Recursive Feature Elimination
(RFE) or Principal Component Analysis (PCA – Principal Component
Analysis). Apply data smoothing (moving averages, Gaussian filtering),
proper preprocessing (normalization, duplication), and cleaning techniques
for text/audio. Use robust models like Ridge regression and ensemble
methods (Random Forest, Boosting) for better generalization.
 Noise : In machine learning, noise is unwanted or incorrect data that can
confuse the model and reduce accuracy. It comes from errors, missing
values, or irrelevant information. To reduce noise, we can remove outliers,
select important features, and clean the data properly.
 Outliers : An outlier is a data point that is very different from others in a
dataset, often due to errors or rare events. It can affect model accuracy and
is detected using methods like Z-score, IQR, or box plots.

6. Reducing the constraint on the model

Reducing constraints on the model, such as decreasing regularization, increasing model


complexity, or adding more features, can help fix underfitting by allowing the model to learn
patterns better and improve performance.

Overfitting The Training data

Overfitting happens when a model learns too much from training data, including noise, making it
perform poorly on new data

1. Model Performs well on training data, but it does not generalize well(high test
error)
When a model performs well on training data but not working with test Data.This
means the model has learned patterns, including noise, from the training set but fails to
generalize to new data.
2. Happens When
1. Model is too complex relative to the amount
Data is simple and model is complex
2. Noisiness of the training data

3.Possible solutions are

 To Simplify the model by selecting one or fewer parameter


 By reducing the number of attributes in the training data or by constraining
model
 To gather more training data
 To reduce the noise in the training data

4. Constraining a model to make it simpler and reduce the risk of overfitting is called
regularization

Regularization is the process of constraining a model to make it simpler and


reduce the risk of overfitting by adding a large coefficients or complex patterns. This helps
improve the model’s ability to generalize to new data.

Bias-Variance Tradeoff

The Bias-Variance Tradeoff is the balance between bias (underfitting) and variance (overfitting)
in a machine learning model.

Prediction error (classified into two types)

 Bias Error
 Variance Error

There is a tradeoff between a model’s ability to minimize bias and variance

A model with high bias is too simple and underfits the data, while a model with high
variance is too complex and overfits. The goal is to find a balance where the model captures
patterns without memorizing noise, ensuring good generalization to new data. Techniques like
regularization, cross-validation, and ensemble methods help manage this tradeoff.

Understanding of these errors avoids the mistake of overfitting and Underfitting

Understanding bias, variance, and the bias-variance tradeoff helps avoid the mistakes
of overfitting (high variance) and underfitting (high bias). A well-balanced model captures
important patterns without memorizing noise. Techniques like regularization, cross-validation,
and feature selection help achieve this balance.

High Variance

High Variance means a model learns too much from training data, including noise, and
performs poorly on new data (overfitting). It can be fixed using regularization, more data, or
ensemble methods.

Model is more data sensitive

A model that is more data sensitive has high variance, meaning it learns too much from
training data, including noise, and struggles to generalize to new data (overfitting). This can be
fixed using regularization, more training data, or ensemble methods.

Bias Error

Bias Error occurs when a model is too simple to capture patterns in the data, leading to
underfitting. It results in high errors on both training and test data. To reduce bias, use a more
complex model, add relevant features, or decrease regularization.

 Difference between the average prediction of our model and the correct value which
we are trying to predict
 Model with high bias pays very little attention to the training data and
oversimplifies the model
 It always leads to high error on training and test data

A high level of bias can lead to underfitting

Which occurs when the algorithm is unable to capture relevant between features and
target output

A linear algorithm often has high bias

Nonlinear model has less bias


Variance (Changing Data set)

Variance in machine learning refers to how much a model's predictions change when trained on
different datasets. High variance means the model is too sensitive to small changes in data,
leading to overfitting, while low variance helps in better generalization.

 Variance indicate how much the estimate of the target function will alter for a given
data point if different training data were used
variance indicates how much a model's predictions change when trained
on different datasets. High variance means the model is too sensitive to training
data, leading to overfitting, while low variance improves generalization.

 Model has high variance pays a lot of attention to training data and does not
generalize on the data which has not seen before.
A model with high variance focuses too much on the training data, learning even
noise, and fails to generalize to new data, leading to overfitting. This can be fixed
using regularization, more training data, or ensemble methods.
 As a result model perform well on training data but has high error on test data
 Variance can lead to overfitting, in which small fluctuations in the training set are
magnified

You might also like