0% found this document useful (0 votes)

39 views

Unit 2 Supervised Learning Regression

The document discusses supervised machine learning and regression. It defines supervised learning, how it works, and common algorithms like linear regression. Linear regression finds the best fit line between independent and dependent variables to predict continuous outputs.

Uploaded by

kingsourabh1074

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Unit 2 Supervised Learning Regression

Uploaded by

kingsourabh1074

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 111

Unit No.

2
Unit 2: Supervised learning:
Regression

Prof . Sachin Sambhaji Patil

D. Y. Patil University Ambi, Pune

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 1

Supervised Machine Learning

• Supervised learning is the types of machine learning in

which machines are trained using well "labelled" training
data, and on basis of that data, machines predict the
output.
• The labelled data means some input data is already tagged
with the correct output.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 2
Supervised Machine Learning

• In supervised learning, the training data provided to the

machines work as the supervisor that teaches the machines
to predict the output correctly.

• It applies the same concept as a student learns in the

supervision of the teacher.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 3

Supervised Machine Learning

• Supervised learning is a process of providing input data as

well as correct output data to the machine learning model.
The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the
output variable(y).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 4

Supervised Machine Learning

• In the real-world, supervised learning can be used for Risk

Assessment, Image classification, Fraud Detection, spam
filtering, etc.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 5

How Supervised Learning Works?

• In supervised learning, models are trained using labelled

dataset, where the model learns about each type of data.
Once the training process is completed, the model is tested
on the basis of test data (a subset of the training set), and
then it predicts the output.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 6

How Supervised Learning Works?

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 7

Supervised Machine Learning
• Suppose we have a dataset of different types of shapes which includes
square, rectangle, triangle, and Polygon. Now the first step is that we
need to train the model for each shape.

• If the given shape has four sides, and all the sides are equal, then it will
be labelled as a Square.

• If the given shape has three sides, then it will be labelled as a triangle.

• If the given shape has six equal sides then it will be labelled as hexagon.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 8
Supervised Machine Learning
• Now, after training, we test our model using the test set, and the
task of the model is to identify the shape.

• The machine is already trained on all types of shapes, and when it

finds a new shape, it classifies the shape on the bases of a number
of sides, and predicts the output.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 9

Supervised Machine Learning
• Steps Involved in Supervised Learning:
1. First Determine the type of training dataset
2. Collect/Gather the labelled training data.
3. Split the training dataset into training dataset, test dataset, and
validation dataset.
4. Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output,
Prof.Sachin Sambhaji which
Patil , D.Y.Patil means
University Ambi , Pune our model is accurate.
10
Types of supervised Machine learning Algorithms:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 11

Types of Supervised Machine learning Algorithms:
• Regression
• Regression algorithms are used if there is a relationship
between the input variable and the output variable.
• It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc.
• Below are some popular Regression algorithms which come
under supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 12
Types of Supervised Machine learning Algorithms:
• Classification
• Classification algorithms are used when the output
variable is categorical, which means there are two classes
such as Yes-No, Male-Female, True-false, etc.
• Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 13
Types of Supervised Machine learning Algorithms:
• Supervised learning has two types:
• Classification: It predicts the class of the dataset based on the
independent input variable. Class is the categorical or discrete values.
like the image of an animal is a cat or dog?

• Regression: It predicts the continuous output variables based on the

independent input variable. like the prediction of house prices based on
different parameters like house age, distance from the main road,
location, area, etc.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 14
Types of Supervised Machine learning Algorithms:
• Linear Regression
• Linear regression is a type of supervised machine learning
algorithm that computes the linear relationship between a
dependent variable and one or more independent features.

• When the number of the independent feature, is 1 then it is

known as Univariate Linear regression, and in the case of more
than one feature, it is known as multivariate linear regression.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 15

Types of Supervised Machine learning Algorithms:
• Linear Regression
• The goal of the algorithm is to find the best linear equation that
can predict the value of the dependent variable based on the
independent variables.

• The equation provides a straight line that represents the

relationship between the dependent and independent variables.

• The slope of the line indicates how much the dependent variable
changes for a unit change in the independent variables.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 16
Linear regression- Linear models,

• Linear regression performs the task to predict a dependent variable

value (y) based on a given independent variable (x)).
• Hence, the name is Linear Regression.
• In the figure above, X (input) is the work experience and Y (output) is the
salary of a person.
• The regression line is the best-fit line for our model.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 17
Linear regression- Linear models,

our independent feature is the experience i.e X and the

respective salary Y is the dependent variable. Let’s assume
there is a linear relationship between X and Y then the salary
can be predicted using:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 18

Linear regression- Linear models,

• The model gets the best regression fit line by finding the
best θ1 and θ2 values.
• θ1: intercept
• θ2: coefficient of x
• Once we find the best θ1 and θ2 values, we get the best-fit
line. So when we are finally using our model for prediction,
it will predict the value of y for the input value of x.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 19
Linear regression-Cost Function
• The cost function or the loss function is nothing but the error or
difference between the predicted value and the true value Y.

• It is the Mean Squared Error (MSE) between the predicted value and
the true value.

• The cost function (J) can be written as:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 20

How to update θ1 and θ2 values to get the best-fit line?

• To achieve the best-fit regression line, the model aims to predict the
target value such that the error difference between the predicted value
and the true value Y is minimum.

• So, it is very important to update the θ1 and θ2 values, to reach the best
value that minimizes the error between the predicted y value (pred) and
the true y value (y).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 21

Gradient Descent
• A linear regression model can be trained using the optimization algorithm
gradient descent by iteratively modifying the model’s parameters to reduce
the mean squared error (MSE) of the model on a training dataset.

• To update θ1 and θ2 values in order to reduce the Cost function (minimizing

RMSE value) and achieve the best-fit line the model uses Gradient Descent.
The idea is to start with random θ1 and θ2 values and then iteratively
update the values, reaching minimum cost.

• A gradient is nothing but a derivative that defines the effects on outputs of

the function with a little bit of variation in inputs.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 22
Multi Dimensionality Reduction
• Used for dimensionality reduction when the input data is not linearly
arranged or it is not known whether a linear relationship exists or not.

• MDS is a non-linear technique for embedding data in a lower-dimensional

space.

• MDS (multidimensional scaling) is an algorithm that transforms a dataset

into another dataset, usually with lower dimensions, keeping the same
euclidean distances between the points.

• It can be used to detect outliers in some multivariate distribution,

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 23
Multi Dimensionality Reduction

• The main objective of MDS is to represent dissimilarities as distances

between points in a low dimensional space such that the distances
correspond as closely as possible to the dissimilarities.

• nonlinear method to project in lower dimensions by saving pairwise

distances

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 24

Multi Dimensionality Reduction

• The metric MDS calculates distances between each pair of points in

the original high-dimensional space and then maps it to lower-
dimensional space while preserving those distances between points
as well as possible.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 25

Differentiate the cost function(J)

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 26

Linear Regression model
• Finding the coefficients of a linear equation that best fits the training
data is the objective of linear regression.

• By moving in the direction of the Mean Squared Error negative gradient

with respect to the coefficients, the coefficients can be changed.

• And the respective intercept and coefficient of X will be if alpha is the

learning rate.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 27

Gradient Descent

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 28

Bias-Variance Trade-Off

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 29

Bias Variance Tradeoff

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 30

Model Complexity

• Model complexity, which in the case of linear regression can be

thought of as the number of predictors increases, estimates variance
also increases, but the bias decreases.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 31

Use of Regularization

• use regularization - a technique allowing to decrease this variance at

the cost of introducing some bias.

• Finding a good bias-variance trade-off allows to minimize the

model's total error.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 32

Types of Regularization Techniques
• There are three popular regularization techniques, each of them aiming
at decreasing the size of the coefficients :

1. Ridge Regression, which penalizes sum of squared coefficients (L2

penalty).

2. Lasso Regression, which penalizes the sum of absolute values of the

coefficients (L1 penalty).

3. Elastic Net, a convex combination of Ridge and Lasso (L1 + L2 )

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 33

Types of Regularization Techniques
• L2 regularization takes the square of the weights, so the cost of
outliers present in the data increases exponentially.

• L1 regularization takes the absolute values of the weights, so

the cost only increases linearly.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 34

Lasso Regression
• Lasso, or Least Absolute Shrinkage and Selection Operator, is quite
similar conceptually to ridge regression.

• It also adds a penalty for non-zero coefficients, but unlike ridge

regression which penalizes sum of squared coefficients (the so-called L2
penalty), lasso penalizes the sum of their absolute values (L1 penalty).

• As a result, for high values of λ, many coefficients are exactly zeroed

under lasso, which is never the case in ridge regression.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 35

Lasso, the loss is defined as:

• In lasso, one of the correlated predictors has a larger coefficient, while

the rest are (nearly) zeroed.
• Lasso tends to do well if there are a small number of significant parameters
and the others are close to zero
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 36
Elastic Net Regularization
• Elastic Net first emerged as a result of critique on lasso, whose
variable selection can be too dependent on data and thus
unstable.

• The solution is to combine the penalties of ridge regression

and lasso to get the best of both worlds.

• Elastic Net aims at minimizing the following loss function:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 37

Elastic Net Regularization
• Elastic Net first emerged as a result of critique on lasso,
whose variable selection can be too dependent on data
and thus unstable.
• The solution is to combine the penalties of ridge
regression and lasso to get the best of both worlds.
• Elastic Net aims at minimizing the following loss function:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 38

Elastic Net Regularization

• Where α is the mixing parameter between

• ridge (α = 0) and
• lasso (α = 1).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 39

Elastic Net Regularization

• The elastic net penalty is a weighted sum of the L1 and L2 penalties.

• The mixing parameter, alpha (α), controls the weight of the L1

penalty relative to the L2 penalty.

• When alpha=1, the penalty reduces to the L1 penalty (Lasso

regression), and when alpha=0, the penalty reduces to the L2
penalty (Ridge regression).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 40

Elastic Net Regularization

• Elastic net regression is a linear regression technique that uses a

penalty term to shrink the coefficients of the predictors.

• The penalty term is a combination of the l1-norm (absolute value)

and the l2-norm (square) of the coefficients, weighted by a
parameter called alpha.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 41

Elastic Net Regularization

• Now, there are two parameters to tune: λ and α.

• The glmnet package allows to tune λ via cross-validation for a fixed α,

but it does not support α-tuning, so we will turn to caret for this job.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 42

Polynomial Regression
• Polynomial Regression is a regression algorithm that models
the relationship between a dependent(y) and independent
variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 43

Polynomial Regression
• It is also called the special case of Multiple Linear Regression in ML.
Because we add some polynomial terms to the Multiple Linear regression
equation to convert it into Polynomial Regression.

• It is a linear model with some modification in order to increase the

accuracy.

• The dataset used in Polynomial regression for training is of non-linear

nature.

• It makes use of a linear regression model to fit the complicated and non-
linear functions and datasets.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 44
Polynomial Regression

• "In Polynomial regression, the original features are converted into

Polynomial features of required degree (2,3,..,n) and then modeled
using a linear model."

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 45

Polynomial Regression

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 46

Simple Linear Regression equation:
• Simple Linear Regression equation:
• y = b0+b1x .........(a)

• Multiple Linear Regression equation:

• y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)

• Polynomial Regression equation:

• y= b0+b1x + b2x2+ b3x3+....+ bnxn …………………(c)

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 47

Linear Regression equation
• When we compare the above three equations, we can clearly see that
all three equations are Polynomial equations but differ by the degree
of variables.

• The Simple and Multiple Linear equations are also Polynomial

equations with a single degree, and the Polynomial regression
equation is Linear equation with the nth degree.

• So if we add a degree to our linear equations, then it will be

converted into Polynomial Linear
Prof.Sachin Sambhaji equations.
Patil , D.Y.Patil University Ambi , Pune 48
Isotonic Regression

• 'iso' means equal and 'tonic' means stretching.

• In terms of machine learning algorithms, isotonic regression

can, therefore, be understood as equal stretching along the
linear regression line.

• It works on top of a linear regression model.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 49

Isotonic Regression

• Isotonic regression has to be non-negative whereas in linear

regression can be negative.

• This means every point in isotonic regression should be high as

before the previous point.

• Isotonic can be free form but linear regression should be linear.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 50

Isotonic Regression

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 51

Isotonic Regression
• Isotonic regression can be formulated as an optimization

problem in which the goal is to find a monotonic function

that minimizes the sum of the squared errors between the
predicted and observed values of the target variable.

• The optimization problem can be written as follows:

minimize subject to

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 52

Isotonic Regression

• where x_i and y_i are the predictors and target variables
for the i^{th} data point,

• respectively, and

• f is the monotonic function that is being fit to the data.

• The constraint ensures that the function is monotonic.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 53

Applications of Isotonic Regression

1. Calibration of predicted probabilities: Isotonic regression can be

used to adjust the predicted probabilities produced by a classifier
so that they are more accurately calibrated to the true probabilities.

2. Ordinal regression: Isotonic regression can be used to model

ordinal variables, which are variables that can be ranked in order
(e.g., “low,” “medium,” and “high”).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 54

Applications of Isotonic Regression
3. Non-parametric regression: Because isotonic regression does not make
any assumptions about the functional form of the relationship between the
predictor and target variables, it can be used as a non-parametric
regression method.

4. Imputing missing values: Isotonic regression can be used to impute

missing values in a dataset by predicting the missing values based on the
surrounding non-missing values.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 55

Applications of Isotonic Regression

5. Outlier detection: Isotonic regression can be used to identify outliers

in a dataset by identifying points that are significantly different from
the overall trend of the data.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 56

Isotonic Regression

• In scikit-learn, isotonic regression can be performed using the

‘IsotonicRegression’ class. This class implements the isotonic
regression algorithm, which fits a non-decreasing piecewise-constant
function to the data.

• how to use the IsotonicRegression class in scikit-learn to perform

isotonic regression:

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 57

Isotonic Regression
from sklearn.isotonic import IsotonicRegression

ir = IsotonicRegression()
# create an instance of the IsotonicRegression class

# Fit isotonic regression model

y_ir = ir.fit_transform(x, y)
# fit the model and transform the data

print('Isotonic Regression Predictions :\n',y_ir)

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 58
Logistic Regression
• Logistic regression is a supervised machine learning algorithm
mainly used for classification tasks where the goal is to predict
the probability that an instance of belonging to a given class.

• It is used for classification algorithms its name is logistic

regression. it’s referred to as regression because it takes the
output of the linear regression function as input and uses a
sigmoid function to estimate the probability for the given class.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 59
Logistic Regression
• The difference between linear regression and logistic
regression is that linear regression output is the continuous
value that can be anything while logistic regression predicts the
probability that an instance belongs to a given class or not.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 60

• Logistic Regression: Logistic Regression
• It is used for predicting the categorical dependent variable
using a given set of independent variables.

• Logistic regression predicts the output of a categorical

dependent variable. Therefore the outcome must be a
categorical or discrete value.

• It can be either Yes or No, 0 or 1, true or False, etc. but instead

of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 61
Logistic Regression
• Logistic Regression:

• Logistic Regression is much similar to the Linear Regression except

that how they are used.

• Linear Regression is used for solving Regression problems, whereas

Logistic regression is used for solving the classification problems.

• In Logistic regression, instead of fitting a regression line, we fit an

“S” shaped logistic function, which predicts two maximum values
(0 or 1). Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 62
Logistic Regression

• The curve from the logistic function indicates the likelihood of

something such as whether the cells are cancerous or not, a mouse is
obese or not based on its weight, etc.

• Logistic Regression is a significant machine learning algorithm

because it has the ability to provide probabilities and classify new
data using continuous and discrete datasets.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 63

Logistic Regression

• Logistic Regression can be used to classify the observations using

different types of data and can easily determine the most effective
variables used for the classification.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 64

Logistic Regression : Logistic Function (Sigmoid Function)
• Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
• It maps any real value into another value within a range of 0 and 1. o The value of
the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 65
Type of Logistic Regression
• On the basis of the categories, Logistic Regression can be classified into three
types:

• Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.

• Multinomial: In multinomial Logistic regression, there can be 3 or more

possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”

• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered

types of dependent variables, such as “low”, “Medium”, or “High”.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 66
Sr.No Linear Regression Logistic Regression
Linear regression is used to predict the Logistic regression is used to predict the
1 continuous dependent variable using a categorical dependent variable using a given
given set of independent variables. set of independent variables.

Linear regression is used for solving

2 Regression problem.
It is used for solving classification problems.

In this we predict the value of continuous In this we predict values of categorical

3 variables varibles

4 In this we find best fit line. In this we find S-Curve .

Least square estimation method is used for Maximum likelihood estimation method is
5 estimation of accuracy. used for Estimation of accuracy.

The output must be continuous value,such Output is must be categorical value such as
6 as price,age,etc. 0 or 1, Yes or no, etc.

It required linear relationship between

7 dependent and independent variables.
It not required linear relationship.

There may be collinearity between the There should not be collinearity between
8 Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune
independent variables. independent varible. 67
Logistic Regression : Sigmoid Function
sigmoid function
where the input
will be z and we
find the
probability
between 0 and 1.
i.e predicted y.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 68
Logistic Regression
• Logistic Regression Equation
• The odd is the ratio of something occurring to something not
occurring. it is different from probability as the probability is the ratio
of something occurring to everything that could possibly occur. so
odd will be

• from sklearn.linear_model import LogisticRegression

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 69
Types of Gradient Descent:
• Typically, there are three types of Gradient Descent:
• Batch Gradient Descent

• Stochastic Gradient Descent

• Mini-batch Gradient Descent

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 70

Gradient Descent

• Gradient descent is an optimization algorithm that’s used

when training a machine learning model.

• It’s based on a convex function and tweaks (changing) its parameters

iteratively to minimize a given function to its local minimum.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 71

WHAT IS GRADIENT DESCENT IN MACHINE LEARNING?
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function.

• Gradient descent in machine learning is simply used to find the

values of a function's parameters (coefficients) that minimize a cost
function as far as possible.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 72

ROLE OF GRADIENT DESCENT
• Initial parameter’s values and from there the gradient descent
algorithm uses calculus to iteratively adjust the values so they
minimize the given cost-function.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 73

ROLE OF GRADIENT DESCENT
• stochastic gradient descent (SGD) does this for each training example
within the dataset, meaning it updates the parameters for each
training example one by one.

• Depending on the problem, this can make SGD faster than batch
gradient descent.

• One advantage is the frequent updates allow us to have a pretty

detailed rate of improvement..

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 74

ROLE OF GRADIENT DESCENT
• The frequent updates, however, are more computationally expensive
than the batch gradient descent approach.

• Additionally, the frequency of those updates can result in noisy

gradients, which may cause the error rate to jump around instead of
slowly decreasing.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 75

Stochastic gradient descendent algorithms
• The gradient descent algorithm is an approximate and iterative
method for mathematical optimization.

• You can use it to approach the minimum of any differentiable

function.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 76

Stochastic gradient descendent algorithms
• Gradient Descent is an iterative optimization process that searches
for an objective function’s optimum value (Minimum/Maximum).

• It is one of the most used methods for changing a model’s

parameters in order to reduce a cost function in machine learning
projects.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 77

Stochastic gradient descendent algorithms
• The primary goal of gradient descent is to identify the model
parameters that provide the maximum accuracy on both training
and test datasets.

• In gradient descent, the gradient is a vector pointing in the general

direction of the function’s steepest rise at a particular point.

• The algorithm might gradually drop towards lower values of the

function by moving in the opposite direction of the gradient, until
reaching the minimum of the function.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 78
Stochastic Gradient Descent
• Stochastic Gradient Descent (SGD) is a variant of the Gradient
Descent algorithm that is used for optimizing machine learning
models.

• It addresses the computational inefficiency of traditional Gradient

Descent methods when dealing with large datasets in machine
learning projects.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 79

Stochastic Gradient Descent
• In SGD, instead of using the entire dataset for each iteration, only a
single random training example (or a small batch) is selected to
calculate the gradient and update the model parameters.

• This random selection introduces randomness into the optimization

process, hence the term “stochastic” in stochastic Gradient Descent

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 80

Stochastic Gradient Descent
• The advantage of using SGD is its computational efficiency, especially
when dealing with large datasets.

• By using a single example or a small batch, the computational cost

per iteration is significantly reduced compared to traditional
Gradient Descent methods that require processing the entire dataset.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 81

Stochastic Gradient Descent
• Stochastic Gradient Descent Algorithm
• Initialization: Randomly initialize the parameters of the model.

• Set Parameters: Determine the number of iterations and the learning

rate (alpha) for updating the parameters.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 82

Stochastic Gradient Descent
• Stochastic Gradient Descent Loop: Repeat the following steps until the model converges
or reaches the maximum number of iterations:

a. Shuffle the training dataset to introduce randomness.

b. Iterate over each training example (or a small batch) in the shuffled order.

c. Compute the gradient of the cost function with respect to the model parameters using
the current training example (or batch).

d. Update the model parameters by taking a step in the direction of the negative gradient,
scaled by the learning rate.

e. Evaluate the convergence criteria, such as the difference in the cost function between
iterations of the gradient. Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 83
Stochastic Gradient Descent
• Return Optimized Parameters: Once the convergence criteria are met or
the maximum number of iterations is reached, return the optimized model
parameters.

• In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is
usually noisier than your typical Gradient Descent algorithm.

• But that doesn’t matter all that much because the path taken by the
algorithm does not matter, as long as we reach the minimum and with a
significantly shorter training time.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 84
The path taken by Batch Gradient Descent is shown below:

• we reach the
minimum and
with a
significantly
shorter
training time.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 85

A path taken by Stochastic Gradient Descent looks as follows –
• One thing to be noted is that,
as SGD is generally noisier
than typical Gradient
Descent, it usually took a
higher number of iterations
to reach the minima,
because of the randomness
in its descent.
• Even though it requires a
higher number of iterations
to reach the minima than
typical Gradient Descent, it is
still computationally much
• Hence, in most scenarios,
SGD is preferred over Batch
less expensive than typical
Gradient Descent for Gradient Descent.
optimizing a learning
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 86
algorithm.
Advantages of Stochastic Gradient Descent
• Speed: SGD is faster than other variants of Gradient Descent such as
Batch Gradient Descent and Mini-Batch Gradient Descent since it uses
only one example to update the parameters.

• Memory Efficiency: Since SGD updates the parameters for each training
example one at a time, it is memory-efficient and can handle large
datasets that cannot fit into memory.

• Avoidance of Local Minima: Due to the noisy updates in SGD, it has the
ability to escape from local minima and converges to a global minimum.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 87
Disadvantages of Stochastic Gradient Descent
• Noisy updates: The updates in SGD are noisy and have a high variance,
which can make the optimization process less stable and lead to
oscillations around the minimum.
• Slow Convergence: SGD may require more iterations to converge to the
minimum since it updates the parameters for each training example one at
a time.
• Sensitivity to Learning Rate: The choice of learning rate can be critical in
SGD since using a high learning rate can cause the algorithm to overshoot
the minimum, while a low learning rate can make the algorithm converge
slowly.
• Less Accurate: Due to the noisy updates, SGD may not converge to the
exact global minimum and can result in a suboptimal solution. This can
be mitigated by using techniques such as learning rate scheduling and
momentum-based updates
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 88
Confusion Matrix

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 89

Confusion Matrix
The target variable has
two values:
Positive or Negative

The columns represent

the actual values of the
target variable.

The rows represent

the predicted values of
the target variable
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 90
Confusion Matrix

• The classification matrix is a standard tool for evaluation of statistical

models and is sometimes referred to as a confusion matrix.

• A Confusion matrix is an N x N matrix used for evaluating

the performance of a classification model, where N is the number
of target classes. The matrix compares the actual target values with
those predicted by the machine learning model.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 91

Confusion Matrix
• A good model is one which has high TP and TN rates, while low FP and
FN rates.

• A confusion matrix is a tabular summary of the number of correct and

incorrect predictions made by a classifier.

• It is used to measure the performance of a classification model.

• It can be used to evaluate the performance of a classification model

through the calculation of performance metrics like accuracy, precision,
recall, and F1-score.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 92
Confusion Matrix
• True Positives (TP): when the actual value is Positive and
predicted is also Positive.
• True negatives (TN): when the actual value is Negative and
prediction is also Negative.
• False positives (FP): When the actual is negative but
prediction is Positive. Also known as the Type 1 error
• False negatives (FN): When the actual is Positive but the
prediction is Negative. Also known as the Type 2 error
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 93
Confusion Matrix

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 94

Classification Measure
• Classification Measure
• Basically, it is an extended version of the confusion matrix. There are
measures other than the confusion matrix which can help achieve
better understanding and analysis of our model and its
performance.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 95

Classification Measure
a. Accuracy

b. Precision

c. Recall (TPR, Sensitivity)

d. F1-Score

e. FPR (Type I Error)

f. FNR (Type II Error)

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 96
Classification Measure : a. Accuracy
Accuracy simply
measures how often
the classifier makes
the correct
prediction.
It’s the ratio between
the number of
correct predictions
and the total number
of predictions.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 97
Classification Measure

• In a two-class problem, we are often looking to discriminate between

observations with a specific outcome, from normal observations.

• “true positive” for correctly predicted event values.

• “false positive” for incorrectly predicted event values.

• “true negative” for correctly predicted no-event values.

• “false negative” for incorrectly predicted no-event values.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 98

Confusion Matrix
1 # Example of a confusion matrix in Python
2 from sklearn.metrics import confusion_matrix
3
4 expected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
5 predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
6 results = confusion_matrix(expected, predicted)
7 print(results)

[[4 2]
[1 3]]
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 99
Calculate Accuracy, Error, Precision, Recall and F1 Score
for the following Confusion Matrix
Actual Positive Actual Negative

Predicted 10 10
Positive

Predicted 25 55
Negative

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 100

Solution 
1. Accuracy is calculated as follows:
(TP + TN) / (TP + TN + FP + FN)
= (10 + 55) / (10 + 55 + 10 + 25) = 65 / 100 = 0.65
2. Error = 1 – Accuracy
Error = 1 – 0.65
Error = 0.35
3. Precision = TP / TP + FP = 10 / (10 + 10) = 0.5.
4. Recall (Sensitivity) = 10 / (10 + 25) = 0.2857.
5. F1 Score is calculated as follows:
F1 Score = 2 * Precision * Recall / (Precision + Recall)
F1 Score = 2 * 0.5 * 0.2857 / (0.5 + 0.2857)
F1 Score = 0.3571
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 101
ROC Curve

• An ROC curve (receiver operating characteristic curve) is a graph

showing the performance of a classification model at all
classification thresholds.

• This curve plots two parameters: True Positive Rate and False
Positive Rate.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 102

ROC Curve

• An ROC curve (receiver operating characteristic curve) is a graph

showing the performance of a classification model at all
classification thresholds.

• This curve plots two parameters: True Positive Rate and False
Positive Rate.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 103

ROC Curve
• True Positive Rate (TPR) is a synonym for recall and is therefore defined
as follows:

• False Positive Rate (FPR) is defined as follows:

• An ROC curve plots TPR vs. FPR at different classification thresholds.

Lowering the classification threshold classifies more items as positive,
thus increasing both False Positives
Prof.Sachin Sambhaji Patil , D.Y.Patiland
UniversityTrue Positives.
Ambi , Pune 104
ROC Curve
• The following figure shows a typical ROC curve.

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 105

ROC Curve

• With a ROC curve, you’re trying to find a good model that optimizes the
trade off between the False Positive Rate (FPR) and True Positive Rate
(TPR). What counts here is how much area is under the curve (Area under
the Curve = AuC).

• The ideal curve in the left image fills in 100%, which means that you’re
going to be able to distinguish between negative results and positive
results 100% of the time (which is almost impossible in real life).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 106

ROC Curve

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 107

ROC Curve
• A Receiver Operator Characteristic (ROC) curve is a graphical plot used to
show the diagnostic ability of binary classifiers.

• A ROC curve is constructed by plotting the true positive rate (TPR)

against the false positive rate (FPR).

• The true positive rate is the proportion of observations that were

correctly predicted to be positive out of all positive observations (TP/(TP
+ FN)).

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 108

Plot ROC-AUC Curve for binary classification problem

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 109

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 110
Thank You

Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 111

Classroom of The Elite - Second Year Volume 11
No ratings yet
Classroom of The Elite - Second Year Volume 11
263 pages
Applied ML notes
No ratings yet
Applied ML notes
123 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Test Bank Questions Chapters 1 and 2
50% (2)
Test Bank Questions Chapters 1 and 2
3 pages
SCSA3015 Deep Learning Unit 1 Notes PDF
No ratings yet
SCSA3015 Deep Learning Unit 1 Notes PDF
30 pages
ANOVA and MANOVA: Statistics For Psychology
No ratings yet
ANOVA and MANOVA: Statistics For Psychology
34 pages
ML Unit No.1
No ratings yet
ML Unit No.1
84 pages
Unit 2
No ratings yet
Unit 2
63 pages
Unit 4
No ratings yet
Unit 4
72 pages
8-Module 5 Linear and Logical Regression-18-03-2024
No ratings yet
8-Module 5 Linear and Logical Regression-18-03-2024
14 pages
(Pec Cs701e)
No ratings yet
(Pec Cs701e)
4 pages
Tybsc Cs368 Data Analytics Labbook
No ratings yet
Tybsc Cs368 Data Analytics Labbook
58 pages
Data Analytics Unit1
No ratings yet
Data Analytics Unit1
17 pages
5 - Model For Predictions - ML
No ratings yet
5 - Model For Predictions - ML
52 pages
AI17
No ratings yet
AI17
10 pages
Unit 3
No ratings yet
Unit 3
17 pages
Unit 4 Machine Learning Tools, Techniques and Applications
No ratings yet
Unit 4 Machine Learning Tools, Techniques and Applications
78 pages
Introduction to ML Unit-1 PPT
No ratings yet
Introduction to ML Unit-1 PPT
90 pages
Module 3 - AIML
No ratings yet
Module 3 - AIML
134 pages
UNIT 1 Notes_copy
No ratings yet
UNIT 1 Notes_copy
13 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Machine Learning Notes
100% (1)
Machine Learning Notes
8 pages
modellingandevaluationunit2june2322-220623063944-5c70ebed
No ratings yet
modellingandevaluationunit2june2322-220623063944-5c70ebed
53 pages
Final
No ratings yet
Final
17 pages
ML Viva Q&A
No ratings yet
ML Viva Q&A
17 pages
Unit 1 Machine Learning Notes
No ratings yet
Unit 1 Machine Learning Notes
19 pages
Ai Unit5 Learning
No ratings yet
Ai Unit5 Learning
62 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
17 pages
DataAnalytics_LabBook
No ratings yet
DataAnalytics_LabBook
61 pages
Supervised learning
No ratings yet
Supervised learning
19 pages
Machine Learning Interview Question
No ratings yet
Machine Learning Interview Question
9 pages
ML - CSA 301 - ML Perspective and Issues
No ratings yet
ML - CSA 301 - ML Perspective and Issues
34 pages
Machine Learning Reg
No ratings yet
Machine Learning Reg
45 pages
Lecture 5 - Interval Estimation
No ratings yet
Lecture 5 - Interval Estimation
76 pages
Unit 1
No ratings yet
Unit 1
62 pages
ML Lecture 4
No ratings yet
ML Lecture 4
15 pages
ML Type
No ratings yet
ML Type
13 pages
Salary Prediction
No ratings yet
Salary Prediction
4 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
Unit1 ML NGP
No ratings yet
Unit1 ML NGP
106 pages
Machine Learning Approachs (AI)
100% (1)
Machine Learning Approachs (AI)
11 pages
21CS54 Aiml Module3 PPT
No ratings yet
21CS54 Aiml Module3 PPT
102 pages
ML PPTS Merged
No ratings yet
ML PPTS Merged
514 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
Unit-4object Segmentation Regression Vs Segmentation Supervised and Unsupervised Learning Tree Building Regression Classification Overfitting Pruning and Complexity Multiple Decision Trees
No ratings yet
Unit-4object Segmentation Regression Vs Segmentation Supervised and Unsupervised Learning Tree Building Regression Classification Overfitting Pruning and Complexity Multiple Decision Trees
25 pages
MLT Unit 1
No ratings yet
MLT Unit 1
15 pages
Machine Learning Unit 1
No ratings yet
Machine Learning Unit 1
72 pages
Chapter 1 Introduction To Machine Learning
No ratings yet
Chapter 1 Introduction To Machine Learning
29 pages
ML Unit 1
No ratings yet
ML Unit 1
20 pages
Academic Int. Report
No ratings yet
Academic Int. Report
50 pages
1 ML M1503-Introduction - ABP
No ratings yet
1 ML M1503-Introduction - ABP
14 pages
Machine Learning
No ratings yet
Machine Learning
10 pages
UNIT-5 ML Part1-1
No ratings yet
UNIT-5 ML Part1-1
59 pages
ML Important Topic
No ratings yet
ML Important Topic
13 pages
Null 5
No ratings yet
Null 5
16 pages
labook DA
No ratings yet
labook DA
59 pages
FAI - ch5 - Remaining Topics
No ratings yet
FAI - ch5 - Remaining Topics
3 pages
ML Unit 1
No ratings yet
ML Unit 1
74 pages
Summary of Key Points
No ratings yet
Summary of Key Points
11 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
AI Session 2
No ratings yet
AI Session 2
64 pages
Python Machine Learning: Introduction to Machine Learning with Python
From Everand
Python Machine Learning: Introduction to Machine Learning with Python
Frank Millstein
No ratings yet
Machine Learning for Absolute Beginners: An Introduction to the Fundamentals and Applications of Machine Learning
From Everand
Machine Learning for Absolute Beginners: An Introduction to the Fundamentals and Applications of Machine Learning
daniel huston
3/5 (1)
DocScanner 18-Dec
No ratings yet
DocScanner 18-Dec
1 page
Assignment No 547
No ratings yet
Assignment No 547
6 pages
Unit 4: Trees
No ratings yet
Unit 4: Trees
28 pages
Malhotra Mr05 PPT 19
100% (9)
Malhotra Mr05 PPT 19
40 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
55 pages
Chapter 6 Multicollinerity
No ratings yet
Chapter 6 Multicollinerity
4 pages
New File Spss
No ratings yet
New File Spss
4 pages
STATA Training Session 3
No ratings yet
STATA Training Session 3
53 pages
Quarter 2 Week 2 Statistics and Probability
No ratings yet
Quarter 2 Week 2 Statistics and Probability
5 pages
Anscombe's Quartet
No ratings yet
Anscombe's Quartet
4 pages
Gavriella Michael Thesis
No ratings yet
Gavriella Michael Thesis
85 pages
(PDF Download) Machine Learning For Business Analytics: Concepts, Techniques and Applications in RapidMiner Galit Shmueli Fulll Chapter
100% (5)
(PDF Download) Machine Learning For Business Analytics: Concepts, Techniques and Applications in RapidMiner Galit Shmueli Fulll Chapter
64 pages
The Multiple Classical Linear Regression Model (CLRM) : Specification and Assumptions
No ratings yet
The Multiple Classical Linear Regression Model (CLRM) : Specification and Assumptions
19 pages
Mcqs
50% (2)
Mcqs
4 pages
Assess 1 PED 106 Lesson 6
No ratings yet
Assess 1 PED 106 Lesson 6
75 pages
Question 2
No ratings yet
Question 2
2 pages
Mathematical Methods
No ratings yet
Mathematical Methods
2 pages
Cross Validation
No ratings yet
Cross Validation
14 pages
Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjeepdf download
100% (2)
Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjeepdf download
58 pages
Lecture 01 Linear Regression Single - Var PDF
No ratings yet
Lecture 01 Linear Regression Single - Var PDF
16 pages
(Big Data Analysis) : Python Scikit-Learn 機器學習
No ratings yet
(Big Data Analysis) : Python Scikit-Learn 機器學習
97 pages
Multiple Regression Methodology and Applications
No ratings yet
Multiple Regression Methodology and Applications
7 pages
Economics Notes
No ratings yet
Economics Notes
6 pages
Introductory Econometrics A Modern Approach 6th Edition Wooldridge Test Bank Compress
No ratings yet
Introductory Econometrics A Modern Approach 6th Edition Wooldridge Test Bank Compress
9 pages
AIML Assignment 7
No ratings yet
AIML Assignment 7
3 pages
Regression Interpretation - SPSS
No ratings yet
Regression Interpretation - SPSS
7 pages
Chap 1,2,3,5,6 (QA) Upload
No ratings yet
Chap 1,2,3,5,6 (QA) Upload
6 pages
Textbook 3
No ratings yet
Textbook 3
2 pages
Exam SRM Sample Questions
No ratings yet
Exam SRM Sample Questions
69 pages
Suggested Solution To Finals Long Quiz
No ratings yet
Suggested Solution To Finals Long Quiz
5 pages