Unit 2 Supervised Learning Regression
Unit 2 Supervised Learning Regression
2
Unit 2: Supervised learning:
Regression
• If the given shape has four sides, and all the sides are equal, then it will
be labelled as a Square.
• If the given shape has three sides, then it will be labelled as a triangle.
• If the given shape has six equal sides then it will be labelled as hexagon.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 8
Supervised Machine Learning
• Now, after training, we test our model using the test set, and the
task of the model is to identify the shape.
• The slope of the line indicates how much the dependent variable
changes for a unit change in the independent variables.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 16
Linear regression- Linear models,
• The model gets the best regression fit line by finding the
best θ1 and θ2 values.
• θ1: intercept
• θ2: coefficient of x
• Once we find the best θ1 and θ2 values, we get the best-fit
line. So when we are finally using our model for prediction,
it will predict the value of y for the input value of x.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 19
Linear regression-Cost Function
• The cost function or the loss function is nothing but the error or
difference between the predicted value and the true value Y.
• It is the Mean Squared Error (MSE) between the predicted value and
the true value.
• To achieve the best-fit regression line, the model aims to predict the
target value such that the error difference between the predicted value
and the true value Y is minimum.
• So, it is very important to update the θ1 and θ2 values, to reach the best
value that minimizes the error between the predicted y value (pred) and
the true y value (y).
• It makes use of a linear regression model to fit the complicated and non-
linear functions and datasets.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 44
Polynomial Regression
minimize subject to
• where x_i and y_i are the predictors and target variables
for the i^{th} data point,
• respectively, and
ir = IsotonicRegression()
# create an instance of the IsotonicRegression class
• Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
Least square estimation method is used for Maximum likelihood estimation method is
5 estimation of accuracy. used for Estimation of accuracy.
The output must be continuous value,such Output is must be categorical value such as
6 as price,age,etc. 0 or 1, Yes or no, etc.
There may be collinearity between the There should not be collinearity between
8 Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune
independent variables. independent varible. 67
Logistic Regression : Sigmoid Function
sigmoid function
where the input
will be z and we
find the
probability
between 0 and 1.
i.e predicted y.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 68
Logistic Regression
• Logistic Regression Equation
• The odd is the ratio of something occurring to something not
occurring. it is different from probability as the probability is the ratio
of something occurring to everything that could possibly occur. so
odd will be
• Depending on the problem, this can make SGD faster than batch
gradient descent.
b. Iterate over each training example (or a small batch) in the shuffled order.
c. Compute the gradient of the cost function with respect to the model parameters using
the current training example (or batch).
d. Update the model parameters by taking a step in the direction of the negative gradient,
scaled by the learning rate.
e. Evaluate the convergence criteria, such as the difference in the cost function between
iterations of the gradient. Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 83
Stochastic Gradient Descent
• Return Optimized Parameters: Once the convergence criteria are met or
the maximum number of iterations is reached, return the optimized model
parameters.
• In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is
usually noisier than your typical Gradient Descent algorithm.
• But that doesn’t matter all that much because the path taken by the
algorithm does not matter, as long as we reach the minimum and with a
significantly shorter training time.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 84
The path taken by Batch Gradient Descent is shown below:
• we reach the
minimum and
with a
significantly
shorter
training time.
• Memory Efficiency: Since SGD updates the parameters for each training
example one at a time, it is memory-efficient and can handle large
datasets that cannot fit into memory.
• Avoidance of Local Minima: Due to the noisy updates in SGD, it has the
ability to escape from local minima and converges to a global minimum.
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 87
Disadvantages of Stochastic Gradient Descent
• Noisy updates: The updates in SGD are noisy and have a high variance,
which can make the optimization process less stable and lead to
oscillations around the minimum.
• Slow Convergence: SGD may require more iterations to converge to the
minimum since it updates the parameters for each training example one at
a time.
• Sensitivity to Learning Rate: The choice of learning rate can be critical in
SGD since using a high learning rate can cause the algorithm to overshoot
the minimum, while a low learning rate can make the algorithm converge
slowly.
• Less Accurate: Due to the noisy updates, SGD may not converge to the
exact global minimum and can result in a suboptimal solution. This can
be mitigated by using techniques such as learning rate scheduling and
momentum-based updates
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 88
Confusion Matrix
b. Precision
d. F1-Score
[[4 2]
[1 3]]
Prof.Sachin Sambhaji Patil , D.Y.Patil University Ambi , Pune 99
Calculate Accuracy, Error, Precision, Recall and F1 Score
for the following Confusion Matrix
Actual Positive Actual Negative
Predicted 10 10
Positive
Predicted 25 55
Negative
• This curve plots two parameters: True Positive Rate and False
Positive Rate.
• This curve plots two parameters: True Positive Rate and False
Positive Rate.
• With a ROC curve, you’re trying to find a good model that optimizes the
trade off between the False Positive Rate (FPR) and True Positive Rate
(TPR). What counts here is how much area is under the curve (Area under
the Curve = AuC).
• The ideal curve in the left image fills in 100%, which means that you’re
going to be able to distinguish between negative results and positive
results 100% of the time (which is almost impossible in real life).