0% found this document useful (0 votes)
13 views41 pages

Unit - 3 - ML - 24

This document covers supervised learning with a focus on regression techniques, including bias, variance, underfitting, and overfitting. It explains linear regression, its types, and the mathematical representation, as well as advanced techniques like Lasso and Ridge regression, and performance metrics such as MAE and RMSE. Additionally, it discusses optimization algorithms like Batch and Stochastic Gradient Descent.

Uploaded by

knair051
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views41 pages

Unit - 3 - ML - 24

This document covers supervised learning with a focus on regression techniques, including bias, variance, underfitting, and overfitting. It explains linear regression, its types, and the mathematical representation, as well as advanced techniques like Lasso and Ridge regression, and performance metrics such as MAE and RMSE. Additionally, it discusses optimization algorithms like Batch and Stochastic Gradient Descent.

Uploaded by

knair051
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Unit -3

Supervised Learning : Regression (06Hrs)


• Bias, Variance, Generalization, Underfitting, Overfitting,
• Linear regression, Regression: Lasso regression, Ridge regression, Gradient descent
algorithm.
• Evaluation Metrics: MAE, RMSE, R2

Dr. Rupali Pawar


Bias and Variance
• Bias :
While making predictions, a difference occurs
between prediction values made by the model
and actual values/expected values, and this
difference is known as bias errors or Errors due
to bias.
Distance between the average predictions by the
model and the truth
• Variance
Difference between the predictions from different
models
Variance errors are either of low variance or high
variance.

Dr.Rupali Pawar
Bias and Variance

Dr.Rupali Pawar
Underfitting
• In the case of underfitting, the model is
not able to learn enough from the
training data, and hence it reduces the
accuracy and produces unreliable
predictions.
• An underfitted model has high bias and
low variance.
How to avoid underfitting:
• By increasing the training time of the
model.
• By increasing the number of features.

Dr.Rupali Pawar
Over fitting and how to reduce overfitting
What is overfitting? How to reduce overfitting
• Building a model that matches the training • Cross-validation: By splitting the data into training and testing sets multiple
times, cross-validation can help identify if a model is overfitting or underfitting
data “too closely”, generating a complex and can be used to tune hyperparameters to reduce variance.
model.
• Feature selection: By choosing the only relevant feature will decrease the
Why does it occur? model’s complexity. and it can reduce the variance error.

• Evaluating a model by testing it on the same • Regularization: We can use L1 or L2 regularization to reduce variance in
machine learning models
data that was used to train it.
• Ensemble methods: It will combine multiple models to improve
• Creating a model that is “too complex”. generalization performance. Bagging, boosting, and stacking are common
ensemble methods that can help reduce variance and improve generalization
What is the impact of over-fitting? performance.
• Simplifying the model: Reducing the complexity of the model, such as
• Model will do well on the training data, but decreasing the number of parameters or layers in a neural network, can also
won’t generalize to out-of-sample data i.e.,on help reduce variance and improve generalization performance.
test data • Early stopping: Early stopping is a technique used to prevent overfitting by
stopping the training of the deep learning model when the performance on the
• Model will have low bias, but high variance. validation set stops improving.

Dr.Rupali Pawar
Dr.Rupali Pawar
Bias Variance Tradeoff

Dr.Rupali Pawar
Linear Regression

• Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis.
Linear regression makes predictions for continuous/real or numeric
variables such as sales, salary, age, product price, etc.
• Linear regression algorithm shows a linear relationship between a target
or dependent (y) and one or more independent (y) variables, hence called
as linear regression. Since linear regression shows the linear relationship,
which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
• The linear regression model provides a sloped straight line representing
the relationship between the variables.

Dr. Rupali Pawar


Types of Linear Regression
• Linear regression can be further
divided into two types of the
algorithm
• Simple Linear Regression:
If a single independent variable is
used to predict the value of a
numerical dependent variable, then
such a Linear Regression algorithm is
called Simple Linear Regression.
• Multiple Linear regression:
If more than one independent
variable is used to predict the value of
a numerical dependent variable, then
such a Linear Regression algorithm is
called Multiple Linear Regression.

Dr. Rupali Pawar


Linear Regression

• The linear regression model provides a sloped


straight line representing the relationship
between the variables.
• Mathematical Representation
y= a0+a1x+ ε

• Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional
degree of freedom)
a1 = Linear regression coefficient (scale factor to
each input value).
ε = random error

Dr. Rupali Pawar


Linear Regression

• Positive Linear Relationship:



If the dependent variable increases on the Y-axis and independent
variable increases on X-axis, then such a relationship is termed as
a Positive linear relationship.

• Negative Linear Relationship :



If the dependent variable decreases on the Y-axis and
independent variable increases on the X-axis, then such a
relationship is called a negative linear relationship.

Dr. Rupali Pawar


Linear Regression

• Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis.
Linear regression makes predictions for continuous/real or numeric
variables such as sales, salary, age, product price, etc.
• Linear regression algorithm shows a linear relationship between a target
or dependent (y) and one or more independent (x) variables, hence called
as linear regression. Since linear regression shows the linear relationship,
which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
• The linear regression model provides a sloped straight line representing
the relationship between the variables.

Dr. Rupali Pawar


Types of Linear Regression
• Linear regression can be further
divided into two types of the
algorithm
• Simple Linear Regression:
If a single independent variable is
used to predict the value of a
numerical dependent variable, then
such a Linear Regression algorithm is
called Simple Linear Regression.
• Multiple Linear regression:
If more than one independent
variable is used to predict the value of
a numerical dependent variable, then
such a Linear Regression algorithm is
called Multiple Linear Regression.

Dr. Rupali Pawar


Linear Regression

• The linear regression model provides a sloped


straight line representing the relationship
between the variables.
• Mathematical Representation
y= a0+a1x+ ε

• Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional
degree of freedom)
a1 = Linear regression coefficient (scale factor to
each input value).
ε = random error

Dr. Rupali Pawar


Linear Regression

• Positive Linear Relationship:



If the dependent variable increases on the Y-axis and independent
variable increases on X-axis, then such a relationship is termed as
a Positive linear relationship.

• Negative Linear Relationship :



If the dependent variable decreases on the Y-axis and
independent variable increases on the X-axis, then such a
relationship is called a negative linear relationship.

Dr. Rupali Pawar


Best Fit Line
• E = Y – Y`

• where, E denotes the prediction


error or residual error
• Y` denotes the predicted value
• Y denotes the actual value
• A line that fits the data "best" will
be one for which the prediction
errors (one for each data point) are
as small as possible. Y` denotes the predicted value
b denotes the slope of the line
X denotes the independent variable
A is the Y intercept
Finding the best fit line
Finding the best fit line
When working with linear regression Model, our main goal is to find the best
fit line that means the error between predicted values and actual values
should be minimized. The best fit line will have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a
different line of regression, so we need to calculate the best values for a0 and
a1 to find the best fit line, so to calculate this we use cost function.
• Cost function- J (a0, a1) =
• The different values for weights or coefficient of lines (a0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
• Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.

Dr. Rupali Pawar


Gradient Descent Algorithm
Linear Regression Numerical

Dr. Rupali Pawar


Numerical 1 Method -1 SPPU-Nov_Dec_22

Dr. Rupali Pawar


Numerical 1 Method 2

Dr. Rupali Pawar


Numerical 2 Method -1 SPPU-Nov_24

Dr. Rupali Pawar


Numerical 3 Method -1 SPPU-

Dr. Rupali Pawar


Finding the best fit line:
When working with linear regression Model, our main goal is to find the best
fit line that means the error between predicted values and actual values
should be minimized. The best fit line will have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a
different line of regression, so we need to calculate the best values for a0 and
a1 to find the best fit line, so to calculate this we use cost function.
• Cost function- J (a0, a1) =
• The different values for weights or coefficient of lines (a0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
• Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.

Dr. Rupali Pawar


Applications of Linear Regression

• Marks scored by students based on number of hours studied (ideally)-
Here marks scored in exams are independent and the number of hours
studied is independent.
• Predicting crop yields based on the amount of rainfall- Yield is a
dependent variable while the measure of precipitation is an independent
variable.
• Predicting the Salary of a person based on years of experience-
Experience becomes the independent while Salary turns into the
dependent variable.
• Weather Forecasting -

Dr. Rupali Pawar


Logistic Regression
• Logistic regression is one of the most
popular Machine Learning algorithms,
which comes under the Supervised
Learning technique. It is used for
predicting the categorical dependent
variable using a given set of independent
variables.
• Logistic regression predicts the output of
a categorical dependent variable.
Therefore the outcome must be a
categorical or discrete value. It can be
either Yes or No, 0 or 1, true or False, etc.
but instead of giving the exact value as 0
and 1, it gives the probabilistic values
which lie between 0 and 1.
• ..

Dr. Rupali Pawar


sigmoid/ logistic function
• The sigmoid/ logistic function is a mathematical function used to map
the predicted values to probabilities. hθ (x) = 1/(1+e-(z)), z= θ0+θ1
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the "S" form. The
S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value,
which defines the probability of either 0 or 1. Such as values above
the threshold value tends to 1, and a value below the threshold
values tends to 0.
Dr. Rupali Pawar
Cost Function for Logistic Regression
• J(θ0,θ1) =1/2m σ [ − y log (hθ (x) ) – (1−y) log (1−hθ (x) )]
• for y=0
• J(θ0,θ1) = - log (1-hθ (x) )
• for y=1
• J(θ0,θ1) = - log (hθ (x) )

Dr. Rupali Pawar


Ridge and Lasso Regression
• Ridge and Lasso regression are some of the simple techniques to reduce model
complexity and prevent over-fitting which may result from simple linear regression.
• Ridge regression is a technique used to analyze multi-linear regression (multicollinear),
also known as L2 regularization. It is Applied when predicted values are greater than the
observed values.

• Lasso stands for – Least Absolute Shrinkage and Selection Operator. It is a technique
where data points are shrunk towards a central point, like the mean. Lasso is also known
as L1 regularization.
• It is applied when the model is overfitted or facing computational challenges.

Dr. Rupali Pawar


Performance Metrics
• For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual
values. It can be written as:
• For the above linear equation, MSE can be calculated as:

• N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.
• Other Metrics are
• R Squared,
• Adjusted R square,
• RMSE-Root Mean Squared Error
• MAE-Mean Absolute Error
Dr. Rupali Pawar
Batch Gradient Descent and Stochastic Gradient
Descent

Batch Gradient Descent involves calculations over the full training set at
each step, which is very slow on very large training data. Thus, it
becomes very computationally expensive to do Batch GD. However, this
is great for convex or relatively smooth error manifolds. Also, Batch GD
scales well with the number of features.
Batch Gradient Descent and Stochastic
Gradient Descent
• Stochastic Gradient Descent tries to solve the main problem in Batch Gradient
descent which is the usage of whole training data to calculate gradients at each
step. SGD is stochastic in nature i.e. it picks up a “random” instance of training
data at each step and then computes the gradient, making it much faster as
there is much fewer data to manipulate at a single time, unlike Batch GD.

Dr. Rupali Pawar


Batch Gradient Descent and Stochastic
Gradient Descent
• Batch gradient descent and stochastic gradient descent are both optimization algorithms used to
minimize the cost function in machine learning models, such as linear regression and neural
networks. The main differences between the two are:
• Data Processing Approach:
Batch gradient descent computes the gradient of the cost function with respect to the model
parameters using the entire training dataset in each iteration. Stochastic gradient descent, on the
other hand, computes the gradient using only a single training example or a small subset of examples
in each iteration.
• Convergence Speed:
Batch gradient descent takes longer to converge since it computes the gradient using the entire
training dataset in each iteration. Stochastic gradient descent, on the other hand, can converge faster
since it updates the model parameters after processing each example, which can lead to faster
convergence.

Dr. Rupali Pawar


• Convergence Accuracy:
Batch gradient descent is more accurate since it computes the gradient using the entire training dataset.
Stochastic gradient descent, on the other hand, can be less accurate since it computes the gradient using
a subset of examples, which can introduce more noise and variance in the gradient estimate.
• Computation and Memory Requirements:
Batch gradient descent requires more computation and memory since it needs to process the entire
training dataset in each iteration. Stochastic gradient descent, on the other hand, requires less
computation and memory since it only needs to process a single example or a small subset of examples
in each iteration.
• Optimization of Non-Convex Functions:
Stochastic gradient descent is more suitable for optimizing non-convex functions since it can escape
local minima and find the global minimum. Batch gradient descent, on the other hand, can get stuck in
local minima.

Dr. Rupali Pawar


Batch Gradient Descent Stochastic Gradient Descent

Computes gradient using the whole Training sample Computes gradient using a single Training sample

Slow and computationally expensive algorithm Faster and less computationally expensive than Batch GD

Not suggested for huge training samples. Can be used for large training samples.

Deterministic in nature. Stochastic in nature.

Gives optimal solution given sufficient time to converge. Gives good solution but not optimal.

The data sample should be in a random order, and this is why we want to
No random shuffling of points are required.
shuffle the training set for every epoch.

Can’t escape shallow local minima easily. SGD can escape shallow local minima more easily.

Convergence is slow. Reaches the convergence much faster.

It updates the model parameters only after processing the entire


It updates the parameters after each data point.
training set.

The learning rate is fixed and cannot be changed during training. The learning rate can be adjusted dynamically.

It may suffer from overfitting if the model is too complex for the It can help reduce overfitting by updating the model parameters more
dataset. frequently.

Dr. Rupali Pawar

You might also like