0% found this document useful (0 votes)
2 views26 pages

Linear Models-Gradient Descent, Regularization (Introduction)

The document discusses gradient descent, an essential optimization algorithm used in machine learning to minimize cost functions and improve model predictions. It outlines different types of gradient descent algorithms, including Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent, highlighting their advantages, disadvantages, and best use cases. The document emphasizes the importance of gradient descent in optimizing model parameters for accurate predictions.

Uploaded by

frozencow55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views26 pages

Linear Models-Gradient Descent, Regularization (Introduction)

The document discusses gradient descent, an essential optimization algorithm used in machine learning to minimize cost functions and improve model predictions. It outlines different types of gradient descent algorithms, including Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent, highlighting their advantages, disadvantages, and best use cases. The document emphasizes the importance of gradient descent in optimizing model parameters for accurate predictions.

Uploaded by

frozencow55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Linear Models

29/04/25 23ECE216 Machine Learning Unit 2 1


29/04/25 23ECE216 Machine Learning Unit 2 2
Gradient descent is a widely used optimization algorithm that is fundamental to many popular machine
learning algorithms. Put simply; gradient descent is a method for finding the minimum value of a
function, which is a critical task in machine learning.

• Gradient descent is an iterative optimization algorithm that is widely used in machine learning.

• At a high level, gradient descent is a method for finding the minimum value of a function by iteratively
adjusting the function's parameters based on the gradient (i.e., the direction of the steepest descent).

• By using gradient descent to minimize the cost function of a machine learning model, we can find the
best set of model parameters for accurate predictions.

• This means that it helps us find the best values for our model’s parameters so that our model can make
accurate predictions.

29/04/25 23ECE216 Machine Learning Unit 2 3


Why Is Gradient Descent Necessary for Machine Learning?

• Gradient descent is necessary for optimizing and fine-tuning the parameters of a machine learning
model. In fact, it is the backbone of many machine learning algorithms.

• Gradient descent aims to minimize a model's cost or loss function, which measures how well the
model performs.

• By reducing the cost function, the model becomes better at making predictions and can generalize
better to new data. The gradient descent process involves iteratively adjusting the model parameters
based on the gradient of the cost function.

• This means that the model moves toward the steepest descent towards the minimum point of the cost
function. With gradient descent, finding the optimal set of parameters for a given model would be
much easier, and the model's performance would improve as a result.

29/04/25 23ECE216 Machine Learning Unit 2 4


Different Types of Gradient Descent Algorithms

Several types of gradient descent algorithms can be used to optimize a cost function. The most
common types are:

29/04/25 23ECE216 Machine Learning Unit 2 5


• The objectives of all supervised machine learning algorithms is to best estimate a target
function (f) that maps input data (X) onto output variables (Y).

• Some machine learning algorithms have coefficients that characterize the algorithms estimate
for the target function (f).

• Different algorithms have different representations and different coefficients, but many of
them require a process of optimization to find the set of coefficients that result in the best
estimate of the target function.

• Examples of algorithms with coefficients that can be optimized using gradient descent are:
• Linear Regression
• Logistic Regression.

29/04/25 23ECE216 Machine Learning Unit 2 6


• Gradient descent can be slow to run on very large datasets.

• One iteration of the gradient descent algorithm requires a prediction for each instance in the
training dataset, it can take a long time when you have many millions of instances.

• When large amounts of data, you can use a variation of gradient descent called stochastic
gradient descent.

• A few samples are selected randomly instead of the whole data set for each iteration. In
Gradient Descent, there is a term called “batch” which denotes the total number of samples
from a dataset that is used for calculating the gradient for each iteration.

29/04/25 23ECE216 Machine Learning Unit 2 7


1. Batch Gradient Descent (BGD)

• Batch gradient descent (BGD) is used to find the error for each point in the training set and update
the model after evaluating all training examples.

• This procedure is known as the training epoch.

• In simple words, it is a greedy approach where we have to sum over all examples for each update.

• Computes the gradient of the cost function using the entire dataset before updating the parameters.

29/04/25 23ECE216 Machine Learning Unit 2 8


Batch Gradient Descent
X1 X2 …. Xn

Sum of All before taking one step (epoch)


Long time to reach the bottom
Does it for
number of examples
number of features Batch Gradient Descent
learning rate
4/29/25 Amrita School of Engineering
Pros:
•Stable convergence
•Guarantees reaching the global minimum for convex functions

Cons:
•Computationally expensive for large datasets
•Requires significant memory

Example: Suppose we are training a linear regression model with Mean Squared Error (MSE) loss function:

Batch Gradient Descent will update the model weights only after computing the gradients over the entire dataset.

29/04/25 23ECE216 Machine Learning Unit 2 10


2. Stochastic Gradient Descent (SGD)

• Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per iteration.
• Or in other words, it processes a training epoch for each example within a dataset and updates each training
example's parameters one at a time.
• As it requires only one training example at a time, hence it is easier to store in allocated memory.

• However, it shows some computational efficiency losses in comparison to batch gradient systems as it shows
frequent updates that require more detail and speed.

• Further, due to frequent updates, it is also treated as a noisy gradient.

• However, sometimes it can be helpful in finding the global minimum and also escaping the local minimum.

Updates the parameters using only one training example at a time.

29/04/25 23ECE216 Machine Learning Unit 2 11


Stochastic Gradient Descent

4/29/25 Amrita School of Engineering


Pros:
• Faster computation
• Can escape local minima due to noise in updates
• Good for online learning and real-time applications

Cons:
• High variance in updates, leading to instability
• Does not guarantee convergence to the global minimum

Example: In a logistic regression model for binary classification, SGD updates weights after
processing each individual sample rather than the entire dataset.

29/04/25 23ECE216 Machine Learning Unit 2 13


Batch Vs Stochastic Gradient Descent
X1 X2 …. Xn X1 X2 …. Xn

bj := bj – α f(Ci)
Does it for Randomly shuffle the dataset
number of examples
number of features Repeat the steps for every example
learning rate
Modify the coefficient at every step
Batch Gradient Descent
4/29/25 Amrita School of Engineering
3. Mini-Batch Gradient Descent (MBGD)

• Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent.

• It divides the training datasets into small batch sizes then performs the updates on those
batches separately.

• Splitting training datasets into smaller batches make a balance to maintain the computational
efficiency of batch gradient descent and speed of stochastic gradient descent.

• Hence, we can achieve a special type of gradient descent with higher computational efficiency
and less noisy gradient descent.

29/04/25 23ECE216 Machine Learning Unit 2 15


A compromise between BGD and SGD, where the dataset is divided into small batches, and
updates are performed on each batch.

where b is the batch size.

Pros:
•Reduces variance compared to SGD, leading to more stable convergence
•Computationally more efficient than BGD for large datasets

Cons:
•Still requires tuning of batch size
•Can be affected by suboptimal batch selection

Example: When training a deep learning model using TensorFlow or PyTorch, mini-batches (e.g., batch
size = 32, 64) are commonly used to balance computational efficiency and convergence speed.
29/04/25 23ECE216 Machine Learning Unit 2 16
Summary

Batch Gradient Descent Stochastic Gradient Mini-Batch Gradient


Feature
(BGD) Descent (SGD) Descent (MBGD)
After computing
After each training After a small batch of
Update Frequency gradients for the entire
example examples
dataset
Slow (requires processing Fast (updates after each Moderate (updates after
Computational Efficiency
entire dataset per step) example) each batch)
High (stores entire Low (only one sample at Moderate (stores a small
Memory Requirement
dataset in memory) a time) batch)
Convergence Speed Slow Fast, but noisy Balanced
Stable, moves smoothly Highly unstable, More stable than SGD,
Stability
toward the minimum fluctuates a lot but less than BGD
Likelihood of Getting High for non-convex
Low due to randomness Moderate
Stuck in Local Minima functions
Large datasets, online
Small datasets, convex Deep learning, large
Best Use Cases learning, real-time
functions datasets
applications

Linear regression, simple Online ad click prediction, Training deep neural


Example Use Case
ML models recommendation systems networks (CNNs, RNNs)

29/04/25 23ECE216 Machine Learning Unit 2 17


Regularization

29/04/25 23ECE216 Machine Learning Unit 2 18


29/04/25 23ECE216 Machine Learning Unit 2 19
29/04/25 23ECE216 Machine Learning Unit 2 20
29/04/25 23ECE216 Machine Learning Unit 2 21
29/04/25 23ECE216 Machine Learning Unit 2 22
29/04/25 23ECE216 Machine Learning Unit 2 23
29/04/25 23ECE216 Machine Learning Unit 2 24
29/04/25 23ECE216 Machine Learning Unit 2 25
29/04/25 23ECE216 Machine Learning Unit 2 26

You might also like