Linear Models-Gradient Descent, Regularization (Introduction)
Linear Models-Gradient Descent, Regularization (Introduction)
• Gradient descent is an iterative optimization algorithm that is widely used in machine learning.
• At a high level, gradient descent is a method for finding the minimum value of a function by iteratively
adjusting the function's parameters based on the gradient (i.e., the direction of the steepest descent).
• By using gradient descent to minimize the cost function of a machine learning model, we can find the
best set of model parameters for accurate predictions.
• This means that it helps us find the best values for our model’s parameters so that our model can make
accurate predictions.
• Gradient descent is necessary for optimizing and fine-tuning the parameters of a machine learning
model. In fact, it is the backbone of many machine learning algorithms.
• Gradient descent aims to minimize a model's cost or loss function, which measures how well the
model performs.
• By reducing the cost function, the model becomes better at making predictions and can generalize
better to new data. The gradient descent process involves iteratively adjusting the model parameters
based on the gradient of the cost function.
• This means that the model moves toward the steepest descent towards the minimum point of the cost
function. With gradient descent, finding the optimal set of parameters for a given model would be
much easier, and the model's performance would improve as a result.
Several types of gradient descent algorithms can be used to optimize a cost function. The most
common types are:
• Some machine learning algorithms have coefficients that characterize the algorithms estimate
for the target function (f).
• Different algorithms have different representations and different coefficients, but many of
them require a process of optimization to find the set of coefficients that result in the best
estimate of the target function.
• Examples of algorithms with coefficients that can be optimized using gradient descent are:
• Linear Regression
• Logistic Regression.
• One iteration of the gradient descent algorithm requires a prediction for each instance in the
training dataset, it can take a long time when you have many millions of instances.
• When large amounts of data, you can use a variation of gradient descent called stochastic
gradient descent.
• A few samples are selected randomly instead of the whole data set for each iteration. In
Gradient Descent, there is a term called “batch” which denotes the total number of samples
from a dataset that is used for calculating the gradient for each iteration.
• Batch gradient descent (BGD) is used to find the error for each point in the training set and update
the model after evaluating all training examples.
• In simple words, it is a greedy approach where we have to sum over all examples for each update.
• Computes the gradient of the cost function using the entire dataset before updating the parameters.
Cons:
•Computationally expensive for large datasets
•Requires significant memory
Example: Suppose we are training a linear regression model with Mean Squared Error (MSE) loss function:
Batch Gradient Descent will update the model weights only after computing the gradients over the entire dataset.
• Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per iteration.
• Or in other words, it processes a training epoch for each example within a dataset and updates each training
example's parameters one at a time.
• As it requires only one training example at a time, hence it is easier to store in allocated memory.
• However, it shows some computational efficiency losses in comparison to batch gradient systems as it shows
frequent updates that require more detail and speed.
• However, sometimes it can be helpful in finding the global minimum and also escaping the local minimum.
Cons:
• High variance in updates, leading to instability
• Does not guarantee convergence to the global minimum
Example: In a logistic regression model for binary classification, SGD updates weights after
processing each individual sample rather than the entire dataset.
bj := bj – α f(Ci)
Does it for Randomly shuffle the dataset
number of examples
number of features Repeat the steps for every example
learning rate
Modify the coefficient at every step
Batch Gradient Descent
4/29/25 Amrita School of Engineering
3. Mini-Batch Gradient Descent (MBGD)
• Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent.
• It divides the training datasets into small batch sizes then performs the updates on those
batches separately.
• Splitting training datasets into smaller batches make a balance to maintain the computational
efficiency of batch gradient descent and speed of stochastic gradient descent.
• Hence, we can achieve a special type of gradient descent with higher computational efficiency
and less noisy gradient descent.
Pros:
•Reduces variance compared to SGD, leading to more stable convergence
•Computationally more efficient than BGD for large datasets
Cons:
•Still requires tuning of batch size
•Can be affected by suboptimal batch selection
Example: When training a deep learning model using TensorFlow or PyTorch, mini-batches (e.g., batch
size = 32, 64) are commonly used to balance computational efficiency and convergence speed.
29/04/25 23ECE216 Machine Learning Unit 2 16
Summary