SGD
SGD
G.R.A.D.I.E.N.T.
4b. Stochastic Gradient Descent (SGD) and conventional Gradient Descent (GD) are
both optimization algorithms used in training machine learning models, but they differ
in their approach to updating model parameters. Here’s when each one is typically
used:
Batch Processing: GD computes the gradient of the loss function with respect to the
entire training dataset. It then updates the model parameters once per epoch (one
pass through the entire dataset).
Suitable for Small Datasets: GD is suitable when the dataset fits entirely in memory
and is not too large. It ensures that each parameter update is based on the precise
average gradient computed across the entire dataset.
Advantages: Typically converges to the global minimum (for convex problems) or a
good local minimum (for non-convex problems) more reliably because it uses precise
gradient information.
Stochastic Gradient Descent (SGD):
Online Learning: SGD updates the model parameters incrementally for each training
example (or mini-batch of examples). It computes the gradient using only one
example (or a small batch) at a time.
Large Datasets: SGD is particularly useful for large datasets that cannot fit into
memory. It allows for iterative updates without needing to load the entire dataset into
memory at once.
Advantages: Faster convergence per iteration compared to GD because updates are
more frequent and use less computational resources per iteration.
When to Use SGD vs. GD:
GD: Use GD when you have a small to moderate-sized dataset that can fit into
memory and when you want to ensure precise updates based on the entire dataset.
It's also suitable for situations where you want a smoother convergence trajectory
towards the minimum.
SGD: Use SGD when dealing with large datasets or when implementing online
learning where you need to continuously update the model as new data arrives. It's
also beneficial in scenarios where computational resources are limited or when you
want faster updates per iteration.
Variants:
Mini-Batch SGD: A compromise between GD and SGD, where updates are made
based on small batches of data. It combines the benefits of both approaches by
reducing the variance of parameter updates compared to pure SGD while still being
computationally efficient.
In practice, the choice between GD and SGD (or its variants) depends on the specific
problem, the size of the dataset, computational resources available, and desired
convergence properties of the optimization process.
DOWNSIDES to SGD:
Due to its stochastic nature and the potential for noisy updates, SGD might not
always converge to the global minimum of the loss function. Instead, it often settles
for a good local minimum, which may or may not be optimal.