0% found this document useful (0 votes)
15 views5 pages

GD Compare

The document discusses various gradient descent algorithms including Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, Momentum-Based Gradient Descent, Nesterov Accelerated Gradient, Adagrad, RMSProp, and Adam. Each algorithm is described in terms of its working mechanism, advantages, disadvantages, and suitable use cases. The conclusion highlights the trade-offs between stability and speed across these methods, with Mini-Batch GD being the most commonly used and Adam noted for its robustness and fast convergence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

GD Compare

The document discusses various gradient descent algorithms including Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, Momentum-Based Gradient Descent, Nesterov Accelerated Gradient, Adagrad, RMSProp, and Adam. Each algorithm is described in terms of its working mechanism, advantages, disadvantages, and suitable use cases. The conclusion highlights the trade-offs between stability and speed across these methods, with Mini-Batch GD being the most commonly used and Adam noted for its robustness and fast convergence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1.

Batch Gradient Descent (Vanilla GD)

• How it works: Computes the gradient of the cost function with respect to the parameters
for the entire training dataset and updates the parameters once per epoch.

• Advantages:

o Stable convergence due to less noisy updates.

o Straight trajectory towards the minimum.

o Guaranteed to converge to the global minimum for convex functions.

• Disadvantages:

o Slow for large datasets because it requires a full pass through the data for each
update.

o Memory-intensive for large datasets.

o Can get stuck in local minima for non-convex functions.

• Use case: Suitable for small to medium-sized datasets.

2. Stochastic Gradient Descent (SGD)

• How it works: Updates the parameters for each training example one at a time, making it
much faster than Batch GD.

• Advantages:

o Faster updates and convergence compared to Batch GD.

o Can escape local minima due to the noisy updates.

o Memory-efficient as it processes one example at a time.

• Disadvantages:

o Noisy updates can lead to oscillations and slower convergence.

o May not converge to the global minimum due to the high variance in updates.

• Use case: Suitable for large datasets and online learning scenarios.
3. Mini-Batch Gradient Descent

• How it works: A compromise between Batch GD and SGD. It updates the parameters using
a small batch of training examples (e.g., 32, 64, 128) instead of the entire dataset or a single
example.

• Advantages:

o Less noisy updates compared to SGD, leading to more stable convergence.

o Faster than Batch GD as it processes smaller batches.

o Memory-efficient and can leverage vectorized operations for faster computation.

• Disadvantages:

o Requires tuning of the batch size.

o Still susceptible to local minima for non-convex functions.

• Use case: Most commonly used in practice, especially for training deep neural networks.

4. Momentum-Based Gradient Descent

• How it works: Adds a momentum term to the update rule, which accumulates the gradient
of past steps to accelerate convergence. The update is influenced by both the current
gradient and the history of gradients.

• Advantages:

o Accelerates convergence by accumulating momentum in the direction of consistent


gradients.

o Helps escape local minima and flat regions.

o Smoother optimization path compared to standard GD.

• Disadvantages:

o Can overshoot the minimum if the momentum term is too high.

o May oscillate around the minimum.

• Use case: Useful for optimizing complex, non-convex loss functions.


5. Nesterov Accelerated Gradient (NAG)

• How it works: A variant of Momentum-based GD that "looks ahead" by computing the


gradient at the estimated future position (based on the momentum term) before making the
update.

• Advantages:

o Faster convergence compared to standard Momentum-based GD.

o Reduces oscillations and overshooting.

o Better handling of curvatures in the loss function.

• Disadvantages:

o More computationally expensive due to the "look-ahead" step.

o Sensitive to the learning rate.

• Use case: Suitable for optimizing complex loss functions with rapid changes in gradients.

6. Adagrad (Adaptive Gradient Algorithm)

• How it works: Adapts the learning rate for each parameter based on the historical
gradients. Parameters with sparse updates get higher learning rates, while those with dense
updates get lower learning rates.

• Advantages:

o Automatically adjusts the learning rate for each parameter.

o Works well with sparse data.

• Disadvantages:

o Learning rate can decay too aggressively, leading to very small updates over time.

o Not suitable for dense data as it can kill the learning rate.

• Use case: Suitable for sparse datasets and problems with uneven feature distributions.
7. RMSProp (Root Mean Square Propagation)

• How it works: Improves upon Adagrad by using an exponentially decaying average of


squared gradients to adjust the learning rate, preventing the aggressive decay of the
learning rate.

• Advantages:

o Prevents the rapid decay of the learning rate seen in Adagrad.

o Works well with both sparse and dense data.

• Disadvantages:

o Can still oscillate around the minimum if the learning rate is too high.

• Use case: Suitable for non-convex optimization problems and deep learning.

8. Adam (Adaptive Moment Estimation)

• How it works: Combines the ideas of Momentum-based GD and RMSProp. It uses both the
exponentially decaying average of past gradients (momentum) and the exponentially
decaying average of past squared gradients (adaptive learning rate).

• Advantages:

o Combines the benefits of Momentum and RMSProp.

o Works well with sparse gradients and noisy data.

o Generally converges faster than other variants.

• Disadvantages:

o Requires tuning of hyperparameters (e.g., β1, β2).

o Can still oscillate around the minimum due to the momentum term.

• Use case: Widely used in deep learning due to its robustness and fast convergence.
Summary Table

Update
Algorithm Advantages Disadvantages Use Case
Frequency

Slow for large datasets, memory-


Batch GD Once per epoch Stable convergence, less noisy updates Small to medium datasets
intensive

Per training Noisy updates, may not converge to


SGD Fast updates, memory-efficient Large datasets, online learning
example global minimum

Balances speed and stability, memory- Most common in practice,


Mini-Batch GD Per mini-batch Requires tuning of batch size
efficient especially for deep learning

Momentum- Accelerates convergence, helps escape Can overshoot, oscillates around Complex, non-convex loss
Per mini-batch
Based GD local minima minimum functions

Nesterov Complex loss functions with rapid


Per mini-batch Faster convergence, reduces oscillations More computationally expensive
Accelerated GD gradient changes

Adaptive learning rate, works well with Learning rate decays aggressively, not Sparse datasets, uneven feature
Adagrad Per mini-batch
sparse data suitable for dense data distributions

Prevents aggressive learning rate decay, Non-convex optimization, deep


RMSProp Per mini-batch Can oscillate around minimum
works with sparse and dense data learning

Combines Momentum and RMSProp, fast Requires hyperparameter tuning, can


Adam Per mini-batch Widely used in deep learning
convergence oscillate

Conclusion

• Batch GD is stable but slow for large datasets.

• SGD is fast but noisy, making it less stable.

• Mini-Batch GD strikes a balance between speed and stability, making it the most commonly used variant.

• Momentum-Based GD and Nesterov Accelerated GD are useful for accelerating convergence and escaping local minima.

• Adagrad, RMSProp, and Adam are adaptive methods that adjust the learning rate dynamically, with Adam being the most popular due to its
robustness and fast convergence.

You might also like