GD Compare
GD Compare
• How it works: Computes the gradient of the cost function with respect to the parameters
for the entire training dataset and updates the parameters once per epoch.
• Advantages:
• Disadvantages:
o Slow for large datasets because it requires a full pass through the data for each
update.
• How it works: Updates the parameters for each training example one at a time, making it
much faster than Batch GD.
• Advantages:
• Disadvantages:
o May not converge to the global minimum due to the high variance in updates.
• Use case: Suitable for large datasets and online learning scenarios.
3. Mini-Batch Gradient Descent
• How it works: A compromise between Batch GD and SGD. It updates the parameters using
a small batch of training examples (e.g., 32, 64, 128) instead of the entire dataset or a single
example.
• Advantages:
• Disadvantages:
• Use case: Most commonly used in practice, especially for training deep neural networks.
• How it works: Adds a momentum term to the update rule, which accumulates the gradient
of past steps to accelerate convergence. The update is influenced by both the current
gradient and the history of gradients.
• Advantages:
• Disadvantages:
• Advantages:
• Disadvantages:
• Use case: Suitable for optimizing complex loss functions with rapid changes in gradients.
• How it works: Adapts the learning rate for each parameter based on the historical
gradients. Parameters with sparse updates get higher learning rates, while those with dense
updates get lower learning rates.
• Advantages:
• Disadvantages:
o Learning rate can decay too aggressively, leading to very small updates over time.
o Not suitable for dense data as it can kill the learning rate.
• Use case: Suitable for sparse datasets and problems with uneven feature distributions.
7. RMSProp (Root Mean Square Propagation)
• Advantages:
• Disadvantages:
o Can still oscillate around the minimum if the learning rate is too high.
• Use case: Suitable for non-convex optimization problems and deep learning.
• How it works: Combines the ideas of Momentum-based GD and RMSProp. It uses both the
exponentially decaying average of past gradients (momentum) and the exponentially
decaying average of past squared gradients (adaptive learning rate).
• Advantages:
• Disadvantages:
o Can still oscillate around the minimum due to the momentum term.
• Use case: Widely used in deep learning due to its robustness and fast convergence.
Summary Table
Update
Algorithm Advantages Disadvantages Use Case
Frequency
Momentum- Accelerates convergence, helps escape Can overshoot, oscillates around Complex, non-convex loss
Per mini-batch
Based GD local minima minimum functions
Adaptive learning rate, works well with Learning rate decays aggressively, not Sparse datasets, uneven feature
Adagrad Per mini-batch
sparse data suitable for dense data distributions
Conclusion
• Mini-Batch GD strikes a balance between speed and stability, making it the most commonly used variant.
• Momentum-Based GD and Nesterov Accelerated GD are useful for accelerating convergence and escaping local minima.
• Adagrad, RMSProp, and Adam are adaptive methods that adjust the learning rate dynamically, with Adam being the most popular due to its
robustness and fast convergence.