0% found this document useful (0 votes)

15 views5 pages

GD Compare

The document discusses various gradient descent algorithms including Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, Momentum-Based Gradient Descent, Nesterov Accelerated Gradient, Adagrad, RMSProp, and Adam. Each algorithm is described in terms of its working mechanism, advantages, disadvantages, and suitable use cases. The conclusion highlights the trade-offs between stability and speed across these methods, with Mini-Batch GD being the most commonly used and Adam noted for its robustness and fast convergence.

Uploaded by

Abdulrahman Ibrahim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

GD Compare

Uploaded by

Abdulrahman Ibrahim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

1.

Batch Gradient Descent (Vanilla GD)

• How it works: Computes the gradient of the cost function with respect to the parameters
for the entire training dataset and updates the parameters once per epoch.

• Advantages:

o Stable convergence due to less noisy updates.

o Straight trajectory towards the minimum.

o Guaranteed to converge to the global minimum for convex functions.

• Disadvantages:

o Slow for large datasets because it requires a full pass through the data for each
update.

o Memory-intensive for large datasets.

o Can get stuck in local minima for non-convex functions.

• Use case: Suitable for small to medium-sized datasets.

2. Stochastic Gradient Descent (SGD)

• How it works: Updates the parameters for each training example one at a time, making it
much faster than Batch GD.

• Advantages:

o Faster updates and convergence compared to Batch GD.

o Can escape local minima due to the noisy updates.

o Memory-efficient as it processes one example at a time.

• Disadvantages:

o Noisy updates can lead to oscillations and slower convergence.

o May not converge to the global minimum due to the high variance in updates.

• Use case: Suitable for large datasets and online learning scenarios.
3. Mini-Batch Gradient Descent

• How it works: A compromise between Batch GD and SGD. It updates the parameters using
a small batch of training examples (e.g., 32, 64, 128) instead of the entire dataset or a single
example.

• Advantages:

o Less noisy updates compared to SGD, leading to more stable convergence.

o Faster than Batch GD as it processes smaller batches.

o Memory-efficient and can leverage vectorized operations for faster computation.

• Disadvantages:

o Requires tuning of the batch size.

o Still susceptible to local minima for non-convex functions.

• Use case: Most commonly used in practice, especially for training deep neural networks.

4. Momentum-Based Gradient Descent

• How it works: Adds a momentum term to the update rule, which accumulates the gradient
of past steps to accelerate convergence. The update is influenced by both the current
gradient and the history of gradients.

• Advantages:

o Accelerates convergence by accumulating momentum in the direction of consistent

gradients.

o Helps escape local minima and flat regions.

o Smoother optimization path compared to standard GD.

• Disadvantages:

o Can overshoot the minimum if the momentum term is too high.

o May oscillate around the minimum.

• Use case: Useful for optimizing complex, non-convex loss functions.

5. Nesterov Accelerated Gradient (NAG)

• How it works: A variant of Momentum-based GD that "looks ahead" by computing the

gradient at the estimated future position (based on the momentum term) before making the
update.

• Advantages:

o Faster convergence compared to standard Momentum-based GD.

o Reduces oscillations and overshooting.

o Better handling of curvatures in the loss function.

• Disadvantages:

o More computationally expensive due to the "look-ahead" step.

o Sensitive to the learning rate.

• Use case: Suitable for optimizing complex loss functions with rapid changes in gradients.

6. Adagrad (Adaptive Gradient Algorithm)

• How it works: Adapts the learning rate for each parameter based on the historical
gradients. Parameters with sparse updates get higher learning rates, while those with dense
updates get lower learning rates.

• Advantages:

o Automatically adjusts the learning rate for each parameter.

o Works well with sparse data.

• Disadvantages:

o Learning rate can decay too aggressively, leading to very small updates over time.

o Not suitable for dense data as it can kill the learning rate.

• Use case: Suitable for sparse datasets and problems with uneven feature distributions.
7. RMSProp (Root Mean Square Propagation)

• How it works: Improves upon Adagrad by using an exponentially decaying average of

squared gradients to adjust the learning rate, preventing the aggressive decay of the
learning rate.

• Advantages:

o Prevents the rapid decay of the learning rate seen in Adagrad.

o Works well with both sparse and dense data.

• Disadvantages:

o Can still oscillate around the minimum if the learning rate is too high.

• Use case: Suitable for non-convex optimization problems and deep learning.

8. Adam (Adaptive Moment Estimation)

• How it works: Combines the ideas of Momentum-based GD and RMSProp. It uses both the
exponentially decaying average of past gradients (momentum) and the exponentially
decaying average of past squared gradients (adaptive learning rate).

• Advantages:

o Combines the benefits of Momentum and RMSProp.

o Works well with sparse gradients and noisy data.

o Generally converges faster than other variants.

• Disadvantages:

o Requires tuning of hyperparameters (e.g., β1, β2).

o Can still oscillate around the minimum due to the momentum term.

• Use case: Widely used in deep learning due to its robustness and fast convergence.
Summary Table

Update
Algorithm Advantages Disadvantages Use Case
Frequency

Slow for large datasets, memory-

Batch GD Once per epoch Stable convergence, less noisy updates Small to medium datasets
intensive

Per training Noisy updates, may not converge to

SGD Fast updates, memory-efficient Large datasets, online learning
example global minimum

Balances speed and stability, memory- Most common in practice,

Mini-Batch GD Per mini-batch Requires tuning of batch size
efficient especially for deep learning

Momentum- Accelerates convergence, helps escape Can overshoot, oscillates around Complex, non-convex loss
Per mini-batch
Based GD local minima minimum functions

Nesterov Complex loss functions with rapid

Per mini-batch Faster convergence, reduces oscillations More computationally expensive
Accelerated GD gradient changes

Adaptive learning rate, works well with Learning rate decays aggressively, not Sparse datasets, uneven feature
Adagrad Per mini-batch
sparse data suitable for dense data distributions

Prevents aggressive learning rate decay, Non-convex optimization, deep

RMSProp Per mini-batch Can oscillate around minimum
works with sparse and dense data learning

Combines Momentum and RMSProp, fast Requires hyperparameter tuning, can

Adam Per mini-batch Widely used in deep learning
convergence oscillate

Conclusion

• Batch GD is stable but slow for large datasets.

• SGD is fast but noisy, making it less stable.

• Mini-Batch GD strikes a balance between speed and stability, making it the most commonly used variant.

• Momentum-Based GD and Nesterov Accelerated GD are useful for accelerating convergence and escaping local minima.

• Adagrad, RMSProp, and Adam are adaptive methods that adjust the learning rate dynamically, with Adam being the most popular due to its
robustness and fast convergence.

Lesson 5 Deep Neural Net Optimization Tuning Interpretability
100% (1)
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
105 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
ADL Unit-3
100% (2)
ADL Unit-3
21 pages
Top 100 Interview Questions On Machine Learning
100% (1)
Top 100 Interview Questions On Machine Learning
155 pages
Optimization in Machine Learning
No ratings yet
Optimization in Machine Learning
26 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
DL Class2
No ratings yet
DL Class2
30 pages
Adam 1
No ratings yet
Adam 1
11 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Module 2
No ratings yet
Module 2
67 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
769 Padam Closing The Generalizati
No ratings yet
769 Padam Closing The Generalizati
16 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Super Gradient Descent: Global Optimization Requires Global Gradient
No ratings yet
Super Gradient Descent: Global Optimization Requires Global Gradient
15 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Cours 5
No ratings yet
Cours 5
23 pages
Mcculloh: Linear Activation Function
No ratings yet
Mcculloh: Linear Activation Function
18 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Super GD
No ratings yet
Super GD
15 pages
Optim
No ratings yet
Optim
33 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Types of Gradient Descent
No ratings yet
Types of Gradient Descent
9 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Mcculloh: Linear Activation Function
No ratings yet
Mcculloh: Linear Activation Function
12 pages
Optimization
No ratings yet
Optimization
26 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Chat GPT
No ratings yet
Chat GPT
4 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
MCQ1
No ratings yet
MCQ1
22 pages
Optimizers
No ratings yet
Optimizers
3 pages
Equation GD
No ratings yet
Equation GD
4 pages
Linear Models-Gradient Descent, Regularization (Introduction)
No ratings yet
Linear Models-Gradient Descent, Regularization (Introduction)
26 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
Gradient Descent and Its Types
No ratings yet
Gradient Descent and Its Types
5 pages
Qbank 2 Solutions
No ratings yet
Qbank 2 Solutions
6 pages
Deep Learnig U2
No ratings yet
Deep Learnig U2
4 pages
Optimizer
No ratings yet
Optimizer
13 pages
NN Optimizers
No ratings yet
NN Optimizers
2 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Op Tim Ization
No ratings yet
Op Tim Ization
1 page
Deep Learning - Question Bank: Course Code 20AIPC502
No ratings yet
Deep Learning - Question Bank: Course Code 20AIPC502
25 pages
Mlts Exam 2025 Ws Mock Exam Solutions
No ratings yet
Mlts Exam 2025 Ws Mock Exam Solutions
6 pages
DL Lab File Front Page
No ratings yet
DL Lab File Front Page
7 pages
SMBL Merged
No ratings yet
SMBL Merged
28 pages
CV 2022081018080367
No ratings yet
CV 2022081018080367
2 pages
2smart Model For Career Guidance Using Hybrid Deep Learning Technique
No ratings yet
2smart Model For Career Guidance Using Hybrid Deep Learning Technique
5 pages
ML Notes Vaibhav
No ratings yet
ML Notes Vaibhav
269 pages
Deep Learning in Medical Image Analysis
No ratings yet
Deep Learning in Medical Image Analysis
9 pages
Program Overview: Key Highlights
No ratings yet
Program Overview: Key Highlights
5 pages
Google Brain
No ratings yet
Google Brain
12 pages
Easwari Engineering College: Robotics and Automation
No ratings yet
Easwari Engineering College: Robotics and Automation
13 pages
21 SVR
No ratings yet
21 SVR
22 pages
Fundamentals of AI&ML - Syllabus
No ratings yet
Fundamentals of AI&ML - Syllabus
3 pages
6COM1044 2023 2024 SVM Classification
No ratings yet
6COM1044 2023 2024 SVM Classification
50 pages
GPT Report
No ratings yet
GPT Report
41 pages
2023 - Adaptive Disentangled Transformer For Sequential Recommendation
No ratings yet
2023 - Adaptive Disentangled Transformer For Sequential Recommendation
13 pages
Gao - 2024 - LLMs-based Machine Translation For E-Commerce
No ratings yet
Gao - 2024 - LLMs-based Machine Translation For E-Commerce
10 pages
Lecture 8.7
No ratings yet
Lecture 8.7
9 pages
33-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
33-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
4 pages
Lecture6 - Giving Computers Ability To Learn From Data - 2
No ratings yet
Lecture6 - Giving Computers Ability To Learn From Data - 2
46 pages
Lbyec3a 2023-2024
No ratings yet
Lbyec3a 2023-2024
11 pages
Transformer-Based Cross-Modal Recipe Embeddings With Large Batch Training
No ratings yet
Transformer-Based Cross-Modal Recipe Embeddings With Large Batch Training
12 pages
Motion Fused Frames: Data Level Fusion Strategy For Hand Gesture Recognition
No ratings yet
Motion Fused Frames: Data Level Fusion Strategy For Hand Gesture Recognition
9 pages
GRU-based Attention Mechanism For Human Activity Recognition
No ratings yet
GRU-based Attention Mechanism For Human Activity Recognition
6 pages
DL Lab Ex - No.5
No ratings yet
DL Lab Ex - No.5
2 pages
Python IEEE Project Titles 2022 2023
No ratings yet
Python IEEE Project Titles 2022 2023
3 pages
Machine Learning and Deep Learning An Overview of Concepts
No ratings yet
Machine Learning and Deep Learning An Overview of Concepts
9 pages
Unsupervised Learning Using Back Propagation in Neural Networks
No ratings yet
Unsupervised Learning Using Back Propagation in Neural Networks
4 pages
ANN Backpropagation: Weight Updates For Hidden Nodes: Step 1: Update The Weights V
No ratings yet
ANN Backpropagation: Weight Updates For Hidden Nodes: Step 1: Update The Weights V
3 pages