0% found this document useful (0 votes)

3 views

optimizers

Uploaded by

GreatInvestor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

optimizers

Uploaded by

GreatInvestor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

### 1.

Gradient Descent with Momentum:

Gradient Descent with Momentum is an improvement over basic Gradient Descent
(GD) that helps to accelerate the convergence, especially in scenarios where there
are ravines (areas where the surface curves steeply in one dimension and slowly in
another).

- Update rule: The update rule for GD with momentum is:

\[
v_t = \beta_1 v_{t-1} + (1 - \beta_1) \nabla_{\theta} J(\theta)
\]
\[
\theta = \theta - \alpha v_t
\]
Where:
- \( v_t \) is the velocity (momentum) at time step \( t \)
- \( \beta_1 \) is the momentum coefficient (usually close to 0.9)
- \( \nabla_{\theta} J(\theta) \) is the gradient of the loss with respect to
the parameters
- \( \alpha \) is the learning rate

- **How it works**:
- Momentum helps to smooth the updates and avoid oscillations.
- The key idea is that it keeps track of the previous gradients and updates
the parameters based on the average of recent gradients.
- If gradients are consistently in the same direction, momentum builds up,
leading to faster movement in that direction.
- If gradients oscillate, the momentum helps to average out and avoid
overshooting.

### 2. Adam (Adaptive Moment Estimation):

Adam is a popular optimization algorithm that combines the advantages of both
**Momentum** and **RMSprop**. It computes adaptive learning rates for each
parameter based on both the first moment (mean) and second moment (uncentered
variance) of the gradients.

- **Update rule**:
\[
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_{\theta} J(\theta)
\]
\[
v_t = \beta_2 v_{t-1} + (1 - \beta_2) \nabla_{\theta} J(\theta)^2
\]
\[
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \
beta_2^t}
\]
\[
\theta = \theta - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
\]
Where:
- \( m_t \) is the first moment (mean of the gradients)
- \( v_t \) is the second moment (variance of the gradients)
- \( \beta_1 \) and \( \beta_2 \) are the exponential decay rates for the
moving averages (usually set to 0.9 and 0.999, respectively)
- \( \hat{m}_t \) and \( \hat{v}_t \) are bias-corrected estimates of the
first and second moments
- \( \alpha \) is the learning rate
- \( \epsilon \) is a small constant to avoid division by zero (usually \
(10^{-8}\))
- **How it works**:
- Adam maintains two moving averages: one for the gradient (\( m_t \)) and one
for the squared gradient (\( v_t \)).
- These moving averages help to adapt the learning rate for each parameter
based on how steep or smooth the gradients are.
- The bias correction terms (\( \hat{m}_t \) and \( \hat{v}_t \)) ensure that
the estimates for \( m_t \) and \( v_t \) are unbiased during the initial steps.
- The combination of momentum and adaptive learning rates makes Adam a very
powerful optimizer.

### 3. Adagrad (Adaptive Gradient Algorithm):

Adagrad is an optimization algorithm designed to adapt the learning rate based
on the historical gradient information for each parameter. It increases the
learning rate for parameters with smaller gradients and decreases it for parameters
with larger gradients.

- **Update rule**:
\[
G_t = G_{t-1} + \nabla_{\theta} J(\theta)^2
\]
\[
\theta = \theta - \frac{\alpha}{\sqrt{G_t + \epsilon}} \nabla_{\theta} J(\
theta)
\]
Where:
- \( G_t \) is the sum of squared gradients up to time step \( t \)
- \( \epsilon \) is a small constant to prevent division by zero
- \( \alpha \) is the learning rate (can be constant or decaying over time)

- **How it works**:
- Adagrad adapts the learning rate for each parameter by accumulating the
squared gradients.
- The more a parameter's gradient fluctuates, the more the learning rate for
that parameter shrinks.
- This helps parameters with frequent updates to converge faster, while
parameters with sparse updates have higher learning rates.
- However, a limitation of Adagrad is that it can result in excessively small
learning rates over time, especially for parameters that don't receive many
updates.

### 4. RMSprop (Root Mean Square Propagation):

RMSprop is an adaptive learning rate algorithm that solves the issue of rapidly
decaying learning rates in Adagrad. It introduces a moving average to the squared
gradients to maintain a more balanced adaptive learning rate.

- **Update rule**:
\[
E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) \nabla_{\theta} J(\theta)^2
\]
\[
\theta = \theta - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} \nabla_{\theta}
J(\theta)
\]
Where:
- \( E[g^2]_t \) is the moving average of the squared gradients at time step \
( t \)
- \( \beta \) is the smoothing constant (typically set to 0.9)
- \( \alpha \) is the learning rate
- \( \epsilon \) is a small constant to avoid division by zero

- **How it works**:
- RMSprop computes a running average of the squared gradients, which helps to
adapt the learning rate.
- This helps prevent the learning rate from decaying too quickly, unlike in
Adagrad, while still benefiting from an adaptive learning rate.
- RMSprop is often used for non-stationary objectives (e.g., training deep
networks) because it helps the optimizer make faster progress in some directions
and slower in others, improving convergence.

### Summary of Key Differences:

- Gradient Descent with Momentum: Accelerates convergence by incorporating past

gradients into the update, reducing oscillations.
- **Adam**: Combines the benefits of Momentum and RMSprop with adaptive learning
rates for each parameter, offering fast convergence and stability.
- **Adagrad**: Adjusts learning rates for each parameter based on the sum of
squared gradients but can lead to a rapid decay in the learning rate.
- **RMSprop**: Similar to Adagrad but with a moving average for squared gradients
to prevent rapid decay, making it more suitable for online and non-stationary
settings.

An Introduction To Pseudocode Workbook
No ratings yet
An Introduction To Pseudocode Workbook
6 pages
ME7 Eeprom Checksum Guide
100% (1)
ME7 Eeprom Checksum Guide
2 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
GD Compare
No ratings yet
GD Compare
5 pages
NN optimizers
No ratings yet
NN optimizers
2 pages
ADL Unit-3
No ratings yet
ADL Unit-3
21 pages
17-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-21!08!2024
No ratings yet
17-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-21!08!2024
3 pages
DL Ut - 1
No ratings yet
DL Ut - 1
14 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
cours5
No ratings yet
cours5
23 pages
11 - Optimizers
No ratings yet
11 - Optimizers
16 pages
Deep learning exp 2.3 MU
No ratings yet
Deep learning exp 2.3 MU
4 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Module 2
No ratings yet
Module 2
67 pages
ChatGPT
No ratings yet
ChatGPT
4 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
deep learnig u2
No ratings yet
deep learnig u2
4 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Document2
No ratings yet
Document2
2 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
DL (2)
No ratings yet
DL (2)
18 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Momentum, AdaGrad, RMSProp, Adam
No ratings yet
Momentum, AdaGrad, RMSProp, Adam
27 pages
Assignment Jaiprakash
No ratings yet
Assignment Jaiprakash
5 pages
Optimizers
No ratings yet
Optimizers
4 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
DL Class2
No ratings yet
DL Class2
30 pages
Part 1.3. Optimazation of Learning Algorithms
No ratings yet
Part 1.3. Optimazation of Learning Algorithms
14 pages
Twentyone 20466 PDF
No ratings yet
Twentyone 20466 PDF
15 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
New--Neural network & deep learning
No ratings yet
New--Neural network & deep learning
8 pages
6 NN RNN
No ratings yet
6 NN RNN
55 pages
4.optimization Techniques
No ratings yet
4.optimization Techniques
1 page
Momentum Update Rule
No ratings yet
Momentum Update Rule
4 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
d.manju23ba032
No ratings yet
d.manju23ba032
3 pages
Optimizers Types
No ratings yet
Optimizers Types
6 pages
DL 4
No ratings yet
DL 4
15 pages
769 Padam Closing The Generalizati
No ratings yet
769 Padam Closing The Generalizati
16 pages
equation GD
No ratings yet
equation GD
4 pages
DL
No ratings yet
DL
12 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
[Fall 2024] Deep Learning 2
No ratings yet
[Fall 2024] Deep Learning 2
46 pages
ML Concepts
No ratings yet
ML Concepts
3 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Deep Learning-Summery
No ratings yet
Deep Learning-Summery
24 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Rajesh (Dl Unit3) 06dec2024
No ratings yet
Rajesh (Dl Unit3) 06dec2024
67 pages
SuperGD
No ratings yet
SuperGD
15 pages
2410.19706 (1)
No ratings yet
2410.19706 (1)
15 pages
Module 2 Part1new
No ratings yet
Module 2 Part1new
32 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Lect 7
No ratings yet
Lect 7
43 pages
Training NNs
No ratings yet
Training NNs
34 pages
MODULE 3
No ratings yet
MODULE 3
7 pages
optimization-techniques
No ratings yet
optimization-techniques
9 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Taylor De
No ratings yet
Taylor De
6 pages
Cascaded Integrator Comb Digital Filters Paper Hogenauer
No ratings yet
Cascaded Integrator Comb Digital Filters Paper Hogenauer
8 pages
CS 510-Design and Analysis of Algorithms-Imdad Ullah Khan
No ratings yet
CS 510-Design and Analysis of Algorithms-Imdad Ullah Khan
3 pages
Pulse Code Modulation (PCM)
No ratings yet
Pulse Code Modulation (PCM)
3 pages
L5 Asmptotic Notations
No ratings yet
L5 Asmptotic Notations
7 pages
Clustering: Dr. Akinul Islam Jony
No ratings yet
Clustering: Dr. Akinul Islam Jony
49 pages
Signals and Systems Model - 1
No ratings yet
Signals and Systems Model - 1
2 pages
Java Project
No ratings yet
Java Project
39 pages
Fundamentals of optimization theory with applications to machine learning Gallier J. - The newest ebook version is ready, download now to explore
No ratings yet
Fundamentals of optimization theory with applications to machine learning Gallier J. - The newest ebook version is ready, download now to explore
70 pages
Log Og
No ratings yet
Log Og
608 pages
Daa Two Mark Questions
No ratings yet
Daa Two Mark Questions
9 pages
Computational Techniques in Civil Engineering Tutorials (Water Resources Part)
No ratings yet
Computational Techniques in Civil Engineering Tutorials (Water Resources Part)
3 pages
Muller Method: Where Z - 2c B B - 4ac
0% (1)
Muller Method: Where Z - 2c B B - 4ac
3 pages
Lesson Plan CSC567 20242
No ratings yet
Lesson Plan CSC567 20242
4 pages
Harshadsa
No ratings yet
Harshadsa
27 pages
Edf Algorithm
No ratings yet
Edf Algorithm
1 page
A cluster-based optimization framework for vehicle routing problem with workload balance
No ratings yet
A cluster-based optimization framework for vehicle routing problem with workload balance
14 pages
Divide and Conquer (3)
No ratings yet
Divide and Conquer (3)
11 pages
Ty-Timetable Latest
No ratings yet
Ty-Timetable Latest
2 pages
06 - Data Preprocessing
No ratings yet
06 - Data Preprocessing
68 pages
Leetcode Questions
No ratings yet
Leetcode Questions
16 pages
M1522.002500 - 양자 컴퓨팅 및 정보의 기초 (Prof. Taehyun Kim)
No ratings yet
M1522.002500 - 양자 컴퓨팅 및 정보의 기초 (Prof. Taehyun Kim)
11 pages
Study & Implementation of Cyclic Redundancy Check (CRC) : Experiment No. 4
No ratings yet
Study & Implementation of Cyclic Redundancy Check (CRC) : Experiment No. 4
9 pages
Linear Functions Quiz 2
No ratings yet
Linear Functions Quiz 2
2 pages
Num Chap 5 Edited
No ratings yet
Num Chap 5 Edited
12 pages
B19ECT502 DTSP Revised
No ratings yet
B19ECT502 DTSP Revised
2 pages
Harris Hawks Optimization
No ratings yet
Harris Hawks Optimization
40 pages
Polynomial and Radical Functions
No ratings yet
Polynomial and Radical Functions
2 pages

optimizers

Uploaded by

optimizers

Uploaded by

### 1.

**Gradient Descent with Momentum**:

- **Update rule**: The update rule for GD with momentum is:

### 2. **Adam (Adaptive Moment Estimation)**:

### 3. **Adagrad (Adaptive Gradient Algorithm)**:

### 4. **RMSprop (Root Mean Square Propagation)**:

### Summary of Key Differences:

- **Gradient Descent with Momentum**: Accelerates convergence by incorporating past

You might also like

Gradient Descent with Momentum:

- Update rule: The update rule for GD with momentum is:

### 2. Adam (Adaptive Moment Estimation):

### 3. Adagrad (Adaptive Gradient Algorithm):

### 4. RMSprop (Root Mean Square Propagation):

- Gradient Descent with Momentum: Accelerates convergence by incorporating past