0% found this document useful (0 votes)
3 views

optimizers

Uploaded by

GreatInvestor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

optimizers

Uploaded by

GreatInvestor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

### 1.

**Gradient Descent with Momentum**:


Gradient Descent with Momentum is an improvement over basic Gradient Descent
(GD) that helps to accelerate the convergence, especially in scenarios where there
are ravines (areas where the surface curves steeply in one dimension and slowly in
another).

- **Update rule**: The update rule for GD with momentum is:


\[
v_t = \beta_1 v_{t-1} + (1 - \beta_1) \nabla_{\theta} J(\theta)
\]
\[
\theta = \theta - \alpha v_t
\]
Where:
- \( v_t \) is the velocity (momentum) at time step \( t \)
- \( \beta_1 \) is the momentum coefficient (usually close to 0.9)
- \( \nabla_{\theta} J(\theta) \) is the gradient of the loss with respect to
the parameters
- \( \alpha \) is the learning rate

- **How it works**:
- Momentum helps to smooth the updates and avoid oscillations.
- The key idea is that it keeps track of the previous gradients and updates
the parameters based on the average of recent gradients.
- If gradients are consistently in the same direction, momentum builds up,
leading to faster movement in that direction.
- If gradients oscillate, the momentum helps to average out and avoid
overshooting.

### 2. **Adam (Adaptive Moment Estimation)**:


Adam is a popular optimization algorithm that combines the advantages of both
**Momentum** and **RMSprop**. It computes adaptive learning rates for each
parameter based on both the first moment (mean) and second moment (uncentered
variance) of the gradients.

- **Update rule**:
\[
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_{\theta} J(\theta)
\]
\[
v_t = \beta_2 v_{t-1} + (1 - \beta_2) \nabla_{\theta} J(\theta)^2
\]
\[
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \
beta_2^t}
\]
\[
\theta = \theta - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
\]
Where:
- \( m_t \) is the first moment (mean of the gradients)
- \( v_t \) is the second moment (variance of the gradients)
- \( \beta_1 \) and \( \beta_2 \) are the exponential decay rates for the
moving averages (usually set to 0.9 and 0.999, respectively)
- \( \hat{m}_t \) and \( \hat{v}_t \) are bias-corrected estimates of the
first and second moments
- \( \alpha \) is the learning rate
- \( \epsilon \) is a small constant to avoid division by zero (usually \
(10^{-8}\))
- **How it works**:
- Adam maintains two moving averages: one for the gradient (\( m_t \)) and one
for the squared gradient (\( v_t \)).
- These moving averages help to adapt the learning rate for each parameter
based on how steep or smooth the gradients are.
- The bias correction terms (\( \hat{m}_t \) and \( \hat{v}_t \)) ensure that
the estimates for \( m_t \) and \( v_t \) are unbiased during the initial steps.
- The combination of momentum and adaptive learning rates makes Adam a very
powerful optimizer.

### 3. **Adagrad (Adaptive Gradient Algorithm)**:


Adagrad is an optimization algorithm designed to adapt the learning rate based
on the historical gradient information for each parameter. It increases the
learning rate for parameters with smaller gradients and decreases it for parameters
with larger gradients.

- **Update rule**:
\[
G_t = G_{t-1} + \nabla_{\theta} J(\theta)^2
\]
\[
\theta = \theta - \frac{\alpha}{\sqrt{G_t + \epsilon}} \nabla_{\theta} J(\
theta)
\]
Where:
- \( G_t \) is the sum of squared gradients up to time step \( t \)
- \( \epsilon \) is a small constant to prevent division by zero
- \( \alpha \) is the learning rate (can be constant or decaying over time)

- **How it works**:
- Adagrad adapts the learning rate for each parameter by accumulating the
squared gradients.
- The more a parameter's gradient fluctuates, the more the learning rate for
that parameter shrinks.
- This helps parameters with frequent updates to converge faster, while
parameters with sparse updates have higher learning rates.
- However, a limitation of Adagrad is that it can result in excessively small
learning rates over time, especially for parameters that don't receive many
updates.

### 4. **RMSprop (Root Mean Square Propagation)**:


RMSprop is an adaptive learning rate algorithm that solves the issue of rapidly
decaying learning rates in Adagrad. It introduces a moving average to the squared
gradients to maintain a more balanced adaptive learning rate.

- **Update rule**:
\[
E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) \nabla_{\theta} J(\theta)^2
\]
\[
\theta = \theta - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} \nabla_{\theta}
J(\theta)
\]
Where:
- \( E[g^2]_t \) is the moving average of the squared gradients at time step \
( t \)
- \( \beta \) is the smoothing constant (typically set to 0.9)
- \( \alpha \) is the learning rate
- \( \epsilon \) is a small constant to avoid division by zero

- **How it works**:
- RMSprop computes a running average of the squared gradients, which helps to
adapt the learning rate.
- This helps prevent the learning rate from decaying too quickly, unlike in
Adagrad, while still benefiting from an adaptive learning rate.
- RMSprop is often used for non-stationary objectives (e.g., training deep
networks) because it helps the optimizer make faster progress in some directions
and slower in others, improving convergence.

### Summary of Key Differences:

- **Gradient Descent with Momentum**: Accelerates convergence by incorporating past


gradients into the update, reducing oscillations.
- **Adam**: Combines the benefits of Momentum and RMSprop with adaptive learning
rates for each parameter, offering fast convergence and stability.
- **Adagrad**: Adjusts learning rates for each parameter based on the sum of
squared gradients but can lead to a rapid decay in the learning rate.
- **RMSprop**: Similar to Adagrad but with a moving average for squared gradients
to prevent rapid decay, making it more suitable for online and non-stationary
settings.

You might also like