optimizers
optimizers
- **How it works**:
- Momentum helps to smooth the updates and avoid oscillations.
- The key idea is that it keeps track of the previous gradients and updates
the parameters based on the average of recent gradients.
- If gradients are consistently in the same direction, momentum builds up,
leading to faster movement in that direction.
- If gradients oscillate, the momentum helps to average out and avoid
overshooting.
- **Update rule**:
\[
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_{\theta} J(\theta)
\]
\[
v_t = \beta_2 v_{t-1} + (1 - \beta_2) \nabla_{\theta} J(\theta)^2
\]
\[
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \
beta_2^t}
\]
\[
\theta = \theta - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
\]
Where:
- \( m_t \) is the first moment (mean of the gradients)
- \( v_t \) is the second moment (variance of the gradients)
- \( \beta_1 \) and \( \beta_2 \) are the exponential decay rates for the
moving averages (usually set to 0.9 and 0.999, respectively)
- \( \hat{m}_t \) and \( \hat{v}_t \) are bias-corrected estimates of the
first and second moments
- \( \alpha \) is the learning rate
- \( \epsilon \) is a small constant to avoid division by zero (usually \
(10^{-8}\))
- **How it works**:
- Adam maintains two moving averages: one for the gradient (\( m_t \)) and one
for the squared gradient (\( v_t \)).
- These moving averages help to adapt the learning rate for each parameter
based on how steep or smooth the gradients are.
- The bias correction terms (\( \hat{m}_t \) and \( \hat{v}_t \)) ensure that
the estimates for \( m_t \) and \( v_t \) are unbiased during the initial steps.
- The combination of momentum and adaptive learning rates makes Adam a very
powerful optimizer.
- **Update rule**:
\[
G_t = G_{t-1} + \nabla_{\theta} J(\theta)^2
\]
\[
\theta = \theta - \frac{\alpha}{\sqrt{G_t + \epsilon}} \nabla_{\theta} J(\
theta)
\]
Where:
- \( G_t \) is the sum of squared gradients up to time step \( t \)
- \( \epsilon \) is a small constant to prevent division by zero
- \( \alpha \) is the learning rate (can be constant or decaying over time)
- **How it works**:
- Adagrad adapts the learning rate for each parameter by accumulating the
squared gradients.
- The more a parameter's gradient fluctuates, the more the learning rate for
that parameter shrinks.
- This helps parameters with frequent updates to converge faster, while
parameters with sparse updates have higher learning rates.
- However, a limitation of Adagrad is that it can result in excessively small
learning rates over time, especially for parameters that don't receive many
updates.
- **Update rule**:
\[
E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) \nabla_{\theta} J(\theta)^2
\]
\[
\theta = \theta - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} \nabla_{\theta}
J(\theta)
\]
Where:
- \( E[g^2]_t \) is the moving average of the squared gradients at time step \
( t \)
- \( \beta \) is the smoothing constant (typically set to 0.9)
- \( \alpha \) is the learning rate
- \( \epsilon \) is a small constant to avoid division by zero
- **How it works**:
- RMSprop computes a running average of the squared gradients, which helps to
adapt the learning rate.
- This helps prevent the learning rate from decaying too quickly, unlike in
Adagrad, while still benefiting from an adaptive learning rate.
- RMSprop is often used for non-stationary objectives (e.g., training deep
networks) because it helps the optimizer make faster progress in some directions
and slower in others, improving convergence.