0% found this document useful (0 votes)
10 views3 pages

17-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-21!08!2024

Uploaded by

gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views3 pages

17-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-21!08!2024

Uploaded by

gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Optimization algorithms are used in machine learning/ Deep Learning to minimize (or

maximize) an objective function J(θ), which is typically a loss function that the model wants
to minimize during training. The objective function depends on the model parameters θ.
Overviews of the optimization algorithms:
1. Adagrad
Adagrad (Adaptive Gradient Algorithm) is an optimization algorithm that adapts the learning
rate to the parameters. It performs smaller updates for parameters associated with frequently
occurring features and larger updates for parameters associated with infrequent features. It's
well-suited for dealing with sparse data.
Key Points:
 Learning Rate: The learning rate is adapted for each parameter individually.
 Squared Gradients: The algorithm accumulates the squared gradients in the
denominator; since every added term is positive, the accumulated sum keeps growing
during training, causing the learning rate to shrink and eventually become
infinitesimally small.
Adagrad
Use Adagrad for:
 Problems with sparse data: Adagrad can be particularly effective in scenarios where
some features occur infrequently since it adapts the learning rate for each parameter.
 Large-scale learning problems: Its per-parameter learning rate adjustment makes it
well-suited for dealing with high-dimensional data where the dataset is large.
However, due to its continuously decreasing learning rate, Adagrad might not perform well in
scenarios where you need a constant or increasing learning rate throughout training.

2. RMSProp
RMSProp (Root Mean Square Propagation) is an optimization algorithm designed to resolve
Adagrad’s radically diminishing learning rates. It uses a moving average of squared gradients
to normalize the gradient itself.
This normalization balances the step size—decreasing the step for large gradients to avoid
exploding and increasing the step for small gradients to avoid vanishing.
Key Points:
 Decay Rate: The algorithm introduces a decay rate to control the moving average
window.
 Stabilized Learning Rate: The learning rate does not decrease to the point of
becoming too small, as it might happen with Adagrad.
RMSProp
Use RMSProp for:
 Non-stationary tasks: Since RMSProp adapts the learning rate based on recent
gradients, it performs well on problems where the statistics of the data change over
time.
 Recurrent Neural Networks (RNNs): RNNs often suffer from the vanishing or
exploding gradient problem, and RMSProp is effective at combatting this issue.
RMSProp is generally a good choice unless you have a reason to use a more sophisticated
optimizer like Adam, which combines the benefits of RMSProp with momentum.

3. Adadelta
Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically
decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts
the window of accumulated past gradients to some fixed size.
Key Points:
 Window of Accumulated Gradients: Adadelta limits the accumulation of squared
gradients to a fixed window size, implementing it as a moving average.
 No Need for a Default Learning Rate**: The method derives its own adaptive
learning rates that require no manual setting of a learning rate.
Adadelta
Use Adadelta for:
 Problems where you want to avoid manual tuning of the learning rate: Adadelta
adapts over time and doesn't require you to set a default learning rate.
 Situations similar to RMSProp, where you're dealing with non-stationary
objectives or training problems that require robustness to noisy gradient
information.
Adadelta is particularly useful if you find that RMSProp is still too aggressive with its
updates even after tuning its decay rate.

4. Adam
Adam (Adaptive Moment Estimation) combines the ideas of momentum and RMSProp. In
addition to storing an exponentially decaying average of past squared gradients like
RMSProp, Adam also keeps an exponentially decaying average of past gradients similar to
momentum.
Key Points:
 Moment Estimation: Adam maintains two different estimations: one for the gradients
(first moment) and one for the squared gradients (second moment).
 Bias Correction: To account for their initializations at zero, the moment estimates are
corrected to prevent bias towards zero.
 Hyperparameters: Adam includes hyperparameters for the decay rates of these
moment estimates.
Adam
Use Adam for:
 Deep learning applications: Adam is very popular in the training of deep neural
networks, particularly because it combines the advantages of RMSProp and
momentum, making it effective for a wide range of tasks.
 Problems with noisy gradients or gradients with high variance, as the moment
estimation helps stabilize the updates.
 Large datasets and high-dimensional parameter spaces: Adam is computationally
efficient and has little memory requirements, which is beneficial in these scenarios.
Adam is often the default choice for many deep learning tasks unless it demonstrates
underperformance. In some cases, adaptive algorithms like Adam might lead to convergence
to suboptimal solutions, and in such cases, a simple SGD with momentum might outperform
it.

Summary
Adagrad: Good for sparse datasets; learning rate diminishes over time.
RMSProp: Modifies Adagrad to use a moving average of squared gradients; good for non-
stationary objectives.
Adadelta: An extension of RMSProp that doesn't require a learning rate; adapts over time.
Adam: Combines the best properties of Adagrad and RMSProp and includes bias correction;
efficient for large datasets and high-dimensional spaces.
These optimizers are generally used in deep learning, and each has its strengths and
weaknesses depending on the context (e.g., the type of data, the architecture of the neural
network, or the specific type of problem being solved). It's common practice to try multiple
optimizers and choose the one that works best for a particular application.

General Guidance
It's important to note that the choice of optimizer might depend heavily on the specific
problem as well as the specific model architecture. Furthermore, the optimal optimizer for a
given scenario might change as research advances and new techniques are developed.
Therefore, it is common for practitioners to try several different optimizers and perform
hyperparameter tuning to find the best solution for their particular problem. Additionally, the
use of learning rate schedules or warm-up phases can substantially impact the performance of
these optimizers.

You might also like