0% found this document useful (0 votes)
5 views

4 Optimization

Uploaded by

SteffinNelson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

4 Optimization

Uploaded by

SteffinNelson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to Deep Learning

Lecture 4
Optimization

Vasileios Belagiannis
Chair of Multimedia Communications and Signal Processing
Friedrich-Alexander-Universität Erlangen-Nürnberg

24.11.2023

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Recap

We have discussed about the deep neural networks. In detail, we have


seen:
• Multilayer Perceptron (MLP).
• Back-propagation.
• Back-propagation with vectors.
• How to combine back-propagation with gradient descent.
• Neural network activation functions.
• Softmax activation.
• Cross entropy.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 2

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Optimization

• Optimization in machine learning (ML) differs from traditional


optimization.
• In ML, we care about some performance metric on the test set.
• However we do not optimize directly for the metric. It is often not
tractable or differentiable (e,g, F1-score, PR-curve).
• We minimize another loss function with the hope to improve over
the performance metric as well.
• Differently, in pure optimization the loss function is what we actually
aim to minimize.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 3

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Optimization
• In supervised learning, the cost function normally is defined as:

J (w) = E(x,y )∼p̂data L(f (x; w), y ) (1)

where L is the loss function, f (x; w) the prediction function, x is the


input and y is the label. We assume that the classification problem
is a classification where the label corresponds to a scalar value.
Finally p̂data is the empirical data distribution. It corresponds to
the training data.
• Assuming access to the data generation distribution pdata , the cost
function can be re-written as:

J ∗ (w) = E(x,y )∼pdata L(f (x; w), y ). (2)

• This is an ideal cost function that actually is not realistic due to the
lack of access to pdata . It measures the expected generalization
error that is usually called risk.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 4

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Empirical Risk Minimization
• In machine learning, the goal is to minimize the risk as defined in
Eq.2.
• Unfortunately, we do not know pdata and consequently it is not
possible to minimize the risk.
• Instead, we minimize the empirical risk that is defined by Eq. 1.
Given m training samples, it can be re-written as:
m
1 X
J (w) = E(x,y )∼p̂data L(f (x; w), y ) = L(f (x i ; w), y i ). (3)
m
i=1

• Training using a machine learning algorithm results in minimizing


the average error in the training set. This is known as empirical
risk minimization.
• In total, we minimize the empirical risk and hope that the risk is
minimized as well.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 5

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Empirical Risk Minimization (Overfitting)

• A major disadvantage for the empirical risk minimization is that is


can overfit to the training data. A deep neural network can actually
memorize the training data [1].
? What is a factor that makes a neural network memorize?
• Besides, there are loss functions L where empirical risk minimization
is not feasible due to the poor gradients. For instance, the Hinge
loss and 0-1 loss functions are two examples with poor gradients.
• An example with the Hinge loss:
(
0 i =j
L(i, j) = (4)
1 otherwise

• To address the two problems, we work with surrogate loss functions.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 6

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Surrogate Loss
• We conclude that the loss that we want to minimize (e.g. two-class
problem with 0-1 loss) may be not the right choice for the neural
network.
• The solution is to optimize a surrogate loss that has (hopefully) the
same effect.
• The negative log-likelihood of the correct class, for example, is
normally used as loss for binary classification problems, instead of
0-1 loss.
• The negative log-likelihood → estimate the conditional class
probability ≡ least classification error (i.e. risk).
• Note that in some cases the surrogate loss often has better
properties than the actual loss (e.g. 0-1 loss vs negative
log-likelihood).

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 7

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Optimization (Local Minimum)

• In pure optimization, the loss function is minimized. This is the


criterion for completion.
• In machine learning algorithms, the optimization ends when we
reach convergence.
? How do we observe the convergence?
• There might be still gradients, but we stop if convergence has been
observed. In pure optimization, we iterate until the gradient tends to
zero (local minimum).

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 8

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 9

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Gradient Descent

As we discussed, gradient-descent is the main optimization algorithm for


neural networks. Based on the training set {{x1 , y1 }, . . . , {xn , yn }}, we
learn the mapping function f : X → Y, parametrized by w. We define the
loss function L and then use back-propagation to calculate the gradients
of the parameters w; and gradient descent for updating the parameters.
Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 10

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Gradient Descent

w∗ = arg min L(f (x; w), y) (5)


w

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 11

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Gradient Descent

We start from an initial point, i.e. set of parameter values, and take a
step opposite to the direction of the gradient. The learning rate η
regulates the influence of the gradient step. This is an iterative process,
i.e. forward pass → loss → back-propagation → gradient descent, until
convergence.
Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 12

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Gradient Descent

w = w − η∇w L(f (x; w), y) (6)

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 13

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Gradient Descent (Variants)
There are three variants of gradient-descent:
? What are the variants of gradient descent?
• Batch gradient descent → The gradient of the loss L(w), w.r.t. to
the parameters w, is computed on the whole training set. It is
defined as: w = w − η∇w L(f (x; w), y).
• Stochastic gradient descent → The parameter update is performed
for each training sample separately. It is defined as:
w = w − η∇w L(xi , yi , w)).
• Mini-batch gradient descent → It performs a gradient update for
every mini-batch of n samples. It is defined as:
w = w − η∇w L(x(i:i+n) , y(i:i+n) , w)).
The mini-batch version is usually referred as stochastic gradient descent
(SGD). We will also refer to mini-batch gradient descent as SGD.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 14

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Challenges with Gradient Descent (Local Minima)

• Neural network contain


non-linearities. As as result,
they have highly non-convex
loss functions.
• Based on the starting point of
the minimization, there is a
danger of getting stuck at a
local minima where the
gradient is zero.
• Note that there are good local
minima in deep neural
networks.

Image source: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 15

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Challenges with Gradient Descent

Saddle Points:
• It’s the point between the two
“mountains”. It is local minima
for one dimension and local
maxima for the other one.
• The situation is similar to local
minima. The optimization is
getting stuck.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 16

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Visualizing the Loss Landscape of Neural Nets (NIPS 2018)

Network without skip connection. Network with skip connections.

Filter normalization helps to visualize the loss function curvature [2], Visualizer:
http://www.telesens.co/loss-landscape-viz/viewer.html

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 17

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Challenges with Gradient Descent (Learning Rate)
Choosing the right learning rate is
not trivial. In particular:
• Low learning can lead to slow
convergence, where the
solution is not the optimal.
• Large learning hampers the
convergence, where the loss
function can fluctuate around
the minima and even diverge.
• The learning rate is the same
for all parameters. However,
the learned features should be
treated differently based on
their sparsity.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 18

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Challenges with Gradient Descent (Learning Rate)
Choosing the right learning rate is
not trivial. In particular:
• Adjusting the learning rate
during training is necessary.
Larger learning rates are
required only at the beginning.
An update rule is necessary.

Solution
The local minima and saddle points could be addressed with mini-batch stochastic
gradient descent and by adding noise to the labels. In addition, shuffling the training
data at every training epocha helps. Although, these three tricks are useful and
necessary, they do not always help to escape from saddle points. There have been
proposed optimization algorithms to address the aforementioned issues. Next we
discuss a few important of them. Note that we focus on first-order algorithms.
a An epoch corresponds to iterating over the training set one time.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 19

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Gradient Descent Variants

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 20

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


SGD with Momentum
• The momentum accelerates the convergence of gradient descent by
reducing the oscillations [3].
• It accumulates the past gradients to moderate the direction of the
current gradient.
• The aim of the momentum is to increase over time and add more
velocity to the parameter update.
• We introduce the variable u to represent the velocity, defined as:
 X m 
1 i i
u ← αu − η∇w L(f (x ; w), y ) . (7)
m
i=1
• The velocity exponentially decays based on the negative gradient of
the mini-batch with m samples.
• The momentum coefficient α ∈ [0, 1] is usually set to 0.9, while the
learning rate η depends on the networks, task and the data.
• Finally the parameters are updated by:
w ← w + u. (8)

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 21

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


SGD with Momentum (Illustration)
• In general, the velocity increases in dimensions when the gradients
point at the same direction and decreases in dimensions when the
gradients point at different directions.
? How do you interpret the velocity when α = 0.9

SGD without momentum. SGD with momentum.


Name Origin
The term momentum stems from the physics and the Newton’s law of motion.
According to Newtonian mechanics, the momentum is equal to mass times the
velocity. Here, we assume unit mass and thus the momentum is equal to the velocity.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 22

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Nesterov Accelerated Gradient (NAG)
• It is very similar to the classic momentum [4, 5].
• The main difference is that the gradient is evaluated after applying
the velocity.
• The velocity is given by:
 m 
1 X
u ← αu − η∇w L(f (x i ; w + αu), y i ) . (9)
m
i=1

• The parameter update remains the same and it is given by:

w ← w + u. (10)

• The evaluation of the gradient at w + αu can be interpreted as a


correction to the standard momentum.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 23

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Nesterov Accelerated Gradient (Illustration)

Nesterov momentum.
Classic momentum.
2-Steps
Nesterov’s momentum first takes make a step (red vector) based on the accumulated
gradient and second estimate the gradient at the landed position to make the
correction.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 24

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Comparison of Momentum Algorithms

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 25

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adaptive Learning Rate

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 26

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adagrad [6]
We have often situations where some parameters are more sensitive to
the gradient updates than other. To avoid performance drops and
convergence difficulties, the gradient influence should not be the same for
all parameters. The network parameters can be individually updated
based on their sensitivity. Adaptive learning rate algorithms take care
of updating the parameters individually.
Adagrad is a standard algorithm for adaptive learning rate. In detail:
• The learning rate is adapted to lower values for frequently updated
parameters.
• The learning rate is adapted to higher values for infrequently
updated parameters.
• Consider a network layer, i.e. set of parameters, as features. The
frequently updated features will have lower learning rate and the
features with less frequent updates will obtain a higher learning rate.
The goals is to balance the learning process.
? How can we measure the update frequency for each parameter?

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 27

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adagrad [6]
The current gradient values are scaled by the reciprocal of the square
root of the sum of all historical squared gradient values. In one equation
this is described by:
1 Compute gradient:
 m 
1 X
g = ∇w L(f (x i ; w, y i ) (11)
m
i=1

2 Accumulate squared gradient:

G=G+g⊗g (12)

3 Compute and apply gradient:


η
w←w− √ g (13)
+G

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 28

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adagrad [6] Observations
A few observations on Adagrad:
• G is a diagonal matrix. It can be treated as a full matrix as well but
the computational cost is large even for small amount of parameters.
• Thus the operation ⊗ corresponds to the elementwise multiplication.
• The gradients are accumulated over time for each training iteration.
•  is a very small value added for stability reasons.
• The square-root applies element-wise.
• The main advantage of Adagrad is the automatic tuning of the
learning rate.
• A common base learning rate is 0.01.
• Note that the learning rate is constantly scaled by the accumulated
gradients. It becomes smaller over time.
• This is a nice property of second order methods, without the cost of
using a second order method such as Newton’s method.
? What is the limitation of Adagrad?

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 29

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adagrad Training Loss

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 30

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adadelta [7]
It’s an approach that allows to adapt the learning rate of each parameter
(i.e. per-dimension) independently. It is an extension of Adagrad. The
motivation is to avoid insistently reducing the learning rate that will end
up to zero. In addition, Adagrad still has a global learning rate that
requires some tuning. It has small computational overhead compared to
the SGD, similar to Adagrad.
• The idea is to accumulate past gradients over a time-window.
• Furthermore, the accumulation is an exponentially decaying average
of the squared
  gradients (over the time-window). The moving
average E g2 depends only on its previous average and the current
gradient. This is similar to SGD with momentum where the decay
constant γ controls the contribution of the previous average. This
can be written as:
E g2 = γE g2 + (1 − γ)g2
   
(14)
where  m 
1 X i i
g = ∇w L(f (x ; w, y ) (15)
m
i=1

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 31

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adadelta [7]
• The value of γ is usually 0.9 as with the momentum.
• We can add the moving average to the parameter update to obtain:
η
w←w− q g (16)
 + E g2
 

• If we ignore the constant , we see that the denominator is the root


mean squared (RMS) error, given as:
  q  
RMS g = E g2 (17)

The parameter update is now:


η
w←w−   g (18)
 + RMS g

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 32

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adadelta [7] Units
An interesting observation is that the gradient update should have the
same units as the parameters. Actually, we do not have units and thus
we refer to hypothetical units. The lack of updating the same units is
done by SGD, momentum and Adagrad too.
• For example, the units in SGD and Momentum relate to the gradient
and not the parameters:
∂L 1
∇w L = = (19)
w w units
• To address this issue, the Adadelta algorithm is reformulated. The
update term in Eq. 18 is:
η
∆wt = −   g (20)
 + RMS gt

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 33

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adadelta [7] Units
• By rearranging Newton’s method for the inverse of the second order
derivative, we get:
∂L
∂w 1 ∆w
∆w = ∂2L
⇒ ∂2L
= ∂L
(21)
∂2w ∂2w ∂w

• The term ∆w is not known. We could assume though a continuous


curvature and use the term from the last time step t − 1. We can
now rewrite the update term as:
 
∆w RMS ∆w t−1
∆wt = ∂L = −   gt (22)
w
RMS g t

• In general, one can use second order derivatives to obtain more


accurate gradients.
? How do we call the matrix for second order derivatives?

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 34

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adadelta [7] Observations
• Note that the global learning rate η has completely vanished.
This is a great advantage of Adadelta compared to the previously
seen optimization approaches.
• In addition, the (hypothetical) units are the same now.
• Adding the update term back to the parameter update, we obtain:
 
RMS ∆w t−1
w←w−   gt (23)
RMS g t

• In RMS, we also integrate the constant  for numerical stability.


This is particularly important for the first iteration.
• The optimization is robust to sudden gradient changes since the
denominator will increase and thus the learning rate will decrease.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 35

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adadelta [7] Comparisons

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 36

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


RMSprop
The algorithm has the same motivation as Adadelta, namely to improve
the Adagrad. Remember that Adagrad shrinks the learning based on all
previous values of the squared gradient. The learning rate may need to
be very small to reach a convex structure, where the solution lies.
Although the algorithm has been independently developed1 , it follows the
same idea as Adadelta. In detail:
1 Compute gradient:
 X m 
1
g = ∇w L(f (x i ; w, y i ) (24)
m
i=1

2 Accumulate squared gradient:


G = αG + (1 − α)g g (25)
3 Compute and apply gradient:
η
w←w− √ g (26)
+G
1 https://www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdf
~

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 37

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


RMSprop
• Note that a common value for α is 0.9, similar to SGD with
momentum.
• The accumulation of squared gradient is the same with Adadelta.
We have again a moving average. It can be also written, using the
expectation as:
E g2 = γE g2 + (1 − γ)g2
   
(27)
• The parameter update is then the same as with Adadelta:
η
∆w = −   g (28)
 + RMS g2

• RMSprop does not consider the concept of working on the


same (hypothetical) units as Adadelta. This is the first
difference between RMSprop and Adadelta.
• Also, RMSprop retains the learning rate η and its recommended
value is 0.001 (second difference from Adadelta).

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 38

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Visualization of Different Optimization Algorithms

Visualization of different optimizers can be found at


https://imgur.com/a/Hqolp.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 39

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adam [8]
Up to now, we have seen how to make use of the past squared gradients.
Adam stores, in addition, the past gradients, similar to the momentum
algorithms (i.e. velocity). There are two moving averages to compute.
• Adam is another first-order gradient-based optimization algorithm.
• It has a bit more memory requirements compared to Adagrad,
Adadelta and RMSprop.
• It is well-suited for noisy and/or sparse gradients.
• The Adam algorithm relies on two moments:

mt = β1 mt−1 + (1 − β1 )g (29)

ut = β2 ut−1 + (1 − β2 )g2 (30)


where mt is the mean (first moment) and ut is the uncentered
variance (second moment) of the gradients. The hyper-parameters
β1 , β2 ∈ [0, 1) control the exponential decay rate of mt and ut .

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 40

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adam [8]
• The two moments are initialized with zeros. This creates a bias
towards zero estimates (especially when the decay rates are small,
i.e. β1 and β2 are close to zero) that needs to be removed.
• The bias-corrected estimates of the moments are given by:
mt
m̂t = (31)
1 − β1t
ut
ût = (32)
1 − β2t
where m̂t is the bias-corrected first moment estimate and ût the
bias-corrected second moment estimate.
• Adam → Adaptive moment estimation.
? Why is the bias important?
? How do we derive the bias-corrected estimates?

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 41

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adam [8] Bias-Corrected Estimates
We examine how we derive the correction 1 − β1 for the first moment.
• The update equation of Eq. 29 can be written as a function of all
previous steps:
Xt
mt = (1 − β1 ) β1t−i gi (33)
i=1

To derive the above equation, consider that:

m0 = 0 (34)

m1 = β1 m0 + (1 − β1 )g1 = (1 − β1 )g1 (35)


m2 = β1 m1 + (1 − β1 )g2 = β1 (1 − β1 )g1 + (1 − β1 )g2 (36)
m3 = β1 m2 + (1 − β1 )g3 = . . . (37)
• By expanding mi , we end up with the pattern of Eq. 33.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 42

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adam [8] Bias-Corrected Estimates
• Now, we examine the expected value of m and how it relates to the
true first moment.
• We take the expectation from the left and right side of Eq. 33:
t
X
E [mt ] = E [(1 − β1 ) β1t−i gi ] = (38)
i=1

t
X
E [gt ](1 − β1 ) β1t−i + ζ = (39)
i=1

E [gt ](1 − β1t ) + ζ (40)


where ζ is
Pthe approximation error. PtNote that
t
(1 − β1 ) i=1 β1t−i = β1t (1 − β1 ) i=1 β1−i =
β1−1 (1−β1−t ) β t −1
β1t (1 − β1 ) = (1 − β1 ) β11 −1 = (1 − β1t )
1−β1−1
Pt
? What is the value of (1 − β1 ) i=1 β1t−i after a few thousands of
iterations?

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 43

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adam [8] Bias-Corrected Estimates
It approaches 1.
1 sum_beta = 0
2 beta = 0.99
3
4 for i in range (1 , 10000) :
5 sum_beta += beta **( i -1)
6
7 print ( sum_beta )
8 print ((1 - beta ) * sum_beta )

• The true first moment E [gt ] is scaled by (1 − β1t ). This happens due
to the initialization of m vector with zeros. To correct the
introduced scale, we have to divide it by (1 − β1t ). This is how we
obtain Eq. 31.
• The derivation is similar for the second moment as well.
• The final parameter update is given by:
m̂t
wt ← wt−1 − η √ (41)
ût + 
where the recommended values are β1 = 0.9, β2 = 0.999 and
 = 10−8 .

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 44

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Adam [8] Observations

• The algorithm has demonstrated faster convergence than SGD with


momentum, Adagrad, RMSprop and Adadelta.
• However, there are many cases where it converged to a worse
solution [9].
• For that reason, Adam has been also combined with the Nesterov
momentum(Nadam [10]), to give more stable for solution.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 45

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Choice of Optimizer

• Sparse data → optimizer with adaptive learning rate.


• Fast convergence → optimizer with adaptive learning rate.
• Big amount of training data → SGD with or without momentum.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 46

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


Next Lecture

Regularization, Initialization, Normalization.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 47

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***


References I
[1] Devansh Arpit, Stanislaw Jastrzkbski, Nicolas Ballas, David Krueger, Emmanuel Bengio,
Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A
closer look at memorization in deep networks. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages 233–242. JMLR. org, 2017.
[2] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss
landscape of neural nets. In Advances in Neural Information Processing Systems, pages
6391–6401, 2018.
[3] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
[4] Y Nesterov. A method of solving a convex programming problem with convergence rate o
(1/k2). Sov. Math. Dokl, 27(2), 1984.
[5] Ilya Sutskever, James Martens, George E Dahl, and Geoffrey E Hinton. On the importance of
initialization and momentum in deep learning. ICML (3), 28(1139-1147):5, 2013.
[6] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research,
12(Jul):2121–2159, 2011.
[7] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
[8] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[9] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The
marginal value of adaptive gradient methods in machine learning. In Advances in Neural
Information Processing Systems, pages 4148–4158, 2017.
[10] Timothy Dozat. Incorporating nesterov momentum into adam. 2016.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 48

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

You might also like