0% found this document useful (0 votes)

5 views

4 Optimization

Uploaded by

SteffinNelson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

4 Optimization

Uploaded by

SteffinNelson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Introduction to Deep Learning

Lecture 4
Optimization

Vasileios Belagiannis
Chair of Multimedia Communications and Signal Processing
Friedrich-Alexander-Universität Erlangen-Nürnberg

24.11.2023

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Recap

We have discussed about the deep neural networks. In detail, we have

seen:
• Multilayer Perceptron (MLP).
• Back-propagation.
• Back-propagation with vectors.
• How to combine back-propagation with gradient descent.
• Neural network activation functions.
• Softmax activation.
• Cross entropy.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 2

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Optimization

• Optimization in machine learning (ML) differs from traditional

optimization.
• In ML, we care about some performance metric on the test set.
• However we do not optimize directly for the metric. It is often not
tractable or differentiable (e,g, F1-score, PR-curve).
• We minimize another loss function with the hope to improve over
the performance metric as well.
• Differently, in pure optimization the loss function is what we actually
aim to minimize.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 3

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Optimization
• In supervised learning, the cost function normally is defined as:

J (w) = E(x,y )∼p̂data L(f (x; w), y ) (1)

where L is the loss function, f (x; w) the prediction function, x is the

input and y is the label. We assume that the classification problem
is a classification where the label corresponds to a scalar value.
Finally p̂data is the empirical data distribution. It corresponds to
the training data.
• Assuming access to the data generation distribution pdata , the cost
function can be re-written as:

J ∗ (w) = E(x,y )∼pdata L(f (x; w), y ). (2)

• This is an ideal cost function that actually is not realistic due to the
lack of access to pdata . It measures the expected generalization
error that is usually called risk.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 4

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Empirical Risk Minimization
• In machine learning, the goal is to minimize the risk as defined in
Eq.2.
• Unfortunately, we do not know pdata and consequently it is not
possible to minimize the risk.
• Instead, we minimize the empirical risk that is defined by Eq. 1.
Given m training samples, it can be re-written as:
m
1 X
J (w) = E(x,y )∼p̂data L(f (x; w), y ) = L(f (x i ; w), y i ). (3)
m
i=1

• Training using a machine learning algorithm results in minimizing

the average error in the training set. This is known as empirical
risk minimization.
• In total, we minimize the empirical risk and hope that the risk is
minimized as well.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 5

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Empirical Risk Minimization (Overfitting)

• A major disadvantage for the empirical risk minimization is that is

can overfit to the training data. A deep neural network can actually
memorize the training data [1].
? What is a factor that makes a neural network memorize?
• Besides, there are loss functions L where empirical risk minimization
is not feasible due to the poor gradients. For instance, the Hinge
loss and 0-1 loss functions are two examples with poor gradients.
• An example with the Hinge loss:
(
0 i =j
L(i, j) = (4)
1 otherwise

• To address the two problems, we work with surrogate loss functions.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 6

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Surrogate Loss
• We conclude that the loss that we want to minimize (e.g. two-class
problem with 0-1 loss) may be not the right choice for the neural
network.
• The solution is to optimize a surrogate loss that has (hopefully) the
same effect.
• The negative log-likelihood of the correct class, for example, is
normally used as loss for binary classification problems, instead of
0-1 loss.
• The negative log-likelihood → estimate the conditional class
probability ≡ least classification error (i.e. risk).
• Note that in some cases the surrogate loss often has better
properties than the actual loss (e.g. 0-1 loss vs negative
log-likelihood).

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 7

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Optimization (Local Minimum)

• In pure optimization, the loss function is minimized. This is the

criterion for completion.
• In machine learning algorithms, the optimization ends when we
reach convergence.
? How do we observe the convergence?
• There might be still gradients, but we stop if convergence has been
observed. In pure optimization, we iterate until the gradient tends to
zero (local minimum).

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 8

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 9

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Gradient Descent

As we discussed, gradient-descent is the main optimization algorithm for

neural networks. Based on the training set {{x1 , y1 }, . . . , {xn , yn }}, we
learn the mapping function f : X → Y, parametrized by w. We define the
loss function L and then use back-propagation to calculate the gradients
of the parameters w; and gradient descent for updating the parameters.
Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 10

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Gradient Descent

w∗ = arg min L(f (x; w), y) (5)

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 11

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Gradient Descent

We start from an initial point, i.e. set of parameter values, and take a
step opposite to the direction of the gradient. The learning rate η
regulates the influence of the gradient step. This is an iterative process,
i.e. forward pass → loss → back-propagation → gradient descent, until
convergence.
Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 12

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Gradient Descent

w = w − η∇w L(f (x; w), y) (6)

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 13

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Gradient Descent (Variants)
There are three variants of gradient-descent:
? What are the variants of gradient descent?
• Batch gradient descent → The gradient of the loss L(w), w.r.t. to
the parameters w, is computed on the whole training set. It is
defined as: w = w − η∇w L(f (x; w), y).
• Stochastic gradient descent → The parameter update is performed
for each training sample separately. It is defined as:
w = w − η∇w L(xi , yi , w)).
• Mini-batch gradient descent → It performs a gradient update for
every mini-batch of n samples. It is defined as:
w = w − η∇w L(x(i:i+n) , y(i:i+n) , w)).
The mini-batch version is usually referred as stochastic gradient descent
(SGD). We will also refer to mini-batch gradient descent as SGD.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 14

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Challenges with Gradient Descent (Local Minima)

Image source: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 15

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Challenges with Gradient Descent

Saddle Points:
• It’s the point between the two
“mountains”. It is local minima
for one dimension and local
maxima for the other one.
• The situation is similar to local
minima. The optimization is
getting stuck.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 16

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Visualizing the Loss Landscape of Neural Nets (NIPS 2018)

Network without skip connection. Network with skip connections.

Filter normalization helps to visualize the loss function curvature [2], Visualizer:
http://www.telesens.co/loss-landscape-viz/viewer.html

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 17

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Challenges with Gradient Descent (Learning Rate)
Choosing the right learning rate is
not trivial. In particular:
• Low learning can lead to slow
convergence, where the
solution is not the optimal.
• Large learning hampers the
convergence, where the loss
function can fluctuate around
the minima and even diverge.
• The learning rate is the same
for all parameters. However,
the learned features should be
treated differently based on
their sparsity.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 18

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Challenges with Gradient Descent (Learning Rate)
Choosing the right learning rate is
not trivial. In particular:
• Adjusting the learning rate
during training is necessary.
Larger learning rates are
required only at the beginning.
An update rule is necessary.

Solution
The local minima and saddle points could be addressed with mini-batch stochastic
gradient descent and by adding noise to the labels. In addition, shuffling the training
data at every training epocha helps. Although, these three tricks are useful and
necessary, they do not always help to escape from saddle points. There have been
proposed optimization algorithms to address the aforementioned issues. Next we
discuss a few important of them. Note that we focus on first-order algorithms.
a An epoch corresponds to iterating over the training set one time.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 19

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Gradient Descent Variants

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 20

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

SGD with Momentum
• The momentum accelerates the convergence of gradient descent by
reducing the oscillations [3].
• It accumulates the past gradients to moderate the direction of the
current gradient.
• The aim of the momentum is to increase over time and add more
velocity to the parameter update.
• We introduce the variable u to represent the velocity, defined as:
X m
1 i i
u ← αu − η∇w L(f (x ; w), y ) . (7)
m
i=1
• The velocity exponentially decays based on the negative gradient of
the mini-batch with m samples.
• The momentum coefficient α ∈ [0, 1] is usually set to 0.9, while the
learning rate η depends on the networks, task and the data.
• Finally the parameters are updated by:
w ← w + u. (8)

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 21

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

SGD with Momentum (Illustration)
• In general, the velocity increases in dimensions when the gradients
point at the same direction and decreases in dimensions when the
gradients point at different directions.
? How do you interpret the velocity when α = 0.9

SGD without momentum. SGD with momentum.

Name Origin
The term momentum stems from the physics and the Newton’s law of motion.
According to Newtonian mechanics, the momentum is equal to mass times the
velocity. Here, we assume unit mass and thus the momentum is equal to the velocity.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 22

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Nesterov Accelerated Gradient (NAG)
• It is very similar to the classic momentum [4, 5].
• The main difference is that the gradient is evaluated after applying
the velocity.
• The velocity is given by:
m
1 X
u ← αu − η∇w L(f (x i ; w + αu), y i ) . (9)
m
i=1

• The parameter update remains the same and it is given by:

w ← w + u. (10)

• The evaluation of the gradient at w + αu can be interpreted as a

correction to the standard momentum.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 23

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Nesterov Accelerated Gradient (Illustration)

Nesterov momentum.
Classic momentum.
2-Steps
Nesterov’s momentum first takes make a step (red vector) based on the accumulated
gradient and second estimate the gradient at the landed position to make the
correction.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 24

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Comparison of Momentum Algorithms

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 25

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adaptive Learning Rate

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 26

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adagrad [6]
We have often situations where some parameters are more sensitive to
the gradient updates than other. To avoid performance drops and
convergence difficulties, the gradient influence should not be the same for
all parameters. The network parameters can be individually updated
based on their sensitivity. Adaptive learning rate algorithms take care
of updating the parameters individually.
Adagrad is a standard algorithm for adaptive learning rate. In detail:
• The learning rate is adapted to lower values for frequently updated
parameters.
• The learning rate is adapted to higher values for infrequently
updated parameters.
• Consider a network layer, i.e. set of parameters, as features. The
frequently updated features will have lower learning rate and the
features with less frequent updates will obtain a higher learning rate.
The goals is to balance the learning process.
? How can we measure the update frequency for each parameter?

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 27

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adagrad [6]
The current gradient values are scaled by the reciprocal of the square
root of the sum of all historical squared gradient values. In one equation
this is described by:
1 Compute gradient:
m
1 X
g = ∇w L(f (x i ; w, y i ) (11)
m
i=1

2 Accumulate squared gradient:

G=G+g⊗g (12)

3 Compute and apply gradient:

η
w←w− √ g (13)
+G

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 28

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adagrad [6] Observations
A few observations on Adagrad:
• G is a diagonal matrix. It can be treated as a full matrix as well but
the computational cost is large even for small amount of parameters.
• Thus the operation ⊗ corresponds to the elementwise multiplication.
• The gradients are accumulated over time for each training iteration.
• is a very small value added for stability reasons.
• The square-root applies element-wise.
• The main advantage of Adagrad is the automatic tuning of the
learning rate.
• A common base learning rate is 0.01.
• Note that the learning rate is constantly scaled by the accumulated
gradients. It becomes smaller over time.
• This is a nice property of second order methods, without the cost of
using a second order method such as Newton’s method.
? What is the limitation of Adagrad?

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 29

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adagrad Training Loss

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 30

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adadelta [7]
It’s an approach that allows to adapt the learning rate of each parameter
(i.e. per-dimension) independently. It is an extension of Adagrad. The
motivation is to avoid insistently reducing the learning rate that will end
up to zero. In addition, Adagrad still has a global learning rate that
requires some tuning. It has small computational overhead compared to
the SGD, similar to Adagrad.
• The idea is to accumulate past gradients over a time-window.
• Furthermore, the accumulation is an exponentially decaying average
of the squared
gradients (over the time-window). The moving
average E g2 depends only on its previous average and the current
gradient. This is similar to SGD with momentum where the decay
constant γ controls the contribution of the previous average. This
can be written as:
E g2 = γE g2 + (1 − γ)g2

(14)
where m
1 X i i
g = ∇w L(f (x ; w, y ) (15)
m
i=1

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 31

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adadelta [7]
• The value of γ is usually 0.9 as with the momentum.
• We can add the moving average to the parameter update to obtain:
η
w←w− q g (16)
+ E g2

• If we ignore the constant , we see that the denominator is the root

mean squared (RMS) error, given as:
q
RMS g = E g2 (17)

The parameter update is now:

η
w←w− g (18)
+ RMS g

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 32

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adadelta [7] Units
An interesting observation is that the gradient update should have the
same units as the parameters. Actually, we do not have units and thus
we refer to hypothetical units. The lack of updating the same units is
done by SGD, momentum and Adagrad too.
• For example, the units in SGD and Momentum relate to the gradient
and not the parameters:
∂L 1
∇w L = = (19)
w w units
• To address this issue, the Adadelta algorithm is reformulated. The
update term in Eq. 18 is:
η
∆wt = − g (20)
+ RMS gt

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 33

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adadelta [7] Units
• By rearranging Newton’s method for the inverse of the second order
derivative, we get:
∂L
∂w 1 ∆w
∆w = ∂2L
⇒ ∂2L
= ∂L
(21)
∂2w ∂2w ∂w

• The term ∆w is not known. We could assume though a continuous

curvature and use the term from the last time step t − 1. We can
now rewrite the update term as:

∆w RMS ∆w t−1
∆wt = ∂L = − gt (22)
w
RMS g t

• In general, one can use second order derivatives to obtain more

accurate gradients.
? How do we call the matrix for second order derivatives?

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 34

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adadelta [7] Observations
• Note that the global learning rate η has completely vanished.
This is a great advantage of Adadelta compared to the previously
seen optimization approaches.
• In addition, the (hypothetical) units are the same now.
• Adding the update term back to the parameter update, we obtain:

RMS ∆w t−1
w←w− gt (23)
RMS g t

• In RMS, we also integrate the constant for numerical stability.

This is particularly important for the first iteration.
• The optimization is robust to sudden gradient changes since the
denominator will increase and thus the learning rate will decrease.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 35

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adadelta [7] Comparisons

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 36

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

RMSprop
The algorithm has the same motivation as Adadelta, namely to improve
the Adagrad. Remember that Adagrad shrinks the learning based on all
previous values of the squared gradient. The learning rate may need to
be very small to reach a convex structure, where the solution lies.
Although the algorithm has been independently developed1 , it follows the
same idea as Adadelta. In detail:
1 Compute gradient:
X m
1
g = ∇w L(f (x i ; w, y i ) (24)
m
i=1

2 Accumulate squared gradient:

G = αG + (1 − α)g g (25)
3 Compute and apply gradient:
η
w←w− √ g (26)
+G
1 https://www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdf
~

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 37

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

RMSprop
• Note that a common value for α is 0.9, similar to SGD with
momentum.
• The accumulation of squared gradient is the same with Adadelta.
We have again a moving average. It can be also written, using the
expectation as:
E g2 = γE g2 + (1 − γ)g2

(27)
• The parameter update is then the same as with Adadelta:
η
∆w = − g (28)
+ RMS g2

• RMSprop does not consider the concept of working on the

same (hypothetical) units as Adadelta. This is the first
difference between RMSprop and Adadelta.
• Also, RMSprop retains the learning rate η and its recommended
value is 0.001 (second difference from Adadelta).

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 38

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Visualization of Different Optimization Algorithms

Visualization of different optimizers can be found at

https://imgur.com/a/Hqolp.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 39

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adam [8]
Up to now, we have seen how to make use of the past squared gradients.
Adam stores, in addition, the past gradients, similar to the momentum
algorithms (i.e. velocity). There are two moving averages to compute.
• Adam is another first-order gradient-based optimization algorithm.
• It has a bit more memory requirements compared to Adagrad,
Adadelta and RMSprop.
• It is well-suited for noisy and/or sparse gradients.
• The Adam algorithm relies on two moments:

mt = β1 mt−1 + (1 − β1 )g (29)

ut = β2 ut−1 + (1 − β2 )g2 (30)

where mt is the mean (first moment) and ut is the uncentered
variance (second moment) of the gradients. The hyper-parameters
β1 , β2 ∈ [0, 1) control the exponential decay rate of mt and ut .

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 40

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adam [8]
• The two moments are initialized with zeros. This creates a bias
towards zero estimates (especially when the decay rates are small,
i.e. β1 and β2 are close to zero) that needs to be removed.
• The bias-corrected estimates of the moments are given by:
mt
m̂t = (31)
1 − β1t
ut
ût = (32)
1 − β2t
where m̂t is the bias-corrected first moment estimate and ût the
bias-corrected second moment estimate.
• Adam → Adaptive moment estimation.
? Why is the bias important?
? How do we derive the bias-corrected estimates?

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 41

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adam [8] Bias-Corrected Estimates
We examine how we derive the correction 1 − β1 for the first moment.
• The update equation of Eq. 29 can be written as a function of all
previous steps:
Xt
mt = (1 − β1 ) β1t−i gi (33)
i=1

To derive the above equation, consider that:

m0 = 0 (34)

m1 = β1 m0 + (1 − β1 )g1 = (1 − β1 )g1 (35)

m2 = β1 m1 + (1 − β1 )g2 = β1 (1 − β1 )g1 + (1 − β1 )g2 (36)
m3 = β1 m2 + (1 − β1 )g3 = . . . (37)
• By expanding mi , we end up with the pattern of Eq. 33.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 42

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adam [8] Bias-Corrected Estimates
• Now, we examine the expected value of m and how it relates to the
true first moment.
• We take the expectation from the left and right side of Eq. 33:
t
X
E [mt ] = E [(1 − β1 ) β1t−i gi ] = (38)
i=1

t
X
E [gt ](1 − β1 ) β1t−i + ζ = (39)
i=1

E [gt ](1 − β1t ) + ζ (40)

where ζ is
Pthe approximation error. PtNote that
t
(1 − β1 ) i=1 β1t−i = β1t (1 − β1 ) i=1 β1−i =
β1−1 (1−β1−t ) β t −1
β1t (1 − β1 ) = (1 − β1 ) β11 −1 = (1 − β1t )
1−β1−1
Pt
? What is the value of (1 − β1 ) i=1 β1t−i after a few thousands of
iterations?

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 43

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adam [8] Bias-Corrected Estimates
It approaches 1.
1 sum_beta = 0
2 beta = 0.99
3
4 for i in range (1 , 10000) :
5 sum_beta += beta **( i -1)
6
7 print ( sum_beta )
8 print ((1 - beta ) * sum_beta )

• The true first moment E [gt ] is scaled by (1 − β1t ). This happens due
to the initialization of m vector with zeros. To correct the
introduced scale, we have to divide it by (1 − β1t ). This is how we
obtain Eq. 31.
• The derivation is similar for the second moment as well.
• The final parameter update is given by:
m̂t
wt ← wt−1 − η √ (41)
ût +
where the recommended values are β1 = 0.9, β2 = 0.999 and
= 10−8 .

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 44

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Adam [8] Observations

• The algorithm has demonstrated faster convergence than SGD with

momentum, Adagrad, RMSprop and Adadelta.
• However, there are many cases where it converged to a worse
solution [9].
• For that reason, Adam has been also combined with the Nesterov
momentum(Nadam [10]), to give more stable for solution.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 45

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Choice of Optimizer

• Sparse data → optimizer with adaptive learning rate.

• Fast convergence → optimizer with adaptive learning rate.
• Big amount of training data → SGD with or without momentum.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 46

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Next Lecture

Regularization, Initialization, Normalization.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 47

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

References I
[1] Devansh Arpit, Stanislaw Jastrzkbski, Nicolas Ballas, David Krueger, Emmanuel Bengio,
Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A
closer look at memorization in deep networks. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages 233–242. JMLR. org, 2017.
[2] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss
landscape of neural nets. In Advances in Neural Information Processing Systems, pages
6391–6401, 2018.
[3] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
[4] Y Nesterov. A method of solving a convex programming problem with convergence rate o
(1/k2). Sov. Math. Dokl, 27(2), 1984.
[5] Ilya Sutskever, James Martens, George E Dahl, and Geoffrey E Hinton. On the importance of
initialization and momentum in deep learning. ICML (3), 28(1139-1147):5, 2013.
[6] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research,
12(Jul):2121–2159, 2011.
[7] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
[8] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[9] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The
marginal value of adaptive gradient methods in machine learning. In Advances in Neural
Information Processing Systems, pages 4148–4158, 2017.
[10] Timothy Dozat. Incorporating nesterov momentum into adam. 2016.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 48

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

DL Unit-2
No ratings yet
DL Unit-2
24 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
DL 3
No ratings yet
DL 3
72 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Lec 2
No ratings yet
Lec 2
5 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Chapter
No ratings yet
Chapter
46 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Lecture2
No ratings yet
Lecture2
67 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
WEEK 4
No ratings yet
WEEK 4
61 pages
DL-12
No ratings yet
DL-12
55 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Module3_notes
No ratings yet
Module3_notes
18 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
No ratings yet
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
20 pages
UNIT2
No ratings yet
UNIT2
25 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford
No ratings yet
Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford
50 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
cours5
No ratings yet
cours5
23 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Lecture-4-1
No ratings yet
Lecture-4-1
60 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
Optimization
No ratings yet
Optimization
51 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Deep learning chapter 1
No ratings yet
Deep learning chapter 1
46 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
ML Notes
No ratings yet
ML Notes
14 pages
Unit 2
No ratings yet
Unit 2
13 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
2.3+Value+Function+Approximation
No ratings yet
2.3+Value+Function+Approximation
55 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
PERCEPTION PERT-PAC Scheduling Hull Block Construction
No ratings yet
PERCEPTION PERT-PAC Scheduling Hull Block Construction
120 pages
Advanced Java and J2EE Question Bank 2021
0% (1)
Advanced Java and J2EE Question Bank 2021
29 pages
Debug 1214
No ratings yet
Debug 1214
18 pages
NationalInstruments 9234 Datasheet
No ratings yet
NationalInstruments 9234 Datasheet
12 pages
Quis Sti Sebelum Uts
No ratings yet
Quis Sti Sebelum Uts
8 pages
La Noyee Yann Tiersen
No ratings yet
La Noyee Yann Tiersen
13 pages
Ix58a-Axp Ix58b-Axp 090806
No ratings yet
Ix58a-Axp Ix58b-Axp 090806
69 pages
Session 12 Innovations from Outside the (ISSCC’s) Box
No ratings yet
Session 12 Innovations from Outside the (ISSCC’s) Box
14 pages
Dbms Final
No ratings yet
Dbms Final
85 pages
CMMI V2 0 Model at A Glance - Print Ready - ENG - 2019 04 29v2 1
No ratings yet
CMMI V2 0 Model at A Glance - Print Ready - ENG - 2019 04 29v2 1
7 pages
Ramdump Wcss Msa0 2023-03-31 19-46-59 Props
No ratings yet
Ramdump Wcss Msa0 2023-03-31 19-46-59 Props
14 pages
DiagReport 2006 Astra-H Vehicle Vehicle DTC Information 201402191128
No ratings yet
DiagReport 2006 Astra-H Vehicle Vehicle DTC Information 201402191128
6 pages
R2-BOM-DO224-256-SHANGRI-LA-ATS-MVSG
No ratings yet
R2-BOM-DO224-256-SHANGRI-LA-ATS-MVSG
1 page
Example 2833xISRPriorities
No ratings yet
Example 2833xISRPriorities
5 pages
Poe Dem
No ratings yet
Poe Dem
7 pages
Auxiliary Water Pump FUB-FUB-FB-GZ-M208-F10 - V.2 (Auxiliary Water Pump) - ALLDATA Repair
No ratings yet
Auxiliary Water Pump FUB-FUB-FB-GZ-M208-F10 - V.2 (Auxiliary Water Pump) - ALLDATA Repair
5 pages
Firmware Upgrade Procedure TV HDMI KB 01 V1
No ratings yet
Firmware Upgrade Procedure TV HDMI KB 01 V1
12 pages
MVS21 USB Kill Switch
No ratings yet
MVS21 USB Kill Switch
74 pages
Ardhouse Pro V.3.1+
No ratings yet
Ardhouse Pro V.3.1+
6 pages
Smart Parking System
100% (1)
Smart Parking System
9 pages
RBS 6601
No ratings yet
RBS 6601
15 pages
Installation Guide For GESOLAR Photovoltaic Module
No ratings yet
Installation Guide For GESOLAR Photovoltaic Module
14 pages
Field Genius 9
No ratings yet
Field Genius 9
38 pages
Instant Access to Hello Startup A Programmer s Guide to Building Products Technologies and Teams 1 (Early Release) Edition Yevgeniy Brikman ebook Full Chapters
100% (3)
Instant Access to Hello Startup A Programmer s Guide to Building Products Technologies and Teams 1 (Early Release) Edition Yevgeniy Brikman ebook Full Chapters
61 pages
Ijeet: International Journal of Electrical Engineering & Technology (Ijeet)
No ratings yet
Ijeet: International Journal of Electrical Engineering & Technology (Ijeet)
11 pages
Factors Influencing Selection of Channel Member: DSK Kumar 18PGP226 Ainesh Kudadah 18PGP227
No ratings yet
Factors Influencing Selection of Channel Member: DSK Kumar 18PGP226 Ainesh Kudadah 18PGP227
7 pages
Wcript
No ratings yet
Wcript
2 pages
How Blockchain Technology Boosts Operations Excellence 4.0 of Chemical Companies
100% (1)
How Blockchain Technology Boosts Operations Excellence 4.0 of Chemical Companies
24 pages
M.daniyal (14017) - Os Manual
No ratings yet
M.daniyal (14017) - Os Manual
58 pages
Aaple Sarkar DBT Portal Help File PDF
No ratings yet
Aaple Sarkar DBT Portal Help File PDF
49 pages

4 Optimization

Uploaded by

4 Optimization

Uploaded by

Introduction to Deep Learning

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

We have discussed about the deep neural networks. In detail, we have

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 2

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

• Optimization in machine learning (ML) differs from traditional

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 3

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

J (w) = E(x,y )∼p̂data L(f (x; w), y ) (1)

where L is the loss function, f (x; w) the prediction function, x is the

J ∗ (w) = E(x,y )∼pdata L(f (x; w), y ). (2)

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 4

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

• Training using a machine learning algorithm results in minimizing

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 5

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

• A major disadvantage for the empirical risk minimization is that is

• To address the two problems, we work with surrogate loss functions.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 6

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 7

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

• In pure optimization, the loss function is minimized. This is the

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 8

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

As we discussed, gradient-descent is the main optimization algorithm for

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

w∗ = arg min L(f (x; w), y) (5)

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 11

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

w = w − η∇w L(f (x; w), y) (6)

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 13

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 14

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

• Neural network contain

Image source: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 15

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 16

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Network without skip connection. Network with skip connections.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 17

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 18

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 19

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 20

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 21

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

SGD without momentum. SGD with momentum.

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 22

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

• The parameter update remains the same and it is given by:

• The evaluation of the gradient at w + αu can be interpreted as a

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 23

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 24

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 25

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 26

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 27

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

2 Accumulate squared gradient:

3 Compute and apply gradient:

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 28

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 29

***Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)***

Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 30

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

• If we ignore the constant , we see that the denominator is the root

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

• In RMS, we also integrate the constant for numerical stability.

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)

Not for sharing (LMS, Friedrich-Alexander-Universität Erlangen-Nürnberg)