4 Optimization
4 Optimization
Lecture 4
Optimization
Vasileios Belagiannis
Chair of Multimedia Communications and Signal Processing
Friedrich-Alexander-Universität Erlangen-Nürnberg
24.11.2023
• This is an ideal cost function that actually is not realistic due to the
lack of access to pdata . It measures the expected generalization
error that is usually called risk.
We start from an initial point, i.e. set of parameter values, and take a
step opposite to the direction of the gradient. The learning rate η
regulates the influence of the gradient step. This is an iterative process,
i.e. forward pass → loss → back-propagation → gradient descent, until
convergence.
Vasileios Belagiannis 24.11.2023 Introduction to Deep Learning, 4.Optimization 12
Saddle Points:
• It’s the point between the two
“mountains”. It is local minima
for one dimension and local
maxima for the other one.
• The situation is similar to local
minima. The optimization is
getting stuck.
Filter normalization helps to visualize the loss function curvature [2], Visualizer:
http://www.telesens.co/loss-landscape-viz/viewer.html
Solution
The local minima and saddle points could be addressed with mini-batch stochastic
gradient descent and by adding noise to the labels. In addition, shuffling the training
data at every training epocha helps. Although, these three tricks are useful and
necessary, they do not always help to escape from saddle points. There have been
proposed optimization algorithms to address the aforementioned issues. Next we
discuss a few important of them. Note that we focus on first-order algorithms.
a An epoch corresponds to iterating over the training set one time.
w ← w + u. (10)
Nesterov momentum.
Classic momentum.
2-Steps
Nesterov’s momentum first takes make a step (red vector) based on the accumulated
gradient and second estimate the gradient at the landed position to make the
correction.
G=G+g⊗g (12)
mt = β1 mt−1 + (1 − β1 )g (29)
m0 = 0 (34)
t
X
E [gt ](1 − β1 ) β1t−i + ζ = (39)
i=1
• The true first moment E [gt ] is scaled by (1 − β1t ). This happens due
to the initialization of m vector with zeros. To correct the
introduced scale, we have to divide it by (1 − β1t ). This is how we
obtain Eq. 31.
• The derivation is similar for the second moment as well.
• The final parameter update is given by:
m̂t
wt ← wt−1 − η √ (41)
ût +
where the recommended values are β1 = 0.9, β2 = 0.999 and
= 10−8 .