0% found this document useful (0 votes)
19 views

03 Optimization

The document discusses different optimization techniques for machine learning models, including vanilla gradient descent, stochastic gradient descent, mini-batch stochastic gradient descent, and momentum. It explains how momentum can provide faster convergence and reduced oscillation compared to vanilla stochastic gradient descent.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

03 Optimization

The document discusses different optimization techniques for machine learning models, including vanilla gradient descent, stochastic gradient descent, mini-batch stochastic gradient descent, and momentum. It explains how momentum can provide faster convergence and reduced oscillation compared to vanilla stochastic gradient descent.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

pt im iz a t ion

Applied De O ar 9h, 2020


M

Yun-Nung
ep Learning
SHAN(GV-iYvUiaSn) Che
U n HTTP://AD
L.MIULAB
.TW
Vanilla Gradient Descent
• Computes the gradient of the cost function w.r.t. to the parameters θ for
the entire training dataset.
• As we need to calculate the gradients for the whole dataset to perform
just one update, batch gradient descent can be very slow and is
intractable for datasets that don't fit in memory.
• Batch gradient descent also doesn't allow us to update our model online,
i.e. with new examples on-the-fly.

✓=✓ ⌘ · r✓ J(✓)

2
Stochastic Gradient Descent
• Stochastic gradient descent (SGD) in contrast performs a parameter
update for each training example.
• It is therefore usually much faster and can also be used to learn online.

(i) (i)
✓=✓ ⌘ · r✓ J(✓; x ; y )

3
Mini-batch Stochastic Gradient Descent
• Mini-batch gradient descent finally takes the best of both worlds and
performs an update for every mini-batch of n training examples.
• This way reduces the variance of the parameter updates, which can lead
to more stable convergence.
• On modern hardware 16 operations of size 1 is much slower than 1
operation of size 16 (parallelization on GPUs)

(i:i+n) (i:i+n)
✓=✓ ⌘ · r✓ J(✓; x ;y )

4
Local Minimum

J(✓)

q
q 0
q 1

5
Local Minimum

J(✓)

q
q 0
q 1
✓ 2

6
Challenges
• Choosing a proper learning rate can be difficult.
• Learning rate schedules try to adjust the learning rate during training by
e.g. annealing, i.e. reducing the learning rate according to a pre-defined
schedule or when the change in objective between epochs falls below a
threshold.
• These schedules and thresholds, however, have to be defined in advance
and are thus unable to adapt to a dataset's characteristics

7
Beyond SGD

8
Momentum
vt = vt 1 + ⌘ · r✓ J(✓)
J(✓) ✓ = ✓ vt

q
q 0
q 1

10
Momentum
• Mini-batch accumulates the gradient of the past steps to determine the
direction to go.
• faster convergence and reduced oscillation.

11
SGD with Momentum
• Remember gradients from past time steps

v t = vt
<latexit sha1_base64="1+ktaCXcpAUDsqs7OTeM+Oi+rXY=">AAACBnicbVDLSgNBEJz1GeMr6lGEwSAIYtiNgl6EoBePEcwDkrD0TibJkJndZaY3EJacvPgrXjwo4tVv8ObfOHkcNFrQUFR1090VxFIYdN0vZ2FxaXllNbOWXd/Y3NrO7exWTZRoxisskpGuB2C4FCGvoEDJ67HmoALJa0H/ZuzXBlwbEYX3OIx5S0E3FB3BAK3k5w4GPl41u6AU0IGf4qk3oie0yRFo10c/l3cL7gT0L/FmJE9mKPu5z2Y7YoniITIJxjQ8N8ZWChoFk3yUbSaGx8D60OUNS0NQ3LTSyRsjemSVNu1E2laIdKL+nEhBGTNUge1UgD0z743F/7xGgp3LVirCOEEesumiTiIpRnScCW0LzRnKoSXAtLC3UtYDDQxtclkbgjf/8l9SLRa8s0Lx7jxfup7FkSH75JAcE49ckBK5JWVSIYw8kCfyQl6dR+fZeXPep60Lzmxmj/yC8/ENb6SX0A==</latexit>
1 + ⌘gt
Momentum Previous Gradient
Momentum Momentum
Conservation
Parameter

• Intuition: Prevent instability resulting from sudden changes

✓t+1 = ✓t
<latexit sha1_base64="T3NCgbdLDyvnA9ESGPQC193gsbk=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhZBEEtSBd0IRTcuK9gHtCFMppN26OTBzE2hhCzd+CtuXCji1k9w5984bYNo9cCFM+fcy9x7vFhwBZb1aRQWFpeWV4qrpbX1jc0tc3unqaJEUtagkYhk2yOKCR6yBnAQrB1LRgJPsJY3vJ74rRGTikfhHYxj5gSkH3KfUwJacs39LgwYEDeFYzvDl/j7meETPHLBNctWxZoC/yV2TsooR901P7q9iCYBC4EKolTHtmJwUiKBU8GyUjdRLCZ0SPqso2lIAqacdHpIhg+10sN+JHWFgKfqz4mUBEqNA093BgQGat6biP95nQT8CyflYZwAC+nsIz8RGCI8SQX3uGQUxFgTQiXXu2I6IJJQ0NmVdAj2/Ml/SbNasU8r1duzcu0qj6OI9tABOkI2Okc1dIPqqIEoukeP6Bm9GA/Gk/FqvM1aC0Y+s4t+wXj/AmU2mO4=</latexit>
vt

14
Adagrad
• It adapts the learning rate to the parameters, performing smaller updates
(i.e. low learning rates) for parameters associated with frequently occurring
features, and larger updates (i.e. high learning rates) for parameters associated
with infrequent features.
• For this reason, it is well-suited for dealing with sparse data.
• G is the accumulation of previous gradient values.

G t = Gt
<latexit sha1_base64="jyu2sTVeIVb385NgfBtxWyckEzU=">AAACBnicbZDLSgMxFIYz9VbrbdSlCMEiCGKZqYJuhKKLuqxgL9CWIZNJ29DMZEjOCGXoyo2v4saFIm59Bne+jWk7C63+EPjyn3NIzu/HgmtwnC8rt7C4tLySXy2srW9sbtnbOw0tE0VZnUohVcsnmgkesTpwEKwVK0ZCX7CmP7ye1Jv3TGkuozsYxawbkn7Ee5wSMJZn71c9wJe46qVw4o7xMe6be0cGEibk2UWn5EyF/4KbQRFlqnn2ZyeQNAlZBFQQrduuE0M3JQo4FWxc6CSaxYQOSZ+1DUYkZLqbTtcY40PjBLgnlTkR4Kn7cyIlodaj0DedIYGBnq9NzP9q7QR6F92UR3ECLKKzh3qJwCDxJBMccMUoiJEBQhU3f8V0QBShYJIrmBDc+ZX/QqNcck9L5duzYuUqiyOP9tABOkIuOkcVdINqqI4oekBP6AW9Wo/Ws/Vmvc9ac1Y2s4t+yfr4BirElwk=</latexit>
1 + gt gt Squared Current Gradient

✓t+1 = ✓t p gt
Gt + ✏
Small Constant
15
RMSProp
• Instead of inefficiently storing all previous squared gradients, the sum of
gradients is recursively defined as a decaying average of all past squared
gradients (rolling average).
• resolve Adagrad's radically diminishing learning rates
• Best choice for RNN…?

E[g 2 ]t = E[g 2 ]t 1 + (1 )gt2



✓t+1 = ✓t p gt
E[g 2 ]t + ✏

16
Adam
• Most standard optimization option in NLP and beyond
• first moment + second moment (momentum + RMSprop)

mt = 1 mt 1 + (1 1 )gt
2
vt = 2 vt 1 + (1 2 )gt


✓t+1 = ✓t p m̂t
vˆt + ✏
17
Miscellaneous

18
Adam is the best?
• Issue of non-convergence

19
Missing Global-Optima
• The solutions found by adaptive methods generalize worse (often significantly
worse) than SGD, even when these solutions have better training performance.
These results suggest that practitioners should reconsider the use of adaptive
methods to train neural networks

20
Adam + SGD
• prior period : Adam for fast convergence
• last period: SGD for gradually seeking the global optima

21
Back to the Data

22
references
• http://ruder.io/optimizing-gradient-descent/
•http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Gradient%20D
escent%20(v2).pdf

http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DNN%20tip.pdf

23

You might also like