0% found this document useful (0 votes)
3 views16 pages

11 - Optimizers

Uploaded by

Swasti Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views16 pages

11 - Optimizers

Uploaded by

Swasti Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Optimizers

RMSProp, GD with Momentum, ADAM

© Nisheeth Joshi
RMSProp
• Root Mean Square Propagation
• Works on the principle of decay
• Chooses a different learning rate for each parameter (This property is
what makes it different from other optimizers)
• Developed by Hilton in 2012 during his lecture

© Nisheeth Joshi
Key Takeaway
• Rather then having a fixed learning rate (0.01), here we have a vector
of trainable parameter (decay function)
• The algorithm is updated iteratively which is a running average of the
magnitude of gradients.
• Due to this the changes in the weights are not always in the direction
of the gradient but rather by element wise gradient of vector vt
• This speeds up the process of convergence

• Updates in one directions are larger and in another is smaller

© Nisheeth Joshi
On iteration t b

w
• Compute 𝛿𝑤 and 𝛿𝑏 𝑜𝑛 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝐺𝐷
Exponential weighted
average of change in
𝑆𝛿𝑤𝑡 = 𝛽𝑆𝛿𝑤𝑡 −1 + 1 − 𝛽 ∗ 𝛿𝑤 2 w/b

𝑆𝛿𝑏𝑡 = 𝛽𝑆𝛿𝑏𝑡 −1 + 1 − 𝛽 ∗ 𝛿𝑏 2

𝛿𝑤 𝛿𝑏
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 −
𝑆𝛿𝑤𝑡 + 𝜖 𝑆𝛿𝑏𝑡 + 𝜖

© Nisheeth Joshi
© Nisheeth Joshi
RMSProp Takeaway
• RMSProp is very similar to Adagrad insofar as both use the
square of the gradient to scale coefficients.
• RMSProp shares with momentum the leaky averaging.
However, RMSProp uses the technique to adjust the coefficient-
wise preconditioner.
• The learning rate needs to be scheduled by the experimenter in
practice.
• The coefficient 𝛽 determines how long the history is when
adjusting the per-coordinate scale.

© Nisheeth Joshi
GD with Momentum
• BGS/MBGD/SGD takes a lot of time to reach to Global minimum
• As the gradient steps are noisy even if they are correct on an average

• Momentum is a technique to improve performance of GD

© Nisheeth Joshi
b1

W1 ∑ ∫ g1 = 0.37 b3
60
W3
W2
W5
80
∑ 24.95
W4
g2 = 0.047

W5 ∑ ∫
5

b2
© Nisheeth Joshi
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
W1+ = W1 -  W7+ = W7 - 
𝜕𝑤1 𝜕𝑤7
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
W2 = W2 - 
+ W8 = W8 - 
+
𝜕𝑤2 𝜕𝑤8
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
W3 = W3 - 
+ b3 = b3 - 
+
𝜕𝑤3 𝜕𝑏3
𝜕𝑐𝑜𝑠𝑡
W4 = W4 - 
+
𝜕𝑤4
𝜕𝑐𝑜𝑠𝑡
W5 = W5 - 
+
𝜕𝑤5
𝜕𝑐𝑜𝑠𝑡
W6 = W6 - 
+
𝜕𝑤6
𝜕𝑐𝑜𝑠𝑡
b1 = b1 - 
+
𝜕𝑏1
𝜕𝑐𝑜𝑠𝑡
b2 = b2 - 
+
𝜕𝑏2

© Nisheeth Joshi
𝜕𝑐𝑜𝑠𝑡
W7+ = W7 − + 𝛽 ∗ W7𝑡−1
𝜕𝑤7

acceleration velocity

© Nisheeth Joshi
Adam Optimizer
• One of the key components of Adam is that it uses exponential
weighted moving averages (also known as leaky averaging) to obtain
an estimate of both the momentum and also the second moment of
the gradient.
• That is, it uses both Gradient Descent with Momentum and RMSProp
Optimizers

© Nisheeth Joshi
Adam Optimization
Initialize
𝑉𝛿𝑤 = 0, 𝑆𝛿𝑤 = 0, 𝑉𝛿𝑏 = 0, 𝑆𝛿𝑏 = 0

On Iteration t
compute 𝛿𝑏 𝛿𝑤 using current minibatch

𝑉𝛿𝑤𝑡 = 𝛽1 𝑉𝛿𝑤𝑡 −1 + 1 − 𝛽1 ∗ 𝛿𝑤
Momentum like Update with 𝛽1

𝑉𝛿𝑏𝑡 = 𝛽1 𝑉𝛿𝑏𝑡−1 + 1 − 𝛽1 ∗ 𝛿𝑏

𝑆𝛿𝑤𝑡 = 𝛽2 𝑆𝛿𝑤𝑡 −1 + 1 − 𝛽2 ∗ 𝛿𝑤 2
RMSProp like Update with 𝛽2
𝑆𝛿𝑏𝑡 = 𝛽2 𝑆𝛿𝑏𝑡 −1 + 1 − 𝛽2 ∗ 𝛿𝑏2

© Nisheeth Joshi
In a typical update of Adam we perform bias corrections

𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑉𝛿𝑤 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑉𝛿𝑏
𝑉𝛿𝑤 = 𝑉𝛿𝑏 =
(1 − 𝛽1𝑡 ) (1 − 𝛽1𝑡 )

𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑆𝛿𝑤 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑆𝛿𝑏
𝑆𝛿𝑤 = 𝑡 𝑆𝛿𝑏 = 𝑡
(1 − 𝛽2 ) (1 − 𝛽2 )

© Nisheeth Joshi
Finally, a weight and bias update is performed

𝑉𝛿𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑤+ = 𝑤 − 𝛼 𝑤

𝑆𝛿𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑤
+ 𝜖

𝑉𝛿𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑏+ = 𝑏 − 𝛼 𝑏

𝑆𝛿𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑
𝑏
+ 𝜖

© Nisheeth Joshi
Hyperparameter Choice
• Learning rate  needs to be tuned.
• Common Choice for 𝛽1 = 0.9 (Momentum terms)
• Common Choice for 𝛽2 = 0.999 (RMSProp terms)
•  = 10−8

• ADAM – ADAptive Moment Estimation

© Nisheeth Joshi
Adam – Key Takeaway
• Reviewing the design of Adam its inspiration is clear.
• Momentum and scale are clearly visible in the state variables.
• Their rather peculiar definition forces us to debias terms (this
could be fixed by a slightly different initialization and update
condition).
• The combination of both terms is pretty straightforward, given
RMSProp.
• The explicit learning rate  allows us to control the step length
to address issues of convergence.

© Nisheeth Joshi

You might also like