0% found this document useful (0 votes)
38 views2 pages

Optimizer

Stochastic gradient descent is an optimization algorithm for training neural networks. It uses random samples from the training data instead of the full dataset to compute parameter updates. Momentum and Nesterov momentum are variants that add momentum terms to reduce oscillation and speed up convergence. AdaGrad, RMSProp, and Adam are adaptive learning rate methods where the learning rate is adjusted for each parameter based on recent gradient information to speed up learning for sparse data and non-stationary objectives.

Uploaded by

SANJIDA AKTER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views2 pages

Optimizer

Stochastic gradient descent is an optimization algorithm for training neural networks. It uses random samples from the training data instead of the full dataset to compute parameter updates. Momentum and Nesterov momentum are variants that add momentum terms to reduce oscillation and speed up convergence. AdaGrad, RMSProp, and Adam are adaptive learning rate methods where the learning rate is adjusted for each parameter based on recent gradient information to speed up learning for sparse data and non-stationary objectives.

Uploaded by

SANJIDA AKTER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Stochastic Gradient Descent

Compute gradient estimate Note: Reduce Dependency


1
𝐠 ← + ∇𝛉
𝑚
𝐿(𝑓(𝐱 ; 𝛉), 𝐲 ) Challenge: Non Convex
Apply update Problem, Slow Convergence
𝛉 ← 𝛉 − 𝜖𝐠
Use sample randomly or in random batches instead of using complete data at each update
Depend only on local gradient

Momentum
Compute gradient estimate
1
Note: Faster Convergence,
𝐠 ← + ∇𝛉
𝑚
𝐿(𝑓(𝐱 ; 𝛉), 𝐲 )
Reduced Oscillation
Compute velocity update
Challenge: Blindly follow
𝐯 ← 𝛼𝐯 − 𝜖𝐠
slops
Apply update
𝛉←𝛉+𝐯
Faster near minima, avoid slow convergence

Nesterov momentum
Compute interim update
𝛉 ← 𝛉 + 𝛼𝐯 Note: Faster Convergence,
Compute gradient (at interim point)
1 Know where it is going
𝐠←+ ∇ 𝐿(𝑓 𝐱 ; 𝛉 , 𝐲 )
𝑚 𝛉 Challenge: Not Adaptive
Compute velocity update
𝐯 ← 𝛼𝐯 − 𝜖𝐠
Apply update
𝛉←𝛉+𝐯

AdaGrad
Compute gradient estimate
1
𝐠←+ ∇
𝑚 𝛉
𝐿(𝑓(𝐱 ; 𝛉), 𝐲 ) Note: Adaptive
Accumulate squared gradient Challenge: Keeps going,
𝐫 ← 𝐫 + 𝐠⨀𝐠
Compute parameter update (Division and square root applied element-wise) Learning Rate shrinks
𝜖
𝚫𝛉 ← − ⨀𝐠
𝛿 + √𝐫
Apply update
𝛉 ← 𝛉 + 𝚫𝛉
Learning rate is adaptive, slows down near minima

RMSProp
Compute gradient estimate
1
𝐠←+ ∇ 𝐿(𝑓(𝐱 ; 𝛉), 𝐲 )
𝑚 𝛉
Accumulate squared gradient
𝐫 ← 𝜌𝐫 + (1 − 𝜌)𝐠⨀𝐠
Compute parameter update (Division and square root applied element-wise)
𝜖
𝚫𝛉 ← − ⨀𝐠
𝛿 + √𝐫
Apply update
𝛉 ← 𝛉 + 𝚫𝛉

RMSProp with Nesterov momentum


Compute interim update
𝛉 ← 𝛉 + 𝛼𝐯
Compute gradient (at interim point)
1
𝐠←+ ∇ 𝐿(𝑓 𝐱 ; 𝛉 , 𝐲 )
𝑚 𝛉
Accumulate squared gradient
𝐫 ← 𝜌𝐫 + (1 − 𝜌)𝐠⨀𝐠
Compute velocity update
𝜖
𝐯 ← 𝛼𝐯 − ⨀𝐠
𝛿 + √𝐫
Apply update
𝛉←𝛉+𝐯
Use two knobs to adapt learning

Adam
Compute gradient estimate Note: adaptive
1
𝐠←+ ∇ 𝐿(𝑓(𝐱 ; 𝛉), 𝐲 )
𝑚 𝛉
𝑡 ←𝑡+1
Update biased first moment estimate
𝐬 ← 𝜌 𝐬 + (1 − 𝜌 )𝐠
Update biased second moment estimate
𝐫 ← 𝜌 𝐫 + (1 − 𝜌 )𝐠⨀𝐠
Correct bias in first moment
𝐬
𝐬 ←
1−𝜌
Correct bias in second moment
𝐫
𝐫 ←
1−𝜌
Compute parameter update (Division and square root applied element-wise)
𝜖
𝚫𝛉 ← − ⨀𝐬
𝛿+ 𝐫
Apply update
𝛉 ← 𝛉 + 𝚫𝛉

Use same rule for each step, no special case for initialization

Ref. Book: Chapter 8.3 & 8.5, Deep Learning. Ian Goodfellow, Yoshua Bengio and
Aaron Courville

You might also like