Unit 2.2
Unit 2.2
Unit 2.2
Unit II
Challenges with Gradient
Descent
Local Minima In Error Surface
□ The primary challenge in optimizing deep
learning models is that we are forced to
use minimal local information to infer the
global structure of the error surface.
■ E.g. Let’s assume you’re an ant on the
continental United States. You’re dropped
randomly on the map, and your goal is to
find the lowest point on this surface.
■ How do you do it?
Local Minima In Error Surface
friction
velocit acceleratio
Momentum
Momentum
Momentum
https://distill.pub/2017/momentum/
Observations
□ Even in the regions having gentle slopes,
momentum based gradient descent is able
to take large steps because the momentum
carries it along.
□ Is moving fast always good? Would there be
a situation where momentum would cause
us to run pass our goal?
Nesterov Accelerated Gradient
Descent
□ Can we do something to reduce these
oscillations ?
□ Intuition
■ Look before you leap
■ Recall that
■ Why not calculate the gradient at
this partially updated value of w
Nesterov Accelerated Gradient
Descent
□ Momentum first computes
the current gradient (small
blue vector)
□ then takes a big jump in the
direction of the updated
accumulated gradient (big
blue vector)
□ NAG first makes a big jump
in the direction of the
previous accumulated
gradient (brown vector)
□ measures the gradient
□ then makes a correction
(red vector), which results
in the complete NAG update
(green vector).
Nesterov Accelerated Gradient
Descent
https://emiliendupont.github.io/2018/01/24/optimization-visualization/
□ Exponential Weighted Averages for past gradients