Lecture 8 Gradient Descent For Non-Convex Functions
Lecture 8 Gradient Descent For Non-Convex Functions
Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture
http://math.etsu.edu/multicalc/prealpha/Chap2/Chap2-8/10-6-53.gif
http://pundit.pratt.duke.edu/piki/images/thumb/0/0a/SurfExp04.png/400px-SurfExp04.png
Eigenvalues of Hessian at critical points
Local minima
Saddle point
http://i.stack.imgur.com/NsI2J.png
Saddle
point
Global
minima
A realistic picture
Local
minima
Local
maxima
“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15
GD vs. Newton’s method
• Gradient descent is based on first-order
approx. 𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝛻𝑓 𝑇 ∆𝜃
Δ𝜃 = −𝜂 𝛻𝑓
“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15
Disadvantages of 2nd order methods
• Updates require O(d3) or at least O(d2)
• SGD momentum: wt
• wt+1 = wt − η gfor a random subset(wt)
Training iterations
Classical and Nesterov Momentum
wt
• GD: − η g(wt)
• wt+1 = wt − η g(wt)
• Classical momentum: wt
α vt
− η g(wt)
• vt+1 = α vt − η g(wt); vt+1
− η g(wt) vt+1
wt+1
• wt+1 = wt + vt+1
• Nesterov momentum α vt
wt
• vt+1 = α vt − η g(wt+αvt); vt+1
− η g(wt+αvt)
wt+1
• wt+1 = wt + vt+1
• Better course-correction for bad velocity
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983
Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
AdaGrad, RMSProp, AdaDelta
• Scales the gradient by a running norm of all
the previous gradients
• Per dimension:
𝑔(𝑤𝑡 )
𝑤𝑡+1 = 𝑤𝑡 − 𝜂
𝑡 2
𝑖=1 𝑔(𝑤𝑡 ) +𝜀
• Automatically reduces learning rate with t
• Parameters with small gradients speed up
• RMSProp and AdaDelta use a forgetting factor
in grad squared so that the updates do not
become too small
Adam optimizer combines AdaGrad
and momentum
• Initialize
• 𝑚0 = 0
• 𝑣0 = 0
• Loop over t
Get gradient
• 𝑔𝑡 = 𝛻𝑤𝑓𝑡 𝑤𝑡−1
• 𝑚𝑡 = 𝛽1 𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡 Update first moment (biased)
Source: http://ruder.io/optimizing-gradient-descent/index.html