0% found this document useful (0 votes)
11 views

Lecture 8 Gradient Descent For Non-Convex Functions

This document discusses advanced machine learning techniques for gradient descent on non-convex functions. It characterizes non-convex loss surfaces and describes issues with techniques like gradient descent and Newton's method. It then explains how stochastic gradient descent, momentum methods, and adaptive learning rate optimization techniques like Adam can help address these issues and help machine learning models escape saddle points and local minima to find better solutions.

Uploaded by

Utkarsh Bhalode
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 8 Gradient Descent For Non-Convex Functions

This document discusses advanced machine learning techniques for gradient descent on non-convex functions. It characterizes non-convex loss surfaces and describes issues with techniques like gradient descent and Newton's method. It then explains how stochastic gradient descent, momentum methods, and adaptive learning rate optimization techniques like Adam can help address these issues and help machine learning models escape saddle points and local minima to find better solutions.

Uploaded by

Utkarsh Bhalode
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Advanced Machine Learning

Gradient Descent for Non-Convex Functions

Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture

• Characterize non-convex loss surfaces with


Hessian

• List issues with non-convex surfaces

• Explain how certain optimization techniques


help solve these issues
Contents
• Characterizing a non-convex loss surfaces
• Issues with gradient decent
• Issues with Newton’s method
• Stochastic gradient descent to the rescue
• Momentum and its variants
• Saddle-free Newton
Why do we not get stuck in bad local
minima?
• Local minima are close to global minima in terms of
errors

• Saddle points are much more likely at higher


portions of the error surface (in high-dimensional
weight space)

• SGD (and other techniques) allow you to escape the


saddle points
Error surfaces and saddle points

http://math.etsu.edu/multicalc/prealpha/Chap2/Chap2-8/10-6-53.gif

http://pundit.pratt.duke.edu/piki/images/thumb/0/0a/SurfExp04.png/400px-SurfExp04.png
Eigenvalues of Hessian at critical points
Local minima

Plateau Long furrow

Saddle point

http://i.stack.imgur.com/NsI2J.png
Saddle
point

Global
minima
A realistic picture

Local
minima

Local
maxima

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/


Is achieving global minima important?
• Global minima for the training data may not
be the global minima for the validation or test
data

• Local minimas are often good enough

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15


Under certain assumptions,
theoretically also they are of high quality
• Results:
– Lowest critical values of the random loss form a band
– Probability of minima outside that band diminishes
exponentially with the size of the network
– Empirical verification
• Assumptions:
– Fully-connected feed-forward neural network
– Variable independence
– Redundancy in network parametrization
– Uniformity

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15


Empirically, most minima are of high quality

“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15
GD vs. Newton’s method
• Gradient descent is based on first-order
approx. 𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝛻𝑓 𝑇 ∆𝜃

Δ𝜃 = −𝜂 𝛻𝑓

• Newton’s method is based on second order


1
𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝛻𝑓 𝑇 ∆𝜃 + ∆𝜃 𝑇 𝐻 ∆𝜃
2
Δ𝜃 = −𝐻 −1 𝛻𝑓
𝑛𝜃
1
𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝜆𝑖 ∆𝒗𝑖 2
2
𝑖=1

“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15
Disadvantages of 2nd order methods
• Updates require O(d3) or at least O(d2)

• May not work well for non-convex surfaces

• Get attracted to saddle points (how?)

• Not very good for batch-updates


GD vs. SGD
• GD: wt
• wt+1 = wt − η gfor all samples(wt) − η gfor all samples (wt)

• SGD momentum: wt
• wt+1 = wt − η gfor a random subset(wt)

− η gfor a random subset(wt)


Compare GD with SGD
• GD requires more computations per update

• SGD is more noisy


SGD helps by changing the loss surface
• Different mini-batches (or samples) have their own loss surfaces
• The loss surface of the entire training sample (dotted) may be
different
• Local minima of one loss surface may not be local minima of
another one
• This helps us escape local minima using stochastic or batch
gradient descent
• Mini-batch size depends on computational resource utilization
Noise can be added in other ways to
escape saddle points
• Random mini-batches (SGD)

• Add noise to the gradient or the update

• Add noise to the input


Learning rate scheduling
• High learning rates explore faster earlier
– But, they can lead to divergence or high final loss
• Low learning rates fine-tune better later
– But, they can be very slow to converge
• LR scheduling combines advantages of both
– Lots of schedules possible: linear, exponential, square-root, step-wise,
cosine
Training loss

Training iterations
Classical and Nesterov Momentum
wt
• GD: − η g(wt)

• wt+1 = wt − η g(wt)
• Classical momentum: wt
α vt
− η g(wt)
• vt+1 = α vt − η g(wt); vt+1
− η g(wt) vt+1
wt+1
• wt+1 = wt + vt+1
• Nesterov momentum α vt
wt
• vt+1 = α vt − η g(wt+αvt); vt+1
− η g(wt+αvt)
wt+1
• wt+1 = wt + vt+1
• Better course-correction for bad velocity
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983
Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
AdaGrad, RMSProp, AdaDelta
• Scales the gradient by a running norm of all
the previous gradients
• Per dimension:
𝑔(𝑤𝑡 )
𝑤𝑡+1 = 𝑤𝑡 − 𝜂
𝑡 2
𝑖=1 𝑔(𝑤𝑡 ) +𝜀
• Automatically reduces learning rate with t
• Parameters with small gradients speed up
• RMSProp and AdaDelta use a forgetting factor
in grad squared so that the updates do not
become too small
Adam optimizer combines AdaGrad
and momentum
• Initialize
• 𝑚0 = 0
• 𝑣0 = 0
• Loop over t
Get gradient
• 𝑔𝑡 = 𝛻𝑤𝑓𝑡 𝑤𝑡−1
• 𝑚𝑡 = 𝛽1 𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡 Update first moment (biased)

• 𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡2 Update second moment (biased)

• 𝑚𝑡 = 𝑚𝑡/(1 − 𝛽1𝑡) Correct bias in first moment


• 𝑣𝑡 = 𝑣𝑡/(1 − 𝛽2𝑡) Correct bias in second moment
• 𝑤𝑡 = 𝑤𝑡−1 − 𝛼 𝑚𝑡/ 𝑣𝑡 + 𝜀 Update parameters

“ADAM: A method for stochastic optimization” Kingma and Ba, ICLR’15


Visualizing optimizers

Source: http://ruder.io/optimizing-gradient-descent/index.html

You might also like