0% found this document useful (0 votes)

11 views21 pages

Lecture 8 Gradient Descent For Non-Convex Functions

This document discusses advanced machine learning techniques for gradient descent on non-convex functions. It characterizes non-convex loss surfaces and describes issues with techniques like gradient descent and Newton's method. It then explains how stochastic gradient descent, momentum methods, and adaptive learning rate optimization techniques like Adam can help address these issues and help machine learning models escape saddle points and local minima to find better solutions.

Uploaded by

Utkarsh Bhalode

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views21 pages

Lecture 8 Gradient Descent For Non-Convex Functions

Uploaded by

Utkarsh Bhalode

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Advanced Machine Learning

Gradient Descent for Non-Convex Functions

Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture

• Characterize non-convex loss surfaces with

Hessian

• List issues with non-convex surfaces

• Explain how certain optimization techniques

help solve these issues
Contents
• Characterizing a non-convex loss surfaces
• Issues with gradient decent
• Issues with Newton’s method
• Stochastic gradient descent to the rescue
• Momentum and its variants
• Saddle-free Newton
Why do we not get stuck in bad local
minima?
• Local minima are close to global minima in terms of
errors

• Saddle points are much more likely at higher

portions of the error surface (in high-dimensional
weight space)

• SGD (and other techniques) allow you to escape the

saddle points
Error surfaces and saddle points

http://math.etsu.edu/multicalc/prealpha/Chap2/Chap2-8/10-6-53.gif

http://pundit.pratt.duke.edu/piki/images/thumb/0/0a/SurfExp04.png/400px-SurfExp04.png
Eigenvalues of Hessian at critical points
Local minima

Plateau Long furrow

Saddle point

http://i.stack.imgur.com/NsI2J.png
Saddle
point

Global
minima
A realistic picture

Local
minima

Local
maxima

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

Is achieving global minima important?
• Global minima for the training data may not
be the global minima for the validation or test
data

• Local minimas are often good enough

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

Under certain assumptions,
theoretically also they are of high quality
• Results:
– Lowest critical values of the random loss form a band
– Probability of minima outside that band diminishes
exponentially with the size of the network
– Empirical verification
• Assumptions:
– Fully-connected feed-forward neural network
– Variable independence
– Redundancy in network parametrization
– Uniformity

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

Empirically, most minima are of high quality

“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15
GD vs. Newton’s method
• Gradient descent is based on first-order
approx. 𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝛻𝑓 𝑇 ∆𝜃

Δ𝜃 = −𝜂 𝛻𝑓

• Newton’s method is based on second order

1
𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝛻𝑓 𝑇 ∆𝜃 + ∆𝜃 𝑇 𝐻 ∆𝜃
2
Δ𝜃 = −𝐻 −1 𝛻𝑓
𝑛𝜃
1
𝑓 𝜃 ∗ + ∆𝜃 = 𝑓 𝜃 ∗ + 𝜆𝑖 ∆𝒗𝑖 2
2
𝑖=1

“Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization” Dauphin et al., NIPS’15
Disadvantages of 2nd order methods
• Updates require O(d3) or at least O(d2)

• May not work well for non-convex surfaces

• Get attracted to saddle points (how?)

• Not very good for batch-updates

GD vs. SGD
• GD: wt
• wt+1 = wt − η gfor all samples(wt) − η gfor all samples (wt)

• SGD momentum: wt
• wt+1 = wt − η gfor a random subset(wt)

− η gfor a random subset(wt)

Compare GD with SGD
• GD requires more computations per update

• SGD is more noisy

SGD helps by changing the loss surface
• Different mini-batches (or samples) have their own loss surfaces
• The loss surface of the entire training sample (dotted) may be
different
• Local minima of one loss surface may not be local minima of
another one
• This helps us escape local minima using stochastic or batch
gradient descent
• Mini-batch size depends on computational resource utilization
Noise can be added in other ways to
escape saddle points
• Random mini-batches (SGD)

• Add noise to the gradient or the update

• Add noise to the input

Learning rate scheduling
• High learning rates explore faster earlier
– But, they can lead to divergence or high final loss
• Low learning rates fine-tune better later
– But, they can be very slow to converge
• LR scheduling combines advantages of both
– Lots of schedules possible: linear, exponential, square-root, step-wise,
cosine
Training loss

Training iterations
Classical and Nesterov Momentum
wt
• GD: − η g(wt)

• wt+1 = wt − η g(wt)
• Classical momentum: wt
α vt
− η g(wt)
• vt+1 = α vt − η g(wt); vt+1
− η g(wt) vt+1
wt+1
• wt+1 = wt + vt+1
• Nesterov momentum α vt
wt
• vt+1 = α vt − η g(wt+αvt); vt+1
− η g(wt+αvt)
wt+1
• wt+1 = wt + vt+1
• Better course-correction for bad velocity
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983
Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
AdaGrad, RMSProp, AdaDelta
• Scales the gradient by a running norm of all
the previous gradients
• Per dimension:
𝑔(𝑤𝑡 )
𝑤𝑡+1 = 𝑤𝑡 − 𝜂
𝑡 2
𝑖=1 𝑔(𝑤𝑡 ) +𝜀
• Automatically reduces learning rate with t
• Parameters with small gradients speed up
• RMSProp and AdaDelta use a forgetting factor
in grad squared so that the updates do not
become too small
Adam optimizer combines AdaGrad
and momentum
• Initialize
• 𝑚0 = 0
• 𝑣0 = 0
• Loop over t
Get gradient
• 𝑔𝑡 = 𝛻𝑤𝑓𝑡 𝑤𝑡−1
• 𝑚𝑡 = 𝛽1 𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡 Update first moment (biased)

• 𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡2 Update second moment (biased)

• 𝑚𝑡 = 𝑚𝑡/(1 − 𝛽1𝑡) Correct bias in first moment

• 𝑣𝑡 = 𝑣𝑡/(1 − 𝛽2𝑡) Correct bias in second moment
• 𝑤𝑡 = 𝑤𝑡−1 − 𝛼 𝑚𝑡/ 𝑣𝑡 + 𝜀 Update parameters

“ADAM: A method for stochastic optimization” Kingma and Ba, ICLR’15

Visualizing optimizers

Source: http://ruder.io/optimizing-gradient-descent/index.html

WEEK 1 DSGN8290 Digital Design and User Interaction
No ratings yet
WEEK 1 DSGN8290 Digital Design and User Interaction
14 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Otimization 2024_ver3
No ratings yet
Otimization 2024_ver3
42 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
DocumentsTraining Neural Networks - Part II
No ratings yet
DocumentsTraining Neural Networks - Part II
91 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Optimizers and Activation functions in Deep Learning
No ratings yet
Optimizers and Activation functions in Deep Learning
15 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Unit3_rev3
No ratings yet
Unit3_rev3
201 pages
cours5
No ratings yet
cours5
23 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Lec 8
No ratings yet
Lec 8
43 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
optim
No ratings yet
optim
33 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Unit-1 and 2 and 3 (1)
No ratings yet
Unit-1 and 2 and 3 (1)
212 pages
Module 2
No ratings yet
Module 2
67 pages
SuperGD
No ratings yet
SuperGD
15 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Rajesh (Dl Unit3) 06dec2024
No ratings yet
Rajesh (Dl Unit3) 06dec2024
67 pages
2410.19706 (1)
No ratings yet
2410.19706 (1)
15 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
Optimizer
No ratings yet
Optimizer
13 pages
Optimizers
No ratings yet
Optimizers
4 pages
cs231n Training Neural Networks II
No ratings yet
cs231n Training Neural Networks II
99 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
5.Scaling_Optimization
No ratings yet
5.Scaling_Optimization
68 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Training NNs
No ratings yet
Training NNs
34 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
From Everand
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Session 12 - The Labour Market
No ratings yet
Session 12 - The Labour Market
12 pages
Session 5 - Consumption
No ratings yet
Session 5 - Consumption
17 pages
Session 15 - Exchange Rates and Capital Flows
No ratings yet
Session 15 - Exchange Rates and Capital Flows
11 pages
Session 14 - The Money Supply
No ratings yet
Session 14 - The Money Supply
13 pages
Session 2 - Macroeconomics - What Is It About
No ratings yet
Session 2 - Macroeconomics - What Is It About
14 pages
Session 1 - Macroeconomics - What Is It About
No ratings yet
Session 1 - Macroeconomics - What Is It About
7 pages
Session 9 - The IS LM Model
No ratings yet
Session 9 - The IS LM Model
8 pages
Session 15 - The Open Economy Mundell Fleming Model
No ratings yet
Session 15 - The Open Economy Mundell Fleming Model
8 pages
Session 3 - Macroeconomics - What Is It About
No ratings yet
Session 3 - Macroeconomics - What Is It About
16 pages
Session 7 - Investment
No ratings yet
Session 7 - Investment
6 pages
Session 6 - Investment
No ratings yet
Session 6 - Investment
13 pages
Session 10 - The IS LM Model
No ratings yet
Session 10 - The IS LM Model
14 pages
Session 16 - The Mundell Fleming Model
No ratings yet
Session 16 - The Mundell Fleming Model
14 pages
Session 13 - The Money Supply
No ratings yet
Session 13 - The Money Supply
6 pages
Session 8 - The Demand For Money
No ratings yet
Session 8 - The Demand For Money
10 pages
Session 13 - The Labour Market
No ratings yet
Session 13 - The Labour Market
8 pages
Session 8 - The IS LM Model
No ratings yet
Session 8 - The IS LM Model
10 pages
Session 7 - The Demand For Money
No ratings yet
Session 7 - The Demand For Money
12 pages
Session 11 - The Labour Market
No ratings yet
Session 11 - The Labour Market
11 pages
GNR602-Lec14-15 Harris-HoG-SIFT
No ratings yet
GNR602-Lec14-15 Harris-HoG-SIFT
86 pages
ML 2
No ratings yet
ML 2
28 pages
Session 18 - Economic Growth
No ratings yet
Session 18 - Economic Growth
9 pages
Session 17 - Economic Growth
No ratings yet
Session 17 - Economic Growth
10 pages
Session 14 - Exchange Rates and Capital Flows
No ratings yet
Session 14 - Exchange Rates and Capital Flows
6 pages
GNR602-Lec12-13 Image Compression
No ratings yet
GNR602-Lec12-13 Image Compression
85 pages
FRA - Lone Pine Cafe (B)
No ratings yet
FRA - Lone Pine Cafe (B)
2 pages
Cash Flow-Introduction-Example
No ratings yet
Cash Flow-Introduction-Example
2 pages
Child Labour, Project Proposal
0% (1)
Child Labour, Project Proposal
6 pages
CV Ahmed Eid-1
No ratings yet
CV Ahmed Eid-1
2 pages
Teacher Classroom Observation (Marzano Framework)
No ratings yet
Teacher Classroom Observation (Marzano Framework)
10 pages
Commencement Exercises With Tasks July 25 2024
No ratings yet
Commencement Exercises With Tasks July 25 2024
12 pages
The Space Between The Notes - The Effects of Background Music On S
No ratings yet
The Space Between The Notes - The Effects of Background Music On S
33 pages
MAQ Software - Job Description - Presales Solutions Architect - 2024
No ratings yet
MAQ Software - Job Description - Presales Solutions Architect - 2024
4 pages
Applied Anthropology. Readings in The Uses of The Science of Man - James A. Clifton
No ratings yet
Applied Anthropology. Readings in The Uses of The Science of Man - James A. Clifton
303 pages
MNS Care For CHO & SN - Addressing Stigma and Discrimination
No ratings yet
MNS Care For CHO & SN - Addressing Stigma and Discrimination
21 pages
UNIT 7 - Well-Being - For Students
No ratings yet
UNIT 7 - Well-Being - For Students
4 pages
Car (Group 1)
No ratings yet
Car (Group 1)
15 pages
Baduanjin Info
No ratings yet
Baduanjin Info
3 pages
According To The Audio
No ratings yet
According To The Audio
6 pages
Cambridge IGCSE ™: English As A Second Language 0510/12
No ratings yet
Cambridge IGCSE ™: English As A Second Language 0510/12
10 pages
IITians Pace Form
100% (1)
IITians Pace Form
2 pages
Calendario Semanal Planificador Organico Infantil Amarillo
No ratings yet
Calendario Semanal Planificador Organico Infantil Amarillo
1 page
BI Year 2 2021 - Exam
No ratings yet
BI Year 2 2021 - Exam
4 pages
Group 2 Loved
No ratings yet
Group 2 Loved
12 pages
Kenyon College Student Athlete Handbook
100% (1)
Kenyon College Student Athlete Handbook
29 pages
Masters in Hand Surgery Faculty Profile
No ratings yet
Masters in Hand Surgery Faculty Profile
8 pages
SOC 302 Notes
No ratings yet
SOC 302 Notes
8 pages
Innovative Lesson Plan
100% (1)
Innovative Lesson Plan
12 pages
Industrial Training Evaluation Rubrics
100% (1)
Industrial Training Evaluation Rubrics
3 pages
Request For Teachers 2
No ratings yet
Request For Teachers 2
4 pages
How To Really Find Your Passion
No ratings yet
How To Really Find Your Passion
9 pages
SSD Research Report Education in AJK
No ratings yet
SSD Research Report Education in AJK
35 pages
LP-Sir-O-3rd-year (1)
No ratings yet
LP-Sir-O-3rd-year (1)
6 pages
Democratic Policies and Practices in Early Childhood Education An Aotearoa New Zealand Case Study Linda Mitchell All Chapters Instant Download
100% (1)
Democratic Policies and Practices in Early Childhood Education An Aotearoa New Zealand Case Study Linda Mitchell All Chapters Instant Download
52 pages
Language Testing Language Assessment and Error Correction
No ratings yet
Language Testing Language Assessment and Error Correction
5 pages
HDEV 4th Edition Rathus Test Bank - Download PDF
100% (1)
HDEV 4th Edition Rathus Test Bank - Download PDF
42 pages

Lecture 8 Gradient Descent For Non-Convex Functions

Uploaded by

Lecture 8 Gradient Descent For Non-Convex Functions

Uploaded by

Advanced Machine Learning

Gradient Descent for Non-Convex Functions

• Characterize non-convex loss surfaces with

• List issues with non-convex surfaces

• Explain how certain optimization techniques

• Saddle points are much more likely at higher

• SGD (and other techniques) allow you to escape the

Plateau Long furrow

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

• Local minimas are often good enough

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

“The Loss Surfaces of Multilayer Networks” Choromanska et al. JMLR’15

• Newton’s method is based on second order

• May not work well for non-convex surfaces

• Get attracted to saddle points (how?)

• Not very good for batch-updates

− η gfor a random subset(wt)

• SGD is more noisy

• Add noise to the gradient or the update

• Add noise to the input

• 𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡2 Update second moment (biased)

• 𝑚𝑡 = 𝑚𝑡/(1 − 𝛽1𝑡) Correct bias in first moment

“ADAM: A method for stochastic optimization” Kingma and Ba, ICLR’15

You might also like