Chap 4 Beyond Gradient Descent

Uploaded by

HRITWIK GHOSH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views26 pages

Chap 4 Beyond Gradient Descent

Uploaded by

HRITWIK GHOSH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Beyond Gradient Descent

Dr. Sanjay Chatterji

CS 837
The Challenges with Gradient Descent
 Deep neural networks are able to crack problems that were previously
deemed intractable.
 Training deep neural networks end to end, however, is fraught with
difficult challenges
massive labelled datasets (ImageNet, CIFAR, etc.)
better hardware in the form of GPU acceleration
several algorithmic discoveries

More recently, breakthroughs in optimization methods have enabled us

to directly train models in an end-to-end fashion.
The primary challenge in optimizing deep learning models is that we are
forced to use minimal local information to infer the global structure of the
error surface.
This is a hard problem because there is usually very little correspondence
between local and global structure.
Organization of the chapter
• The next couple of sections will focus primarily
on local minima and whether they pose
hurdles for successfully training deep models.
• In subsequent sections, we will further explore
the nonconvex error surfaces induced by deep
models, why vanilla mini-batch gradient
descent falls short, and how modern
nonconvex optimizers overcome these pitfalls.
Local Minima in the Error Surfaces
 Assume an ant on the continental United States map, and goal
is to find the lowest point on the surface.
 The surface of the US is extremely complex (a non convex
surface).
 Even a mini-batch version won’t save us from a deep local
minimum.
Couple of critical questions
• Theoretically, local minima pose a significant
issue.
• But in practice, how common are local minima
in the error surfaces of deep networks?
• In which scenarios are they actually
problematic for training?
• In the following two sections, we’ll pick apart
common misconceptions about local minima.
Model Identifiability
 The first source of local minima is tied to a concept
commonly referred to as model identifiability.
 Deep neural networks has an infinite number of local
minima. There are two major reasons.
 within a layer any rearrangement (symmetric) of
neurons will give same final output. Total n!l equivalent
configurations.
 As ReLU uses a piecewise linear function, we can
multiply incoming weights by k while scaling outgoing
weights by 1/k without changing the behavior of the
network.
Continued…
Local minima arise for non-identifiability are not
inherently problematic

• All non-identifiable configurations behave in

an indistinguishable fashion
• Local minima are only problematic when they
are spurious: configuration of weights at local
minima in a neural network has higher error
than the configuration at the global minimum.
How Pesky Are Spurious Local Minima in
Deep Networks?
• For many years, deep learning practitioners blamed all of their
troubles in training deep networks on spurious local minima.
• Recent studies indicate that most local minima have error rates
and generalization characteristics similar to global minima.
• We might try to tackle this problem by plotting the value of the
error function over time as we train a deep neural network.
• This strategy, however, doesn’t give us enough information about
the error surface.
• Instead of analyzing the error function over time, Goodfellow et
al. (2014) investigated error surface between an initial parameter
vector and a successful final solution using linear interpolation.
θα= α ・ θf + (1 −α) ・ θi.
Continued..
• If we run this experiment over and over again, we find that there
are no truly troublesome local minima that would get us stuck.
• Vary the value of alpha to see how the error surface changes as we
traverse the line between the randomly initialized point and the
final SGD solution.
• The true struggle of gradient descent isn’t the existence of trouble-
some local minima, but to find the appropriate direction to move.
Flat Regions in the Error Surface
• The gradient approaches zero in a peculiar flat region (alpha
= 1). Not local minima.
• zero gradient might slow down learning.
• More generally, given an arbitrary function, a point at which
the gradient is the zero vector is called a critical point.
• These “flat” regions that are potentially pesky but not
necessarily deadly are called saddle points.
• As function has more and more dimensions, saddle points
are exponentially more likely than local minima.
Continued..
• In d-dimensional parameter space, we can slice through a
critical point on d different axes.
• A critical point can only be a local minimum if it appears as a
local minimum in every single one of the d one-dimensional
subspaces
• critical point can come in one of three different flavors in a
one dimensional subspace
– probability that a random critical point is in a random function is
1/3d
– a random function with k critical points has an expected number
of k/3d local minima.
– as the dimensionality of our parameter space increases, local
minima become exponentially more rare.
When the Gradient Points in the Wrong
Direction
• Upon analyzing the error surfaces of deep networks, it seems like
the most critical challenge to optimizing deep networks is finding
the correct trajectory to move in.
• Gradient isn’t usually a very good indicator of the good trajectory.
• When the contours are perfectly circular does the gradient always
point in the direction of the local minimum?
• If the contours are extremely elliptical, the gradient can be as
inaccurate as 90 degrees away from the correct direction!
• For every weight wi in the parameter space, the gradient computes
“how the value of the error changes as we change the value of wi”:
∂E/∂wi
• It gives us the direction of steepest descent.
• If contours are perfectly circular and we take big step, the gradient
doesn’t change direction as we move.
Gradient using second derivatives
• The gradient changes under our feet as we move in a
certain direction. Compute second derivatives.
• We can compile this information into a special matrix
known as the Hessian matrix (H).
• Computing the Hessian matrix exactly is a difficult task.
Momentum-Based Optimization
• The problem of an ill-conditioned Hessian matrix
manifests itself in the form of gradients that fluctuate
wildly.
• One popular mechanism for dealing with ill-
conditioning bypasses the computation of the Hessian,
and focuses on how to cancel out these fluctuations
over the duration of training.
• There are two major components that determine how a
ball rolls down an error surface.
– Acceleration
– Motion
Momentum-Based Optimization
• Our goal, then, is to somehow generate an analog for velocity
in our optimization algorithm.
• We can do this by keeping track of an exponentially weighted
decay of past gradients.
• We use the momentum hyperparameter m to determine what
fraction of the previous velocity to retain in the new update.
• Momentum significantly reduces the volatility of updates. The
larger the momentum, the less responsive we are to new
updates.
A Brief View of Second-Order Methods

• Computing the Hessian is a computationally difficult task.

• Momentum afforded us significant speedup without
having to worry about it altogether.
• Several second-order methods, have been researched
over the past several years that attempt to approximate
the Hessian directly.
– Conjugate gradient descent (conjugate direction relative to the
previous choice)
– Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm (inverse
of the Hessian matrix)
Second-Order Methods

• The conjugate direction is chosen by using an indirect

approximation of the Hessian to linearly combine the
gradient and our previous direction.
• With a slight modification, this method generalizes to
the nonconvex error surfaces we find in deep networks.
• BFGS has a significant memory footprint, but recent
work has produced a more memory-efficient version
known as L-BFGS.
Learning Rate Adaptation
• One of the major breakthroughs in modern deep network
optimization was the advent of learning rate adaption.
• The basic concept behind learning rate adaptation is that
the optimal learning rate is appropriately modified over
the span of learning to achieve good convergence
properties.
• Some popular adaptive learning rate algorithms
– AdaGrad
– RMSProp
– Adam
AdaGrad—Accumulating Historical
Gradients
• It attempts to adapt the global learning rate over time using an
accumulation of the historical gradients.
• Specifically, we keep track of a learning rate for each
parameter.
• This learning rate is inversely scaled with respect to the root
mean square of all the parameter’s historical gradients
(gradient accumulation vector r).
• Note that we add a tiny number δ(~10-7) to the denominator in
order to prevent division by zero.
Simply using a naive accumulation
of gradients isn’t sufficient
• This update mechanism means that the parameters with the
largest gradients experience a rapid decrease in their learning
rates, while parameters with smaller gradients only observe a
small decrease in their learning rates.
• AdaGrad also has a tendency to cause a premature drop in
learning rate, and as a result doesn’t work particularly well for
some deep models.
• While AdaGrad works well for simple convex functions, it isn’t
designed to navigate the complex error surfaces of deep
networks.
• Flat regions may force AdaGrad to decrease the learning rate
before it reaches a minimum.
RMSProp—Exponentially Weighted Moving
Average of Gradients
• Lets bring back a concept we introduced earlier while discussing
momentum to dampen fluctuations in the gradient.
• Compared to naive accumulation, exponentially weighted
moving averages also enable us to “toss out” measurements
that we made a long time ago.
• The decay factor ρ determines how long we keep old gradients.
• The smaller the decay factor, the shorter the effective window.
Plugging this modification into AdaGrad gives rise to the
RMSProp learning algorithm.
Adam—Combining Momentum and
RMSProp
• Spiritually, we can think about Adam as a variant
combination of RMSProp and momentum.
• We want to keep track of an exponentially weighted moving
average of the gradient (first moment of the gradient).
• Similarly to RMSProp, we can maintain an exponentially
weighted moving average of the historical gradients (second
moment of the gradient).
Bias in Adam
• However, it turns out these estimations are biased relative to the real
moments because we start off by initializing both vectors to the zero
vector.
• In order to remedy this bias, we derive a correction factor for both
estimations.
• Recently, Adam has gained popularity because of its corrective measures
against the zero initialization bias (a weakness of RMSProp) and its ability
to combine the core concepts behind RMSProp with momentum more
effectively.
• The default hyperparameter settings for Adam for TensorFlow generally
perform quite well, but Adam is also generally robust to choices in
hyperparameters.
• The only exception is that the learning rate may need to be modified in
certain cases from the default value of 0.001.
The Philosophy Behind Optimizer Selection

• We’ve discussed several strategies that are used to make

navigating the complex error surfaces of deep networks more
tractable.
• These strategies have culminated in several optimization
algorithms.
• While it would be awfully nice to know when to use which
algorithm, there is very little consensus among expert
practitioners.
• Currently, the most popular algorithms are mini-batch gradient
descent, mini-batch gradient with momentum, RMSProp,
RMSProp with momentum, Adam, and AdaDelta.
Thank You

Zlib - Pub Marine Biology Function Biodiversity Ecology
100% (4)
Zlib - Pub Marine Biology Function Biodiversity Ecology
584 pages
Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
1 Intro
No ratings yet
1 Intro
91 pages
Lecture 11
No ratings yet
Lecture 11
46 pages
Huna Course
100% (1)
Huna Course
49 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
DL - Unit 2
No ratings yet
DL - Unit 2
60 pages
DocumentsTraining Neural Networks - Part II
No ratings yet
DocumentsTraining Neural Networks - Part II
91 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Optimization
No ratings yet
Optimization
21 pages
Optim
No ratings yet
Optim
33 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Week 5 Optimisation
No ratings yet
Week 5 Optimisation
24 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
04 Numerical
No ratings yet
04 Numerical
39 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Super GD
No ratings yet
Super GD
15 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
Backpropagation Optimization Tutorial
No ratings yet
Backpropagation Optimization Tutorial
14 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Super Gradient Descent: Global Optimization Requires Global Gradient
No ratings yet
Super Gradient Descent: Global Optimization Requires Global Gradient
15 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
Chapter
No ratings yet
Chapter
46 pages
Cours 5
No ratings yet
Cours 5
23 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
You VS You
No ratings yet
You VS You
30 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
The Nature and Goals of Anthropology, Sociology and Political Science
No ratings yet
The Nature and Goals of Anthropology, Sociology and Political Science
12 pages
Untapped Mineral Potential of Somaliland Are View
No ratings yet
Untapped Mineral Potential of Somaliland Are View
12 pages
Cmap Esp 9 - Quarter 1 - Based On Matatag 2025
No ratings yet
Cmap Esp 9 - Quarter 1 - Based On Matatag 2025
4 pages
Reflections On AIDS
No ratings yet
Reflections On AIDS
8 pages
Cgat Series
No ratings yet
Cgat Series
20 pages
DFD 1
No ratings yet
DFD 1
75 pages
Health Care Marketing Assignment 2
100% (1)
Health Care Marketing Assignment 2
11 pages
What Is A Worldview? Published in Dutch As: "Wat Is Een Wereldbeeld?"
No ratings yet
What Is A Worldview? Published in Dutch As: "Wat Is Een Wereldbeeld?"
14 pages
Module #2: Transformation of Stresses in 2-D
No ratings yet
Module #2: Transformation of Stresses in 2-D
34 pages
Dam Safety Workshop 2023-1 India
No ratings yet
Dam Safety Workshop 2023-1 India
4 pages
DFD 2
No ratings yet
DFD 2
39 pages
240-49230046 - Failure Mode and Effects Analysis Guideline
100% (1)
240-49230046 - Failure Mode and Effects Analysis Guideline
29 pages
Nonverbal Behaviour Culture Gender and The Media
100% (1)
Nonverbal Behaviour Culture Gender and The Media
3 pages
NLP Soft-Eng Algo Autumn End Semester
No ratings yet
NLP Soft-Eng Algo Autumn End Semester
8 pages
Sort - SEIRI: Checklist Item Criteria Exist? Rating Comments
No ratings yet
Sort - SEIRI: Checklist Item Criteria Exist? Rating Comments
2 pages
Chap 6 Embedding
No ratings yet
Chap 6 Embedding
44 pages
Chap 3.1 Embedding in Tensorflow
No ratings yet
Chap 3.1 Embedding in Tensorflow
23 pages
Practice Test Planner - 2024-25 (TYM) Phase-03 Version 2.0
No ratings yet
Practice Test Planner - 2024-25 (TYM) Phase-03 Version 2.0
4 pages
Biology Practical Class 12
No ratings yet
Biology Practical Class 12
7 pages
Behavioral Pragmatism Barnes Holmes
No ratings yet
Behavioral Pragmatism Barnes Holmes
12 pages
Grade 8 Diagnostic Test Kasaysayan NG Daigdig
No ratings yet
Grade 8 Diagnostic Test Kasaysayan NG Daigdig
4 pages
WORK
No ratings yet
WORK
17 pages
ALGO - 7th Sem 2018
No ratings yet
ALGO - 7th Sem 2018
2 pages
Newton's Laws of Motion at Work Science Presentation in Beige Charcoal Hand Drawn Style
No ratings yet
Newton's Laws of Motion at Work Science Presentation in Beige Charcoal Hand Drawn Style
18 pages
Gandjariella Thermophila Gen Nov SP Nov
No ratings yet
Gandjariella Thermophila Gen Nov SP Nov
22 pages
CEM1000W - Tutorial - WFP 1 (Nomenclature) - Solutions
No ratings yet
CEM1000W - Tutorial - WFP 1 (Nomenclature) - Solutions
2 pages
Adm602 - Tutorial - Question - Week 8
No ratings yet
Adm602 - Tutorial - Question - Week 8
7 pages
Raslika Sharfina Nirwan: Professional Experience
100% (1)
Raslika Sharfina Nirwan: Professional Experience
1 page
Ultrasonic Horn Designs
No ratings yet
Ultrasonic Horn Designs
5 pages
Untitled10 - Jupyter Notebook
No ratings yet
Untitled10 - Jupyter Notebook
9 pages
Ict Tools in Biology Education: DR Katarzyna Potyrala
No ratings yet
Ict Tools in Biology Education: DR Katarzyna Potyrala
9 pages
Wilson Newsletter September 2020
No ratings yet
Wilson Newsletter September 2020
2 pages
Ministry of Resin Exposure Times - Durable Grey
No ratings yet
Ministry of Resin Exposure Times - Durable Grey
1 page
Active Contour: Advancing Computer Vision with Active Contour Techniques
From Everand
Active Contour: Advancing Computer Vision with Active Contour Techniques
Fouad Sabry
No ratings yet
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet

Chap 4 Beyond Gradient Descent

Uploaded by

Chap 4 Beyond Gradient Descent

Uploaded by

Beyond Gradient Descent

Dr. Sanjay Chatterji

More recently, breakthroughs in optimization methods have enabled us

• All non-identifiable configurations behave in

• Computing the Hessian is a computationally difficult task.

• The conjugate direction is chosen by using an indirect

• We’ve discussed several strategies that are used to make

You might also like