0% found this document useful (0 votes)

87 views8 pages

Gradient Based Optimization

Deep Learning

Uploaded by

Kavitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views8 pages

Gradient Based Optimization

Deep Learning

Uploaded by

Kavitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

3.

GRADIENT-BASED OPTIMIZATION

 Most deep learning algorithms involve optimization of some sort.

 Optimization refers to the task of either minimizing or maximizing some function f(x) by
altering x.
 We usually phrase most optimization problems in terms of minimizing f(x).
 Maximization may be accomplished via a minimization algorithm by minimizing - f(x).
 The function we want to minimize or maximize is called the objective function or
criterion.
 When we are minimizing it, we may also call it the cost function, loss function, or error
function.

For example: We might say x ∗ = arg min f (x)

 S
uppose
we have a
function
y = f(x),
where
both x
and y are
real
numbers.

 T
he derivative of this function is denoted as f’(x) or as dy/ dx. The derivative f‘(x) gives the
slope of f (x) at the point x.

 In other words, it specifies how to scale a small change in the input in order to obtain the
corresponding change in the output: f(x +∈ ) ≈ f(x) +∈f’(x).

 The derivative is therefore useful for minimizing a function because it tells us how to
change x in order to make a small improvement in y.
For example:

 We know that f (x − ∈ sign (f ∈(x))) is less than f (x) for small enough∈.

 We can thus reduce f (x) by moving x in small steps with opposite sign of the derivative.

 When f ‘(x) = 0, the derivative provides no information about which direction to move.
Points

 Where f ‘(x) = 0 are known as critical points or stationary points.

 A local minimum is a point where f (x) is lower than at all neighboring points, so it is no
longer possible to decrease f(x) by making infinitesimal steps. A local maximum is a point
where f (x) is higher than at all neighboring points.
 The directional derivative in direction u (a unit vector) is the slope of the function f in
direction u.

 In other words, the directional derivative is the derivative of the function f(x + αu) with
respect to α, evaluated at α = 0.

 Using the chain rule, we can see that ∂ /∂α f(x + αu) evaluates to u T∇xf(x) when α = 0.

 To minimize f, we would like to find the direction in which f decreases the fastest. We can
do this using the directional derivative:

Where θ is the angle between u and the gradient.

 Substituting in ||u||2 = 1 and ignoring factors that do not depend on u, this simplifies to
minu cos θ. This is minimized when u points in the opposite direction as the gradient. In
other words, the gradient points directly uphill, and the negative gradient points directly
downhill. We can decrease f by moving in the direction of the negative gradient. This is
known as the method of steepest descent or gradient descent. Steepest descent proposes a
new p

where ∈ is the learning rate, a positive scalar determining the size of the step. We can
choose ∈ in several different ways. A popular approach is to set ∈ to a small constant.
Beyond the Gradient: Jacobian and Hessian Matrices

 Jacobian and Hessian Matrices Sometimes we need to find all of the partial derivatives of
a function whose input and output are both vectors.

 The matrix containing all such partial derivatives is known as a Jacobian matrix.

 Specifically, if we have a function f : R m → Rn , then the Jacobian matrix J ∈ Rn×m of f

is defined such that Ji,j = ∂/ ∂xj f(x)i.

 We are also sometimes interested in a derivative of a derivative. This is known as a second

derivative.
For example:- For a function f : R n → R, the derivative with respect to xi of the derivative of f
with respect to xj is denoted as ∂2 / ∂xi ∂xj f.
 In a single dimension, we can denote d2 /dx2 f by f “(x).
 The second derivative tells us how the first derivative will change as we vary the input.

Suppose we have a quadratic function

o If such a function has a second derivative of zero, then there is no curvature.

o It is a perfectly flat line, and its value can be predicted using only the gradient.

o If the gradient is 1, then we can make a step of size ∈ along the negative gradient, and the
cost function will decrease by ∈ .

o If the second derivative is negative, the function curves downward, so the cost function
will actually decrease by more than ∈.

o Finally, if the second derivative is positive, the function curves upward, so the cost
function can decrease by less than ∈.
When our function has multiple input dimensions, there are many second derivatives. These
derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix
H(f)(x) is defined such that

 Most of the functions we encounter in the context of deep learning have a symmetric
Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can
decompose it into a set of real eigenvalues and an orthogonal basis eigenvectors.

 The second derivative in a specific direction represented by a unit vector d is given by

dTHd.

 When d is an eigenvector of H , the second derivative in that direction is given by the

corresponding eigenvalue.

 For other directions of d the directional second derivative is a weighted average of all of
the eigenvalues, with weights between 0 and 1, and eigenvectors that have smaller angle
with d receiving more weight.

 The maximum eigenvalue determines the maximum second derivative and the minimum
eigenvalue determines the minimum second derivative.

 The (directional) second derivative tells us how well we can expect a gradient descent step
to perform.
 We can make a second-order Taylor series approximation to the function f(x) around the
current point x (0) :

 When gTHg is zero or negative, the Taylor series approximation predicts that increasing ∈
forever will decrease f forever.

 When gTHg is positive, solving for the optimal step size that decreases the Taylor series
approximation of the function the most yields
Figure: A saddle point containing both positive and negative curvature.

 The function in this example is f (x) = x2 1 − x2 2 . Along the axis corresponding to x1, the
function curves upward.

Figure : Gradient descent fails to exploit the curvature information

 The simplest method for doing so is known as Newton’s method. Newton’s method is
based on using a second-order Taylor series expansion to approximate f(x) near some
point x (0) :

Optimization algorithms that use only the gradient such as gradient descent are called first-
order optimization algorithms.

 Optimization algorithms that also use the Hessian matrix, such as Newton’s method are
called second-order optimization algorithms.
 In the context of deep learning, we sometimes gain some guarantees by restricting
ourselves to functions that are either Lipschitz continuous or have Lipschitz continuous
derivatives.

 Convex optimization algorithms are able to provide many more guarantees by making
stronger restrictions.

 Convex optimization algorithms are applicable only to convex functions -- functions for
which the Hessian is positive semidefinite everywhere.

 Such functions are well-behaved because they lack saddle points and all of their local
minima are necessarily global minima.

 However, most problems in deep learning are difficult to express in terms of convex
optimization. Convex optimization is used only as a subroutine of some deep learning
algorithms.

 Ideas from the analysis of convex optimization algorithms can be useful for proving the
convergence of deep learning algorithms.

 However, in general, the importance of convex optimization is greatly diminished in the

context of deep learning.

Machine Learning - Exploring The Model - Resp
No ratings yet
Machine Learning - Exploring The Model - Resp
18 pages
4.2 Gradient-Based Optimization
No ratings yet
4.2 Gradient-Based Optimization
35 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
DL Slides 3
No ratings yet
DL Slides 3
99 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
mit18_s096iap23_lec06
No ratings yet
mit18_s096iap23_lec06
9 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
Unit VI Optimization Techniques question bank solved answer
No ratings yet
Unit VI Optimization Techniques question bank solved answer
20 pages
ML Notes
No ratings yet
ML Notes
14 pages
Slides Concepts 1 Differentiability
No ratings yet
Slides Concepts 1 Differentiability
14 pages
Chapter 2 Optimization
No ratings yet
Chapter 2 Optimization
47 pages
Module 6
No ratings yet
Module 6
47 pages
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
No ratings yet
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
19 pages
MScFE 650 MLF - Video - Transcripts - M3
No ratings yet
MScFE 650 MLF - Video - Transcripts - M3
19 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Optimization
No ratings yet
Optimization
21 pages
Optimization Algorithm 0401
No ratings yet
Optimization Algorithm 0401
26 pages
Enigma Submission
No ratings yet
Enigma Submission
3 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
DL UNIT-I
No ratings yet
DL UNIT-I
30 pages
Fast Curvature Matrix-Vector Products For Second-Order Gradient Descent
No ratings yet
Fast Curvature Matrix-Vector Products For Second-Order Gradient Descent
16 pages
Gradient_Descent
No ratings yet
Gradient_Descent
52 pages
Optimization2
No ratings yet
Optimization2
40 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
CS6910 Tutorial1
No ratings yet
CS6910 Tutorial1
10 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
Maths For ML
No ratings yet
Maths For ML
1 page
07_Gradient_Descent_For_Linear_Regression_10_min
No ratings yet
07_Gradient_Descent_For_Linear_Regression_10_min
5 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
4. Gradient Descent
No ratings yet
4. Gradient Descent
15 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Calculus - class notes
No ratings yet
Calculus - class notes
4 pages
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
No ratings yet
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
8 pages
Ann2018 L5
No ratings yet
Ann2018 L5
23 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
nn1
No ratings yet
nn1
21 pages
Perceptrons
No ratings yet
Perceptrons
12 pages
LInear
No ratings yet
LInear
14 pages
18 Vector Calculus and Optimization
No ratings yet
18 Vector Calculus and Optimization
6 pages
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
No ratings yet
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
8 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
ch4
No ratings yet
ch4
28 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
12 pages
mit18_s096iap23_lec4
No ratings yet
mit18_s096iap23_lec4
14 pages
4-Optimization of 2 Variables, Gradient Descent
No ratings yet
4-Optimization of 2 Variables, Gradient Descent
12 pages
Infinite Series
From Everand
Infinite Series
James M Hyslop
No ratings yet
Memory Technology
No ratings yet
Memory Technology
26 pages
Trends in Power and Energy in Integrated Circuits
No ratings yet
Trends in Power and Energy in Integrated Circuits
21 pages
Depende Bali Ty
No ratings yet
Depende Bali Ty
9 pages
7& 9 Autoencoder and Variational Autoencoder
No ratings yet
7& 9 Autoencoder and Variational Autoencoder
13 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
NCERT Solutions Class 12 Maths Chapter 5 Continuity and Differentiability
No ratings yet
NCERT Solutions Class 12 Maths Chapter 5 Continuity and Differentiability
39 pages
Table of Derivatives
0% (2)
Table of Derivatives
1 page
Calculus MCQs
100% (2)
Calculus MCQs
64 pages
Taylor Series
No ratings yet
Taylor Series
6 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
12 pages
1.7 Fourier Integral:: Lecture
No ratings yet
1.7 Fourier Integral:: Lecture
13 pages
Chapter 3 Laplace Transform
No ratings yet
Chapter 3 Laplace Transform
50 pages
(Elements in The Philosophy of Mathematics) John Stillwell - A Concise History of Mathematics For Philosophers (2019, Cambridge University Press)
100% (3)
(Elements in The Philosophy of Mathematics) John Stillwell - A Concise History of Mathematics For Philosophers (2019, Cambridge University Press)
151 pages
Group 6 Ans
No ratings yet
Group 6 Ans
7 pages
Level Measurement: Spring 2019
No ratings yet
Level Measurement: Spring 2019
23 pages
Harmonized Modular Curriculum For B. SC Degree Program in Mathematics
100% (1)
Harmonized Modular Curriculum For B. SC Degree Program in Mathematics
131 pages
Properties of Laplace Transform - I: Ang M.S 2012-8-14
No ratings yet
Properties of Laplace Transform - I: Ang M.S 2012-8-14
5 pages
Download Essentials of Mathematical Methods in Science and Engineering 2nd Edition S. Selcuk Bayin ebook All Chapters PDF
100% (2)
Download Essentials of Mathematical Methods in Science and Engineering 2nd Edition S. Selcuk Bayin ebook All Chapters PDF
65 pages
Mathematical Methods in Physics I(1)
No ratings yet
Mathematical Methods in Physics I(1)
53 pages
W25-DE-HW01
No ratings yet
W25-DE-HW01
3 pages
ME01000421 - Computational Methods for Mechanical Engineers
No ratings yet
ME01000421 - Computational Methods for Mechanical Engineers
4 pages
Introduction to Complex Analysis 1st Edition Michael E. Taylor download
100% (1)
Introduction to Complex Analysis 1st Edition Michael E. Taylor download
60 pages
Calculus of Variations PDF
100% (2)
Calculus of Variations PDF
18 pages
[EXT] Math R2 SFT - Math Problem Bank - Math Problem Bank
No ratings yet
[EXT] Math R2 SFT - Math Problem Bank - Math Problem Bank
15 pages
Binomial Theorem: Archimedes
No ratings yet
Binomial Theorem: Archimedes
3 pages
Testbank for Calculus With Applications 11th Edition Lial Instant Download
No ratings yet
Testbank for Calculus With Applications 11th Edition Lial Instant Download
18 pages
Mathamethics IX Complete SLM-138-182
No ratings yet
Mathamethics IX Complete SLM-138-182
45 pages
Wiley Titles
No ratings yet
Wiley Titles
76 pages
T-Ratio (Compound Angles)
No ratings yet
T-Ratio (Compound Angles)
10 pages
Calculus 1mt1003 Project 3
No ratings yet
Calculus 1mt1003 Project 3
23 pages
Reynolds Transport Theorem
No ratings yet
Reynolds Transport Theorem
5 pages
Actuarial Mathematics and Life-Table Statistics
No ratings yet
Actuarial Mathematics and Life-Table Statistics
34 pages
The Product Rule
100% (1)
The Product Rule
19 pages
Unit 2
No ratings yet
Unit 2
39 pages
Notes Riemann Sums Definite Integrals
No ratings yet
Notes Riemann Sums Definite Integrals
5 pages

Gradient Based Optimization

Uploaded by

Gradient Based Optimization

Uploaded by

3.

 Most deep learning algorithms involve optimization of some sort.

For example: We might say x ∗ = arg min f (x)

 Where f ‘(x) = 0 are known as critical points or stationary points.

Where θ is the angle between u and the gradient.

 Specifically, if we have a function f : R m → Rn , then the Jacobian matrix J ∈ Rn×m of f

 We are also sometimes interested in a derivative of a derivative. This is known as a second

Suppose we have a quadratic function

 The second derivative in a specific direction represented by a unit vector d is given by

 When d is an eigenvector of H , the second derivative in that direction is given by the

Figure : Gradient descent fails to exploit the curvature information

 However, in general, the importance of convex optimization is greatly diminished in the

You might also like