0% found this document useful (0 votes)
87 views8 pages

Gradient Based Optimization

Deep Learning

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views8 pages

Gradient Based Optimization

Deep Learning

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

3.

GRADIENT-BASED OPTIMIZATION

 Most deep learning algorithms involve optimization of some sort.


 Optimization refers to the task of either minimizing or maximizing some function f(x) by
altering x.
 We usually phrase most optimization problems in terms of minimizing f(x).
 Maximization may be accomplished via a minimization algorithm by minimizing - f(x).
 The function we want to minimize or maximize is called the objective function or
criterion.
 When we are minimizing it, we may also call it the cost function, loss function, or error
function.

For example: We might say x ∗ = arg min f (x)

 S
uppose
we have a
function
y = f(x),
where
both x
and y are
real
numbers.

 T
he derivative of this function is denoted as f’(x) or as dy/ dx. The derivative f‘(x) gives the
slope of f (x) at the point x.

 In other words, it specifies how to scale a small change in the input in order to obtain the
corresponding change in the output: f(x +∈ ) ≈ f(x) +∈f’(x).

 The derivative is therefore useful for minimizing a function because it tells us how to
change x in order to make a small improvement in y.
For example:

 We know that f (x − ∈ sign (f ∈(x))) is less than f (x) for small enough∈.

 We can thus reduce f (x) by moving x in small steps with opposite sign of the derivative.

 When f ‘(x) = 0, the derivative provides no information about which direction to move.
Points

 Where f ‘(x) = 0 are known as critical points or stationary points.

 A local minimum is a point where f (x) is lower than at all neighboring points, so it is no
longer possible to decrease f(x) by making infinitesimal steps. A local maximum is a point
where f (x) is higher than at all neighboring points.
 The directional derivative in direction u (a unit vector) is the slope of the function f in
direction u.

 In other words, the directional derivative is the derivative of the function f(x + αu) with
respect to α, evaluated at α = 0.

 Using the chain rule, we can see that ∂ /∂α f(x + αu) evaluates to u T∇xf(x) when α = 0.

 To minimize f, we would like to find the direction in which f decreases the fastest. We can
do this using the directional derivative:

Where θ is the angle between u and the gradient.

 Substituting in ||u||2 = 1 and ignoring factors that do not depend on u, this simplifies to
minu cos θ. This is minimized when u points in the opposite direction as the gradient. In
other words, the gradient points directly uphill, and the negative gradient points directly
downhill. We can decrease f by moving in the direction of the negative gradient. This is
known as the method of steepest descent or gradient descent. Steepest descent proposes a
new p

where ∈ is the learning rate, a positive scalar determining the size of the step. We can
choose ∈ in several different ways. A popular approach is to set ∈ to a small constant.
Beyond the Gradient: Jacobian and Hessian Matrices

 Jacobian and Hessian Matrices Sometimes we need to find all of the partial derivatives of
a function whose input and output are both vectors.

 The matrix containing all such partial derivatives is known as a Jacobian matrix.

 Specifically, if we have a function f : R m → Rn , then the Jacobian matrix J ∈ Rn×m of f


is defined such that Ji,j = ∂/ ∂xj f(x)i.

 We are also sometimes interested in a derivative of a derivative. This is known as a second


derivative.
For example:- For a function f : R n → R, the derivative with respect to xi of the derivative of f
with respect to xj is denoted as ∂2 / ∂xi ∂xj f.
 In a single dimension, we can denote d2 /dx2 f by f “(x).
 The second derivative tells us how the first derivative will change as we vary the input.

Suppose we have a quadratic function


o If such a function has a second derivative of zero, then there is no curvature.

o It is a perfectly flat line, and its value can be predicted using only the gradient.

o If the gradient is 1, then we can make a step of size ∈ along the negative gradient, and the
cost function will decrease by ∈ .

o If the second derivative is negative, the function curves downward, so the cost function
will actually decrease by more than ∈.

o Finally, if the second derivative is positive, the function curves upward, so the cost
function can decrease by less than ∈.
When our function has multiple input dimensions, there are many second derivatives. These
derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix
H(f)(x) is defined such that

 Most of the functions we encounter in the context of deep learning have a symmetric
Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can
decompose it into a set of real eigenvalues and an orthogonal basis eigenvectors.

 The second derivative in a specific direction represented by a unit vector d is given by


dTHd.

 When d is an eigenvector of H , the second derivative in that direction is given by the


corresponding eigenvalue.

 For other directions of d the directional second derivative is a weighted average of all of
the eigenvalues, with weights between 0 and 1, and eigenvectors that have smaller angle
with d receiving more weight.

 The maximum eigenvalue determines the maximum second derivative and the minimum
eigenvalue determines the minimum second derivative.

 The (directional) second derivative tells us how well we can expect a gradient descent step
to perform.
 We can make a second-order Taylor series approximation to the function f(x) around the
current point x (0) :

 When gTHg is zero or negative, the Taylor series approximation predicts that increasing ∈
forever will decrease f forever.

 When gTHg is positive, solving for the optimal step size that decreases the Taylor series
approximation of the function the most yields
Figure: A saddle point containing both positive and negative curvature.

 The function in this example is f (x) = x2 1 − x2 2 . Along the axis corresponding to x1, the
function curves upward.

Figure : Gradient descent fails to exploit the curvature information

 The simplest method for doing so is known as Newton’s method. Newton’s method is
based on using a second-order Taylor series expansion to approximate f(x) near some
point x (0) :

Optimization algorithms that use only the gradient such as gradient descent are called first-
order optimization algorithms.

 Optimization algorithms that also use the Hessian matrix, such as Newton’s method are
called second-order optimization algorithms.
 In the context of deep learning, we sometimes gain some guarantees by restricting
ourselves to functions that are either Lipschitz continuous or have Lipschitz continuous
derivatives.

 Convex optimization algorithms are able to provide many more guarantees by making
stronger restrictions.

 Convex optimization algorithms are applicable only to convex functions -- functions for
which the Hessian is positive semidefinite everywhere.

 Such functions are well-behaved because they lack saddle points and all of their local
minima are necessarily global minima.

 However, most problems in deep learning are difficult to express in terms of convex
optimization. Convex optimization is used only as a subroutine of some deep learning
algorithms.

 Ideas from the analysis of convex optimization algorithms can be useful for proving the
convergence of deep learning algorithms.

 However, in general, the importance of convex optimization is greatly diminished in the


context of deep learning.

You might also like