0% found this document useful (0 votes)
5 views

Chapter 8 Lecture Notes

This chapter discusses search methods for real-valued functions using gradients, specifically focusing on the concept of level sets and the gradient descent algorithm. The method of steepest descent is introduced as a technique to minimize functions, where the step size is chosen to maximize the decrease in the objective function. Practical stopping criteria for the algorithm are also provided, along with examples illustrating the application of the steepest descent method to quadratic functions.

Uploaded by

jmarmolejo03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Chapter 8 Lecture Notes

This chapter discusses search methods for real-valued functions using gradients, specifically focusing on the concept of level sets and the gradient descent algorithm. The method of steepest descent is introduced as a technique to minimize functions, where the step size is chosen to maximize the decrease in the objective function. Practical stopping criteria for the algorithm are also provided, along with examples illustrating the application of the steepest descent method to quadratic functions.

Uploaded by

jmarmolejo03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

In this chapter we consider a class of search methods for real-valued functions on ℝn.

These methods use the gradient of the given function.


• A level set of a function f: ℝn → ℝ on is the set of points x satisfying f(x) = c for some constant c.
○ Thus, a point x0 ∈ ℝn is on the level set corresponding to level c if f(x0) = c.
• In the case of functions of two real variables, f: ℝ2 → ℝ, the notion of the level set is illustrated in following figure

• The gradient of f at x0 denoted ∇f(x0), if it is not a zero vector, is orthogonal to the tangent vector to an arbitrary smooth curve passing
through x0 on the level set f(x) = c.
○ Thus, the direction of maximum rate of increase of a real-valued differentiable function at a point is orthogonal to the level set of the
function through that point.
• In other words, the gradient acts in such a direction that for a given small displacement, the function f increases more in the direction
of the gradient than in any other direction.
• To prove this statement, recall that 〈∇f(x), d〉, ||d|| = 1, is the rate of increase of f in the direction d at the point x.
• By the Cauchy-Schwarz inequality,

• Thus, the direction in which ∇f(x) points is the direction of maximum rate of increase of f at x.
○ The direction in which – ∇f(x) points is the direction of maximum rate of decrease of f at x.
• Hence, the direction of negative gradient is a good direction to search if we want to find a function minimizer.

• Let x(0) be a starting point, and consider the point x(0) – α∇f(x(0)). Then, by Taylor’s theorem, we obtain

○ Thus, if ∇f(x(0)) ≠ 0, then for sufficiently small α > 0, we have

• This means that the point x(0) – α∇f(x(0))) is an improvement over the point x(0) if we are searching for a minimizer.

• To formulate an algorithm that implements this idea, suppose that we are given a point x(k).
• To find the next point x(k+1) we start at x(k) and move by an amount –α∇f(x(k)), where αk is a positive scalar called the step size.
• This procedure leads to the following iterative algorithm:

○ We refer to this as a gradient descent algorithm (aka gradient algorithm).


• The gradient varies as the search proceeds, tending to zero as we approach the minimizer.
• We have the option of either taking very small steps and reevaluating the gradient at every step, or we can take large steps each
time.
• The first approach results in a laborious method of reaching the minimizer, whereas the second approach may result in a more
zigzag path to the minimizer.
• The advantage of the second approach is possibly fewer gradient evaluations.
• Among many different methods that use this philosophy the most popular is the method of steepest descent, which we discuss
next.
• A very popular example is applying a gradient method to the training of a class of neural networks.

The Method of Steepest Descent:


The method of steepest descent is a gradient algorithm where the step size αk is chosen to achieve the maximum amount of decrease of
the objective function at each individual step. Specifically, αk is chosen to minimize ϕk(α) ≜ f(x(k) – α▽ f(x(k)) In other words,
• To summarize, the steepest descent algorithm proceeds as follows:
○ At each step, starting from the point x(k) we conduct a line search in the direction – ∇f(x(k)) until a minimizer, x(k+1), is found.
○ A typical sequence resulting from the method of steepest descent is depicted in the following figure

• Note: The method of steepest descent moves in orthogonal steps, as stated in the following proposition.

Proposition 1
If [Equation] {x(k)}∞k=0 is a steepest descent sequence for a given function f: ℝn → ℝ, then for each k the vector x(k+1) – x(k) is
orthogonal to the vector
x(k+2) – x(k+1).

• The proposition above implies that ∇f(x(k)) is parallel to the tangent plane to the level set {f(x) = f(x(k+1))} at x(k+1).
○ Note that as each new point is generated by the steepest descent algorithm, the corresponding value of the function f decreases in
value, as stated below.

Proposition 2)
If {x(k)}∞k=0 is the steepest descent sequence for f: ℝn → ℝ and if ∇f(x(k)) ≠ 0, then f(x(k+1)) < f(x(k)).

• If for some k, we have ∇f(x(k)) = 0, then the point x(k) satisfies the FONC.
○ In this case, x(k+1) = x(k). We can use the above as the basis for a stopping (termination) criterion for the algorithm.
○ The condition ∇f(x(k+1)) = 0, however, is not directly suitable as a practical stopping criterion, because the numerical computation of the
gradient will rarely be identically equal to zero.
• A practical stopping criterion is to check if the norm ||∇f(x(k))|| of the gradient is less than a prespecified threshold, in which case we
stop.
○ Alternatively, we may compute the absolute difference |f(x(k+1)) – f(x(k))| between objective function values for every two successive
iterations, and if the difference is less than some prespecified threshold, then we stop; that is, we stop when

○ Yet another alternative is to compute the norm ||x(k+1) – x(k)|| of the difference between two successive iterates, and we stop if the
norm is less than a prespecified threshold:

○ Alternatively, we may check “relative” values of the quantities above; for example,

Note: The two (relative) stopping criteria above are preferable to the previous (absolute) criteria because the relative criteria are “scale-
independent.”
○ For example, scaling the objective function does not change the satisfaction of the criterion |f(x(k+1)) – f(x(k))|/|f(x(k))| < ε.
○ Similarly, scaling the decision variable does not change the satisfaction of the criterion ||x(k+1) – x(k)||/||x(k))|| < ε.

Example 1) Use the method of steepest descent to find the minimizer of the following function
• Let us now see what the method of steepest descent does with a quadratic function of the form

○ where Q ∈ ℝn×n is a symmetric positive definite matrix, b ∈ ℝn, and x ∈ ℝn. The unique minimizer of f can be found by setting the
gradient of f to zero, where

○ There is no loss of generality in assuming Q to be a symmetric matrix.


• Therefor if we are given a quadratic form x⊤Ax and A ≠ A⊤, then because the transposition of a scalar equals itself, we obtain

• The Hessian of f is F(x) = Q = Q⊤ > 0. To simplify the notation we write g(k) = ∇f(x(k)). Then, the steepest descent algorithm for the
quadratic function can be represented as

• In the quadratic case, we can find an explicit formula for αk.


○ Assume that g(k) ≠ 0, for if g(k) = 0, then x(k) = x* and the algorithm stops.
○ Because αk ≥ 0 is a minimizer of ϕk(α) = f(x(k) – αg(k)), we apply the FONC to ϕk(α) to obtain

• In summary, the method of steepest descent for the quadratic takes the form
Example 2) Let f(x1,x2 ) = x 1 2+x 2 2. Find the minimal solution.

• What if f(x1,x2 ) = x 1 2/5+x 2 2?

You might also like