Chapter 6vh
Chapter 6vh
Knowing the function value and gradient for a design point can help determine
the direction to travel, but this first-order information does not directly help
determine how far to step to reach a local minimum. Second-order information,
on the other hand, allows us to make a quadratic approximation of the objective
function and approximate the right step size to reach a local minimum as shown
in figure 6.1. As we have seen with quadratic fit search in chapter 3, we can
analytically obtain the location where a quadratic approximation has a zero
gradient. We can then use that location as the next iteration to approach a local
minimum.
In univariate optimization, the quadratic approximation about a point x (k)
comes from the second-order Taylor expansion:
( x − x (k) )2 ′′ (k)
q ( x ) = f ( x (k) ) + ( x − x (k) ) f ′ ( x (k) ) + f (x ) (6.1)
2
88 c ha p te r 6 . se c on d -ord e r me thod s
f
x x
Setting the derivative to zero and solving for the root yields the update equation
for Newton’s method:
∂
q( x ) = f ′ ( x (k) ) + ( x − x (k) ) f ′′ ( x (k) ) = 0 (6.2)
∂x
f ′ ( x (k) )
x ( k +1) = x ( k ) − (6.3)
f ′′ ( x (k) )
This update is shown in figure 6.2.
The update rule in Newton’s method involves dividing by the second derivative.
The update is undefined if the second derivative is zero, which occurs when
the quadratic approximation is a line. Instability also occurs when the second
derivative is very close to zero, in which case the next iterate will lie very far from
the current design point, far from where the local quadratic approximation is
valid. Poor local approximations can lead to poor performance with Newton’s
method. Figure 6.3 shows three kinds of failure cases.
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
6 . 1. n e w ton ’s me thod 89
x
x (k) x ( k +1) x ( k +1) x (k)
x x
Figure 6.3. Examples of failure
cases with Newton’s method.
Newton’s method does tend to converge quickly when in a bowl-like region
that is sufficiently close to a local minimum. It has quadratic convergence, meaning
the difference between the minimum and the iterate is approximately squared
with every iteration. This rate of convergence holds for Newton’s method starting
from x (1) within a distance δ of a root x ∗ if1 1
The final condition enforces suf-
ficient closeness, ensuring that the
• f ′′ ( x ) 6= 0 for all points in I, function is sufficiently approxi-
mated by the Taylor expansion.
• f ′′′ ( x ) is continuous on I, and J. Stoer and R. Bulirsch, Introduc-
tion to Numerical Analysis, 3rd ed.
′′′ (1) ′′′ ∗ Springer, 2002.
• 1 f (x )
2 | f ′′ ( x (1) ) | < c| ff ′′ ((xx∗ )) | for some c < ∞
1
f ( x ) ≈ q (x ) = f ( x( k ) ) + ( g( k ) ) ⊤ (x − x( k ) ) + (x − x( k ) ) ⊤ H( k ) (x − x( k ) ) (6.4)
2
where g(k) and H(k) are the gradient and Hessian at x(k) , respectively.
We evaluate the gradient and set it to zero:
∇ q (x ) = g( k ) + H( k ) (x − x( k ) ) = 0 (6.5)
We then solve for the next iterate, thereby obtaining Newton’s method in multi-
variate form:
x( k +1) = x( k ) − (H( k ) ) −1 g( k ) (6.6)
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
90 c hap te r 6 . se c on d -ord e r me thod s
If f is quadratic and its Hessian is positive definite, then the update converges 2
Termination conditions for de-
scent methods are given in chap-
to the global minimum in one step. For general functions, Newton’s method
ter 4.
is often terminated once x ceases to change by more than a given tolerance.2
Example 6.1 shows how Newton’s method can be used to minimize a function.
The gradient at x(2) is zero, so we have converged after a single iteration. The
Hessian is positive definite everywhere, so x(2) is the global minimum.
Newton’s method can also be used to supply a descent direction to line search
or can be modified to use a step factor.3 Smaller steps toward the minimum or 3
See chapter 5.
line searches along the descent direction can increase the method’s robustness.
The descent direction is:4 4
The descent direction given by
Newton’s method is similar to the
d(k) = −(H(k) )−1 g(k) (6.7) natural gradient or covariant gra-
dient. S. Amari, “Natural Gradi-
ent Works Efficiently in Learning,”
Neural Computation, vol. 10, no. 2,
pp. 251–276, 1998.
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
6 . 2. se c a n t me thod 91
Newton’s method for univariate function minimization requires the first and
second derivatives f ′ and f ′′ . In many cases, f ′ is known but the second derivative
is not. The secant method (algorithm 6.2) applies Newton’s method using estimates
of the second derivative and thus only requires f ′ . This property makes the secant
method more convenient to use in practice.
The secant method uses the last two iterates to approximate the second deriva-
tive:
f ′ ( x ( k ) ) − f ′ ( x ( k −1) )
f ′′ ( x (k) ) ≈ (6.8)
x ( k ) − x ( k −1)
This estimate is substituted into Newton’s method:
x ( k ) − x ( k −1)
x ( k +1) ← x ( k ) − f ′ ( x (k) ) (6.9)
f ′ ( x ( k ) ) − f ′ ( x ( k −1) )
The secant method requires an additional initial design point. It suffers from
the same problems as Newton’s method and may take more iterations to converge
due to approximating the second derivative.
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
92 c ha p te r 6 . se c on d -ord e r me thod s
These methods typically set Q(1) to the identity matrix, and they then apply up-
dates to reflect information learned with each iteration. To simplify the equations
for the various quasi-Newton methods, we define the following:
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
6 . 3. q u a si -n e w ton me thod s 93
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
94 c ha p te r 6 . se c on d -ord e r me thod s
BFGS does better than DFP with approximate line search but still uses an
n × n dense matrix. For very large problems where space is a concern, the Limited-
memory BFGS method (algorithm 6.5), or L-BFGS, can be used to approximate
BFGS.7 L-BFGS stores the last m values for δ and γ rather than the full inverse 7
J. Nocedal, “Updating Quasi-
Newton Matrices with Limited
Hessian, where i = 1 indexes the oldest value and i = m indexes the most recent.
Storage,” Mathematics of Computa-
The process for computing the descent direction d at x begins by computing tion, vol. 35, no. 151, pp. 773–782,
q(m) = ∇ f (x). The remaining vectors q(i) for i from m − 1 down to 1 are computed 1980.
using
⊤
δ( i +1) q( i +1)
q( i ) = q( i +1) − ⊤ γ( i +1) (6.15)
γ( i +1) δ( i +1)
These vectors are used to compute another m + 1 vectors, starting with
γ( m ) ⊙ δ( m ) ⊙ q( m )
z(0) = ⊤ (6.16)
γ( m ) γ( m )
Computing the diagonal for the above expression and substituting the result into
z(1) = Q(1) q(1) results in the equation for z(1) .
The quasi-Newton methods discussed in this section are compared in figure 6.4.
They often perform quite similarly.
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
6 . 4 . su mma ry 95
x1
6.4 Summary
6.5 Exercises
Exercise 6.2. When finding roots in one dimension, when would we use Newton’s
method instead of the bisection method?
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
96 c ha p te r 6 . se c on d -ord e r me thod s
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
6 . 5 . e x e rc i se s 97
2. Plot f ′ vs. x. Overlay the progression of each method, drawing lines from
( x (i) , f ′ ( x (i) )) to ( x (i+1) , 0) to ( x (i+1) , f ′ ( x (i+1) )) for each transition.
Exercise 6.6. Give an example of a sequence of points x (1) , x (2) , . . . and a function
f such that f ( x (1) ) > f ( x (2) ) > · · · and yet the sequence does not converge to a
local minimum. Assume f is bounded from below.
Exercise 6.8. Give an example where the BFGS update does not exist. What would
you do in this case?
Exercise 6.10. In this problem we will derive the optimization problem from
which the Davidon-Fletcher-Powell update is obtained. Start with a quadratic
approximation at x(k) :
⊤ 1 ⊤
f ( k ) (x ) = y ( k ) + g( k ) x − x( k ) + x − x( k ) H( k ) x − x( k )
2
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
98 c ha p te r 6 . se c on d -ord e r me thod s
where y(k) , g(k) , and H(k) are the objective function value, the true gradient, and
a positive definite Hessian approximation at x(k) .
The next iterate is chosen using line search to obtain:
−1
x( k +1) ← x( k ) − α ( k ) H( k ) g( k )
∇ f ( k +1) (x( k ) ) = g( k )
Finally, assuming that the curvature condition is enforced, explain why one
then solves the following optimization problem to obtain H(k+1) :10 10
The Davidon-Fletcher-Powell up-
date is obtained by solving such
an optimization problem to obtain
minimize H − H( k ) an analytical solution and then
H
finding the corresponding update
subject to H = H⊤ equation for the inverse Hessian
approximation.
Hδ(k+1) = γ(k+1)
where H − H(k) is a matrix norm that defines a distance between H and H(k) .
© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]