0% found this document useful (0 votes)
21 views12 pages

Chapter 6vh

Tyh

Uploaded by

Krishna Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

Chapter 6vh

Tyh

Uploaded by

Krishna Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

6 Second-Order Methods

The previous chapter focused on optimization methods that involve first-order


approximations of the objective function using the gradient. This chapter focuses
on leveraging second-order approximations that use the second derivative in uni-
variate optimization or the Hessian in multivariate optimization to direct the
search. This additional information can help improve the local model used for
informing the selection of directions and step lengths in descent algorithms.

6.1 Newton’s Method

Knowing the function value and gradient for a design point can help determine
the direction to travel, but this first-order information does not directly help
determine how far to step to reach a local minimum. Second-order information,
on the other hand, allows us to make a quadratic approximation of the objective
function and approximate the right step size to reach a local minimum as shown
in figure 6.1. As we have seen with quadratic fit search in chapter 3, we can
analytically obtain the location where a quadratic approximation has a zero
gradient. We can then use that location as the next iteration to approach a local
minimum.
In univariate optimization, the quadratic approximation about a point x (k)
comes from the second-order Taylor expansion:

( x − x (k) )2 ′′ (k)
q ( x ) = f ( x (k) ) + ( x − x (k) ) f ′ ( x (k) ) + f (x ) (6.1)
2
88 c ha p te r 6 . se c on d -ord e r me thod s

Figure 6.1. A comparison of first-


order and second-order approxi-
mations. Bowl-shaped quadratic
approximations have unique loca-
tions where the derivative is zero.
f

f
x x

Setting the derivative to zero and solving for the root yields the update equation
for Newton’s method:

q( x ) = f ′ ( x (k) ) + ( x − x (k) ) f ′′ ( x (k) ) = 0 (6.2)
∂x
f ′ ( x (k) )
x ( k +1) = x ( k ) − (6.3)
f ′′ ( x (k) )
This update is shown in figure 6.2.

Figure 6.2. Newton’s method can


be interpreted as a root-finding
method applied to f ′ that itera-
tively improves a univariate design
point by taking the tangent line at
f′

( x, f ′ ( x )), finding the intersection


f

with the x-axis, and using that x


0 value as the next design point.

x ( k +1) x (k) x ( k +1) x (k)


x x

The update rule in Newton’s method involves dividing by the second derivative.
The update is undefined if the second derivative is zero, which occurs when
the quadratic approximation is a line. Instability also occurs when the second
derivative is very close to zero, in which case the next iterate will lie very far from
the current design point, far from where the local quadratic approximation is
valid. Poor local approximations can lead to poor performance with Newton’s
method. Figure 6.3 shows three kinds of failure cases.

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
6 . 1. n e w ton ’s me thod 89

Oscillation Overshoot Negative f ′′


f

x
x (k) x ( k +1) x ( k +1) x (k)
x x
Figure 6.3. Examples of failure
cases with Newton’s method.
Newton’s method does tend to converge quickly when in a bowl-like region
that is sufficiently close to a local minimum. It has quadratic convergence, meaning
the difference between the minimum and the iterate is approximately squared
with every iteration. This rate of convergence holds for Newton’s method starting
from x (1) within a distance δ of a root x ∗ if1 1
The final condition enforces suf-
ficient closeness, ensuring that the
• f ′′ ( x ) 6= 0 for all points in I, function is sufficiently approxi-
mated by the Taylor expansion.
• f ′′′ ( x ) is continuous on I, and J. Stoer and R. Bulirsch, Introduc-
tion to Numerical Analysis, 3rd ed.
′′′ (1) ′′′ ∗ Springer, 2002.
• 1 f (x )
2 | f ′′ ( x (1) ) | < c| ff ′′ ((xx∗ )) | for some c < ∞

for an interval I = [ x ∗ − δ, x ∗ + δ]. The final condition guards against overshoot.


Newton’s method can be extended to multivariate optimization (algorithm 6.1).
The multivariate second-order Taylor expansion at x(k) is:

1
f ( x ) ≈ q (x ) = f ( x( k ) ) + ( g( k ) ) ⊤ (x − x( k ) ) + (x − x( k ) ) ⊤ H( k ) (x − x( k ) ) (6.4)
2
where g(k) and H(k) are the gradient and Hessian at x(k) , respectively.
We evaluate the gradient and set it to zero:

∇ q (x ) = g( k ) + H( k ) (x − x( k ) ) = 0 (6.5)

We then solve for the next iterate, thereby obtaining Newton’s method in multi-
variate form:
x( k +1) = x( k ) − (H( k ) ) −1 g( k ) (6.6)

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
90 c hap te r 6 . se c on d -ord e r me thod s

If f is quadratic and its Hessian is positive definite, then the update converges 2
Termination conditions for de-
scent methods are given in chap-
to the global minimum in one step. For general functions, Newton’s method
ter 4.
is often terminated once x ceases to change by more than a given tolerance.2
Example 6.1 shows how Newton’s method can be used to minimize a function.

Example 6.1. Newton’s method


With x(1) = [9, 8], we will use Newton’s method to minimize Booth’s func- used to minimize Booth’s function;
tion: see appendix B.2.
f (x) = ( x1 + 2x2 − 7)2 + (2x1 + x2 − 5)2
The gradient of Booth’s function is:

∇ f (x) = [10x1 + 8x2 − 34, 8x1 + 10x2 − 38]

The Hessian of Booth’s function is:


" #
10 8
H(x) =
8 10

The first iteration of Newton’s method yields:


" # " # −1 " #
(2) (1)

(1)
 −1
(1) 9 10 8 10 · 9 + 8 · 8 − 34
x =x − H g = −
8 8 10 8 · 9 + 10 · 8 − 38
" # " # −1 " # " #
9 10 8 120 1
= − =
8 8 10 114 3

The gradient at x(2) is zero, so we have converged after a single iteration. The
Hessian is positive definite everywhere, so x(2) is the global minimum.

Newton’s method can also be used to supply a descent direction to line search
or can be modified to use a step factor.3 Smaller steps toward the minimum or 3
See chapter 5.
line searches along the descent direction can increase the method’s robustness.
The descent direction is:4 4
The descent direction given by
Newton’s method is similar to the
d(k) = −(H(k) )−1 g(k) (6.7) natural gradient or covariant gra-
dient. S. Amari, “Natural Gradi-
ent Works Efficiently in Learning,”
Neural Computation, vol. 10, no. 2,
pp. 251–276, 1998.

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
6 . 2. se c a n t me thod 91

function newtons_method(∇f, H, x, ϵ, k_max) Algorithm 6.1. Newton’s method,


which takes the gradient of the
k, Δ = 1, fill(Inf, length(x))
function ∇f, the Hessian of the ob-
while norm(Δ) > ϵ && k ≤ k_max
jective function H, an initial point x,
Δ = H(x) \ ∇f(x) a step size tolerance ϵ, and a maxi-
x -= Δ mum number of iterations k_max.
k += 1
end
return x
end

6.2 Secant Method

Newton’s method for univariate function minimization requires the first and
second derivatives f ′ and f ′′ . In many cases, f ′ is known but the second derivative
is not. The secant method (algorithm 6.2) applies Newton’s method using estimates
of the second derivative and thus only requires f ′ . This property makes the secant
method more convenient to use in practice.
The secant method uses the last two iterates to approximate the second deriva-
tive:
f ′ ( x ( k ) ) − f ′ ( x ( k −1) )
f ′′ ( x (k) ) ≈ (6.8)
x ( k ) − x ( k −1)
This estimate is substituted into Newton’s method:
x ( k ) − x ( k −1)
x ( k +1) ← x ( k ) − f ′ ( x (k) ) (6.9)
f ′ ( x ( k ) ) − f ′ ( x ( k −1) )
The secant method requires an additional initial design point. It suffers from
the same problems as Newton’s method and may take more iterations to converge
due to approximating the second derivative.

6.3 Quasi-Newton Methods

Just as the secant method approximates f ′′ in the univariate case, quasi-Newton


methods approximate the inverse Hessian. Quasi-Newton method updates have
the form:
x( k +1) ← x( k ) − α ( k ) Q( k ) g( k ) (6.10)
where α(k) is a scalar step factor and Q(k) approximates the inverse of the Hessian
at x(k) .

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
92 c ha p te r 6 . se c on d -ord e r me thod s

function secant_method(f′, x0, x1, ϵ) Algorithm 6.2. The secant method


for univariate function minimiza-
g0 = f′(x0)
tion. The inputs are the first deriva-
Δ = Inf
tive f′ of the target function, two
while abs(Δ) > ϵ initial points x0 and x1, and the
g1 = f′(x1) desired tolerance ϵ. The final x-
Δ = (x1 - x0)/(g1 - g0)*g1 coordinate is returned.
x0, x1, g0 = x1, x1 - Δ, g1
end
return x1
end

These methods typically set Q(1) to the identity matrix, and they then apply up-
dates to reflect information learned with each iteration. To simplify the equations
for the various quasi-Newton methods, we define the following:

γ( k +1) ≡ g( k +1) − g( k ) (6.11)


δ ( k +1)
≡x ( k +1)
−x (k)
(6.12)

The Davidon-Fletcher-Powell (DFP) method (algorithm 6.3) uses:5 5


The original concept was pre-
sented in a technical report, W. C.
Qγγ⊤ Q δδ⊤ Davidon, “Variable Metric Method
Q ← Q− + ⊤ (6.13) for Minimization,” Argonne Na-
γ⊤ Qγ δ γ tional Laboratory, Tech. Rep. ANL-
5990, 1959. It was later published:
where all terms on the right hand side are evaluated at the same iteration. W. C. Davidon, “Variable Metric
The update for Q in the DFP method has three properties: Method for Minimization,” SIAM
Journal on Optimization, vol. 1, no. 1,
pp. 1–17, 1991. The method was
1. Q remains symmetric and positive definite.
modified by R. Fletcher and M. J. D.
Powell, “A Rapidly Convergent De-
2. If f (x) = 21 x⊤ Ax + b⊤ x + c, then Q = A−1 . Thus the DFP has the same scent Method for Minimization,”
convergence properties as the conjugate gradient method. The Computer Journal, vol. 6, no. 2,
pp. 163–168, 1963.
3. For high-dimensional problems, storing and updating Q can be significant
compared to other methods like the conjugate gradient method.

An alternative to DFP, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method


(algorithm 6.4), uses:6 6
R. Fletcher, Practical Methods of Op-
timization, 2nd ed. Wiley, 1987.
! !
δγ⊤ Q + Qγδ⊤ γ⊤ Qγ δδ⊤
Q ← Q− + 1+ ⊤ (6.14)
δ⊤ γ δ γ δ⊤ γ

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
6 . 3. q u a si -n e w ton me thod s 93

mutable struct DFP <: DescentMethod Algorithm 6.3. The Davidon-


Q Fletcher-Powell descent method.
end
function init!(M::DFP, f, ∇f, x)
m = length(x)
M.Q = Matrix(1.0I, m, m)
return M
end
function step!(M::DFP, f, ∇f, x)
Q, g = M.Q, ∇f(x)
x′ = line_search(f, x, -Q*g)
g′ = ∇f(x′)
δ = x′ - x
γ = g′ - g
Q[:] = Q - Q*γ*γ'*Q/(γ'*Q*γ) + δ*δ'/(δ'*γ)
return x′
end

mutable struct BFGS <: DescentMethod Algorithm 6.4. The Broyden-


Q Fletcher-Goldfarb-Shanno descent
method.
end
function init!(M::BFGS, f, ∇f, x)
m = length(x)
M.Q = Matrix(1.0I, m, m)
return M
end
function step!(M::BFGS, f, ∇f, x)
Q, g = M.Q, ∇f(x)
x′ = line_search(f, x, -Q*g)
g′ = ∇f(x′)
δ = x′ - x
γ = g′ - g
Q[:] = Q - (δ*γ'*Q + Q*γ*δ')/(δ'*γ) +
(1 + (γ'*Q*γ)/(δ'*γ))[1]*(δ*δ')/(δ'*γ)
return x′
end

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
94 c ha p te r 6 . se c on d -ord e r me thod s

BFGS does better than DFP with approximate line search but still uses an
n × n dense matrix. For very large problems where space is a concern, the Limited-
memory BFGS method (algorithm 6.5), or L-BFGS, can be used to approximate
BFGS.7 L-BFGS stores the last m values for δ and γ rather than the full inverse 7
J. Nocedal, “Updating Quasi-
Newton Matrices with Limited
Hessian, where i = 1 indexes the oldest value and i = m indexes the most recent.
Storage,” Mathematics of Computa-
The process for computing the descent direction d at x begins by computing tion, vol. 35, no. 151, pp. 773–782,
q(m) = ∇ f (x). The remaining vectors q(i) for i from m − 1 down to 1 are computed 1980.
using
 ⊤
δ( i +1) q( i +1)
q( i ) = q( i +1) − ⊤ γ( i +1) (6.15)
γ( i +1) δ( i +1)
These vectors are used to compute another m + 1 vectors, starting with

γ( m ) ⊙ δ( m ) ⊙ q( m )
z(0) = ⊤ (6.16)
γ( m ) γ( m )

and proceeding with z(i) for i from 1 to m according to


 ⊤  ⊤ 
( i −1) ( i −1) ( i −1) ( i −1)
δ q γ z
z( i ) = z( i −1) + δ( i −1)  (6.17)
 
 ⊤ ( i −1) −  ⊤ ( i −1) 
γ( i −1) δ γ( i −1) δ

The descent direction is d = −z(m) .


For minimization, the inverse Hessian Q must remain positive definite. The
initial Hessian is often set to the diagonal of
 ⊤
γ(1) δ(1)
Q(1) = ⊤ (6.18)
γ(1) γ(1)

Computing the diagonal for the above expression and substituting the result into
z(1) = Q(1) q(1) results in the equation for z(1) .
The quasi-Newton methods discussed in this section are compared in figure 6.4.
They often perform quite similarly.

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
6 . 4 . su mma ry 95

DFP Figure 6.4. Several quasi-Newton


BFGS methods compared on the Rosen-
L-BFGS (m = 3) brock function; see appendix B.6.
All methods have nearly identical
L-BFGS (m = 2)
updates, with L-BFGS noticeably
L-BFGS (m = 1) deviating only when its history, m,
is 1.
x2

x1

6.4 Summary

• Incorporating second-order information in descent methods often speeds con-


vergence.

• Newton’s method is a root-finding method that leverages second-order infor-


mation to quickly descend to a local minimum.

• The secant method and quasi-Newton methods approximate Newton’s method


when the second-order information is not directly available.

6.5 Exercises

Exercise 6.1. What advantage does second-order information provide about


convergence that first-order information lacks?

Exercise 6.2. When finding roots in one dimension, when would we use Newton’s
method instead of the bisection method?

Exercise 6.3. Apply Newton’s method to f ( x ) = x2 from a starting point of your


choice. How many steps do we need to converge?

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
96 c ha p te r 6 . se c on d -ord e r me thod s

mutable struct LimitedMemoryBFGS <: DescentMethod Algorithm 6.5. The Limited-


m memory BFGS descent method,
which avoids storing the ap-
δs
proximate inverse Hessian. The
γs parameter m determines the history
qs size. The LimitedMemoryBFGS type
end also stores the step differences
function init!(M::LimitedMemoryBFGS, f, ∇f, x) δs, the gradient changes γs, and
M.δs = [] storage vectors qs.
M.γs = []
M.qs = []
return M
end
function step!(M::LimitedMemoryBFGS, f, ∇f, x)
δs, γs, qs, g = M.δs, M.γs, M.qs, ∇f(x)
m = length(δs)
if m > 0
q = g
for i in m : -1 : 1
qs[i] = copy(q)
q -= (δs[i]⋅q)/(γs[i]⋅δs[i])*γs[i]
end
z = (γs[m] .* δs[m] .* q) / (γs[m]⋅γs[m])
for i in 1 : m
z += δs[i]*(δs[i]⋅qs[i] - γs[i]⋅z)/(γs[i]⋅δs[i])
end
x′ = line_search(f, x, -z)
else
x′ = line_search(f, x, -g)
end
g′ = ∇f(x′)
push!(δs, x′ - x); push!(γs, g′ - g)
push!(qs, zeros(length(x)))
while length(δs) > M.m
popfirst!(δs); popfirst!(γs); popfirst!(qs)
end
return x′
end

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
6 . 5 . e x e rc i se s 97

Exercise 6.4. Apply Newton’s method to f ( x ) = 1 ⊤


2 x Hx starting from x(1) =
[1, 1]. What have you observed? Use H as follows:
" #
1 0
H= (6.19)
0 1000

Next, apply gradient descent to the same optimization problem by stepping


with the unnormalized gradient. Do two steps of the algorithm. What have you
observed? Finally, apply the conjugate gradient method. How many steps do you
need to converge?

Exercise 6.5. Compare Newton’s method and the secant method on f ( x ) = x2 + x4 ,


with x (1) = −3 and x (0) = −4. Run each method for 10 iterations. Make two
plots:

1. Plot f vs. the iteration for each method.

2. Plot f ′ vs. x. Overlay the progression of each method, drawing lines from
( x (i) , f ′ ( x (i) )) to ( x (i+1) , 0) to ( x (i+1) , f ′ ( x (i+1) )) for each transition.

What can we conclude about this comparison?

Exercise 6.6. Give an example of a sequence of points x (1) , x (2) , . . . and a function
f such that f ( x (1) ) > f ( x (2) ) > · · · and yet the sequence does not converge to a
local minimum. Assume f is bounded from below.

Exercise 6.7. What is the advantage of a Quasi-Newton method over Newton’s


method?

Exercise 6.8. Give an example where the BFGS update does not exist. What would
you do in this case?

Exercise 6.9. Suppose we have a function f (x) = ( x1 + 1)2 + ( x2 + 3)2 + 4. If we


start at the origin, what is the resulting point after one step of Newton’s method?

Exercise 6.10. In this problem we will derive the optimization problem from
which the Davidon-Fletcher-Powell update is obtained. Start with a quadratic
approximation at x(k) :
 ⊤   1 ⊤  
f ( k ) (x ) = y ( k ) + g( k ) x − x( k ) + x − x( k ) H( k ) x − x( k )
2

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]
98 c ha p te r 6 . se c on d -ord e r me thod s

where y(k) , g(k) , and H(k) are the objective function value, the true gradient, and
a positive definite Hessian approximation at x(k) .
The next iterate is chosen using line search to obtain:
  −1
x( k +1) ← x( k ) − α ( k ) H( k ) g( k )

We can construct a new quadratic approximation f (k+1) at x(k+1) . The approxi-


mation should enforce that the local function evaluation is correct:

f ( k +1) ( x( k +1) ) = y ( k +1)

and that the local gradient is correct:

∇ f ( k +1) (x( k +1) ) = g( k +1)

and that the previous gradient is correct:

∇ f ( k +1) (x( k ) ) = g( k )

Show that updating the Hessian approximation to obtain H(k+1) requires:8 8


This condition is called the secant
equation. The vectors δ and γ are
defined in equation (6.11).
H( k +1) δ( k +1) = γ( k +1)

Then, show that in order for H(k+1) to be positive definite, we require:9 9


This condition is called the cur-
vature condition. It can be enforced
 ⊤ using the Wolfe conditions during
δ( k +1) γ( k +1) > 0 line search.

Finally, assuming that the curvature condition is enforced, explain why one
then solves the following optimization problem to obtain H(k+1) :10 10
The Davidon-Fletcher-Powell up-
date is obtained by solving such
an optimization problem to obtain
minimize H − H( k ) an analytical solution and then
H
finding the corresponding update
subject to H = H⊤ equation for the inverse Hessian
approximation.
Hδ(k+1) = γ(k+1)

where H − H(k) is a matrix norm that defines a distance between H and H(k) .

© 2019 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-22 00:25:57-07:00, revision 47fd495, comments to [email protected]

You might also like