0% found this document useful (0 votes)
5 views14 pages

chapter8-Unconstrained Optimization

Chapter 8 discusses unconstrained optimization, focusing on problems that can be framed as minimizing or maximizing a function without input constraints. It explores various examples such as nonlinear least-squares, maximum likelihood estimation, and geometric problems, emphasizing the importance of identifying local and global minima. The chapter also introduces concepts of optimality, including stationary points and the role of convexity in ensuring that local minima are global minima.

Uploaded by

2230677
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views14 pages

chapter8-Unconstrained Optimization

Chapter 8 discusses unconstrained optimization, focusing on problems that can be framed as minimizing or maximizing a function without input constraints. It explores various examples such as nonlinear least-squares, maximum likelihood estimation, and geometric problems, emphasizing the importance of identifying local and global minima. The chapter also introduces concepts of optimality, including stationary points and the role of convexity in ensuring that local minima are global minima.

Uploaded by

2230677
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Chapter 8

Unconstrained Optimization

In previous chapters, we have chosen to take a largely variational approach to deriving standard
algorithms for computational linear algebra. That is, we define an objective function, possibly with
constraints, and pose our algorithms as a minimization or maximization problem. A sampling
from our previous discussion is listed below:

Problem Objective Constraints


Least-squares E(~x ) = k A~x − ~bk2 None
Project ~b onto ~a E(c) = kc~a − ~bk None
Eigenvectors of symmetric matrix E(~x ) = ~x > A~x k~x k = 1
Pseudoinverse E(~x ) = k~x k2 A> A~x = A>~b
Principal components analysis E(C ) = k X − CC > X kFro C > C = Id×d
Broyden step E( Jk ) = k Jk − Jk−1 k2Fro Jk · (~xk − ~xk−1 ) = f (~xk ) − f (~xk−1 )

Obviously the formulation of problems in this fashion is a powerful and general approach. For
this reason, it is valuable to design algorithms that function in the absence of a special form for the
energy E, in the same way that we developed strategies for finding roots of f without knowing
the form of f a priori.

8.1 Unconstrained Optimization: Motivation


In this chapter, we will consider unconstrained problems, that is, problems that can be posed as
minimizing or maximizing a function f : Rn → R without any requirements on the input. It is
not difficult to encounter such problems in practice; we list a few examples below.
Example 8.1 (Nonlinear least-squares). Suppose we are given a number of pairs ( xi , yi ) such that
f ( xi ) ≈ yi , and we wish to find the best approximating f within a particular class. For instance, we
may expect that f is exponential, in which case we should be able to write f ( x ) = ce ax for some c and some
a; our job is to find these parameters. One simple strategy might be to attempt to minimize the following
energy:
E( a, c) = ∑(yi − ce axi )2 .
i
This form for E is not quadratic in a, so our linear least-squares methods do not apply.

1
Example 8.2 (Maximum likelihood estimation). In machine learning, the problem of parameter esti-
mation involves examining the results of a randomized experiment and trying to summarize them using
a probability distribution of a particular form. For example, we might measure the height of every student
in a class, yielding a list of heights hi for each student i. If we have a lot of students, we might model the
distribution of student heights using a normal distribution:
1 2 2
g(h; µ, σ) = √ e−(h−µ) /2σ ,
σ 2π
where µ is the mean of the distribution and σ is the standard deviation.
Under this normal distribution, the likelihood that we observe height hi for student i is given by
g(hi ; µ, σ), and under the (reasonable) assumption that the height of student i is probabilistically inde-
pendent of that of student j, the probability of observing the entire set of heights observed is given by the
product
P({ h1 , . . . , hn }; µ, σ) = ∏ g(hi ; µ, σ).
i
A common method for estimating the parameters µ and σ of g is to maximize P viewed as a function of µ
and σ with { hi } fixed; this is called the maximum-likelihood estimate of µ and σ. In practice, we usually
optimize the log likelihood `(µ, σ) ≡ log P({ h1 , . . . , hn }; µ, σ); this function has the same maxima but
enjoys better numerical and mathematical properties.
Example 8.3 (Geometric problems). Many geometry problems encountered in graphics and vision do
not reduce to least-squares energies. For instance, suppose we have a number of points ~x1 , . . . , ~xk ∈ R3 . If
we wish to cluster these points, we might wish to summarize them with a single ~x minimizing:

E(~x ) ≡ ∑ k~x − ~xi k2 .


i

The ~x ∈ R3 minimizing E is known as the geometric median of {~x1 , . . . , ~xk }. Notice that the norm of the
difference ~x − ~xi in E is not squared, so the energy is no longer quadratic in the components of ~x.
Example 8.4 (Physical equilibria, adapted from CITE). Suppose we attach an object to a set of springs;
each spring is anchored at point ~xi ∈ R3 and has natural length Li and constant k i . In the absence of
gravity, if our object is located at position ~p ∈ R3 , the network of springs has potential energy
1
2∑
2
E(~p) = k i (k~p − ~xi k2 − Li )
i

Equilibria of this system are given by minima of E and reflect points ~p at which the spring forces are all
balanced. Such systems of equations are used to visualize graphs G = (V, E), by attaching vertices in V
with springs for each pair in E.

8.2 Optimality
Before discussing how to minimize or maximize a function, we should be clear what it is we are
looking for; notice that maximizing f is the same as minimizing − f , so the minimization problem
is sufficient for our consideration. For a particular f : Rn → R and ~x ∗ ∈ Rn , we need to derive
optimality conditions that verify that ~x ∗ has the lowest possible value f (~x ∗ ).
Of course, ideally we would like to find global optima of f :

2
5

f (x)
4

3
−2 −1 0 1 2
x

Figure 8.1: A function f ( x ) with multiple optima.

Definition 8.1 (Global minimum). The point ~x ∗ ∈ Rn is a global minimum of f : Rn → R if


f (~x ∗ ) ≤ f (~x ) for all ~x ∈ Rn .
Finding a global minimum of f without any information about the structure of f effectively
requires searching in the dark. For instance, suppose an optimization algorithm identifies the local
minimum near x = −1 in the function in Figure 8.1. It is nearly impossible to realize that there is
a second, lower minimum near x = 1 simply by guessing x values—for all we know, there may
be third even lower minimum of f at x = 1000!
Thus, in many cases we satisfy ourselves by finding a local minimum:
Definition 8.2 (Local minimum). The point ~x ∗ ∈ Rn is a local minimum of f : Rn → R if f (~x ∗ ) ≤
f (~x ) for all ~x ∈ Rn satisfying k~x − ~x ∗ k < ε for some ε > 0.
This definition requires that ~x ∗ attains the smallest value in some neighborhood defined by the
radius ε. Notice that local optimization algorithms have a severe limitation that they cannot guar-
antee that they yield the lowest possible value of f , as in Figure 8.1 if the left local minimum is
reached; many strategies, heuristic and otherwise, are applied to explore the landscape of possible
~x values to help gain confidence that a local minimum has the best possible value.

8.2.1 Differential Optimality


A familiar story from single- and multi-variable calculus is that finding potential minima and
maxima of a function f : Rn → R is more straightforward when f is differentiable. Recall that the
gradient vector ∇ f = (∂ f/∂x1 , . . . , ∂ f/∂xn ) points in the direction in which f increases the most; the
vector −∇ f points in the direction of greatest decrease. One way to see this is to recall that near a
point ~x0 ∈ Rn , f looks like the linear function

f (~x ) ≈ f (~x0 ) + ∇ f (~x0 ) · (~x − ~x0 ).

If we take ~x − ~x0 = α∇ f (~x0 ), then we find:

f (~x0 + α∇ f (~x0 )) ≈ f (~x0 ) + αk∇ f (~x0 )k2

When k∇ f (~x0 )k > 0, the sign of α determines whether f increases or decreases.


It is not difficult to formalize the above argument to show that if ~x0 is a local minimum, then
we must have ∇ f (~x0 ) = ~0. Notice this condition is necessary but not sufficient: maxima and saddle

3
3 5 0

2
−1
f (x)
0
1
−2

0 −5 −3
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x

Figure 8.2: Critical points can take many forms; here we show a local minimum, a saddle point,
and a local maximum.

0.4
f (x)

0.2

0
−2 −1 0 1 2
x

Figure 8.3: A function with many stationary points.

points also have ∇ f (~x0 ) = ~0 as illustrated in Figure 8.2. Even so, this observation about minima
of differentiable functions yields a common strategy for root-finding:
1. Find points ~xi satisfying ∇ f (~xi ) = ~0.

2. Check which of these points is a local minimum as opposed to a maximum or saddle point.
Given their important role in this strategy, we give the points we seek a special name:
Definition 8.3 (Stationary point). A stationary point of f : Rn → R is a point ~x ∈ Rn satisfying
∇ f (~x ) = ~0.
That is, our strategy for minimization can be to find stationary points of f and then eliminate those
that are not minima.
It is important to keep in mind when we can expect our strategies for minimization to succeed.
In most cases, such as those shown in Figure 8.2, the stationary points of f are isolated, meaning we
can write them in a discrete list {~x0 , ~x1 , . . .}. A degenerate case, however, is shown in Figure 8.3;
here, the entire interval [−1/2, 1/2] is composed of stationary points, making it impossible to con-
sider them one at a time. For the most part, we will ignore such issues as degenerate cases, but
will return to them when we consider the conditioning of the minimization problem.
Suppose we identify a point ~x ∈ R as a stationary point of f and now wish to check if it is
a local minimum. If f is twice-differentiable, one strategy we can employ is to write its Hessian

4
matrix:
∂2 f ∂2 f ∂2 f
 
∂x11 ∂x1 ∂x2 ··· ∂x1 ∂xn
∂2 f ∂2 f ∂2 f
 

∂x2 ∂x1 ∂2 x2
··· ∂x2 ∂xn

H f (~x ) = 
 
.. .. .. 
. . ··· .
 
 
∂2 f ∂2 f ∂2 f
∂xn ∂x1 ∂xn ∂x2 ··· ∂2 x n

We can add another term to our Taylor expansion of f to see the role of H f :

1
f (~x ) ≈ f (~x0 ) + ∇ f (~x0 ) · (~x − ~x0 ) + (~x − ~x0 )> H f (~x − ~x0 )
2
If we substitute a stationary point ~x ∗ , then by definition we know:

1
f (~x ) ≈ f (~x ∗ ) + (~x − ~x ∗ )> H f (~x − ~x ∗ )
2
If H f is positive definite, then this expression shows f (~x ) ≥ f (~x ∗ ), and thus ~x ∗ is a local minimum.
More generally, one of a few situations can occur:

• If H f is positive definite, then ~x ∗ is a local minimum of f .

• If H f is negative definite, then ~x ∗ is a local maximum of f .

• If H f is indefinite, then ~x ∗ is a saddle point of f .

• If H f is not invertible, then oddities such as the function in Figure 8.3 can occur.

Checking if a matrix is positive definite can be accomplished by checking if its Cholesky factor-
ization exists or—more slowly—by checking that all its eigenvalues are positive. Thus, when the
Hessian of f is known we can check stationary points for optimality using the list above; many
optimization algorithms including the ones we will discuss simply ignore the final case and notify
the user, since it is relatively unlikely.

8.2.2 Optimality via Function Properties


Occasionally, if we know more information about f : Rn → R we can provide optimality condi-
tions that are stronger or easier to check than the ones above.
One property of f that has strong implications for optimization is convexity, illustrated in Fig-
ure NUMBER:

Definition 8.4 (Convex). A function f : Rn → R is convex when for all ~x, ~y ∈ Rn and α ∈ (0, 1) the
following relationship holds:

f ((1 − α)~x + α~y) ≤ (1 − α) f (~x ) + α f (~y).

When the inequality is strict, the function is strictly convex.

Convexity implies that if you connect in Rn two points with a line, the values of f along the line
are less than or equal to those you would obtain by linear interpolation.
Convex functions enjoy many strong properties, the most basic of which is the following:

5
2

f (x)
1

0
−4 −2 0 2 4
x

Figure 8.4: A quasiconvex function.

Proposition 8.1. A local minimum of a convex function f : Rn → R is necessarily a global minimum.

Proof. Take ~x to be such a local minimum and suppose there exists ~x ∗ 6= ~x with f (~x ∗ ) < f (~x ).
Then, for α ∈ (0, 1),

f (~x + α(~x ∗ − ~x )) ≤ (1 − α) f (~x ) + α f (~x ∗ ) by convexity


< f (~x ) since f (~x ∗ ) < f (~x )

But taking α → 0 shows that ~x cannot possibly be a local minimum.

This proposition and related observations show that it is possible to check if you have reached a
global minimum of a convex function simply by applying first-order optimality. Thus, it is valuable
to check by hand if a function being optimized happens to be convex, a situation occurring sur-
prisingly often in scientific computing; one sufficient condition that can be easier to check when f
is twice differentiable is that H f is positive definite everywhere.
Other optimization techniques have guarantees under other assumptions about f . For exam-
ple, one weaker version of convexity is quasi-convexity, achieved when

f ((1 − α)~x + α~y) ≤ max( f (~x ), f (~y)).

An example of a quasiconvex function is shown in Figure 8.4; although it does not have the char-
acteristic “bowl” shape of a convex function, it does have a unique optimum.

8.3 One-Dimensional Strategies


As in the last chapter, we will start with one-dimensional optimization of f : R → R and then
expand our strategies to more general functions f : Rn → R.

8.3.1 Newton’s Method


Our principal strategy for minimizing differentiable functions f : Rn → R will be to find sta-
tionary points ~x ∗ satisfying ∇ f (~x ∗ ) = 0. Assuming we can check whether stationary points are
maxima, minima, or saddle points as a post-processing step, we will focus on the problem of
finding the stationary points ~x ∗ .

6
To this end, suppose f : R → R is differentiable. Then, as in our derivation of Newton’s
method for root-finding, we can approximate:

1 00
f ( x ) ≈ f ( xk ) + f 0 ( xk )( x − xk ) + f ( xk )( x − xk )2 .
2
The approximation on the right hand side is a parabola whose vertex is located at xk − f 0 (xk )/ f 00 (xk ).
Of course, in reality f is not necessarily a parabola, so Newton’s method simply iterates the for-
mula
f 0 (x )
xk+1 = xk − 00 k .
f ( xk )
This technique is easily-analyzed given the work we already have put into understanding New-
ton’s method for root-finding in the previous chapter. In particular, an alternative way to derive
the formula above comes from root-finding on f 0 ( x ), since stationary points satisfy f 0 ( x ) = 0.
Thus, in most cases Newton’s method for optimization exhibits quadratic convergence, provided
the initial guess x0 is sufficiently close to x ∗ .
A natural question to ask is whether the secant method can be applied in an analogous way.
Our derivation of Newton’s method above finds roots of f 0 , so the secant method could be used to
eliminate the evaluation of f 00 but not f 0 ; situations in which we know f 0 but not f 00 are relatively
rare. A more suitable parallel is to replace the line segments used to approximate f in the secant
method with parabolas. This strategy, known as successive parabolic interpolation, also minimizes a
quadratic approximation of f at each iteration, but rather than using f ( xk ), f 0 ( xk ), and f 00 ( xk ) to
construct the approximation it uses f ( xk ), f ( xk−1 ), and f ( xk−2 ). The derivation of this technique
is relatively straightforward, and it converges superlinearly.

8.3.2 Golden Section Search


We skipped over bisection in our parallel of single-variable root-finding techniques. There are
many reasons for this omission. Our motivation for bisection was that it employed only the weak-
est assumption on f needed to find roots: continuity. The Intermediate Value Theorem does not
apply to minima in any intuitive way, however, so it appears such a straightforward approach
does not exist.
It is valuable, however, to have at least one minimization strategy available that does not re-
quire differentiability of f as an underlying assumption; after all, there are non-differentiable func-
tions that have clear minima, like f ( x ) ≡ | x | at x = 0. To this end, one alternative assumption
might be that f is unimodular:

Definition 8.5 (Unimodular). A function f : [ a, b] → R is unimodular if there exists x ∗ ∈ [ a, b] such


that f is decreasing for x ∈ [ a, x ∗ ] and increasing for x ∈ [ x ∗ , b].

In other words, a unimodular function decreases for some time, and then begins increasing; no
localized minima are allowed. Notice that functions like | x | are not differentiable but still are
unimodular.
Suppose we have two values x0 and x1 such that a < x0 < x1 < b. We can make two observa-
tions that will help us formulate an optimization technique:

• If f ( x0 ) ≥ f ( x1 ), then we know that f ( x ) ≥ f ( x1 ) for all x ∈ [ a, x0 ]. Thus, the interval [ a, x0 ]


can be discarded in our search for a minimum of f .

7
• If f ( x1 ) ≥ f ( x0 ), then we know that f ( x ) ≥ f ( x0 ) for all x ∈ [ x1 , b], and thus we can discard
[ x1 , b ].
This structure suggests a potential strategy for minimization beginning with the interval [ a, b] and
iteratively removing pieces according to the rules above.
One important detail remains, however. Our convergence guarantee for the bisection algo-
rithm came from the fact that we could remove half of the interval in question in each iteration.
We could proceed in a similar fashion, removing a third of the interval each time; this requires two
evaluations of f during each iteration at new x0 and x1 locations. If evaluating f is expensive,
however, we may wish to reuse information from previous iterations to avoid at least one of those
two evaluations.
For now a = 0 and b = 1; the strategies we derive below will work more generally by shifting
and scaling. In the absence of more information about f , we might as well make a symmetric
choice x0 = α and x1 = 1 − α for some α ∈ (0, 1/2). Suppose our iteration removes the rightmost
interval [ x1 , b]. Then, the search interval becomes [0, 1 − α], and we know f (α) from the previous
iteration. The next iteration will divide [0, 1 − α] such that x0 = α(1 − α) and x1 = (1 − α)2 . If we
wish to reuse f (α) from the previous iteration, we could set (1 − α)2 = α, yielding:
1 √
α=(3 − 5)
2
1 √
1 − α = ( 5 − 1)
2
The value of 1 − α ≡ τ above is the golden ratio! It allows for the reuse of one of the function
evaluations from the previous iterations; a symmetric argument shows that the same choice of α
works if we had removed the left interval instead of the right one.
The golden section search algorithm makes use of this construction (CITE):

1. Take τ ≡ 21 ( 5 − 1)., and initialize a and b so that f is unimodular on [ a, b].
2. Make an initial subdivision x0 = a + (1 − τ )(b − a) and x1 = a + τ (b − a).
3. Initialize f 0 = f ( x0 ) and f 1 = f ( x1 ).
4. Iterate until b − a is sufficiently small:
(a) If f 0 ≥ f 1 , then remove the interval [ a, x0 ] as follows:
• Move left side: a ← x0
• Reuse previous iteration: x0 ← x1 , f 0 ← f 1
• Generate new sample: x1 ← a + τ (b − a), f 1 ← f ( x1 )
(b) If f 1 > f 0 , then remove the interval [ x1 , b] as follows:
• Move right side: b ← x1
• Reuse previous iteration: x1 ← x0 , f 1 ← f 0
• Generate new sample: x0 ← a + (1 − τ )(b − a), f 0 ← f ( x0 )
This algorithm clearly converges unconditionally and linearly. When f is not globally unimodal,
it can be difficult to find [ a, b] such that f is unimodal on that interval, limiting the applications of
this technique somewhat; generally [ a, b] is guessed by attempting to bracket a local minimum of
f.

8
8.4 Multivariable Strategies
We continue in our parallel of our discussion of root-finding by expanding our discussion to mul-
tivariable problems. As with root-finding, multivariable problems are considerably more difficult
than problems in a single variable, but they appear so many times in practice that they are worth
careful consideration.
Here, we will consider only the case that f : Rn → R is differentiable. Optimization methods
more similar to golden section search for non-differentiable functions are of limited applications
and are difficult to formulate.

8.4.1 Gradient Descent


Recall from our previous discussion that ∇ f (~x ) points in the direction of “steepest ascent” of f at
~x; similarly, the vector −∇ f (~x ) is the direction of “steepest descent.” If nothing else, this definition
guarantees that when ∇ f (~x ) 6= ~0, for small α > 0 we must have

f (~x − α∇ f (~x )) ≤ f (~x ).

Suppose our current estimate of the location of the minimum of f is ~xk . Then, we might wish
to choose ~xk+1 so that f (~xk+1 ) < f (~xk ) for an iterative minimization strategy. One way to simplify
the search for ~xk+1 would be to use one of our one-dimensional algorithms from §8.3 on a simpler
problem. In particular, consider the function gk (t) ≡ f (~xk − t∇ f (~xk )), which restricts f to the line
through ~xk parallel to ∇ f (~xk ). Thanks to our discussion of the gradient, we know that small t will
yield a decrease in f .
The gradient descent algorithm iteratively solves these one-dimensional problems to improve
our estimate of ~xk :

1. Choose an initial estimate ~x0

2. Iterate until convergence of ~xk :

(a) Take gk (t) ≡ f (~xk − t∇ f (~xk ))


(b) Use a one-dimensional algorithm to find t∗ minimizing gk over all t ≥ 0 (“line search”)
(c) Take ~xk+1 ≡ ~xk − t∗ ∇ f (~xk )

Each iteration of gradient descent decreases f (~xk ), so the objective values converge. The algorithm
only terminates when ∇ f (~xk ) ≈ ~0, showing that gradient descent must at least reach a local
minimum; convergence is slow for most functions f , however. The line search process can be
replaced by a method that simply decreases the objective a non-negligible if suboptimal amount,
although it is more difficult to guarantee convergence in this case.

8.4.2 Newton’s Method


Paralleling our derivation of the single-variable case, we can write a Taylor series approximation
of f : Rn → R using its Hessian H f :

1
f (~x ) ≈ f (~xk ) + ∇ f (~xk )> · (~x − ~xk ) + (~x − ~xk )> · H f (~xk ) · (~x − ~xk )
2

9
Differentiating with respect to ~x and setting the result equal to zero yields the following iterative
scheme:
~xk+1 = ~xk − [ H f (~xk )]−1 ∇ f (~xk )
It is easy to double check that this expression is a generalization of that in §8.3.1, and once again it
converges quadratically when ~x0 is near a minimum.
Newton’s method can be more efficient than gradient descent depending on the optimization
objective f . Recall that each iteration of gradient descent potentially requires many evaluations of
f during the line search procedure. On the other hand, we must evaluate and invert the Hessian
H f during each iteration of Newton’s method. Notice that these factors do not affect the number of
iterations but do affect runtime: this is a tradeoff that may not be obvious via traditional analysis.
It is intuitive why Newton’s method converges quickly when it is near an optimum. In partic-
ular, gradient descent has no knowledge of H f ; it proceeds analogously to walking downhill by
looking only at your feet. By using H f , Newton’s method has a larger picture of the shape of f
nearby.
When H f is not positive definite, however, the objective locally might look like a saddle or peak
rather than a bowl. In this case, jumping to an approximate stationary point might not make sense.
Thus, adaptive techniques might check if H f is positive definite before applying a Newton step; if
it is not positive definite, the methods can revert to gradient descent to find a better approximation
of the minimum. Alternatively, they can modify H f by, e.g., projecting onto the closest positive
definite matrix.

8.4.3 Optimization without Derivatives: BFGS


Newton’s method can be difficult to apply to complicated functions f : Rn → R. The second
derivative of f might be considerably more involved than the form of f , and H f changes with
each iteration, making it difficult to reuse work from previous iterations. Additionally, H f has
size n × n, so storing H f requires O(n2 ) space, which can be unacceptable.
As in our discussion of root-finding, techniques for minimization that imitate Newton’s method
but use approximate derivatives are called quasi-Newton methods. Often they can have similarly
strong convergence properties without the need for explicit re-evaluation and even factorization
of the Hessian at each iteration. In our discussion, we will follow the development of (CITE NO-
CEDAL AND WRIGHT).
Suppose we wish to minimize f : Rn → R using an iterative scheme. Near the current estimate
~xk of the root, we might estimate f with a quadratic model:

1
f (~xk + δ~x ) ≈ f (~xk ) + ∇ f (~xk ) · δ~x + (δ~x )> Bk (δ~x ).
2
Notice that we have asked that our approximation agrees with f to first order at ~xk ; as in Broyden’s
method for root-finding, however, we will allow our estimate of the Hessian Bk to vary.
This quadratic model is minimized by taking δ~x = − Bk−1 ∇ f (~xk ). In case kδ~x k2 is large and we
do not wish to take such a considerable step, we will allow ourselves to scale this difference by a
step size αk , yielding
~xk+1 = ~xk − αk Bk−1 ∇ f (~xk ).
Our goal is to find a reasonable estimate Bk+1 by updating Bk , so that we can repeat this process.

10
The Hessian of f is nothing more than the derivative of ∇ f , so we can write a secant-style
condition on Bk+1 :
Bk+1 (~xk+1 − ~xk ) = ∇ f (~xk+1 ) − ∇ f (~xk ).
We will substitute ~sk ≡ ~xk+1 − ~xk and ~yk ≡ ∇ f (~xk+1 ) − ∇ f (~xk ), yielding an equivalent condition
Bk+1~sk = ~yk .
Given the optimization at hand, we wish for Bk to have two properties:
• Bk should be a symmetric matrix, like the Hessian H f .

• Bk should be positive (semi-)definite, so that we are seeking minima rather than maxima or
saddle points.
The symmetry condition is enough to eliminate the possibility of using the Broyden estimate we
developed in the previous chapter.
The positive definite constraint implicitly puts a condition on the relationship between ~sk and
~yk . In particular, premultiplying the relationship Bk+1~sk = ~yk by ~s> > > y . For
k shows ~sk Bk+1~sk = ~sk ~ k
Bk+1 to be positive definite, we must then have ~sk · ~yk > 0. This observation can guide our choice
of αk ; it is easy to see that it holds for sufficiently small αk > 0.
Assume that ~sk and ~yk satisfy our compatibility condition. With this in place, we can write
down a Broyden-style optimization leading to a possible approximation Bk+1 :

minimizeBk+1 k Bk+1 − Bk k
such that Bk>+1 = Bk+1
Bk+1~sk = ~yk

For appropriate choice of norms k · k, this optimization yield the well-known DFP (Davidon-
Fletcher-Powell) iterative scheme.
Rather than work out the details of the DFP scheme, we move on to a more popular method
known as the BFGS (Broyden-Fletcher-Goldfarb-Shanno) formula, which appears in many mod-
ern systems. Notice that—ignoring our choice of αk for now—our second-order approximation
was minimized by taking δ~x = − Bk−1 ∇ f (~xk ). Thus, in the end the behavior of our iterative scheme
is dictated by the inverse matrix Bk−1 . Asking that k Bk+1 − Bk k is small can still imply relatively
bad differences between the action of Bk−1 and that of Bk−+11 !
With this observation in mind, the BFGS scheme makes a small alteration to the above deriva-
tion. Rather than computing Bk at each iteration, we can compute its inverse Hk ≡ Bk−1 directly.
Now our condition Bk+1~sk = ~yk gets reversed to ~sk = Hk+1~yk ; the condition that Bk is symmetric is
the same as asking that Hk is symmetric. We solve an optimization

minimize Hk+1 k Hk+1 − Hk k


such that Hk>+1 = Hk+1
~sk = Hk+1~yk
This construction has the nice side benefit of not requiring matrix inversion to compute δ~x =
− Hk ∇ f (~xk ).
To derive a formula for Hk+1 , we must decide on a matrix norm k · k. As with our previous
discussion, the Frobenius norm looks closest to least-squares optimization, making it likely we can
generate a closed-form expression for Hk+1 rather than having to solve the minimization above as
a subroutine of BFGS optimization.

11
The Frobenius norm, however, has one serious drawback for Hessian matrices. Recall that the
Hessian matrix has entries ( H f )ij = ∂ f i/∂x j . Often the quantities xi for different i can have different
units; e.g. consider maximizing the profit (in dollars) made by selling a cheeseburger of radius
r (in inches) and price p (in dollars), leading to f : (inches, dollars) → dollars. Squaring these
different quantities and adding them up does not make sense.
Suppose we find a symmetric positive definite matrix W so that W~sk = ~yk ; we will check in
the exercises that such a matrix exists. Such a matrix takes the units of ~sk = ~xk+1 − ~xk to those
of ~yk = ∇ f (~xk+1 ) − ∇ f (~xk ). Taking inspiration from our expression k Ak2Fro = Tr( A> A), we can
define a weighted Frobenius norm of a matrix A as
2
k A kW ≡ Tr( A> W > AW )

It is straightforward to check that this expression has consistent units when applied to our opti-
mization for Hk+1 . When both W and A are symmetric with columns w ~ i and ~ai , resp., expanding
the expression above shows:
2
k A kW = ∑(~ wi ·~a j )(~
w j ·~ai ).
ij

This choice of norm combined with the choice of W yields a particularly clean formula for Hk+1
given Hk , ~sk , and ~yk :

Hk+1 = ( In×n − ρk~sk~y> yk~s>


k ) Hk ( In×n − ρk~
>
k ) + ρk~sk~sk ,

where ρk ≡ 1/~y·~s. We show in the Appendix to this chapter how to derive this formula.
The BFGS algorithm avoids the need to compute and invert a Hessian matrix for f , but it still
requires O(n2 ) storage for Hk . A useful variant known as L-BFGS (“Limited-Memory BFGS”)
avoids this issue by keeping a limited history of vectors ~yk and ~sk and applying Hk by expanding
its formula recursively. This approach actually can have better numerical properties despite its
compact use of space; in particular, old vectors ~yk and ~sk may no longer be relevant and should be
ignored.

8.5 Problems
List of ideas:
• Derive Gauss-Newton

• Stochastic methods, AdaGrad

• VSCG algorithm

• Wolfe conditions for gradient descent; plug into BFGS

• Sherman-Morrison-Woodbury formula for Bk for BFGS

• Prove BFGS converges; show existence of a matrix W

• (Generalized) reduced gradient algorithm

• Condition number for optimization

12
Appendix: Derivation of BFGS Update1
Our optimization for Hk+1 has the following Lagrange multiplier expression (for ease of notation
we take Hk+1 ≡ H and Hk = H ∗ ):

Λ≡ ∑(~wi · (~h j − ~h∗j ))(~w j · (~hi − ~hi∗ )) − ∑ αij ( Hij − Hji ) − ~λ> ( H~yk −~sk )
ij i< j

= ∑(~ w j · (~hi − ~hi∗ )) − ∑ αij Hij − ~λ> ( H~yk −~sk ) if we assume αij = −α ji
wi · (~h j − ~h∗j ))(~
ij ij

Taking derivatives to find critical points shows (for ~y ≡ ~yk ,~s ≡ ~sk ):

∂Λ
0=
∂Hij
= ∑ 2wi` (~w j · (~h` − ~h∗` )) − αij − λi y j
`
= 2 ∑ wi` (W > ( H − H ∗ )) j` − αij − λi y j
`
= 2 ∑(W > ( H − H ∗ )) j` w`i − αij − λi y j by symmetry of W
`
= 2(W > ( H − H ∗ )W ) ji − αij − λi y j
= 2(W ( H − H ∗ )W )ij − αij − λi y j by symmetry of W and H

So, in matrix form we have the following list of facts:

0 = 2W ( H − H ∗ )W − A − ~λ~y> , where Aij = αij


A> = − A, W > = W, H > = H, ( H ∗ )> = H ∗
H~y = ~s, W~s = ~y

We can achieve a pair of relationships using transposition combined with symmetry of H and W
and asymmetry of A:

0 = 2W ( H − H ∗ )W − A − ~λ~y>
0 = 2W ( H − H ∗ )W + A − ~y~λ>
=⇒ 0 = 4W ( H − H ∗ )W − ~λ~y> − ~y~λ>

Post-multiplying this relationship by ~s shows:

~0 = 4(~y − W H ∗~y) − ~λ(~y ·~s) − ~y(~λ ·~s)

Now, take the dot product with ~s:

0 = 4(~y ·~s) − 4(~y> H ∗~y) − 2(~y ·~s)(~λ ·~s)

This shows:
~λ ·~s = 2ρ~y> (~s − H ∗~y), for ρ ≡ 1/~y·~s
1 Special thanks to Tao Du for debugging several parts of this derivation.

13
Now, we substitute this into our vector equality:

~0 = 4(~y − W H ∗~y) − ~λ(~y ·~s) − ~y(~λ ·~s) from before


= 4(~y − W H ∗~y) − ~λ(~y ·~s) − ~y[2ρ~y> (~s − H ∗~y)] from our simplification
=⇒ ~λ = 4ρ(~y − W H ∗~y) − 2ρ2~y> (~s − H ∗~y)~y

Post-multiplying by ~y> shows:

~λ~y> = 4ρ(~y − W H ∗~y)~y> − 2ρ2~y> (~s − H ∗~y)~y~y>

Taking the transpose,

~y~λ> = 4ρ~y(~y> − ~y> H ∗ W ) − 2ρ2~y> (~s − H ∗~y)~y~y>

Combining these results and dividing by four shows:

1 ~ >
(λ~y + ~y~λ> ) = ρ(2~y~y> − W H ∗~y~y> − ~y~y> H ∗ W ) − ρ2~y> (~s − H ∗~y)~y~y>
4
Now, we will pre- and post-multiply by W −1 . Since W~s = ~y, we can equivalently write ~s = W −1~y;
furthermore, by symmetry of W we then know ~y> W −1 = ~s> . Applying these identities to the
expression above shows:

1 −1 ~ >
W (λ~y + ~y~λ> )W −1 = 2ρ~s~s> − ρH ∗~y~s> − ρ~s~y> H ∗ − ρ2 (~y>~s)~s~s> + ρ2 (~y> H ∗~y)~s~s>
4
= 2ρ~s~s> − ρH ∗~y~s> − ρ~s~y> H ∗ − ρ~s~s> +~sρ2 (~y> H ∗~y)~s> by definition of ρ
= ρ~s~s> − ρH ∗~y~s> − ρ~s~y> H ∗ +~sρ2 (~y> H ∗~y)~s>

Finally, we can conclude our derivation of the BFGS step as follows:

0 = 4W ( H − H ∗ )W − ~λ~y> − ~y~λ> from before


1
=⇒ H = W −1 (~λ~y> + ~y~λ> )W −1 + H ∗
4
= ρ~s~s> − ρH ∗~y~s> − ρ~s~y> H ∗ +~sρ2 (~y> H ∗~y)~s> + H ∗ from the last paragraph
= H ∗ ( I − ρ~y~s> ) + ρ~s~s> − ρ~s~y> H ∗ + (ρ~s~y> ) H ∗ (ρ~y~s> )
= H ∗ ( I − ρ~y~s> ) + ρ~s~s> − ρ~s~y> H ∗ ( I − ρ~y~s> )
= ρ~s~s> + ( I − ρ~s~y> ) H ∗ ( I − ρ~y~s> )

This final expression is exactly the BFGS step introduced in the chapter.

14

You might also like