0% found this document useful (0 votes)

5 views14 pages

chapter8-Unconstrained Optimization

Chapter 8 discusses unconstrained optimization, focusing on problems that can be framed as minimizing or maximizing a function without input constraints. It explores various examples such as nonlinear least-squares, maximum likelihood estimation, and geometric problems, emphasizing the importance of identifying local and global minima. The chapter also introduces concepts of optimality, including stationary points and the role of convexity in ensuring that local minima are global minima.

Uploaded by

2230677

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views14 pages

chapter8-Unconstrained Optimization

Uploaded by

2230677

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Chapter 8

Unconstrained Optimization

In previous chapters, we have chosen to take a largely variational approach to deriving standard
algorithms for computational linear algebra. That is, we define an objective function, possibly with
constraints, and pose our algorithms as a minimization or maximization problem. A sampling
from our previous discussion is listed below:

Problem Objective Constraints

Least-squares E(~x ) = k A~x − ~bk2 None
Project ~b onto ~a E(c) = kc~a − ~bk None
Eigenvectors of symmetric matrix E(~x ) = ~x > A~x k~x k = 1
Pseudoinverse E(~x ) = k~x k2 A> A~x = A>~b
Principal components analysis E(C ) = k X − CC > X kFro C > C = Id×d
Broyden step E( Jk ) = k Jk − Jk−1 k2Fro Jk · (~xk − ~xk−1 ) = f (~xk ) − f (~xk−1 )

Obviously the formulation of problems in this fashion is a powerful and general approach. For
this reason, it is valuable to design algorithms that function in the absence of a special form for the
energy E, in the same way that we developed strategies for finding roots of f without knowing
the form of f a priori.

8.1 Unconstrained Optimization: Motivation

In this chapter, we will consider unconstrained problems, that is, problems that can be posed as
minimizing or maximizing a function f : Rn → R without any requirements on the input. It is
not difficult to encounter such problems in practice; we list a few examples below.
Example 8.1 (Nonlinear least-squares). Suppose we are given a number of pairs ( xi , yi ) such that
f ( xi ) ≈ yi , and we wish to find the best approximating f within a particular class. For instance, we
may expect that f is exponential, in which case we should be able to write f ( x ) = ce ax for some c and some
a; our job is to find these parameters. One simple strategy might be to attempt to minimize the following
energy:
E( a, c) = ∑(yi − ce axi )2 .
i
This form for E is not quadratic in a, so our linear least-squares methods do not apply.

1
Example 8.2 (Maximum likelihood estimation). In machine learning, the problem of parameter esti-
mation involves examining the results of a randomized experiment and trying to summarize them using
a probability distribution of a particular form. For example, we might measure the height of every student
in a class, yielding a list of heights hi for each student i. If we have a lot of students, we might model the
distribution of student heights using a normal distribution:
1 2 2
g(h; µ, σ) = √ e−(h−µ) /2σ ,
σ 2π
where µ is the mean of the distribution and σ is the standard deviation.
Under this normal distribution, the likelihood that we observe height hi for student i is given by
g(hi ; µ, σ), and under the (reasonable) assumption that the height of student i is probabilistically inde-
pendent of that of student j, the probability of observing the entire set of heights observed is given by the
product
P({ h1 , . . . , hn }; µ, σ) = ∏ g(hi ; µ, σ).
i
A common method for estimating the parameters µ and σ of g is to maximize P viewed as a function of µ
and σ with { hi } fixed; this is called the maximum-likelihood estimate of µ and σ. In practice, we usually
optimize the log likelihood `(µ, σ) ≡ log P({ h1 , . . . , hn }; µ, σ); this function has the same maxima but
enjoys better numerical and mathematical properties.
Example 8.3 (Geometric problems). Many geometry problems encountered in graphics and vision do
not reduce to least-squares energies. For instance, suppose we have a number of points ~x1 , . . . , ~xk ∈ R3 . If
we wish to cluster these points, we might wish to summarize them with a single ~x minimizing:

E(~x ) ≡ ∑ k~x − ~xi k2 .

The ~x ∈ R3 minimizing E is known as the geometric median of {~x1 , . . . , ~xk }. Notice that the norm of the
difference ~x − ~xi in E is not squared, so the energy is no longer quadratic in the components of ~x.
Example 8.4 (Physical equilibria, adapted from CITE). Suppose we attach an object to a set of springs;
each spring is anchored at point ~xi ∈ R3 and has natural length Li and constant k i . In the absence of
gravity, if our object is located at position ~p ∈ R3 , the network of springs has potential energy
1
2∑
2
E(~p) = k i (k~p − ~xi k2 − Li )
i

Equilibria of this system are given by minima of E and reflect points ~p at which the spring forces are all
balanced. Such systems of equations are used to visualize graphs G = (V, E), by attaching vertices in V
with springs for each pair in E.

8.2 Optimality
Before discussing how to minimize or maximize a function, we should be clear what it is we are
looking for; notice that maximizing f is the same as minimizing − f , so the minimization problem
is sufficient for our consideration. For a particular f : Rn → R and ~x ∗ ∈ Rn , we need to derive
optimality conditions that verify that ~x ∗ has the lowest possible value f (~x ∗ ).
Of course, ideally we would like to find global optima of f :

2
5

f (x)
4

3
−2 −1 0 1 2
x

Figure 8.1: A function f ( x ) with multiple optima.

Definition 8.1 (Global minimum). The point ~x ∗ ∈ Rn is a global minimum of f : Rn → R if

f (~x ∗ ) ≤ f (~x ) for all ~x ∈ Rn .
Finding a global minimum of f without any information about the structure of f effectively
requires searching in the dark. For instance, suppose an optimization algorithm identifies the local
minimum near x = −1 in the function in Figure 8.1. It is nearly impossible to realize that there is
a second, lower minimum near x = 1 simply by guessing x values—for all we know, there may
be third even lower minimum of f at x = 1000!
Thus, in many cases we satisfy ourselves by finding a local minimum:
Definition 8.2 (Local minimum). The point ~x ∗ ∈ Rn is a local minimum of f : Rn → R if f (~x ∗ ) ≤
f (~x ) for all ~x ∈ Rn satisfying k~x − ~x ∗ k < ε for some ε > 0.
This definition requires that ~x ∗ attains the smallest value in some neighborhood defined by the
radius ε. Notice that local optimization algorithms have a severe limitation that they cannot guar-
antee that they yield the lowest possible value of f , as in Figure 8.1 if the left local minimum is
reached; many strategies, heuristic and otherwise, are applied to explore the landscape of possible
~x values to help gain confidence that a local minimum has the best possible value.

8.2.1 Differential Optimality

A familiar story from single- and multi-variable calculus is that finding potential minima and
maxima of a function f : Rn → R is more straightforward when f is differentiable. Recall that the
gradient vector ∇ f = (∂ f/∂x1 , . . . , ∂ f/∂xn ) points in the direction in which f increases the most; the
vector −∇ f points in the direction of greatest decrease. One way to see this is to recall that near a
point ~x0 ∈ Rn , f looks like the linear function

f (~x ) ≈ f (~x0 ) + ∇ f (~x0 ) · (~x − ~x0 ).

If we take ~x − ~x0 = α∇ f (~x0 ), then we find:

f (~x0 + α∇ f (~x0 )) ≈ f (~x0 ) + αk∇ f (~x0 )k2

When k∇ f (~x0 )k > 0, the sign of α determines whether f increases or decreases.

It is not difficult to formalize the above argument to show that if ~x0 is a local minimum, then
we must have ∇ f (~x0 ) = ~0. Notice this condition is necessary but not sufficient: maxima and saddle

3
3 5 0

2
−1
f (x)
0
1
−2

0 −5 −3
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x

Figure 8.2: Critical points can take many forms; here we show a local minimum, a saddle point,
and a local maximum.

0.4
f (x)

0.2

0
−2 −1 0 1 2
x

Figure 8.3: A function with many stationary points.

points also have ∇ f (~x0 ) = ~0 as illustrated in Figure 8.2. Even so, this observation about minima
of differentiable functions yields a common strategy for root-finding:
1. Find points ~xi satisfying ∇ f (~xi ) = ~0.

2. Check which of these points is a local minimum as opposed to a maximum or saddle point.
Given their important role in this strategy, we give the points we seek a special name:
Definition 8.3 (Stationary point). A stationary point of f : Rn → R is a point ~x ∈ Rn satisfying
∇ f (~x ) = ~0.
That is, our strategy for minimization can be to find stationary points of f and then eliminate those
that are not minima.
It is important to keep in mind when we can expect our strategies for minimization to succeed.
In most cases, such as those shown in Figure 8.2, the stationary points of f are isolated, meaning we
can write them in a discrete list {~x0 , ~x1 , . . .}. A degenerate case, however, is shown in Figure 8.3;
here, the entire interval [−1/2, 1/2] is composed of stationary points, making it impossible to con-
sider them one at a time. For the most part, we will ignore such issues as degenerate cases, but
will return to them when we consider the conditioning of the minimization problem.
Suppose we identify a point ~x ∈ R as a stationary point of f and now wish to check if it is
a local minimum. If f is twice-differentiable, one strategy we can employ is to write its Hessian

4
matrix:
∂2 f ∂2 f ∂2 f
 
∂x11 ∂x1 ∂x2 ··· ∂x1 ∂xn
∂2 f ∂2 f ∂2 f
 

∂x2 ∂x1 ∂2 x2
··· ∂x2 ∂xn

H f (~x ) = 
 
.. .. .. 
. . ··· .
 
 
∂2 f ∂2 f ∂2 f
∂xn ∂x1 ∂xn ∂x2 ··· ∂2 x n

We can add another term to our Taylor expansion of f to see the role of H f :

1
f (~x ) ≈ f (~x0 ) + ∇ f (~x0 ) · (~x − ~x0 ) + (~x − ~x0 )> H f (~x − ~x0 )
2
If we substitute a stationary point ~x ∗ , then by definition we know:

1
f (~x ) ≈ f (~x ∗ ) + (~x − ~x ∗ )> H f (~x − ~x ∗ )
2
If H f is positive definite, then this expression shows f (~x ) ≥ f (~x ∗ ), and thus ~x ∗ is a local minimum.
More generally, one of a few situations can occur:

• If H f is positive definite, then ~x ∗ is a local minimum of f .

• If H f is negative definite, then ~x ∗ is a local maximum of f .

• If H f is indefinite, then ~x ∗ is a saddle point of f .

• If H f is not invertible, then oddities such as the function in Figure 8.3 can occur.

Checking if a matrix is positive definite can be accomplished by checking if its Cholesky factor-
ization exists or—more slowly—by checking that all its eigenvalues are positive. Thus, when the
Hessian of f is known we can check stationary points for optimality using the list above; many
optimization algorithms including the ones we will discuss simply ignore the final case and notify
the user, since it is relatively unlikely.

8.2.2 Optimality via Function Properties

Occasionally, if we know more information about f : Rn → R we can provide optimality condi-
tions that are stronger or easier to check than the ones above.
One property of f that has strong implications for optimization is convexity, illustrated in Fig-
ure NUMBER:

Definition 8.4 (Convex). A function f : Rn → R is convex when for all ~x, ~y ∈ Rn and α ∈ (0, 1) the
following relationship holds:

f ((1 − α)~x + α~y) ≤ (1 − α) f (~x ) + α f (~y).

When the inequality is strict, the function is strictly convex.

Convexity implies that if you connect in Rn two points with a line, the values of f along the line
are less than or equal to those you would obtain by linear interpolation.
Convex functions enjoy many strong properties, the most basic of which is the following:

5
2

f (x)
1

0
−4 −2 0 2 4
x

Figure 8.4: A quasiconvex function.

Proposition 8.1. A local minimum of a convex function f : Rn → R is necessarily a global minimum.

Proof. Take ~x to be such a local minimum and suppose there exists ~x ∗ 6= ~x with f (~x ∗ ) < f (~x ).
Then, for α ∈ (0, 1),

f (~x + α(~x ∗ − ~x )) ≤ (1 − α) f (~x ) + α f (~x ∗ ) by convexity

< f (~x ) since f (~x ∗ ) < f (~x )

But taking α → 0 shows that ~x cannot possibly be a local minimum.

This proposition and related observations show that it is possible to check if you have reached a
global minimum of a convex function simply by applying first-order optimality. Thus, it is valuable
to check by hand if a function being optimized happens to be convex, a situation occurring sur-
prisingly often in scientific computing; one sufficient condition that can be easier to check when f
is twice differentiable is that H f is positive definite everywhere.
Other optimization techniques have guarantees under other assumptions about f . For exam-
ple, one weaker version of convexity is quasi-convexity, achieved when

f ((1 − α)~x + α~y) ≤ max( f (~x ), f (~y)).

An example of a quasiconvex function is shown in Figure 8.4; although it does not have the char-
acteristic “bowl” shape of a convex function, it does have a unique optimum.

8.3 One-Dimensional Strategies

As in the last chapter, we will start with one-dimensional optimization of f : R → R and then
expand our strategies to more general functions f : Rn → R.

8.3.1 Newton’s Method

Our principal strategy for minimizing differentiable functions f : Rn → R will be to find sta-
tionary points ~x ∗ satisfying ∇ f (~x ∗ ) = 0. Assuming we can check whether stationary points are
maxima, minima, or saddle points as a post-processing step, we will focus on the problem of
finding the stationary points ~x ∗ .

6
To this end, suppose f : R → R is differentiable. Then, as in our derivation of Newton’s
method for root-finding, we can approximate:

1 00
f ( x ) ≈ f ( xk ) + f 0 ( xk )( x − xk ) + f ( xk )( x − xk )2 .
2
The approximation on the right hand side is a parabola whose vertex is located at xk − f 0 (xk )/ f 00 (xk ).
Of course, in reality f is not necessarily a parabola, so Newton’s method simply iterates the for-
mula
f 0 (x )
xk+1 = xk − 00 k .
f ( xk )
This technique is easily-analyzed given the work we already have put into understanding New-
ton’s method for root-finding in the previous chapter. In particular, an alternative way to derive
the formula above comes from root-finding on f 0 ( x ), since stationary points satisfy f 0 ( x ) = 0.
Thus, in most cases Newton’s method for optimization exhibits quadratic convergence, provided
the initial guess x0 is sufficiently close to x ∗ .
A natural question to ask is whether the secant method can be applied in an analogous way.
Our derivation of Newton’s method above finds roots of f 0 , so the secant method could be used to
eliminate the evaluation of f 00 but not f 0 ; situations in which we know f 0 but not f 00 are relatively
rare. A more suitable parallel is to replace the line segments used to approximate f in the secant
method with parabolas. This strategy, known as successive parabolic interpolation, also minimizes a
quadratic approximation of f at each iteration, but rather than using f ( xk ), f 0 ( xk ), and f 00 ( xk ) to
construct the approximation it uses f ( xk ), f ( xk−1 ), and f ( xk−2 ). The derivation of this technique
is relatively straightforward, and it converges superlinearly.

8.3.2 Golden Section Search

We skipped over bisection in our parallel of single-variable root-finding techniques. There are
many reasons for this omission. Our motivation for bisection was that it employed only the weak-
est assumption on f needed to find roots: continuity. The Intermediate Value Theorem does not
apply to minima in any intuitive way, however, so it appears such a straightforward approach
does not exist.
It is valuable, however, to have at least one minimization strategy available that does not re-
quire differentiability of f as an underlying assumption; after all, there are non-differentiable func-
tions that have clear minima, like f ( x ) ≡ | x | at x = 0. To this end, one alternative assumption
might be that f is unimodular:

Definition 8.5 (Unimodular). A function f : [ a, b] → R is unimodular if there exists x ∗ ∈ [ a, b] such

that f is decreasing for x ∈ [ a, x ∗ ] and increasing for x ∈ [ x ∗ , b].

In other words, a unimodular function decreases for some time, and then begins increasing; no
localized minima are allowed. Notice that functions like | x | are not differentiable but still are
unimodular.
Suppose we have two values x0 and x1 such that a < x0 < x1 < b. We can make two observa-
tions that will help us formulate an optimization technique:

• If f ( x0 ) ≥ f ( x1 ), then we know that f ( x ) ≥ f ( x1 ) for all x ∈ [ a, x0 ]. Thus, the interval [ a, x0 ]

can be discarded in our search for a minimum of f .

7
• If f ( x1 ) ≥ f ( x0 ), then we know that f ( x ) ≥ f ( x0 ) for all x ∈ [ x1 , b], and thus we can discard
[ x1 , b ].
This structure suggests a potential strategy for minimization beginning with the interval [ a, b] and
iteratively removing pieces according to the rules above.
One important detail remains, however. Our convergence guarantee for the bisection algo-
rithm came from the fact that we could remove half of the interval in question in each iteration.
We could proceed in a similar fashion, removing a third of the interval each time; this requires two
evaluations of f during each iteration at new x0 and x1 locations. If evaluating f is expensive,
however, we may wish to reuse information from previous iterations to avoid at least one of those
two evaluations.
For now a = 0 and b = 1; the strategies we derive below will work more generally by shifting
and scaling. In the absence of more information about f , we might as well make a symmetric
choice x0 = α and x1 = 1 − α for some α ∈ (0, 1/2). Suppose our iteration removes the rightmost
interval [ x1 , b]. Then, the search interval becomes [0, 1 − α], and we know f (α) from the previous
iteration. The next iteration will divide [0, 1 − α] such that x0 = α(1 − α) and x1 = (1 − α)2 . If we
wish to reuse f (α) from the previous iteration, we could set (1 − α)2 = α, yielding:
1 √
α=(3 − 5)
2
1 √
1 − α = ( 5 − 1)
2
The value of 1 − α ≡ τ above is the golden ratio! It allows for the reuse of one of the function
evaluations from the previous iterations; a symmetric argument shows that the same choice of α
works if we had removed the left interval instead of the right one.
The golden section search algorithm makes use of this construction (CITE):
√
1. Take τ ≡ 21 ( 5 − 1)., and initialize a and b so that f is unimodular on [ a, b].
2. Make an initial subdivision x0 = a + (1 − τ )(b − a) and x1 = a + τ (b − a).
3. Initialize f 0 = f ( x0 ) and f 1 = f ( x1 ).
4. Iterate until b − a is sufficiently small:
(a) If f 0 ≥ f 1 , then remove the interval [ a, x0 ] as follows:
• Move left side: a ← x0
• Reuse previous iteration: x0 ← x1 , f 0 ← f 1
• Generate new sample: x1 ← a + τ (b − a), f 1 ← f ( x1 )
(b) If f 1 > f 0 , then remove the interval [ x1 , b] as follows:
• Move right side: b ← x1
• Reuse previous iteration: x1 ← x0 , f 1 ← f 0
• Generate new sample: x0 ← a + (1 − τ )(b − a), f 0 ← f ( x0 )
This algorithm clearly converges unconditionally and linearly. When f is not globally unimodal,
it can be difficult to find [ a, b] such that f is unimodal on that interval, limiting the applications of
this technique somewhat; generally [ a, b] is guessed by attempting to bracket a local minimum of
f.

8
8.4 Multivariable Strategies
We continue in our parallel of our discussion of root-finding by expanding our discussion to mul-
tivariable problems. As with root-finding, multivariable problems are considerably more difficult
than problems in a single variable, but they appear so many times in practice that they are worth
careful consideration.
Here, we will consider only the case that f : Rn → R is differentiable. Optimization methods
more similar to golden section search for non-differentiable functions are of limited applications
and are difficult to formulate.

8.4.1 Gradient Descent

Recall from our previous discussion that ∇ f (~x ) points in the direction of “steepest ascent” of f at
~x; similarly, the vector −∇ f (~x ) is the direction of “steepest descent.” If nothing else, this definition
guarantees that when ∇ f (~x ) 6= ~0, for small α > 0 we must have

f (~x − α∇ f (~x )) ≤ f (~x ).

Suppose our current estimate of the location of the minimum of f is ~xk . Then, we might wish
to choose ~xk+1 so that f (~xk+1 ) < f (~xk ) for an iterative minimization strategy. One way to simplify
the search for ~xk+1 would be to use one of our one-dimensional algorithms from §8.3 on a simpler
problem. In particular, consider the function gk (t) ≡ f (~xk − t∇ f (~xk )), which restricts f to the line
through ~xk parallel to ∇ f (~xk ). Thanks to our discussion of the gradient, we know that small t will
yield a decrease in f .
The gradient descent algorithm iteratively solves these one-dimensional problems to improve
our estimate of ~xk :

1. Choose an initial estimate ~x0

2. Iterate until convergence of ~xk :

(a) Take gk (t) ≡ f (~xk − t∇ f (~xk ))

(b) Use a one-dimensional algorithm to find t∗ minimizing gk over all t ≥ 0 (“line search”)
(c) Take ~xk+1 ≡ ~xk − t∗ ∇ f (~xk )

Each iteration of gradient descent decreases f (~xk ), so the objective values converge. The algorithm
only terminates when ∇ f (~xk ) ≈ ~0, showing that gradient descent must at least reach a local
minimum; convergence is slow for most functions f , however. The line search process can be
replaced by a method that simply decreases the objective a non-negligible if suboptimal amount,
although it is more difficult to guarantee convergence in this case.

8.4.2 Newton’s Method

Paralleling our derivation of the single-variable case, we can write a Taylor series approximation
of f : Rn → R using its Hessian H f :

1
f (~x ) ≈ f (~xk ) + ∇ f (~xk )> · (~x − ~xk ) + (~x − ~xk )> · H f (~xk ) · (~x − ~xk )
2

9
Differentiating with respect to ~x and setting the result equal to zero yields the following iterative
scheme:
~xk+1 = ~xk − [ H f (~xk )]−1 ∇ f (~xk )
It is easy to double check that this expression is a generalization of that in §8.3.1, and once again it
converges quadratically when ~x0 is near a minimum.
Newton’s method can be more efficient than gradient descent depending on the optimization
objective f . Recall that each iteration of gradient descent potentially requires many evaluations of
f during the line search procedure. On the other hand, we must evaluate and invert the Hessian
H f during each iteration of Newton’s method. Notice that these factors do not affect the number of
iterations but do affect runtime: this is a tradeoff that may not be obvious via traditional analysis.
It is intuitive why Newton’s method converges quickly when it is near an optimum. In partic-
ular, gradient descent has no knowledge of H f ; it proceeds analogously to walking downhill by
looking only at your feet. By using H f , Newton’s method has a larger picture of the shape of f
nearby.
When H f is not positive definite, however, the objective locally might look like a saddle or peak
rather than a bowl. In this case, jumping to an approximate stationary point might not make sense.
Thus, adaptive techniques might check if H f is positive definite before applying a Newton step; if
it is not positive definite, the methods can revert to gradient descent to find a better approximation
of the minimum. Alternatively, they can modify H f by, e.g., projecting onto the closest positive
definite matrix.

8.4.3 Optimization without Derivatives: BFGS

Newton’s method can be difficult to apply to complicated functions f : Rn → R. The second
derivative of f might be considerably more involved than the form of f , and H f changes with
each iteration, making it difficult to reuse work from previous iterations. Additionally, H f has
size n × n, so storing H f requires O(n2 ) space, which can be unacceptable.
As in our discussion of root-finding, techniques for minimization that imitate Newton’s method
but use approximate derivatives are called quasi-Newton methods. Often they can have similarly
strong convergence properties without the need for explicit re-evaluation and even factorization
of the Hessian at each iteration. In our discussion, we will follow the development of (CITE NO-
CEDAL AND WRIGHT).
Suppose we wish to minimize f : Rn → R using an iterative scheme. Near the current estimate
~xk of the root, we might estimate f with a quadratic model:

1
f (~xk + δ~x ) ≈ f (~xk ) + ∇ f (~xk ) · δ~x + (δ~x )> Bk (δ~x ).
2
Notice that we have asked that our approximation agrees with f to first order at ~xk ; as in Broyden’s
method for root-finding, however, we will allow our estimate of the Hessian Bk to vary.
This quadratic model is minimized by taking δ~x = − Bk−1 ∇ f (~xk ). In case kδ~x k2 is large and we
do not wish to take such a considerable step, we will allow ourselves to scale this difference by a
step size αk , yielding
~xk+1 = ~xk − αk Bk−1 ∇ f (~xk ).
Our goal is to find a reasonable estimate Bk+1 by updating Bk , so that we can repeat this process.

10
The Hessian of f is nothing more than the derivative of ∇ f , so we can write a secant-style
condition on Bk+1 :
Bk+1 (~xk+1 − ~xk ) = ∇ f (~xk+1 ) − ∇ f (~xk ).
We will substitute ~sk ≡ ~xk+1 − ~xk and ~yk ≡ ∇ f (~xk+1 ) − ∇ f (~xk ), yielding an equivalent condition
Bk+1~sk = ~yk .
Given the optimization at hand, we wish for Bk to have two properties:
• Bk should be a symmetric matrix, like the Hessian H f .

• Bk should be positive (semi-)definite, so that we are seeking minima rather than maxima or
saddle points.
The symmetry condition is enough to eliminate the possibility of using the Broyden estimate we
developed in the previous chapter.
The positive definite constraint implicitly puts a condition on the relationship between ~sk and
~yk . In particular, premultiplying the relationship Bk+1~sk = ~yk by ~s> > > y . For
k shows ~sk Bk+1~sk = ~sk ~ k
Bk+1 to be positive definite, we must then have ~sk · ~yk > 0. This observation can guide our choice
of αk ; it is easy to see that it holds for sufficiently small αk > 0.
Assume that ~sk and ~yk satisfy our compatibility condition. With this in place, we can write
down a Broyden-style optimization leading to a possible approximation Bk+1 :

minimizeBk+1 k Bk+1 − Bk k
such that Bk>+1 = Bk+1
Bk+1~sk = ~yk

For appropriate choice of norms k · k, this optimization yield the well-known DFP (Davidon-
Fletcher-Powell) iterative scheme.
Rather than work out the details of the DFP scheme, we move on to a more popular method
known as the BFGS (Broyden-Fletcher-Goldfarb-Shanno) formula, which appears in many mod-
ern systems. Notice that—ignoring our choice of αk for now—our second-order approximation
was minimized by taking δ~x = − Bk−1 ∇ f (~xk ). Thus, in the end the behavior of our iterative scheme
is dictated by the inverse matrix Bk−1 . Asking that k Bk+1 − Bk k is small can still imply relatively
bad differences between the action of Bk−1 and that of Bk−+11 !
With this observation in mind, the BFGS scheme makes a small alteration to the above deriva-
tion. Rather than computing Bk at each iteration, we can compute its inverse Hk ≡ Bk−1 directly.
Now our condition Bk+1~sk = ~yk gets reversed to ~sk = Hk+1~yk ; the condition that Bk is symmetric is
the same as asking that Hk is symmetric. We solve an optimization

minimize Hk+1 k Hk+1 − Hk k

such that Hk>+1 = Hk+1
~sk = Hk+1~yk
This construction has the nice side benefit of not requiring matrix inversion to compute δ~x =
− Hk ∇ f (~xk ).
To derive a formula for Hk+1 , we must decide on a matrix norm k · k. As with our previous
discussion, the Frobenius norm looks closest to least-squares optimization, making it likely we can
generate a closed-form expression for Hk+1 rather than having to solve the minimization above as
a subroutine of BFGS optimization.

11
The Frobenius norm, however, has one serious drawback for Hessian matrices. Recall that the
Hessian matrix has entries ( H f )ij = ∂ f i/∂x j . Often the quantities xi for different i can have different
units; e.g. consider maximizing the profit (in dollars) made by selling a cheeseburger of radius
r (in inches) and price p (in dollars), leading to f : (inches, dollars) → dollars. Squaring these
different quantities and adding them up does not make sense.
Suppose we find a symmetric positive definite matrix W so that W~sk = ~yk ; we will check in
the exercises that such a matrix exists. Such a matrix takes the units of ~sk = ~xk+1 − ~xk to those
of ~yk = ∇ f (~xk+1 ) − ∇ f (~xk ). Taking inspiration from our expression k Ak2Fro = Tr( A> A), we can
define a weighted Frobenius norm of a matrix A as
2
k A kW ≡ Tr( A> W > AW )

It is straightforward to check that this expression has consistent units when applied to our opti-
mization for Hk+1 . When both W and A are symmetric with columns w ~ i and ~ai , resp., expanding
the expression above shows:
2
k A kW = ∑(~ wi ·~a j )(~
w j ·~ai ).
ij

This choice of norm combined with the choice of W yields a particularly clean formula for Hk+1
given Hk , ~sk , and ~yk :

Hk+1 = ( In×n − ρk~sk~y> yk~s>

k ) Hk ( In×n − ρk~
>
k ) + ρk~sk~sk ,

where ρk ≡ 1/~y·~s. We show in the Appendix to this chapter how to derive this formula.
The BFGS algorithm avoids the need to compute and invert a Hessian matrix for f , but it still
requires O(n2 ) storage for Hk . A useful variant known as L-BFGS (“Limited-Memory BFGS”)
avoids this issue by keeping a limited history of vectors ~yk and ~sk and applying Hk by expanding
its formula recursively. This approach actually can have better numerical properties despite its
compact use of space; in particular, old vectors ~yk and ~sk may no longer be relevant and should be
ignored.

8.5 Problems
List of ideas:
• Derive Gauss-Newton

• Stochastic methods, AdaGrad

• VSCG algorithm

• Wolfe conditions for gradient descent; plug into BFGS

• Sherman-Morrison-Woodbury formula for Bk for BFGS

• Prove BFGS converges; show existence of a matrix W

• (Generalized) reduced gradient algorithm

• Condition number for optimization

12
Appendix: Derivation of BFGS Update1
Our optimization for Hk+1 has the following Lagrange multiplier expression (for ease of notation
we take Hk+1 ≡ H and Hk = H ∗ ):

Λ≡ ∑(~wi · (~h j − ~h∗j ))(~w j · (~hi − ~hi∗ )) − ∑ αij ( Hij − Hji ) − ~λ> ( H~yk −~sk )
ij i< j

= ∑(~ w j · (~hi − ~hi∗ )) − ∑ αij Hij − ~λ> ( H~yk −~sk ) if we assume αij = −α ji
wi · (~h j − ~h∗j ))(~
ij ij

Taking derivatives to find critical points shows (for ~y ≡ ~yk ,~s ≡ ~sk ):

∂Λ
0=
∂Hij
= ∑ 2wi` (~w j · (~h` − ~h∗` )) − αij − λi y j
`
= 2 ∑ wi` (W > ( H − H ∗ )) j` − αij − λi y j
`
= 2 ∑(W > ( H − H ∗ )) j` w`i − αij − λi y j by symmetry of W
`
= 2(W > ( H − H ∗ )W ) ji − αij − λi y j
= 2(W ( H − H ∗ )W )ij − αij − λi y j by symmetry of W and H

So, in matrix form we have the following list of facts:

0 = 2W ( H − H ∗ )W − A − ~λ~y> , where Aij = αij

A> = − A, W > = W, H > = H, ( H ∗ )> = H ∗
H~y = ~s, W~s = ~y

We can achieve a pair of relationships using transposition combined with symmetry of H and W
and asymmetry of A:

0 = 2W ( H − H ∗ )W − A − ~λ~y>
0 = 2W ( H − H ∗ )W + A − ~y~λ>
=⇒ 0 = 4W ( H − H ∗ )W − ~λ~y> − ~y~λ>

Post-multiplying this relationship by ~s shows:

~0 = 4(~y − W H ∗~y) − ~λ(~y ·~s) − ~y(~λ ·~s)

Now, take the dot product with ~s:

0 = 4(~y ·~s) − 4(~y> H ∗~y) − 2(~y ·~s)(~λ ·~s)

This shows:
~λ ·~s = 2ρ~y> (~s − H ∗~y), for ρ ≡ 1/~y·~s
1 Special thanks to Tao Du for debugging several parts of this derivation.

13
Now, we substitute this into our vector equality:

~0 = 4(~y − W H ∗~y) − ~λ(~y ·~s) − ~y(~λ ·~s) from before

= 4(~y − W H ∗~y) − ~λ(~y ·~s) − ~y[2ρ~y> (~s − H ∗~y)] from our simplification
=⇒ ~λ = 4ρ(~y − W H ∗~y) − 2ρ2~y> (~s − H ∗~y)~y

Post-multiplying by ~y> shows:

~λ~y> = 4ρ(~y − W H ∗~y)~y> − 2ρ2~y> (~s − H ∗~y)~y~y>

Taking the transpose,

~y~λ> = 4ρ~y(~y> − ~y> H ∗ W ) − 2ρ2~y> (~s − H ∗~y)~y~y>

Combining these results and dividing by four shows:

1 ~ >
(λ~y + ~y~λ> ) = ρ(2~y~y> − W H ∗~y~y> − ~y~y> H ∗ W ) − ρ2~y> (~s − H ∗~y)~y~y>
4
Now, we will pre- and post-multiply by W −1 . Since W~s = ~y, we can equivalently write ~s = W −1~y;
furthermore, by symmetry of W we then know ~y> W −1 = ~s> . Applying these identities to the
expression above shows:

1 −1 ~ >
W (λ~y + ~y~λ> )W −1 = 2ρ~s~s> − ρH ∗~y~s> − ρ~s~y> H ∗ − ρ2 (~y>~s)~s~s> + ρ2 (~y> H ∗~y)~s~s>
4
= 2ρ~s~s> − ρH ∗~y~s> − ρ~s~y> H ∗ − ρ~s~s> +~sρ2 (~y> H ∗~y)~s> by definition of ρ
= ρ~s~s> − ρH ∗~y~s> − ρ~s~y> H ∗ +~sρ2 (~y> H ∗~y)~s>

Finally, we can conclude our derivation of the BFGS step as follows:

0 = 4W ( H − H ∗ )W − ~λ~y> − ~y~λ> from before

1
=⇒ H = W −1 (~λ~y> + ~y~λ> )W −1 + H ∗
4
= ρ~s~s> − ρH ∗~y~s> − ρ~s~y> H ∗ +~sρ2 (~y> H ∗~y)~s> + H ∗ from the last paragraph
= H ∗ ( I − ρ~y~s> ) + ρ~s~s> − ρ~s~y> H ∗ + (ρ~s~y> ) H ∗ (ρ~y~s> )
= H ∗ ( I − ρ~y~s> ) + ρ~s~s> − ρ~s~y> H ∗ ( I − ρ~y~s> )
= ρ~s~s> + ( I − ρ~s~y> ) H ∗ ( I − ρ~y~s> )

This final expression is exactly the BFGS step introduced in the chapter.

Adjoint-Based Model Tuning and Machine Learning Strategy for Turbulence Model Improvement
No ratings yet
Adjoint-Based Model Tuning and Machine Learning Strategy for Turbulence Model Improvement
23 pages
PROCESS Documentation Addendum
No ratings yet
PROCESS Documentation Addendum
26 pages
ConvexOptimization Boyd Slides
No ratings yet
ConvexOptimization Boyd Slides
394 pages
MAE Optimization Lecture 2 Handout
No ratings yet
MAE Optimization Lecture 2 Handout
46 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
opte - Optimization
No ratings yet
opte - Optimization
125 pages
chp#06
No ratings yet
chp#06
12 pages
MTC SCHEME OF WORK 2025
No ratings yet
MTC SCHEME OF WORK 2025
2 pages
2- Non Linear Optimization_V3
No ratings yet
2- Non Linear Optimization_V3
20 pages
week12
No ratings yet
week12
9 pages
Chapter 2 - Unconstrained Optimization
No ratings yet
Chapter 2 - Unconstrained Optimization
20 pages
opte
No ratings yet
opte
32 pages
Lecture1 introductionPCA
No ratings yet
Lecture1 introductionPCA
75 pages
04-Inductive Method
No ratings yet
04-Inductive Method
24 pages
A New Ensemble Adversarial Attack Powered by Long
No ratings yet
A New Ensemble Adversarial Attack Powered by Long
10 pages
Nonlinear Programming Solvers For Unconstrained An
No ratings yet
Nonlinear Programming Solvers For Unconstrained An
44 pages
Lecture Two REGRESSION ANALYSIS
No ratings yet
Lecture Two REGRESSION ANALYSIS
84 pages
5 Optimization Techniques
No ratings yet
5 Optimization Techniques
40 pages
Chapter
No ratings yet
Chapter
46 pages
Chapter 6 Lecture Notes
No ratings yet
Chapter 6 Lecture Notes
4 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
Optimization Techniques
No ratings yet
Optimization Techniques
96 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
18 pages
CSE488_Lab6_Optimization
No ratings yet
CSE488_Lab6_Optimization
20 pages
Laplace Transform Note 6
No ratings yet
Laplace Transform Note 6
16 pages
MIS403 Lec15 Nov14
No ratings yet
MIS403 Lec15 Nov14
24 pages
Booth Multiplier
No ratings yet
Booth Multiplier
14 pages
Vijayi WFH Tech_Assignment_AI Internship_Jan 2025
No ratings yet
Vijayi WFH Tech_Assignment_AI Internship_Jan 2025
3 pages
Performance Metrics
No ratings yet
Performance Metrics
3 pages
Optim
No ratings yet
Optim
70 pages
Turing Machines - a formal definition: 6-tuple M = (Q, ∑, I, q, δ, F) states alphabet input alphabet initial state
No ratings yet
Turing Machines - a formal definition: 6-tuple M = (Q, ∑, I, q, δ, F) states alphabet input alphabet initial state
9 pages
Course Notes For EE394V Restructured Electricity Markets: Locational Marginal Pricing
No ratings yet
Course Notes For EE394V Restructured Electricity Markets: Locational Marginal Pricing
95 pages
Lecture_1_2_background
No ratings yet
Lecture_1_2_background
6 pages
Princeton University Notation and Terminology in optimization
No ratings yet
Princeton University Notation and Terminology in optimization
13 pages
PH1700 Session 4a - Stu - Poisson Distribution
No ratings yet
PH1700 Session 4a - Stu - Poisson Distribution
30 pages
Algorithm Lec1 DR MME
No ratings yet
Algorithm Lec1 DR MME
46 pages
NEOM UNIT-1 Sept-23
No ratings yet
NEOM UNIT-1 Sept-23
34 pages
1 - Theory of Maxima and Minima
No ratings yet
1 - Theory of Maxima and Minima
31 pages
Stop Using The Elbow Criterion For K-Means
No ratings yet
Stop Using The Elbow Criterion For K-Means
7 pages
Introduction to Optimization
No ratings yet
Introduction to Optimization
18 pages
Nonlinear Program
No ratings yet
Nonlinear Program
12 pages
Question Bank
No ratings yet
Question Bank
8 pages
Convex Optimization Prerequisite_topics
No ratings yet
Convex Optimization Prerequisite_topics
6 pages
Part 3 Nonlinear Op Tim Ization
No ratings yet
Part 3 Nonlinear Op Tim Ization
69 pages
An Introduction To Intelligent and Autonomous Control
No ratings yet
An Introduction To Intelligent and Autonomous Control
427 pages
01 Intro Notes Cvxopt f22
No ratings yet
01 Intro Notes Cvxopt f22
25 pages
Bms Basic NLP 120609
No ratings yet
Bms Basic NLP 120609
103 pages
ECE 301 Division 1 Exam 2 Solutions, 11/10/2011, 8-9:45pm in ME 1061
100% (1)
ECE 301 Division 1 Exam 2 Solutions, 11/10/2011, 8-9:45pm in ME 1061
16 pages
OptimumEngineeringDesign Day2b
No ratings yet
OptimumEngineeringDesign Day2b
24 pages
Mathematical Optimization
No ratings yet
Mathematical Optimization
11 pages
Math Chapter 7
No ratings yet
Math Chapter 7
4 pages
Optimization (SF1811 SF1831 SF1841)
100% (1)
Optimization (SF1811 SF1831 SF1841)
198 pages
Lec 18
No ratings yet
Lec 18
6 pages
Convexity II: Optimization Basics: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Convexity II: Optimization Basics: Ryan Tibshirani Convex Optimization 10-725
28 pages
1.4b Solving Absolute Value Equations Worksheet
No ratings yet
1.4b Solving Absolute Value Equations Worksheet
1 page
I. Introduction To Convex Optimization
No ratings yet
I. Introduction To Convex Optimization
12 pages
Single Variable Optimization
No ratings yet
Single Variable Optimization
24 pages
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
No ratings yet
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
20 pages
1 Introduction To Optimization: 1.1 Notations and Definitions
No ratings yet
1 Introduction To Optimization: 1.1 Notations and Definitions
4 pages
Continuous Optimization
No ratings yet
Continuous Optimization
23 pages
Minimization or Maximization of Functions
No ratings yet
Minimization or Maximization of Functions
4 pages
University of Maryland: Econ 600
No ratings yet
University of Maryland: Econ 600
22 pages
Lecture 3
No ratings yet
Lecture 3
5 pages
MAT260 - Numerical Analysis - Cheat Sheet: Preliminaries
No ratings yet
MAT260 - Numerical Analysis - Cheat Sheet: Preliminaries
2 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Optimization: Dixit 1990 Simon and Blume 1994 Carter 2001 de La Fuente 2000
No ratings yet
Optimization: Dixit 1990 Simon and Blume 1994 Carter 2001 de La Fuente 2000
25 pages
Rainfall Prediction in India Using Multiple Linear Regression
No ratings yet
Rainfall Prediction in India Using Multiple Linear Regression
3 pages
Process Optimization Algorythms PDF
No ratings yet
Process Optimization Algorythms PDF
77 pages
CE9053 - Assignment 1
No ratings yet
CE9053 - Assignment 1
4 pages
EE364a Homework 3 Solutions: 0 N 0 1 N N 1 1 N N 0 0
No ratings yet
EE364a Homework 3 Solutions: 0 N 0 1 N N 1 1 N N 0 0
19 pages
Machine Learning Notes2
No ratings yet
Machine Learning Notes2
34 pages
Convex Optimization - Introduction (S.l. Dr. Ing. Carmen Voicu)
No ratings yet
Convex Optimization - Introduction (S.l. Dr. Ing. Carmen Voicu)
32 pages
Mathematical Economics (ECON 471) Unconstrained & Constrained Optimization
No ratings yet
Mathematical Economics (ECON 471) Unconstrained & Constrained Optimization
20 pages
Chapter 9. Classification: Advanced Methods
No ratings yet
Chapter 9. Classification: Advanced Methods
39 pages
Analythical Methods
No ratings yet
Analythical Methods
45 pages
Optimization Problems: 1.1 Preliminary Definitions
No ratings yet
Optimization Problems: 1.1 Preliminary Definitions
4 pages
Linear Systems and Optimal Control Condensed Notes: J. A. Mcmahan JR
No ratings yet
Linear Systems and Optimal Control Condensed Notes: J. A. Mcmahan JR
22 pages
Handout 1 Introduction
No ratings yet
Handout 1 Introduction
7 pages
What Is Optimization?
No ratings yet
What Is Optimization?
15 pages
Nonlinear Optimization
No ratings yet
Nonlinear Optimization
6 pages
Linear Programming Notes
No ratings yet
Linear Programming Notes
169 pages
Engineering Statistics & Linear Algebra: 18EC44 Model Question Paper-2 With Effect From 2019-20 (CBCS Scheme)
No ratings yet
Engineering Statistics & Linear Algebra: 18EC44 Model Question Paper-2 With Effect From 2019-20 (CBCS Scheme)
4 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
Final Project Report
No ratings yet
Final Project Report
52 pages
INDE 513 hw1 Sol
No ratings yet
INDE 513 hw1 Sol
7 pages
Optioncalculator
No ratings yet
Optioncalculator
16 pages
Pulse Transfer Function and Manipulation of Block Diagrams: The Output Signal Is Sample To Obtain
No ratings yet
Pulse Transfer Function and Manipulation of Block Diagrams: The Output Signal Is Sample To Obtain
41 pages
Peskin Chapter 4
No ratings yet
Peskin Chapter 4
34 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)

chapter8-Unconstrained Optimization

Uploaded by

chapter8-Unconstrained Optimization

Uploaded by

Chapter 8

Problem Objective Constraints

8.1 Unconstrained Optimization: Motivation

E(~x ) ≡ ∑ k~x − ~xi k2 .

Figure 8.1: A function f ( x ) with multiple optima.

Definition 8.1 (Global minimum). The point ~x ∗ ∈ Rn is a global minimum of f : Rn → R if

8.2.1 Differential Optimality

f (~x ) ≈ f (~x0 ) + ∇ f (~x0 ) · (~x − ~x0 ).

If we take ~x − ~x0 = α∇ f (~x0 ), then we find:

f (~x0 + α∇ f (~x0 )) ≈ f (~x0 ) + αk∇ f (~x0 )k2

When k∇ f (~x0 )k > 0, the sign of α determines whether f increases or decreases.

Figure 8.3: A function with many stationary points.

• If H f is positive definite, then ~x ∗ is a local minimum of f .

• If H f is negative definite, then ~x ∗ is a local maximum of f .

• If H f is indefinite, then ~x ∗ is a saddle point of f .

8.2.2 Optimality via Function Properties

f ((1 − α)~x + α~y) ≤ (1 − α) f (~x ) + α f (~y).

When the inequality is strict, the function is strictly convex.

Figure 8.4: A quasiconvex function.

Proposition 8.1. A local minimum of a convex function f : Rn → R is necessarily a global minimum.

f (~x + α(~x ∗ − ~x )) ≤ (1 − α) f (~x ) + α f (~x ∗ ) by convexity

But taking α → 0 shows that ~x cannot possibly be a local minimum.

f ((1 − α)~x + α~y) ≤ max( f (~x ), f (~y)).

8.3 One-Dimensional Strategies

8.3.1 Newton’s Method

8.3.2 Golden Section Search

Definition 8.5 (Unimodular). A function f : [ a, b] → R is unimodular if there exists x ∗ ∈ [ a, b] such

• If f ( x0 ) ≥ f ( x1 ), then we know that f ( x ) ≥ f ( x1 ) for all x ∈ [ a, x0 ]. Thus, the interval [ a, x0 ]

8.4.1 Gradient Descent

f (~x − α∇ f (~x )) ≤ f (~x ).

1. Choose an initial estimate ~x0

2. Iterate until convergence of ~xk :

(a) Take gk (t) ≡ f (~xk − t∇ f (~xk ))

8.4.2 Newton’s Method

8.4.3 Optimization without Derivatives: BFGS

minimize Hk+1 k Hk+1 − Hk k

Hk+1 = ( In×n − ρk~sk~y> yk~s>

• Stochastic methods, AdaGrad

• Wolfe conditions for gradient descent; plug into BFGS

• Sherman-Morrison-Woodbury formula for Bk for BFGS

• Prove BFGS converges; show existence of a matrix W

• (Generalized) reduced gradient algorithm

• Condition number for optimization

So, in matrix form we have the following list of facts:

0 = 2W ( H − H ∗ )W − A − ~λ~y> , where Aij = αij

Post-multiplying this relationship by ~s shows:

~0 = 4(~y − W H ∗~y) − ~λ(~y ·~s) − ~y(~λ ·~s)

Now, take the dot product with ~s:

0 = 4(~y ·~s) − 4(~y> H ∗~y) − 2(~y ·~s)(~λ ·~s)

~0 = 4(~y − W H ∗~y) − ~λ(~y ·~s) − ~y(~λ ·~s) from before

Post-multiplying by ~y> shows:

~λ~y> = 4ρ(~y − W H ∗~y)~y> − 2ρ2~y> (~s − H ∗~y)~y~y>

Taking the transpose,

~y~λ> = 4ρ~y(~y> − ~y> H ∗ W ) − 2ρ2~y> (~s − H ∗~y)~y~y>

Combining these results and dividing by four shows:

Finally, we can conclude our derivation of the BFGS step as follows:

0 = 4W ( H − H ∗ )W − ~λ~y> − ~y~λ> from before

You might also like