Bms Basic NLP 120609
Bms Basic NLP 120609
Nonlinear Optimization
Part I: Unconstrained and box-constrained problems
Humboldt-University of Berlin
Department of Mathematics
John-von-Neumann Haus, Room 2.426
Rudower Chaussee 25, Berlin-Adlershof
[email protected]
Contents
Acknowledgement i
Chapter 1. Introduction 1
1. Preface and motivation 1
2. Notions of solutions 3
4. Global convergence 61
4.1. Trust-Region method 61
4.2. Global convergence of the trust region algorithm 64
4.2.1. Superlinear convergence 68
Chapter 8. Quasi-Newton methods 73
1. Update rules 73
2. Local convergence theory 76
3. Global convergence 81
4. Numerical aspects 81
4.1. Memory-efficient updating 81
4.2. Positive definiteness 82
5. Further Quasi-Newton formulae 82
Chapter 9. Box-constrained problems 85
1. Necessary conditions 85
2. Sufficient conditions 87
3. Projected gradient method 89
4. Superlinearly convergent methods 93
4.1. Projected Newton method 95
Bibliography 97
Acknowledgement
These lecture notes grew out of several courses which I held at the Karl-Franzens University
of Graz and the University of the Philippines in Manila, respectively.
For the careful typesetting of all the proofs in my original manuscript (and for the tedious task
of deciphering my hand-writing) I would like to express my sincere thanks to Mag. Cornelia
Kulmer. Mag. Ian Kopacka was invaluable for the input he gave and for tracing typos in an
earlier version of the script.
These lecture notes are largely based on the monographs listed in the bibliography.
i
CHAPTER 1
Introduction
Let us assume that at time t = 0 the displacement is y(0) = y0 and furthermore ẏ(0) = 0,
then the following initial conditions hold true:
(1.5) y(0) = y0 , ẏ(0) = 0
In the following we will concentrate on the time interval [0, T ]. Let {y j }N
j=1 be measurements
of the spring’s deviation at time instances tj = (j −1)T /(N −1). The objective is to determine
the spring constant k and the damping factor c with the help of measurements.
Let x = (c, k)> . To emphasize the dependence of y(t) on x, we also write y(x; t). Following
the motivation at the beginning of the example, we try to solve the following unconstrained
non-linear minimization problem:
N
1X
(1.6) min f (x) := |y(x; tj ) − y j |2 .
x∈R2 2
j=1
If X is a set of discrete points, one refers to (1.1) as a discrete (or combinatorial) optimization
problem; otherwise the optimization problem is called continuous.
Occasionally, f is not differentiable, then we call (1.1) a nondifferentiable optimization prob-
lem (this would also be the case if one of the ci ’s in X1 or X2 would be non-differentiable,
even if f would be differentiable).
Remark 1.3. For instance, replacing the objective function f (x) in example 1 by
N
X
g(x) = |y(x; tj ) − yj |,
j=1
we obtain a nondifferentiable optimization problem.
2. Notions of solutions
In the following definition we introduce our basic notions of optimality.
Definition 1.1. Let f : X → R with X ⊂ Rn . The point x∗ ∈ X is called a
(i) (strict) global minimizer of f (on X), if and only if
f (x∗ ) ≤ f (x) (f (x∗ ) < f (x)) for all x ∈ X \ {x∗ }.
The optimal objective value f (x∗ ) is called a (strict) global minimum;
(ii) (strict) local minimizer of f (on X), if there exists a neighborhood U of x∗ such that
f (x∗ ) ≤ f (x) (f (x∗ ) < f (x)) for all x ∈ (X ∩ U ) \ {x∗ },
The optimal objective value f (x∗ ) is called a (strict) local minimum.
Remark 1.4. The point x∗ is a ((strict) global, (strict) local) maximizer of f (on X), if and
only if x∗ is a ((strict) global, (strict) local) minimizer of −f (on X).
In the following the gradient of f in x is denoted by
>
∂f ∂f
∇f (x) = (x), . . . , (x) .
∂x1 ∂xn
Definition 1.2. Let X ⊂ Rn be an open set and f : X → R be a continuously differentiable
function. The point x∗ ∈ X is called a stationary point of f , if
∇f (x∗ ) = 0
holds true.
CHAPTER 2
Optimality conditions
This chapter deals with necessary and sufficient conditions for characterizing minimizers
(under certain differentiability assumptions on f ).
1. General case
If f does not possess any structure nor properties apart from differentiability, then we can
only make statements about local minimizers, in general.
Theorem 2.1. Let X ⊂ Rn an open set and f : X → R a continuously differentiable function.
If x∗ ∈ X is a local minimizer of f (on X), then
(2.1) ∇f (x∗ ) = 0,
i.e., x∗ is a stationary point.
Proof. We prove the statement by means of contradiction. Let us assume that x∗ is a
local minimizer for which ∇f (x∗ ) 6= 0 does not hold true. Then there exists d ∈ Rn with
∇f (x∗ )> d < 0 (choose for instance d = −∇f (x∗ )) .
By assumption, f is continuously differentiable. Consequently, the directional derivative of f
at x∗ in direction d exists:
f (x∗ + αd) − f (x∗ )
f 0 (x∗ ; d) = lim = ∇f (x∗ )> d < 0.
α↓0 α
Due to the continuity of the derivative there exists ᾱ > 0 satisfying x∗ + αd ∈ X (X is open)
and
f (x∗ + αd) − f (x∗ )
<0
α
for all 0 < α ≤ ᾱ. Therefore, it holds that
f (x∗ + αd) < f (x∗ ) ∀0 < α ≤ ᾱ,
which contradicts the assumption that x∗ is a local minimizer of f in X.
Remark 2.1. (1) Since Theorem 2.1 only uses first order derivatives and assumes x∗ to
be a (local) minimizer, it specifies a first order necessary condition.
(2) The condition ∇f (x∗ ) = 0 is not sufficient for a local minimum; consider, e.g.,
f (x) = −x2 with x∗ = 0.
As preparation for the next Theorem 2.2 we need the following lemma about the continuity
of the smallest eigenvalue of a matrix.
Lemma 2.1. Let Sn be the vector space of symmetrical n × n-matrices. For A ∈ Sn let
λ(A) ∈ R be the smallest eigenvalue of A. Then the following estimate holds true:
|λ(A) − λ(B)| ≤ kA − Bk for all A, B ∈ Sn .
5
6 BMS Basic Course: Nonlinear Optimization
Note that the vector norm and the matrix norm are denoted by the same symbol, i.e. k · k.
If f is twice continuously differentiable it follows from Lemma 2.1 and from the continuity
of ∇2 f ∈ Rn×n (the Hessian of f ), that ∇2 f (x) is positive definite in a neighborhood of
x∗ if ∇2 f (x∗ ) is positive definite. An analogue statement holds true, if ∇2 f (x∗ ) is negative
definite.
Theorem 2.2. Let X ⊂ Rn be open and f : X → R be twice continuously differentiable.
If x∗ ∈ X is a local minimizer of f (on X), then ∇f (x∗ ) = 0 and the Hessian ∇2 f (x∗ ) is
positive semi-definite.
Proof. The statement that ∇f (x∗ ) = 0 holds true, follows from Theorem 2.1. Therefore,
we only have to consider the positive-semi-definiteness of the Hessian of f at x∗ . Again we
prove the statement by means of contradiction. Let us assume that x∗ is a local minimizer of
f , but ∇2 f (x∗ ) is not positive semi-definite. Then there exists d ∈ Rn such that
(2.2) d> ∇2 f (x∗ )d < 0.
Applying Taylor’s theorem, we obtain for sufficiently small α > 0:
α2 > 2 α2 > 2
(2.3) f (x∗ + αd) = f (x∗ ) + α∇f (x∗ )> d +d ∇ f (ξ(α))d = f (x∗ ) + d ∇ f (ξ(α))d,
2 2
where we used ∇f (x∗ ) = 0 and the existence of ϑ = ϑ(α) ∈ (0, 1) with ξ(α) = x∗ + ϑαd ∈ X.
By Lemma 2.1 and (2.2) there exists ᾱ > 0, such that
d> ∇2 f (ξ(α))d < 0 ∀0 < α ≤ ᾱ.
Now, (2.3) yields
f (x∗ + αd) < f (x∗ ) ∀0 < α ≤ ᾱ,
which contradicts the assumption that x∗ is a local minimizer of f on X.
Remark 2.2. (1) The conditions of Theorem 2.1 and Theorem 2.2 are not sufficient for
local minimality; consider, e.g., f (x) = x21 − x42 with x∗ = (0, 0)> .
(2) As Theorem 2.2 involves second order derivatives and assumes x∗ to be a (local)
minimizer, it defines second order necessary conditions.
The subsequent theorem specifies second order sufficient conditions: If conditions (a) and (b)
of Theorem 2.3 are satisfied at the point x∗ , then x∗ is a strict local minimizer of f on X.
Theorem 2.3. Let X ⊂ Rn be open and f : X → R twice continuously differentiable. If
(a) ∇f (x∗ ) = 0 and
(b) ∇2 f (x∗ ) is positive definite,
then x∗ is a strict local minimizer of f (on X).
Proof. Assumption (b) ensures that λ(∇2 f (x∗ )) > 0, i.e., the smallest eigenvalue of the
Hessian of f in x∗ is positive. Therefore it holds:
d> ∇2 f (x∗ )d ≥ µd> d = µkdk2 ∀d ∈ Rn ,
for 0 < µ ≤ λ(∇2 f (x∗ )). From Taylor’s Theorem we obtain for all d sufficiently close to 0:
x∗ + d ∈ X and
1
f (x∗ + d) = f (x∗ ) + ∇f (x∗ )> d + d> ∇2 f (ξ(d))d
2
Prof. Dr. Michael Hintermüller 7
with ξ(d) = x∗ + ϑd for ϑ = ϑ(d) ∈ (0, 1). Applying (a) and the Cauchy-Schwarz inequality,
we obtain
1 1
f (x∗ + d) = f (x∗ ) + d> ∇f (x∗ )d + d> ∇2 f (ξ(d)) − ∇2 f (x∗ ) d
2 2
1
≥ f (x∗ ) + µ − k∇2 f (ξ(d)) − ∇2 f (x∗ )k kdk2 .
2
Given that ∇2 f is continuous, we are able to choose d small enough, such that k∇2 f (ξ(d)) −
∇2 f (x∗ )k ≤ µ2 holds true. Thus,
µ
f (x∗ + d) ≥ f (x∗ ) + kdk2 > f (x∗ )
4
for all sufficiently small d ∈ R with d 6= 0. Hence x∗ is a strict local minimizer of f on X.
n
Remark 2.3. (1) Conditions (a) and (b) in Theorem 2.3 are not necessary for the local
minimality of x∗ ; consider f (x) = x21 + x42 with x∗ = (0, 0)> . To some extent there
is a “gap” between necessary and sufficient conditions.
(2) Given (a) of Theorem 2.3 in the case of an indefinite Hessian ∇2 f (x∗ ), we refer to
x∗ as a saddle point.
2. Convex functions
Convex functions are of particular importance for optimization. For a convex function f
we are able to show that the first order necessary conditions are also sufficient for local
optimality (see Theorem 2.6) . In the following we will introduce procedures that approximate
a complicated non-linear minimization problem by a sequence of convex problems. Apart
from global properties, these convex problems offer a simple way of computing solutions or
approximations.
Definition 2.1. (1) A set X ⊂ Rn is called convex, if for all x, y ∈ X and all λ ∈ (0, 1)
λx + (1 − λ)y ∈ X,
i.e. the segment [x, y] lies completely in X.
(2) Let X ⊂ Rn convex. A function f : X → R is called
(i) (strictly) convex (in X), if for all x, y ∈ X and for all λ ∈ (0, 1) the following
holds true:
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)
f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y), x 6= y .
Geometrically, the (strict) convexity of f means that the line segment between
f (x) and f (y) is located (strictly) above the graph of f .
(ii) uniformly convex (on X), if there exists µ > 0 with
f (λx + (1 − λ)y) + µλ(1 − λ)kx − yk2 ≤ λf (x) + (1 − λ)f (y)
for all x, y ∈ X and all λ ∈ (0, 1). (In that case, f is sometimes called uniformly
convex with module(us) µ)
By definition, every uniformly convex function is also strictly convex and every strictly convex
function is also convex. The converse is not true in general!
8 BMS Basic Course: Nonlinear Optimization
that f is strictly convex. Since a strictly convex function is also convex, we already have the
result from (a). For z := 12 (x + y), the inequality of (2.4) yields
∇f (y)> (x − y) = 2∇f (y)> (z − y) ≤ 2 (f (z) − f (y)) .
If x 6= y holds true then the strict convexity of f implies that 2f (z) < f (x) + f (y). Thanks
to the relation above, we deduce that
∇f (y)> (x − y) < f (x) − f (y),
which corresponds to (2.5).
Taking into account the quadratic term, the proof of (c) is completely analogous to (a).
Now we provide a characterization of twice continuously differentiable (strictly, uniformly)
convex functions, enabling us to read off the convexity qualities of f from the definiteness of
the Hessian of f .
Theorem 2.4. Let X ⊂ Rn an open, convex set and f : X → R twice continuously differen-
tiable. Then the following statements hold true:
(a) f is convex (on X) if and only if ∇2 f (x) is positive semi-definite for all x ∈ X .
(b) If ∇2 f (x) is positive definite for all x ∈ X, then f is strictly convex (on X).
(c) f is uniformly convex (on X) if and only if ∇2 f (x) is uniformly positive definite on
X, i.e., if there exists µ > 0 such that
d> ∇2 f (x)d ≥ µkdk2
for x ∈ X and for all d ∈ Rn .
Proof. We start with (a) and assume that f is convex (on X). Due to the assumption
that f is twice continuously differentiable, the application of Taylor’s Theorem yields the
following equation:
1
(2.7) f (y) = f (x) + ∇f (x)> (y − x) + (y − x)> ∇2 f (x)(y − x) + r(y − x)
2
for all y ∈ X sufficiently close to x. The remainder term has the following property: r(y −
x)/ky − xk2 → 0 for y → x. Now we choose y = x + αd, where d ∈ Rn is arbitrary and α > 0
is sufficiently small. Lemma 2.2 (a) yields
α2 > 2
0≤ d ∇ f (x)d + r(αd).
2
Dividing by α2 /2 and considering the limit for α ↓ 0 we obtain
0 ≤ d> ∇2 f (x)d.
Since x ∈ X and d ∈ Rn were chosen arbitrarily, the statement holds true. Conversely: Given
that f is twice continuously differentiable with ∇2 f (x) positive semi-definite for all x ∈ X,
the subsequent equation follows from Taylor’s theorem and the mean value theorem
1 1
Z
>
(2.8) f (y) = f (x) + ∇f (x) (y − x) + (y − x)> ∇2 f (x + τ (y − x)))(y − x)dτ.
2 0
The positive semi-definiteness of ∇2 f yields
f (y) ≥ f (x) + ∇f (x)> (y − x)
for y, x ∈ X. Then the convexity of f on X follows from Theorem 2.2 (a).
10 BMS Basic Course: Nonlinear Optimization
The proof of (b) works analogously to the second part of proof (a). Now let us turn to the
verification of (c). We assume that f is uniformly convex. Then, analogously to (a), we
obtain (2.7). Theorem 2.2 (c) with y = x + αd, where d ∈ Rn and α > 0 is sufficiently small,
provides
α2 > 2
µα2 kdk2 ≤ d ∇ f (x)d + r(αd).
2
Dividing by α2 and considering the limit for α ↓ 0 gives
1
µkdk2 ≤ d> ∇2 f (x)d
2
for an arbitrary d ∈ R , which proves the assertion. Conversely: Let ∇2 f be uniformly
n
positive definite (with modulus µ > 0). Then the assertion follows from relation (2.8),
Z 1
(y − x)> ∇2 f (x + τ (y − x))(y − x)dτ ≥ µkx − yk2
0
and Theorem 2.2 (c).
Note that the statement (b) of Theorem 2.4 can not be reversed in general; consider, e.g.,
f (x) = x4 in R.
The following lemma deals with the level sets of uniformly convex functions. In the con-
text of general continuously differentiable functions, the statement of Lemma 2.3 is of local
importance, i.e., in a neighborhood of the local minimizer x∗ of f (in X).
Lemma 2.3. Let f : Rn → R be continuously differentiable and x0 ∈ Rn arbitrary. Further
assume that the level set
L(x0 ) := {x ∈ Rn : f (x) ≤ f (x0 )}
is convex and that f is uniformly convex in L(x0 ). Then the set L(x0 ) is compact.
Proof. First note that due to our construction L(x0 ) 6= ∅ holds true. Let x ∈ L(x0 ).
From the uniform convexity of f in L(x0 ) we obtain (with λ = 21 )
µ 1
kx − x0 k2 + f ( 12 (x + x0 )) ≤ f (x) + f (x0 ) ,
4 2
where µ > 0. With the aid of Theorem 2.2 (a) we get the following estimate:
µ 1
kx − x0 k2 ≤ f (x) − f (x0 ) − f ( 12 (x + x0 )) − f (x0 )
4 2
≤ − f ( 21 (x + x0 )) − f (x0 )
1 1
≤ − ∇f (x0 )> (x − x0 ) ≤ k∇f (x0 )k kx − x0 k.
2 2
¿From this we infer
2
kx − x0 k ≤ k∇f (x0 )k ∀x ∈ L(x0 )
µ
or–in other words–the boundedness of L(x0 ). Given that f is continuous, we find that L(x0 )
is closed. Closedness and boundedness yields the compactness of L(x0 )., which ends the
proof.
Now we have gathered all ingredients for proving the following theorem which illustrate why
(strictly, uniformly) convex functions are of fundamental importance in optimization.
Prof. Dr. Michael Hintermüller 11
The central result of this section is given in Theorem 2.6. It proves that the necessary
condition ∇f (x∗ ) = 0 is also sufficient for x∗ being a global minimizer of the convex function
f.
Theorem 2.6. Let f : Rn → R be a continuously differentiable and convex function, and let
x∗ ∈ Rn be a stationary point of f . Then x∗ is a global minimizer of f in Rn .
12 BMS Basic Course: Nonlinear Optimization
In general, only exceptional cases allow the explicit calculation of (local) solutions of the
minimization problem
(3.1) min f (x), x ∈ Rn .
In practice, iterative methods are applied for computing approximate (local) minimizers.
After a convergence analysis, these methods are normally represented in algorithmic form
and implemented on a computer.
For this reason, we now consider descent methods for finding solutions of problem (3.1), in
which f : Rn → R is a continuously differentiable function. The fundamental idea of the
methods in this chapter is as follows:
(i) At a point x ∈ Rn , one chooses a direction d ∈ Rn in which the function value
decreases (descent method).
(ii) Starting at x, one proceeds along this direction d as long as the function value of f
reduces sufficiently (step size strategy).
These steps will be formalized.
Definition 3.1. Let f : Rn → R and x ∈ Rn . The vector d ∈ Rn is called a descent direction
of f at x, if there exists an ᾱ > 0 such that
f (x + αd) < f (x) for all α ∈ (0, ᾱ].
Let us assume that f is continuously differentiable at x ∈ Rn . Then
(3.2) ∇f (x)> d < 0
is sufficient to show that d ∈ Rn is a descent direction of f in x. To see this, we define
ϕ(α) := f (x + αd). The continuous differentiability of f implies
(3.3) ϕ(α) = ϕ(0) + αϕ0 (0) + r(α),
where r(α)/α → 0 for α → 0+ . We have
ϕ(0) = f (x) and ϕ0 (0) = ∇f (x)> d.
Transforming (3.3) and dividing by α yields
ϕ(α) − ϕ(0) r(α)
= ∇f (x)> d + .
α α
Since r(α)/α → 0 for α → 0+ and ∇f (x)> d < 0 (by assumption (3.2)), the existence of an
ᾱ > 0 from Definition 3.1 is proven. Thus, d is a descent direction of f in x.
Remark 3.1. (1) Condition (3.2) indicates that the angle between d and the negative
gradient of f in x is less than 90o .
13
14 BMS Basic Course: Nonlinear Optimization
(2) The criterion (3.2) is not necessary for d to be a descent direction of f at x. Consider,
for instance, the case where x is a strict local maximizer. Then all directions d ∈ Rn
would be descent directions of f in x, but (3.2) does not hold.
Examples for possible descent directions d = d(x) are:
d = −∇f (x) (direction of steepest descent),
d = −M ∇f (x) mit M ∈ S n positive definite (gradient-related descent direction).
1. Globally convergent descent methods
Now let us consider a general descent method. For the time being, we do neither specify the
exact choice of the descent direction nor the conditions on the step size along this direction.
Below, in Lemma 3.1 we introduce abstract conditions which ensure the convergence (with its
meaning still to be made precise) of the subsequent algorithm. In the following paragraphs
we then specify methods to determine an appropriate step size, and furthermore we address
the choice of the descent direction.
(b)
2
∇f (xk )> dk
f (xk + αk dk ) ≤ f (xk ) − Θ2 (sufficient decrease)
kdk k
with αk > 0 for all k ∈ N.
Then every accumulation point of the sequence {xk } is a stationary point of f .
Proof. According to the assumption, every step size αk fulfills the sufficient decrease
condition (b). Using (a) in (b) results in
(3.4) f (xk+1 ) ≤ f (xk ) − Θ21 Θ2 k∇f (xk )k2 .
The relation (3.4) ensures that the sequence of function values {f (xk )} decreases monoton-
ically. Let x∗ be a accumulation point of the sequence {xk }. Due to the continuity of f ,
{f (xk )} converges to f (x∗ ) for a subsequence of xk . The monotonicity ensures that {f (xk )}
itself converges to {f (x∗ )}, in particular it holds true:
f (xk+1 ) − f (xk ) → 0 for k → ∞.
The inequality (3.4) implies that
k∇f (xk )k → 0 for k → ∞.
Thus, every accumulation point x∗ of {xk } is a stationary point of f .
Remark 3.2. If ηk denotes the angle between dk and −∇f (xk ), then part (a) of Lemma 3.1
means that
∇f (xk )> dk
cos(ηk ) = −
k∇f (xk )k kdk k
is bounded away from 0; or in other words the angle is uniformly smaller than 90o . A
famous example for a descent direction fulfilling the angle condition from Theorem 3.1 is
dk = −∇f (xk ), the direction of steepest descent of f at xk .
Additionally assuming that f is uniformly convex on the convex level set L(x0 ), it is possible
to substitute the angle condition of Lemma 3.1 by the weaker Zoutendijk-condition, i.e.
∞ 2
∇f (xk )> dk
X
δk = ∞ mit δk = .
k∇f (xk )k kdk k
k=0
The Zoutendijk-conditon ensures that the angle between dk and −∇f (xk ) tends sufficiently
slowly to 90o .
method. In the following, we consider three important representatives of practicable step size
strategies, which all resign to an approximate minimization of f (xk + αdk ) w.r.t. α > 0.
2.1. Armijo rule. The subsequent strategy does not fit directly into the framework of
Theorem 3.1, because the sufficient decrease condition may be violated. However, it allows to
point out essential aspects of an approximate minimization of f (xk + αdk ). In addition, the
Armijo rule is an important element of alternative step size strategies, fulfilling the sufficient
decrease condition.
In order to keep the subsequent explanations exemplary, we consider gradient-related descent
directions of the form
d = d(x) = −M ∇f (x), M ∈ S n positive definite.
It has to be mentioned that the analysis can be done in a more general way, i.e. for descent
directions in the sense of Definition 3.1.
Let σ ∈ (0, 1) be fixed. The Armijo rule is a condition which ensures a sufficient descent in
the following sense:
(3.5) f (x + αd) ≤ f (x) + σα∇f (x)> d
This requirement can be interpreted as a restriction on the step size α. The meaning of (3.5)
can be illustrated: The solid line in Figure 1 represents the graph of f (x + αd) for α ≥ 0.
f(x)+σ α ∇ f(x)Td
f(x+α d)
f(x)
a b c
0
α>0
The dashed line represents the half-line f (x) + σα∇f (x)> d. In our example, the condition
(3.5) is fulfilled for α ∈ [0, a] ∪ [b, c]. Note that due to the requirement ∇f (x)> d < 0 and the
(Lipschitz-)continuous differentiability of f , the existence of a > 0 is ensured. The fact that
even α ∈ [c, d] fulfills the Armijo-condition has accidentally occured in our example.
For the actual calculation of α, one checks (3.5) sequentially for e.g.
(3.6) α = βl, l = 0, 1, 2, . . . ,
where β ∈ (0, 1) is fixed. One begins with α(0) = β 0 = 1 and stops the test if (3.5) holds true;
otherwise l is incremented (resp. the α-value is decremented) and (3.5) will be checked once
again.
Prof. Dr. Michael Hintermüller 17
In the following algorithm, (3.6) is generalized: If α(l) does not fulfill the Armijo condition
(3.5), then α(l+1) is chosen such that,
(3.7) α(l+1) ∈ [να(l) , να(l) ]
with 0 < ν ≤ ν < 1.
As a result of the step size strategy in Algorithm 3.2, the actual step size cannot be smaller
than 2λ(1−σ)
Lκ̄ multiplied by the factor ν. This proves (3.11).
Since Algorithm 3.2 chooses α(0) = 1 and α(l+1) ≤ ν̄α(l) , αk will be found after at most m
reductions, where m ∈ N fulfills the following relation:
2λ(1 − σ) 2λks (1 − σ)
ν̄ m < ≤ .
Lκ̄ Lκ((M k )−1 )
A simple calculation leads to (3.12).
Abandoning the assumption that ∇f is Lipschitz-continuous, there might be subsequences
{αk(l) } such that αk(l) → 0. Then the statement of Lemma 3.2 does not hold true any
longer. In this case, let us assume that xk(l) → x∗ with ∇f (x∗ ) 6= 0. Further we assume that
αk(l) = β jk (l) . Then the following holds true:
f (xk(l) + β jk (l)−1 dk(l) ) − f (xk(l) )
> σ∇f (xk(l) )> dk(l)
β jk (l)−1
and after a transition to a further subsequence, we can observe that
0 ≥ ∇f (x∗ )> d∗ ≥ σ∇f (x∗ )> d∗ ,
which is a contradiction, as σ ∈ (0, 1) was assumed. This shows, despite a weakening of
assumptions of Lemma 3.2, the convergence of Algorithm 3.1 with a step size choice according
to Algorithm 3.2. The indication of an upper resp. lower bound on {αk } is dropped.
Theorem 3.2. Let f : Rn → R be continuously differentiable and let {M k } fulfill the assump-
tion of Lemma 3.2. Then either {f (xk )} is unbounded from below or
(3.13) lim ∇f (xk ) = 0,
k→∞
f(x)+σ α ∇ f(x)Td
f(x+α d)
f(x)
f(x)+µ α ∇ f(x)Td
a b c d
0
α>0
Now we discuss another step size choice based on polynomial models of the function ϕ(α).
Our discussion is based on quadratic models. However, models of higher order are often
applied in practice.
Armijo step size algorithm based on polynomial models. Apart from the simple
backtracking method, there are strategies which apply polynomial models of ϕ(α)(= f (x+αd))
.
1Superscripts denote the index in iterative methods to determine α resp. α in the k-th iteration of the
k
minimization algorithm for f .
Prof. Dr. Michael Hintermüller 21
−ϕ0 (0)(α(l) )2
α̂min = > 0.
2(ϕ(α(l) ) − ϕ(0) − ϕ0 (0)α(l) )
Now, we can choose the next trial step size α(l+1) according to
This choice ensures the required property αl+1 ∈ [ναl , να(l) ]. In addition, (3.17) ensures that
2.2. Wolfe-Powell rule. Let σ ∈ (0, 12 ) and ρ ∈ [σ, 1) be fixed. The Wolfe-Powell
conditions are: For x, d ∈ Rn with ∇f (x)> d < 0 determine a step size α > 0 such that
Just like the Armijo-Goldstein rule, the choice of σ < 21 enables us to accept the exact
minimizer of a quadratic function as a Wolfe-Powell step size. The condition (3.21) ideally
implies that the graph of ϕ(α) = f (x + αd) at α > 0 does not descend as ”steeply” as at
α = 0. This claim is motivated by the fact that
ϕ0 (α̂) = ∇f (x + α̂d)> d = 0
f(x)+σ α ∇ f(x)Td
f(x+α d)
f(x)
ρ∇ f(x)Td
a b c d
0
α>0
Now we want to prove that for given x and d, the set of Wolfe-Powell step sizes is non-empty
and that for step sizes which fulfill (3.20) and (3.21), the sufficient decrease condition of
Theorem 3.1 is satisfied.
be the set of Wolfe-Powell step sizes at x in direction d. Then the following statements hold
true:
(a) If f is bounded from below, then SWP (x, d) 6= ∅, i.e. the Wolfe-Powell step size
strategy is well-defined.
Prof. Dr. Michael Hintermüller 23
If the descent directions dk in the general descent method, cf. Algorithm 3.1, are chosen such
that the angle condition of Theorem 3.1 is fulfilled, then we can easily infer from Theorem 3.3
(and Theorem 3.1) that every accumulation point of the sequence {xk } is a stationary point.
It remains to examine the numerical realization of the Wolfe-Powell-rule.
Before specifying the corresponding step size algorithm, we consider the following lemma
which is going to be used for the determination of an appropriate starting point for the
numerical determination of a Wolfe-Powell step size in the subsequent algorithm.
Lemma 3.3. Let σ < ρ (cf. Theorem 3.3), ϕ0 (0) < 0 and Φ(α) := ϕ(α) − ϕ(0) − σαϕ0 (0). If
[a, b] denotes an interval with the properties
(3.22) Φ(a) ≤ 0, Φ(b) ≥ 0, Φ0 (a) < 0,
then [a, b] contains a point ᾱ with
Φ(ᾱ) < 0, Φ0 (ᾱ) = 0.
ᾱ is an interior point of an interval I such that for all α ∈ I there holds:
Φ(α) ≤ 0 and ϕ0 (α) ≥ ρϕ0 (0),
i.e. I ⊂ SWP (x, d).
Proof. According to the assumption, Φ(a) ≤ 0, Φ0 (a) < 0 and Φ(b) ≥ 0 hold true.
Therefore there is at least one point ξ ∈ (a, b) with Φ0 (ξ) ≥ for sufficiently small > 0. If
there was no such a point, it would hold that Φ0 (α) ≤ 0 for all α ∈ [a, b]. Since Φ0 (a) < 0
was assumed, then from the continuous differentiability of Φ we would get that Φ(b) < 0, a
contradiction. Now let ξˆ be the smallest element ξ in (a, b) satisfying Φ0 (ξ) ≥ . Given that
ˆ with Φ0 (ξ0 ) = 0 by Bolzano’s Root Theorem . Let
Φ0 is continuous, there exists a ξ0 ∈ (a, ξ)
ξ0 be the smallest element having this property. Then Φ(ξ0 ) < 0 holds true. If that was
not the case, there would be an ξ1 ∈ (a, ξ0 ) with Φ(a) = Φ(ξ1 ) ≤ 0 due to the continuity of
Φ. Rolle’s Theorem however ensures the existence of a ξ00 ∈ (a, ξ1 ) with Φ0 (ξ00 ) = 0 which
contradicts the choice of ξ0 . To conclude the first part of the proof, we set ᾱ = ξ0 .
For the proof of the second part, note that
Φ0 (α) =ϕ0 (α) − σ ϕ0 (0) = ∇f (x + αd)> d − σ ∇f (x)> d
=∇f (x + αd)> d − ρ ∇f (x)> d + (ρ − σ)ϕ0 (0)
holds true. This implies
(3.23) ∇f (x + ᾱd)> d > ∇f (x + ᾱd)> d + (ρ − σ)ϕ0 (0) = ρ ∇f (x)> d.
As Φ(ᾱ) < 0, there exists a neighborhood [ᾱ − r0 , ᾱ + r0 ], r0 > 0, s.t. Φ(α) ≤ 0 holds
true for all α ∈ [ᾱ − r0 , ᾱ + r0 ]. Due to the continuity of Φ0 , ρ > σ and ϕ0 (0) < 0, for
0 < ≤ 12 (σ − ρ)ϕ0 (0) there exists r > 0 such that
ϕ0 (α) = ∇f (x + αd)> d > ∇f (x + αd)> d − ≥ ρ ∇f (x)> d = ρ ϕ0 (0)
for all α ∈ [ᾱ − r , ᾱ + r ]. Choosing r = min(r0 , r ) > 0, it follows that
Φ(α) ≤ 0 and ϕ0 (α) ≥ ρϕ0 (0)
for all α ∈ I := [ᾱ − r, ᾱ + r] ⊂ SWP (x, d).
Lemma 3.3 is crucial for the following algorithm.
Prof. Dr. Michael Hintermüller 25
Concerning the choice of α(0) , we remark that from the second iteration of the descent method
Algorithm 3.1 on, i.e. k ≥ 2, α(0) can be chosen as αk−1 . If a lower bound f of the function
values of f on Rn is known, we can infer immediately
0 ≥ Φ(α) = ϕ(α) − ϕ(0) − σα∇f (x)> d ≥ f − ϕ(0) − σαϕ0 (0)
from Φ(α) ≤ 0. Rearranging yields
f − ϕ(0)
α≤ =: ᾱ.
σϕ0 (0)
26 BMS Basic Course: Nonlinear Optimization
Thus it is reasonable to choose α(0) ∈ (0, ᾱ] in this case. With regard to Newton- or Quasi-
Newton methods, the step size α = 1 is of particular importance. Hence it is recommandable
to choose α(0) = min{1, ᾱ} for these methods.
Now we demonstrate that Algorithm 3.3 terminates successfully after finitely many steps,
under certain conditions.
Theorem 3.4. Let f : Rn → R be continuously differentiable and bounded from below. Fur-
thermore let σ ∈ (0, 12 ) and ρ ∈ ( 12 , 1) be fixed. Then Algorithm 3.3 terminates after finitely
many steps at “RETURN 1” or “RETURN 2” with α ∈ SWP (x, d).
Proof. First we consider the case where Algorithm 3.3 terminates at ”RETURN 1”.
Then it is obvious that the Wolfe-Powell conditions are fulfilled.
In the next step we show that the loop in the first part of the algorithm (switch to (A.1)) is
finite, i.e. after finitely many steps we continue with (B.0) or we terminate with ”RETURN
1”. Let us assume that the loop was not finite, resp. the algorithm would return infinitely
often to the position (A.1). In this case it holds α(i) = γ α(0) and Φ(α(i) ) < 0 for all i ∈ N.
But the last inequality implies
f (x + α(i) d) < f (x) + σ α(i) ∇f (x)> d ∀i ∈ N.
By assumption, γ > 1 and ϕ0 (0) < 0. Hence we would obtain f (x + α(i) d) ↓ −∞, which con-
tradicts the boundedness of f from below. Hence the loop (switch to (A.1)) has to terminate
with ”RETURN 1” in the first part of the algorithm after finitely many steps or it has to
jump to position (B.0).
Now, assume one has reached (B.0). In that case, the interval [a, b] has the property of
Lemma 3.3 and ϕ0 (a) < ρϕ0 (0). If the algorithm terminates with ”RETURN 2”, then the
Wolfe-Powell conditions are satisfied. It remains to prove that ”RETURN 2” can be reached
after finitely many trials (switch to (B.1)). First we prove by induction that the interval
(j) (j) (j) (j)
[α1 , α2 ] has, for every j ∈ N, the property (3.22) (with a = α1 and b = α2 ) and fulfills
(j)
ϕ0 (α1 ) < ρϕ0 (0).
• j = 0. The assertion follows from the conditions which allow to arrive at (B.0).
(j) (j) (j) (j)
• j → j + 1. We assume that [α1 , α2 ] fulfills (3.22) (with a = α1 and b = α2 ) and
(j)
ϕ0 (α1 ) < ρϕ0 (0).
(j+1) (j) (j+1)
If Φ(α(j) ) ≥ 0, then setting α1 := α1 , α2 := α(j) gives:
(j+1) (j)
Φ(α1 ) = Φ(α1 ) ≤ 0,
(j+1)
Φ(α2 ) = Φ(α(j) ) ≥ 0,
(j+1) (j)
Φ0 (α1 ) = Φ0 (α1 ) < 0.
In case Φ(α(j) ) < 0 and ϕ0 (α(j) ) < ρϕ0 (0) (otherwise the algorithm will terminate
(j+1) (j+1) (j)
with ”RETURN 2”), we have α1 = α(j) and α2 = α2 and
(j+1)
Φ(α1 ) = Φ(α(j) ) < 0,
(j+1) (j)
Φ(α2 ) = Φ(α2 ) ≥ 0,
(j+1)
Φ0 (α1 ) = Φ0 (α(j) ) = ϕ0 (α(j) ) − σ ϕ0 (0) < 0,
since ρ > σ > 0.
Prof. Dr. Michael Hintermüller 27
(j) (j)
In both cases, [α1 , α2 ] has the desired property.
Finally we show that the loop (switch to (B.1)) is finite. Let us assume that this was not
(j) (j)
true. Then the intervals [α1 , α2 ] would ”shrink” to a point α∗ . This results from the fact
that
(j) (j) (j+1) (j+1)
0 < α2 − α1 ≤ max{1 − τ1 , 1 − τ2 } (α2 − −α1 )
and max{1 − τ1 , 1 − τ2 } < 1 hold true. Lemma 3.3 would yield that for each j ∈ N there
(j) (j)
exists α̂(j) ∈ (α1 , α2 ) , such that
Φ(α̂(j) ) < 0 and Φ0 (α̂(j) ) = 0
would be fulfilled. Because of α̂(j) → α∗ for j → ∞, it follows Φ0 (α∗ ) = 0 and also
(3.24) ϕ0 (α∗ ) = σϕ0 (0) > ρϕ0 (0),
(j)
since by assumption 0 < σ < ρ and ϕ0 (0) < 0. On the other hand, ϕ0 (α1 ) < ρϕ0 (0) and the
continuity of ϕ0 would imply ϕ0 (α∗ ) ≤ ρϕ0 (0). This, however, would contradict (3.24). Hence,
the loop (switch to (B.1)) has to terminate after finitely many iterations with ”RETURN
2”.
The freedom of choice w.r.t. α(j) in (B.1) can again be used to apply quadratic or cubic
polynomial models.
2.3. Strong Wolfe-Powell rule. Let σ ∈ (0, 21 ) and ρ ∈ [σ, 1) fixed. The Strong Wolfe-
Powell rule requires: For x, d ∈ Rn with ∇f (x)> d < 0 determine a step size α > 0 with
(3.25) f (x + αd) ≤ f (x) + σα∇f (x)> d,
(3.26) |∇f (x + αd)> d| ≤ −ρ∇f (x)> d.
In comparison to the Wolfe-Powell rule, condition (3.26) requires not only that the graph of
ϕ(α)(= f (x + αd)) in α > 0 does not decrease as steeply as in α = 0, but also that the graph
does not increase too steeply. A step size, which fulfills (3.26) for a very small ρ (and thus
also for a very small σ), is near to a stationary point of ϕ(·).
For the set of Strong Wolfe-Powell step sizes in x in direction d, i.e.
SSWP (x, d) := {α > 0 : (3.25) + (3.26) are fulfilled}
an analogue statement to Theorem 3.3 holds true. Furthermore we can prove an analogous
result to Lemma 3.3, in which the third condition in (3.22) has to be modified. The corre-
sponding step size algorithm is structured similarly to Algorithm 3.3.
3. Practical aspects
The algorithms in section 2 of this chapter are idealized. In numerical practice it has to
be taken into account that the exactness in the evaluation of functions and derivatives is
machine- and problem-dependent. If these “inaccuracies” are not taken into account in the
conditions of Paragraphs 2.1–2.3, dead loops are likely to occur. In the best case, error
bounds (α), (0) ≥ 0 and ˆ(α), ˆ(0) ≥ 0 are known for the function values ϕ(α), ϕ(0) and
derivatives ϕ0 (α), ϕ0 (0). Then it would be possible to modify resp. attenuate the condition
ϕ(α) ≤ ϕ(0) + σαϕ0 (0) in the following way:
ϕ(α) ≤ ϕ(0) + σα ϕ0 (0) + ˆ(0) + (α) + (0).
28 BMS Basic Course: Nonlinear Optimization
In most cases, such error bounds are not available. Then a possible approach contains the
application of error bounds of the following form
(α) := 1 + |ϕ(α)|
(or also (α) := |ϕ(α)| for sufficiently large |ϕ(α)|), where ≥ M . The value M > 0
corresponds to the machine precision (or the relative accuracy in computations). If the
analytic form of the derivative is implemented, then one may choose ˆ(α) = ˆ(0) = 0 (and
slightly enlarged).
(j) (j)
Furthermore the step size algorithm has to be terminated whenever the interval [α1 , α2 ]
(j) (j)
gets ”too small”, i.e., when α2 − α1 > 0 becomes small. A practicable criterion uses a
tolerance level of the form
(j) (j)
∆ := (1 + α2 ) for α2 < +∞
(j)
(or ∆ := α2 ).
Since Theorem 3.4 assumes the boundedness of f from below, a lower bound to ϕ should be
employed in the step size strategy, terminating the algorithm when φ drops below this bound.
CHAPTER 4
Rate of convergence
29
30 BMS Basic Course: Nonlinear Optimization
(3) Between Q- and R-convergence resp. the Q- and R-factor, the following relations
hold true:
OQ {xk } ≤ OR {xk } and R1 {xk } ≤ Q1 {xk }.
It is often convenient to use the Landau symbols O and O for describing the convergence
behavior.
Definition 4.3. Let f, g : Rn → Rm and x∗ ∈ Rn . We write
(a) f (x) = O(g(x)) for x → x∗ if and only if there exist a uniform constant λ > 0 and a
neighborhood U of x∗ such that for all x ∈ U \ {x∗ } the following relation holds true:
kf (x)k ≤ λkg(x)k.
(b) f (x) = O(g(x)) for x → x∗ if and only if for all > 0 there exists a neighborhood U
of x∗ such that for all x ∈ U \ {x∗ } we have
kf (x)k ≤ kg(x)k.
Remark 4.3. If limk xk = x∗ , then {xk } converges to x∗ (at least)
(1) Q-superlinearly, if kxk+1 − x∗ k = O(kxk − x∗ k);
(2) Q-quadratically, if kxk+1 − x∗ k = O(kxk − x∗ k2 ).
2. Characterizations
The aim is to specify an alternative characterization for (Q-)superlinear and (Q-)quadratic
convergence of a sequence {xk }. For that purpose we need some auxiliary results which will
also be applied in the subsequent chapters.
Lemma 4.1. Let f : Rn → R and {xk } ⊂ Rn with limk→∞ xk = x∗ ∈ Rn . Then the following
assertions holds true:
(a) If f is twice continuously differentiable, then
k∇f (xk ) − ∇f (x∗ ) − ∇2 f (xk )(xk − x∗ )k = O(kxk − x∗ k).
(b) If, in addition, ∇2 f is locally Lipschitz-continuous, then
k∇f (xk ) − ∇f (x∗ ) − ∇2 f (xk )(xk − x∗ )k = O(kxk − x∗ k2 ).
Proof. (a)The triangle inequality yields
k∇f (xk ) − ∇f (x∗ ) − ∇2 f (xk )(xk − x∗ )k ≤
≤ k∇f (xk ) − ∇f (x∗ ) − ∇2 f (x∗ )(xk − x∗ )k + k∇2 f (x∗ ) − ∇2 f (xk )k · kxk − x∗ k.
k∇f (xk+1 )k =k∇f (xk+1 ) − ∇f (xk ) − ∇2 f (x∗ )(xk+1 − xk ) + ∇f (xk ) + ∇2 f (x∗ )(xk+1 − xk )k
Z 1
=k (∇2 f (xk + τ (xk+1 − xk ) − ∇2 f (x∗ ))(xk+1 − xk )dτ
0
+ ∇f (xk ) + ∇2 f (x∗ )(xk+1 − xk )k
Z 1
≤ k∇2 f (xk + τ (xk+1 − xk ) − ∇2 f (x∗ )k dτ · kxk+1 − xk k
0
+ k∇f (xk ) + ∇2 f (x∗ )(xk+1 − xk )k.
By assumption (c), the continuity of ∇2 f (·) and limk→∞ xk = x∗ we infer the existence of a
zero sequence (εk ) ⊂ R with
Concerning the general descent method (cf. Algorithm 3.1), we still have to make a reasonable
choice of the descent direction dk .
(5.2) is equivalent to
2
κ−1
k
f (x ) − f (x k+1
)≤ (f (xk ) − f (x∗ )).
κ+1
Furthermore, λmin x> x ≤ x> Qx ≤ λmax x> x implies
√ κ−1 k 0
∗
k
kx − x k ≤ κ kx − x∗ k.
κ+1
Evidently the rate of convergence is very slow if κ is very large (zig-zagging effect).
2. Gradient-related methods
A possible remedy for the slow convergence of the method of steepest descent consists in
choosing
dk = −H −1 ∇f (xk ),
where H ∈ S n is positive definite. Additionally, the matrix H should be chosen such that
λmax (H −1 Q) λmax (Q)
0< <
λmin (H −1 Q) λmin (Q)
and Hdk = −∇f (xk ) is simple to solve.
Definition 5.1. Let f : Rn → R be continuously differentiable and {xk } ⊂ Rn . A sequence
{dk } ⊂ Rn is called gradient-related w.r.t. f and {xk }, if for every subsequence {xk(l) }
converging to a nonstationary point of f , there exists c > 0 and > 0 such that
(a) kdk(l) k ≤ c for all l ∈ N,
(b) ∇f (xk(l) )> dk(l) ≤ − for all sufficiently large l ∈ N.
Remark 5.1. (1) Let {H k } ⊂ S n be a sequence of positive definite matrices, which
fulfill
c1 kxk2 ≤ x> H k x ≤ c2 kxk2 for all x ∈ Rn , k ∈ N,
with constants c2 ≥ c1 > 0. Then {dk } given by
H k dk = −∇f (xk ) for all k ∈ N
is gradient-related.
(2) For Algorithm 3.1 with gradient-related search directions and Armijo step size strat-
egy an analogous statement to Theorem 5.1.
(3) Sometimes the choice
∂ 2 f (xk )
H k = diag(hkii ) with hkii =
∂x2i
results in a significant improvement of the rate of convergence.
CHAPTER 6
We first derive the conjugate gradient method for minimizing strictly convex quadratic func-
tions. Then we transfer the technique to minimization problems of general nonlinear functions.
In this context we consider the Fletcher-Reeves and the Polak-Ribière variants of the conju-
gate gradient (CG) method. The two versions differ in the update strategy of a scalar which
has an impact on the determination of the search direction and the line search algorithm.
While the original Polak-Ribière method requires an impractical step size strategy in order
to be analyzed successfully, we will briefly elaborate on a modified Polak-Ribière variant of
the conjugate gradient method, which is based on an implementable step size strategy.
We infer
(6.6) (g k+1 )> dk = 0 for k = 0, 1, ..., n − 1.
The application of (6.2) for i 6= j implies
(g i+1 − g i )> dj = (Axi+1 − Axi )> dj = αj (di )> Adj = 0.
Together with (6.6) this yields
(g k+1 )> dj = (g j+1 )> dj + Σki=j+1 (g i+1 − g i )> dj = 0
for j = 0, ..., k. This proves (6.5). Due to (6.2), the vectors d0 , . . . , dn−1 are pairwise orthog-
onal w.r.t. the scalar product
< u, v >A := u> Av.
Consequently these vectors are also linearly independent. From (6.5), it follows immediately
that g n = 0 or equivalently xn = x∗ , the solution of the problem (6.1).
Remark 6.1. Vectors d0 , d1 , . . . , dn−1 with the property (6.2) are called A-conjugated or A-
orthogonal.
We continue by defining a strategy to determine A-conjugate directions d0 , d1 , . . . , dn−1 : We
begin with
d0 = −∇f (x0 ) = −g 0 .
Assume we already know l + 1 vectors d0 , . . . , dl with
(6.7) (di )> Adj = 0 for i, j = 0, . . . , l mit i 6= j.
According to Lemma 6.1, (6.4) and (6.5) hold true for k = 0, . . . , l. We suppose that g l+1 6= 0,
otherwise we have already found the solution. We make the ansatz
l
X
l+1 l+1
(6.8) d := −g + βil di .
i=0
input: x0 ∈ Rn
begin
set g 0 := Ax0 − b, d0 := −g 0 , k := 0; choose ≥ 0.
while kg k k >
begin
set
kg k k2
αk :=
(dk )> Adk
xk+1 := xk + αk dk
g k+1 := g k + αk Adk
kg k+1 k2
βk :=
kg k k2
dk+1 := −g k+1 + βk dk
k := k + 1
end
end
end
Remark 6.2. (1) The main complexity of Algorithm 6.1 consists of the calculation of
Ad . Given that the product is needed twice, one should store it as z k := Adk .
k
(g k )> W −1 g k
αk :=
(dk )> Adk
xk+1 := xk + αk dk
g k+1 := g k + αk Adk
(g k+1 )> W −1 g k+1
βk :=
(g k )> W −1 g k
dk+1 := −W −1 g k+1 + βk dk
k := k + 1
Prof. Dr. Michael Hintermüller 45
end
end
end
2. Nonlinear functions
2.1. Fletcher-Reeves method. The elementary structure of Algorithm 6.1 is the mo-
tivation for the following variant of the conjugate gradient method for the minimization of
continuously differentiable but not necessarily quadratic functions.
Note that we now require ρ ∈ (σ, 12 ) which is more restrictive than the condition ρ ∈ [σ, 1)
introduced in chapter 2.3. The current restriction is due to the convergence analysis of
the method. Firstly it can be shown that Algorithm 6.3 is well-defined for a continuously
differentiable function f : Rn → R which is bounded from below. Moreover we have the
following convergence property.
46 BMS Basic Course: Nonlinear Optimization
If L(x0 )
is convex, f twice continuously differentiable and uniformly convex on L(x0 ), then
the sequence {xk } generated by the Fletcher-Reeves method (Algorithm 6.3) converges to the
unique global minimizer of f .
2.2. Polak-Ribière method and modifications. In numerical practice it is often ob-
served that variants of the following nonlinear CG-method (Polak-Ribière method ) have better
convergence behavior.
input: x0 ∈ Rn .
begin
set d0 := −∇f (x0 ), k := 0; choose ≥ 0.
while k∇f (xk )k >
begin
determine αk such that
(6.13) αk = min{α > 0|∇f (xk + αdk )> dk = 0}.
set
xk+1 := xk + αk dk
(∇f (xk+1 ) − ∇f (xk ))> ∇f (xk+1 )
βkP R :=
k∇f (xk )k2
dk+1 := −∇f (xk+1 ) + βkP R dk
k := k + 1
end
end
end
The Fletcher-Reeves and the Polak-Ribière-method differ in the strategy for determining αk
and for the choice of βk :
• The step size choice (6.13) in the Polak-Ribière method is impractical, but necessary
for the convergence analysis. However in numerical practice the strong Wolfe-Powell
rule (here with small ρ), which is also applied in the Fletcher-Reeves algorithm,
yields satisfying results. There also exist so-called modified Polak-Ribière methods,
which work with an implementable step size strategy. In one instance, one computes
αk such that xk+1 = xk + αk dk and dk+1 = −∇f (xk+1 ) + βkP R dk satisfy the following
conditions:
(6.14) f (xk+1 ) ≤ f (xk ) − σαk2 kdk k2 and
(6.15) − δ2 k∇f (xk+1 )k2 ≤ ∇f (xk+1 )> dk+1 ≤ −δ1 k∇f (xk+1 )k2 ,
Prof. Dr. Michael Hintermüller 47
Newton’s method
From now on we assume that f and its local minimizers x∗ fulfill the following conditions:
1. f is twice continuously differentiable with k∇2 f (x) − ∇2 f (y)k ≤ γkx − yk
in a neighborhood of x∗ with
(A)
2. ∇f (x∗ ) = 0,
3. ∇2 f (x∗ ) positive definite.
To simplify notations we will frequently denote the current iteration point by xa and the new
iterate by x+ .
We consider the following quadratic model of f near xa :
1
ma (x) = f (xa ) + ∇f (xa )> (x − xa ) + (x − xa )> ∇2 f (xa )(x − xa ).
2
2
If ∇ f (xa ) is positive definite, then we define x+ as the minimizer of ma (x):
0 = m0a (x+ ) = ∇f (xa ) + ∇2 f (xa )(x+ − xa ).
Rearranging yields the iteration rule of Newton’s method, i.e.
x+ = xa − (∇2 f (xa ))−1 ∇f (xa ).
Naturally the inverse (∇2 f (xa ))−1 is not computed. Merely
∇2 f (xa )d = −∇f (xa )
is solved, and we set x+ = xa + d. If xa is far away from a local minimizer, then ∇2 f (xa )
might possess negative eigenvalues. Then x+ can be a local maximizer or a saddle point of
ma . In order to cope with this, we have to introduce certain modifications. But for the time
being we assume that xa is sufficiently close to a local minimizer.
In what follows we will often make use of the following result:
Lemma 7.1. Let (A) be satisfied. Then there exists δ > 0 such that for all x ∈ B(δ) := {y :
ky − x∗ k < δ} it holds that
k∇2 f (x)k ≤ 2k∇2 f (x∗ )k,
k(∇2 f (x))−1 k ≤ 2k(∇2 f (x∗ ))−1 k,
kx − x∗ k
≤ k∇f (x)k ≤ 2k∇2 f (x∗ )k kx − x∗ k.
2k(∇2 f (x∗ ))−1 k
This enables us to prove local convergence of Newton’s method.
Theorem 7.1. Let (A) be satisfied. Then there exists constants K > 0 and δ > 0 (independent
of xa and x+ ) such that, for xa ∈ B(δ)1, the Newton step
x+ = xa − (∇2 f (xa ))−1 ∇f (xa )
49
50 BMS Basic Course: Nonlinear Optimization
For the estimates above one uses third order Taylor expansions and intermediate values ξ1 , ξ2 .
The minimizer h∗ of err0 (h) = h2 + hf fulfills
f
err00 (h∗ ) = 2h∗ − ∗ 2 = 0,
(h )
yielding
1/3 2/3
h∗ = f and err0 (h∗ ) = O(f ).
Thus the gradient error is of the order
2/3
g = O(f ).
For the approximation of the Hessian we obtain
4/9
H = O(g2/3 ) = O(f ),
which is significantly better than in the case of forward differences.
Naturally, one can only expect convergence of an iterative scheme if g → 0 during the
iteration. This is illustrated by the following result. Errors denoted by f (·), g (·) and H (·)
have to be understood scalar-, vector- and matrix-valued. Inequalities like H (·) < are to
be read elementwise.
Theorem 7.3. Let (A) be satisfied. Then there exist constants K̄ > 0, δ > 0 and > 0 such
that for xa ∈ B(δ) and H (xa ) < it holds that
−1
x+ = xa − ∇2 f (xa ) + H (xa ) (∇f (xa ) + g (xa ))
−1
is well-defined, i.e. ∇2 f (xa ) + H (xa )
is nonsingular and satisfies
kx+ − x∗ k ≤ K̄ kxa − x∗ k2 + kH (xa )kkxa − x∗ k + kg (xa )k .
Proof. Let δ be chosen such that Lemma 7.1 and Theorem 7.1 hold true.
Define
−1
xN 2
+ = xa − (∇ f (xa )) ∇f (xa )
and note that
−1
x+ = xN 2
+ + ((∇ f (xa )) − (∇2 f (xa ) + H (xa ))−1 )∇f (xa )
− (∇2 f (xa ) + H (xa ))−1 g (xa ).
Lemma 7.1 and Theorem 7.1 imply
kx+ − x∗ k ≤ k xN ∗ 2
+ − x k + k((∇ f (xa ))
−1
− (∇2 f (xa ) + H (xa ))−1 )∇f (xa )
+ k(∇2 f (xa ) + H (xa ))−1 k · kg (xa )k
≤ Kkxa − x∗ k2 + k(∇2 f (xa ))−1 − (∇2 f (xa ) + H (xa ))−1 k · k∇f (xa )k
(7.2) + k(∇2 f (xa ) + H (xa ))−1 k · kg (xa )k
≤ Kkxa − x∗ k2 + 2k(∇2 f (xa ))−1 − (∇2 f (xa ) + H (xa ))−1 k · k∇2 f (x∗ )k·
−1
· kxN 2
+ − xa k + k(∇ f (xa ) + H (xa )) k · kg (xa )k
≤K̃kxa − x∗ k2 + 2 · k(∇2 f (xa ))−1 − (∇2 f (xa ) + H (xa ))−1 k · k∇2 f (x∗ )k
· kxa − x∗ k + k(∇2 f (xa ) + H (xa ))−1 k · kg (xa )k.
Prof. Dr. Michael Hintermüller 53
For kH (xa )k ≤ k(∇2 f (x∗ ))−1 k−1 /4 Lemma 7.1 yields
kH (xa )k ≤ k(∇2 f (xa ))−1 k−1 /2.
Setting B = ∇2 f (xa ) + H (xa ) and A = (∇2 f (xa ))−1 , one obtains:
kI − BAk =kI − (∇2 f (xa ) + H (xa ))(∇2 f (xa ))−1 k
1
(7.3) ≤ kH (xa )k · k(∇2 f (xa ))−1 k ≤
2
The Banach Lemma implies that B = ∇2 f (xa ) + H (xa ) is nonsingular and additionally:
k(∇2 f (xa )−1 k
k(∇2 f (xa ) + H (xa ))−1 k ≤
1 − kH (xa )(∇2 f (xa )−1 k
k(∇2 f (xa ))−1 k
≤ 1 = 2k(∇2 f (xa ))−1 k
2
(7.4) ≤ 4k(∇2 f (x∗ ))−1 k.
Thus we have by (7.4), (7.3) and Lemma 7.1:
k(∇2 f (xa ))−1 − (∇2 f (xa ) + H (xa ))−1 k ≤
k(∇2 f (xa ) + H (xa ))−1 k · kI − (∇2 f (xa ) + H (xa ))(∇2 f (xa ))−1 k
| {z } | {z }
≤ 4·k(∇2 f (x∗ ))−1 k 2 −1
≤ k (x )k k(∇ f (x )) k
{za
(7.4) H a
(7.3) | }
≤2·k(∇2 f (x∗ ))−1 k
Theorem 7.4. Let (A) be satisfied. Then there exist K > 0 and δ > 0 such that for x0 ∈ B(δ)
it holds that the sequence {xk } generated by (7.5) converges q-linearly to x∗ and satisfies
kxk+1 − x∗ k ≤ Kkx0 − x∗ k kxk − x∗ k.
Proof. Let δ be chosen such that Theorem 7.3 holds true. Then (7.6) implies
kxk+1 − x∗ k ≤ K(kxk − x∗ k2 + γ(kx0 − x∗ k + kxk − x∗ k)kxk − x∗ k
= K(kxk − x∗ k(1 + γ) + γkx0 − x∗ k)kxk − x∗ k
≤ K(1 + 2γ)δkxk − x∗ k.
Decreasing δ until K(1 + 2γ)δ = γ < 1 yields the assertion.
The Sharmanskii-method is a generalization of (7.5). In this variant, the Hessian will be
updated at each (m + 1)-st iteration by the Hessian at the current iterate. For m = 0 one
obtains Newton’s method and for m = ∞ iteration (7.5). We state the following result.
Theorem 7.5. Let (A) be satisfied and m ≥ 0 be given. Then there exist constants K ≥ 0
and δ > 0 such that the Sharmanskii-method converges q-superlinearly to x∗ for all x0 ∈ B(δ).
Appropriate stopping criteria. We have already mentioned that the stopping criterion
(7.7) k∇f (xk )k ≤ τr k∇f (x0 )k + τa
with τr ∈ (0, 1) and 1 τa > 0 is adequate whereas testing the difference between two
consecutive functions values (as sole stopping criterion) is not reasonable. Consider e.g.
k
X
f (xk ) = − j −1 .
j=1
If x∗ is a local minimizer of f and f (x∗ ) = 0, then the problem min f (x) is called a null-
residuum problem. In case f (x∗ ) is small, i.e. the data-fitting is good, then one refers to a
problem with small residuum.
Let R0 ∈ RM ×n be the Jacobian of R, then it holds that
∇f (x) = R0 (x)> R(x) ∈ Rn .
The necessary condition for a local minimizer x∗ is given by
(7.9) 0 = ∇f (x∗ ) = R0 (x∗ )> R(x∗ ).
For an underdetermined problem with rank (R0 (x∗ )) = M , (7.9) does imply R(x∗ ) = 0.
However for n > M this is not the case. The Hessian of f is given by
M
X
2 0 > 0
∇ f (x) = R (x) R (x) + ri (x)∇2 ri (x).
j=1
Observe that for the computation of ∇2 f (x), the M Hessians ∇2 ri (x) have to be evaluated.
2.1. Gauss-Newton iteration. Let us assume that min f (x) is a null-residuum prob-
lem. Then it holds that
∇2 f (x∗ ) = R0 (x∗ )> R0 (x∗ ), as ri (x∗ ) = 0 ∀i.
This suggests to use R0 (x)> R0 (x) as an approximation of the Hessian of f , which converges
to ∇2 f (x∗ ) for x → x∗ . In case of small residuals ri (x∗ ), R0 (x)> R0 (x) typically represents a
good Hessian approximation at x near x∗ .
With the help of this Hessian approximation, we construct the following quadratic model:
1
ma (x) = f (xa ) + R(xa )> R0 (xa )(x − xa ) + (x − xa )> R0 (xa )> R0 (xa )(x − xa ).
2
Assuming that R0 (xa )> R0 (xa ) has full rank, there exists a unique minimizer x+ of ma (x)
which satisfies
0 = R0 (xa )> R(xa ) + R0 (xa )> R0 (xa )(x+ − xa ).
In the following we will consider over- and underdetermined problems separately. In any case
we make the following assumption:
Assumption 7.1. The point x∗ is a local minimizer of min kR(x)k2 , R0 (x) is Lipschitz con-
tinuous at x∗ , and R0 (x∗ )> R0 (x∗ ) has full rank. The last assumption means that
• R0 (x∗ ) is nonsingular for M = n;
• R0 (x∗ ) has a full column rank for M > n;
• R0 (x∗ ) has a full row rank for M < n.
2.2. Overdetermined problems. The Gauss-Newton method is given by the iteration
rule −1
xk+1 = xk − R0 (xk )> R0 (xk ) R0 (xk )> R(xk ), x0 ∈ Rn given.
We have the following result.
Theorem 7.6. Let M > n and Assumption 7.1 satisfied. Then there exist K > 0 and δ > 0
such that the Gauss-Newton-step
−1
(7.10) x+ = xa − R0 (xa )> R0 (xa ) R0 (xa )> R(xa )
Prof. Dr. Michael Hintermüller 57
Thus we have seen that the Gauss-Newton method converges even for problems with large
residuum, provided that R00 (x∗ ) is sufficiently small.
2.3. Underdetermined problems. At first we consider the following underdetermined
linear least-squares problem
min kAx − bk2 , A ∈ RM ×n , M < n.
It can be demonstrated that there is no unique minimizer, but a unique minimizer with
minimal norm. This special solution can be expressed with the help of the singular value
decomposition of A, which is given by given by
A = U ΣV >
with Σ = diag(σi ) ∈ RM ×n a diagonal matrix whose diagonal entries are called singular
values. It holds that σi ≥ 0 and σi = 0 for i > M . The columns of U ∈ RM ×M and V ∈ Rn×n
are called left and right singular vectors. The matrices U and V are orthogonal.
The solution with minimal norm is given by
x = A† b,
where A† = V Σ† U > , σ † = diag(σi† ) and
−1
† σi for σi 6= 0,
σi =
0 for σi = 0.
The matrix A† is called the Moore-Penrose inverse of A. If A is a nonsingular quadratic
matrix, then it holds that A† = A−1 . The singular value decomposition also exists for
M > n, and—if A has full column rank—one obtains A† = (A> A)−1 A> . In addition it holds
that A† A is a projection onto the image of A† and AA† is a projection onto the image of A,
i.e.
A† AA† = A† , (A† A)> = A† A and AA† A = A, (AA† )> = AA> .
The solution with minimal norm of
1
min kR(xa ) + R0 (xa )(x − xa )k2
2
in case of underdetermined problems is
x+ = xa − R0 (xa )† R(xa ),
which corresponds to the Gauss-Newton iteration for the associated nonlinear least-squares
problem. In the linear case, i.e. R(x) = Ax − b, it follows
x+ = xa − A† (Axa − b) = (I − A† A)xa + A† b.
Let ea = xa − A† b and e+ = x+ − A† b, then A† AA† b = A† b implies
e+ = (I − A† A)ea .
This does not ensure that x+ = A† b is the solution with minimal norm, but it does imply that
x+ solves the problem and that the method terminates after one step. Let Z = {x : R(x) = 0}.
Theorem 7.7. Let M < n and the Assumption 7.1 be fulfilled for z ∗ ∈ Z. Then there exists
δ > 0 such that the Gauss-Newton iteration
xk+1 = xk − R0 (xk )† R(xk )
is well-defined for kx0 − x∗ k ≤ δ and converges R-quadratically to z ∗ ∈ Z.
Prof. Dr. Michael Hintermüller 59
• if ηk ≤ Kη k∇f (xk )kp for Kη > 0, then the rate of convergence is Q-superlinear with
Q-order 1 + p.
3.1. Implementation of the Newton-CG method. As already mentioned, in the
Newton-CG method, the Newton-direction
∇2 f (xk )dk = −∇f (xk )
is determined with the help of the CG-method. In addition we assume that Dh2 f (x; d) is a
sufficiently exact and disposable approximation of the Hessian-vector product ∇2 f (x)d. The
quantity h can be interpreted for example as the step size of a difference-approximation of
the second derivative of f in the direction d. We now specify a variant of the preconditioned
CG-method, which terminates with an error message, if ∇2 f (x) is singular (w.r.t. d), i.e.
d> ∇2 f (x)d = 0; or if d turns out to be a direction of negative curvature i.e. d> ∇2 f (x)d < 0.
Later we will see that the case of a negative curvature can also lead to meaningful search
directions.
Algorithm 7.1.
input: W ∈ S n positive definite, η ∈ R+
0, x∈R .
n
begin
set d0 := 0, r0 := ∇f (x), p0 := −W −1 r0 , l := 0.
while krl k > ηk∇f (x)k
begin
wl := Dh2 f (x; pl )
if (pl )> wl = 0 then RETURN(“indefiniteness”)
if (pl )> wl < 0 then RETURN(“negative curvature”)
set
(rl )> W −1 rl
αl :=
(pl )> wl
dl+1 := dl + αl pl
rl+1 := rl + αl wl
(rl+1 )> W −1 rl+1
βl+1 :=
(rl )> W −1 rl
pl+1 := −W −1 rl+1 + βl+1 pl
l := l + 1
end
end
In the implementation of the Newton-CG Algorithm, the preconditioner W and the error
bound η from Theorem 7.9 will be adjusted at each Newton iteration. This idea is imple-
mented in the following Newton-CG method:
Remark 7.1. (1) In Algorithm 7.2 we have made use of a rather simple stopping cri-
terion. Naturally, combined criteria (see the discussion in section 1 of this chapter)
can be applied as well.
(2) The algorithm terminates whenever “indefiniteness” occurs in the CG iteration. Pos-
sible modifications in case that dk is a direction of negative curvature will be discussed
later on.
(3) We also stop the Newton iteration as soon as the full step xk+1 = xk + dk does not
contribute to a decrease in f .
4. Global convergence
Regarding Newton methods we have only considered local convergence (results) so far. Until
now, we have always assumed that x0 is sufficiently close to a local solution x∗ . Now we will
introduce globalization approaches which allow for a relaxation of the choice of the starting
point.
If one ensures that ∇2 f (xk ) or a corresponding Hessian approximation satisfies
c1 kdk2 ≤ d> ∇2 f (xk )d ≤ c2 kdk2 ∀k ∈ N ∀d ∈ Rn ,
(0 < c1 ≤ c2 ), then dk defined as the solution of ∇2 f (xk )d = −∇f (xk ) is a gradient-related
search direction. Inserting this into the general descent method 3.1 yields (cf. chapter 5.2)
the global convergence of the global Newton method which means that the method converges
to a stationary point regardless of the choice of the starting point x0 . Furthermore it can be
shown that in the vicinity of x∗ , αk = 1 will be accepted as the step size if one initiates the
step size algorithm with α(0) = 1.
Here, we focus on another way to globalize Newton’s method. Given a current approxima-
tion of a solution, due to its strategy of confining the next iterate to a sufficiently small
neighborhood of the current iterate this strategy is called trust-region method (or trust region
globalization).
4.1. Trust-Region method. A major drawback of the general descent method 3.1 with
one of the step size strategies of chapter 3.2 is the necessity of ensuring that {∇2 f (xk )} ⊂ S n
is positive definite. Trust region methods deal with this problem in a suitable way and
solve it algorithmically. Roughly speaking, these methods realize a smooth transition from
the method of steepest descent to Newton’s method. In this way the global convergence
property of steepest descent is combined with the fast local convergence of Newton’s method
(Theorem 7.2).
62 BMS Basic Course: Nonlinear Optimization
The idea can be described as follows: Let ma (x) be a quadratic model of f in a neighborhood
of xa , which is given by
1
ma (x) = f (xa ) + ∇f (xa )> (x − xa ) + (x − xa )> ∇2 f (xa )(x − xa ),
2
and let ∆ be the radius of a ball about xa where we “trust” the model ma to represent f
well. The quantity ∆ is called the trust region radius, and one refers to
T (∆) = {x : kx − xa k ≤ ∆}
as the trust region.
Given xa , the next iterate x+ is chosen as an approximate minimizer of ma in T (∆). The
associated trust region subproblem is defined as
(7.15) min ma (xa + d) s.t. kdk ≤ ∆.
We denote the solution of (7.15) by dv (trial step) and the associated trial solution as xv =
xa + dv . Then we have to decide whether the step is acceptable or whether the trust region
radius needs to be changed. Usually, both options are checked simultaneously. For the former
one verifies whether the quadratic model is a good approximation of f in T (∆). For this
purpose we define
ared = f (xa ) − f (xv ) (actual reduction)
and also
pred = ma (xa ) − ma (xv ) (predicted reduction).
2
Note that (with Ha = ∇ f (xa ))
1
pred = ma (xa ) − ma (xv ) = −∇f (xa )> (xv − xa ) − (xv − xa )> Ha (xv − xa )
2
> 1 >
= −∇f (xa ) dv − dv Ha dv .
2
In the following algorithm we need the parameters
µ0 ≤ µ ≤ µ
to decide whether we reject the trial step (ared/pred < µ0 ) and/or reduce ∆ (ared/pred < µ),
whether we increase ∆ (ared/pred > µ) or leave the trust region radius unchanged. The
reduction resp. the increase of ∆ are realized by multiplication by 0 < ω < 1 < ω. Further
let C > 1 be fixed.
Algorithm 7.3.
input: xa ∈ Rn , xv ∈ Rn , ∆ ∈ R+ .
begin
ˆ (0) := ∆, l := 0.
z 0 := xa , zv0 := xv , ∆
while z l = xa
begin
ared(l) := f (xa ) − f (zvl ), dlv := zvl − xa , pred(l) := −∇f (xa )> dlv − 12 (dlv )> Ha dlv
if ared(l) /pred(l) < µ0 then
z l+1 := xa , ∆ˆ (l+1) := ω ∆ ˆ (l)
if l ≥ 1 and ∆ ˆ (l) > ∆ˆ (l−1)
z l+1 l−1
:= zv , ∆ ˆ (l+1) := ∆ ˆ (l−1)
else
Prof. Dr. Michael Hintermüller 63
end
elseif µ0 ≤ ared(l) /pred(l) ≤ µ then
ˆ (l+1) := ω ∆
z l+1 := z l , ∆ ˆ (l)
v
elseif µ ≤ ared(l) /pred(l) ≤ µ then
z l+1 := zvl
elseif µ ≤ ared(l) /pred(l)
if kdlv k = ∆ˆ (l) ≤ Ck∇f (xa )k then
z l+1 := xa , ∆ ˆ (l+1) := ω ∆
ˆ (l)
l+1 ˆ (l+1)
compute the solution dv of the trust region sub-problem with radius ∆
l+1
zv := xa + dv l+1
else
z l+1 = zvl
end
end
l := l + 1
end
ˆ (l)
x+ := z l , ∆+ = ∆
end
4.2. Global convergence of the trust region algorithm. Theoretically, the trust
region subproblem can be solved exactly. It turns out that even a relatively inaccurate solution
of the trust region subproblem suffices to prove global and locally superlinear convergence.
For the proof we invoke the following assumption.
Assumption 7.2.
(1) There exists σ > 0 such that
(7.16) pred = f (xa ) − ma (xv ) ≥ σk∇f (xa )k min{kdv k, k∇f (xa )k}.
(2) There exists M > 0 such that
k∇f (xa )k
kdv k ≥ or kdv k = ∆a .
M
We obtain the following global convergence result.
Theorem 7.10. Let ∇f be Lipschitz continuous with modulus L. Let {xk } be the sequence
generated by Algorithm 7.4, and further assume that the solutions of the trust region subprob-
lems fulfill Assumption 7.2. Moreover suppose that the matrices {H k } are bounded. Then
either f is bounded from below or ∇f (xk ) = 0 for a finite k, or
lim k∇f (xk )k = 0.
k→∞
Proof. Assume that ∇f (xk ) 6= 0 ∀k and f is bounded from below; otherwise the asser-
tion is immediate. We show that in case the step is accepted (and, hence, the radius is not
further enlarged), there exists MT ∈ (0, 1) such that
(7.17) kdkv k ≥ MT k∇f (xk )k.
Assume for the moment that (7.17) holds true. Since dkv is accepted, Algorithm 7.3 and
Assumption 7.2 yield
aredk ≥ µ0 predk ≥ µ0 σk∇f (xk )k min{kdkv k, k∇f (xk )k}.
Applying (7.17), we obtain
(7.18) aredk ≥ µ0 σk∇f (xk )k2 min{MT , 1} = µ0 σMT k∇f (xk )k2 .
Since {f (xk )} is monotonically decreasing and f is bounded from below, it follows that
limk→∞ aredk = 0. Thus, (7.18) implies limk→∞ k∇f (xk )k = 0.
It remains to prove (7.17). First note that for kdkv k < ∆k from Assumption 7.2(2), we get
k∇f (xk )k
kdkv k ≥ .
M
The case
(7.19) kdkv k = ∆k and kdkv k < k∇f (xk )k
1
remains. In fact, if (7.19) does not hold true, then (7.17) follows from MT = min{1, M }.
k
Provided that (7.19) is satisfied and dv is accepted we show that
2σ min{1 − µ, (1 − µ0 )ω −2 }
(7.20) kdkv k = ∆k ≥ k∇f (xk )k.
M +L
Then the assertion follows with
1 2σ min{1 − µ, (1 − µ0 )ω −2 }
MT = min 1, , .
M M +L
Prof. Dr. Michael Hintermüller 65
(M + L)w̄kdk,+
v k (M + L)w̄2 kdkv k
≥1− ≥ 1 − ≥ µ0
2k∇f (xk )kσ 2k∇f (xk )kσ
66 BMS Basic Course: Nonlinear Optimization
owing to (7.22). Hence, the enlargement of the radius produces an acceptable step which
would be taken instead of dkv . Thus, (7.20) has to hold true.
Next we study the computation of the trial step dv resp. the trial points xv = xa + dv . It
suffices to compute approximate solutions of the trust region subproblem (7.15), such that
Assumption 7.2 is satisfied. To this end, a simple idea is based on fixing the direction to the
steepest descent direction under the trust region constraint. Let xa be the current iterate
and ∆a be the current trust region radius. Then the trial point xv := xv (α) is defined as the
minimizer αa of
min Ψa (α) := ma (xa − α∇f (xa )) s.t. xv (α) := xa − α∇f (xa ) ∈ T (∆a ).
t≥0
It holds that
α2
Ψa (α) = ma (xa − α∇f (xa )) = f (xa ) − αk∇f (xa )k2 + ∇f (xa )> Ha ∇f (xa ),
2
Ψ0a (α) = −k∇f (xa )k2 + α∇f (xa )> Ha ∇f (xa ).
For determining αa we have to distinguish between the following cases:
(1) ∇f (xa )> Ha ∇f (xa ) ≤ 0. Obviously the Trust-Region constraint becomes active. It
holds that
kxv (αa ) − xa k = αa k∇f (xa )k = ∆a ,
∆a
which yields αa = k∇f (xa )k .
(2) ∇f (xa )> Ha ∇f (xa ) > 0. In this case, we have
k∇f (xa )k2
m0a (xa − α̂a ∇f (xa )) = 0 ⇒ α̂a = .
∇f (xa )> Ha ∇f (xa )
If kxv (α̂a ) − xa k ≤ ∆a is fulfilled, then we accept α̂a as αa ; otherwise the trust region
constraint becomes active and analogously to (1) it follows that αa = k∇f∆(xa a )k .
To summarize, we have
∆a
if ∇f (xa )> Ha ∇f (xa )≤0,
(
k∇f (xa )k
(7.24) αa := (xa )k2
min{ k∇f∆(xa a )k , ∇f (xk∇f
a )> H ∇f (x ) }
a a
if ∇f (xa )> Ha ∇f (xa )>0.
The minimizer of the quadratic model ma in the direction of the negative gradient is called the
1
Cauchy point and will be denoted by xCP a . The Cauchy point has the following properties
which will prove to be useful for the global convergence result.
(1) ∇f (xa )> Ha ∇f (xa ) ≤ 0. Then it follows that
αa2
f (xa ) − ma (xCP 2
a ) = αa k∇f (xa )k − ∇f (xa )> Ha ∇f (xa )
2
≥ ∆a k∇f (xa )k = kdv k · k∇f (xa )k.
(2) ∇f (xa )> Ha ∇f (xa ) > 0. Depending on where the minimum in (7.24) is attained, we
have the following situations:
1When using the iteration index k, then the Cauchy point will be written as xk .
CP
Prof. Dr. Michael Hintermüller 67
∆a
(i) αa = k∇f (xa )k . Then it holds that
k∇f (xa )k2
αa ≤ α̂a =
∇f (xa )> Ha ∇f (xa )
and thus
αa ∇f (xa )> Ha ∇f (xa ) ≤ k∇f (xa )k2 .
This implies
αa2
f (xa ) − ma (xCP
a ) = ∆a k∇f (xa )k − ∇f (xa )> Ha ∇f (xa )
2
αa
≥ ∆a k∇f (xa )k − k∇f (xa )k2
2
∆a ∆a
= ∆a k∇f (xa )k − k∇f (xa )k = k∇f (xa )k
2 2
1
= kdv k · k∇f (xa )k.
2
k∇f (xa )k2
(ii) The second case occurs, if αa = α̂a = ∇f (xa )> Ha ∇f (xa )
. Then we have
Remark 7.2. Weakening the assumptions of Theorem 7.11, one can still show lim inf k→∞ k∇f (xk )k =
0.
68 BMS Basic Course: Nonlinear Optimization
4.2.1. Superlinear convergence. The idea of always fixing the direction −∇f (xa ) for the
approximate solution of the trust region subproblem and determining the step size αa such
that xv (αa ) − xa ∈ T (∆a ) often leads to a very slow (only Q-linear) rate of convergence
(comparable to the method of steepest descent). For this reason, we discuss a technique
which locally realizes the transition to the Newton direction. For this purpose, at xa we
define the Newton point
−1
xN
a = xa − Ha ∇f (xa ).
If Ha ∈ S n is positive definite, then xN
a is the global minimizer of the usual quadratic model
of f at xa . In case that Ha possesses directions of negative curvature, then the quadratic
model has no finite minimizer. The Newton point, however, remains meaningful.
Now we consider a special approximate solution of the trust region subproblem, which finally
yields a locally superlinearly convergent method. This is achieved by minimizing ma along a
piecewise linear path P ∈ T (∆), which is called the dogleg path. Its classical variant makes
use of three points: xa , xN CP
a and x̂a , the global minimizer of the quadratic model in the
direction of steepest descent. It holds that x̂CP a only exists, if ∇f (xa )> Ha ∇f (xa ) > 0 is
CP
satisfied. Whenever x̂a exists and fulfills
CP > CP
(7.26) (xN
a − x̂a ) (x̂a − xa ) ≥ 0,
then we define xN n
a as the last node on the path. If Ha ∈ S is positive definite, then it can
be shown that (7.26) is fulfilled. In case (7.26) is violated, then xN
a is not used. A closer
examination of (7.26) yields
CP > CP CP > CP
0 ≤(xN N
a − x̂a ) (x̂a − xa ) = (xa − xa + xa − x̂a ) (x̂a − xa )
> CP
=(xN CP 2
a − xa ) (x̂a − xa ) − kx̂a − xa k .
for λ > 0. Hence, φ is strictly monotonically increasing, which concludes the proof.
Now we show that the quadratic model decreases monotonically along the ”dogleg”-path.
Lemma 7.4. Let the assumptions of Lemma 7.3 be satisfied. Then the local quadratic model
1
ma (x) = f (xa ) + ∇f (xa )> (x − xa ) + (x − xa )> Ha (xa )(x − xa )
2
is monotonically decreasing along P.
Proof. Let x̂CP
a (6= xa ) be the minimizer of ma in the direction −∇f (xa ). Thus, ma is
strictly monotonically decreasing from xa to x̂CP
a . Hence, only the segment from x̂a
CP to xN
a
remains to be discussed. For this purpose, let
>
ψ(λ) = ma (xa + (1 − λ)dˆCP N ˆCP N
a + λda ) = f (xa ) + ∇f (xa ) ((1 − λ)da + λda )
1 N >
+ ((1 − λ)dˆCP ˆCP
a + λda ) Ha ((1 − λ)da + λda ).
N
2
70 BMS Basic Course: Nonlinear Optimization
ψ 0 (λ) = 2α̂a (1 − λ)k∇f (xa )k2 − (1 − λ)α̂a2 ∇f (xa )> Ha ∇f (xa ) + (1 − λ)∇f (xa )> dN
a
= 2α̂a (1 − λ)k∇f (xa )k2 − (1 − λ)α̂a k∇f (xa )k2 + (1 − λ)∇f (xa )> (−Ha−1 ∇f (xa ))
= (1 − λ)∇f (xa )> (α̂a ∇f (xa ) − Ha−1 ∇f (xa ))
1−λ > −1
= (xa − x̂CP
a ) (xa − Ha ∇f (xa ) − (xa − α̂a ∇f (xa )))
α̂a
1−λ > N
= (xa − x̂CP CP
a ) (xa − x̂a ) ≤ 0.
α̂a
We have shown that the trust region subproblem possesses a unique solution. Now we can
prove the following global convergence theorem.
Theorem 7.12. Let ∇f be Lipschitz continuous with modulus L. Let {xk } be generated by Al-
gorithm 7.4, where the solutions of the trust region subproblem are given by (D). Furthermore
we assume that the sequence of matrices {H k } is bounded. Then either {f (xk )} is bounded
from below or ∇f (xk ) = 0 for a finite k or
lim k∇f (xk )k = 0.
k→∞
Proof. We have to prove that the conditions of Assumption 7.2 are satisfied.
Condition 2: Let kH k k ≤ ∆. In case kdk k ≤ ∆, the definition of xD yields (7.26) and
xv = xkN . Further we have
k∇f (xk )k
kdk k = kxk − xkN k = k(H k )−1 ∇f (xk )k ≥ .
M
Condition 1: We distinguish the different cases for the determination of xD . If xD =
xCP CP
a , then either kda k = ∆a or (7.26) does not hold true. First we consider the case
∇f (xa )> Ha ∇f (xa ) ≤ 0. Then kdCP ∆a
a k = ∆a and αa = k∇f (xa )k . Therefore it holds that
αa2
predk = αa k∇f (xa )k2 − ∇f (x)> Ha ∇f (xa )
2
∇f (xa )> Ha ∇f (xa )
= ∆a k∇f (xa )k − ∆2a ≥ ∆a k∇f (xa )k = kdv k · k∇f (xa )k
2k∇f (xa )k2
Prof. Dr. Michael Hintermüller 71
Condition 1 holds true for σ = 1. The following cases can be checked similarly:
CP
kda k = ∆a 1
∇f (xa )> Ha ∇f (xa ) > 0 ∧ ⇒σ= .
kdCP
a k < ∆ a 2
Finally, there is the case, where (7.26) holds true whereas xD 6= xCP a . In this situation we
have
k∇f (xa )k2
predk ≥ ma (xa ) − ma (xCP
a )≥ .
M
Now we can apply Theorem 7.10 to obtain the assertion.
Finally, we establish fast local convergence.
Theorem 7.13. Let ∇f be Lipschitz continuous with modulus L. Let {xk } be generated by
Algorithm 7.4, where the solutions of the trust region subproblems are given by (D). Further
assume that H k = ∇2 f (xk ) and {H k } is bounded, f is bounded from below and x∗ is a
minimizer of f satisying (A). If limk xk = x∗ , then xk converges Q-quadratically to x∗ .
Proof. Since x∗ is the limit of {xk }, there exists a δ > 0 such that for sufficiently large
k it holds that
kxk − x∗ k ≤ δ, kH k k ≤ 2k∇2 f (x∗ )k, k(H k )−1 k ≤ 2k(∇2 f (x∗ ))−1 k.
Furthermore, let δ be chosen such that the assertions of Theorem 7.1 hold true. If H k ∈ S k
is positive definite, then (H k )−1 ∈ S k is positive definite as well and (7.26) is fulfilled. Hence,
the ”dogleg” path starts at xa and runs through xCP a to xN
a . For sufficiently small ρ, the it
holds that
k(H k )−1 ∇f (xk )k ≤ 2kxk − x∗ k ≤ 2ρ.
From this we infer (see proof of Theorem 7.14):
1
predk ≥ kdkv k · k∇f (xk )k.
2
Moreover, we have
Z 1
k > k
aredk = −∇f (x ) dv − (∇f (xk + τ dkv ) − ∇f xk )> dkv dτ
0
Z 1
1 k > 2 k k
= predk + (dv ) ∇ f (x )dv − (∇f (xk + τ dkv ) − ∇f xk )> dkv dτ
2 0
= predk + O (kdkv k · k∇f (xk )kρ)
Hence, ared
predk = 1 − O(ρ). For sufficiently small ρ, the trust region radius will be enalrged
k
until xN N
a is located in the trust region, and xa will be accepted.
Remark 7.3. Some trust region methods realize the inexact Newton idea. The local rate of
convergence can be derived analogously to Theorem 7.9.
CHAPTER 8
Quasi-Newton methods
Unlike Newton’s method, quasi-Newton methods do not make use of second order derivatives
of f . Rather, they approximate the second order derivatives iteratively with the help of
first order derivatives. Here, we consider quasi-Newton methods, which essentially work
like Newton methods with line search, but ∇2 f (xk ) is approximated by a positive definite
matrix H k . At each iteration, H k is updated in an appropriate way. The general algorithmic
structure is as follows:
For the initial matrix H 0 choose a (symmetric) positive definite matrix. A standard choice
is given by H 0 = I, but sometimes better scaling might be necessary. The benefits of quasi-
Newton methods are (amongst others):
• Only first order derivatives are required.
• H k is always positive definite such that dk is a descent direction for f at xk and our
line search framework is applicable.
• Some variants require only O(n2 ) multiplications per iteration (instead of O(n3 ) like
Newton’s method).
The last point is related to quasi-Newton variants, which approximate (∇2 f (xk ))−1 directly,
thus sparing the cost of solving the linear system to determine dk .
As we will see soon, positive definiteness is not guaranteed for all quasi-Newton methods. In
the case where {H k } consists only of positive definite matrices, then one speaks of a ”variable
metric”-method.
1. Update rules
In this section we will discuss several update rules for the Hessian approximation H. Let
sa = αa da (= −αa (H k )−1 ∇f (xk )),
ya = ∇f (x+ ) − ∇f (xa ) with x+ = xa + sa .
It holds that
ya = ∇f (x+ ) − ∇f (xa ) = ∇f (xa ) + ∇2 f (xa )(x+ − xa ) + O(kx+ − xa k) − ∇f (xa )
= ∇2 f (xa )sa + O(ksa k).
Therefore we require
(8.1) H+ sa = ya .
73
74 BMS Basic Course: Nonlinear Optimization
This condition is the called quasi-Newton condition (or secant condition). A simple ansatz
for H+ in (8.1) is
H+ = Ha + αuu> , α ∈ R, u ∈ Rn ,
which is referred to as the symmetric rank-1-update. Inserting this update into (8.1) yields
Ha sa + αu(u> sa ) = ya .
Thus, u is proportional to ya − Ha sa . We set u = ya − Ha sa (as the length can be adjusted
by α) which implies αu> sa = 1. This results in the symmetric rank-1-formula
(ya − Ha sa )(ya − Ha sa )>
(8.2) H+ = Ha + .
(ya − Ha sa )> sa
Unfortunately this formula has a few drawbacks. In particular the positive definiteness gets
lost (even if H 0 is chosen positive definite) and numerical problems appear whenever ya −
Ha sa ≈ 0 resp. (ya − Ha sa )> sa ≈ 0.
The non-symmetric ansatz
H+ = Ha + αuv > , α ∈ R, u, v ∈ Rn ,
with v := sa inserted into (8.1) yields
Ha sa + αu(s>
a sa ) = ya .
Hence, u is proportional to ya − Ha sa , which immediately implies α(s>
a sa ) = 1. We obtain
the non-symmetric rank-1-formula
(ya − Ha sa )s>
a
(8.3) H+ = Ha + >
.
sa sa
The fact that positive definiteness cannot be guaranteed and the absence of symmetry repre-
sent crucial disadvantages of (8.3).
More flexible update formulae can be derived by applying symmetric rank-2-updates, i.e.
H+ = Ha + αuu> + βvv > , α, β ∈ R, u, v ∈ Rn .
Inserting this into (8.1) yields
(8.4) Ha sa + αuu> sa + βvv > sa = ya .
The vectors u and v are no longer uniquely determined. In view of (8.4), it is adequate to
choose
u = ya and v = Ha sa .
Then we obtain
αya ya> sa + β(Ha sa )(Ha sa )> sa = ya − Ha sa ,
which implies
α(ya> sa ) = 1 and β(s> a Ha sa ) = −1.
Thus,
1 1
α= > and β = − >
ya sa sa Ha sa
and finally
ya ya> (Ha sa )(Ha sa )>
(8.5) H+ = Ha + − .
ya> sa s>a Ha sa
Prof. Dr. Michael Hintermüller 75
The update rule (8.5) is called the BFGS-formula (named after Broyden-Fletcher-Goldfarb-
Shanno).
One may also approximate ∇2 f (xk )−1 by B k . In this case the quasi-Newton condition reads
B+ ya = sa .
This formula is called the DFP-formula (after Davidon-Fletcher-Powell). Owing to the rela-
tions B ↔ H and y ↔ s, (8.5) and (8.6) are considered as dual to each other. Numerical
practice shows that the BFGS-method is often superior to the DFP-method. In the following,
we therefore will focus on the update according to (8.5).
First of all, note that if H 0 ∈ S n also H k ∈ S n due to the structure of (8.5). The positive
definiteness is considered in the following lemma.
Lemma 8.1. Let Ha ∈ S n be positive definite, ya> sa > 0 and H+ determined according to
(8.5). Then H+ ∈ S n is positive definite.
and also
1 1
(z > Ha sa )2 ≤ kHa2 zk2 · kHa2 sa k2 = (z > Ha z) · (s>
a Ha sa ).
The condition ya> sa > 0 is realistic. For quadratic problems with positive definite Hessian G,
it holds that
ya> sa = (∇f (x+ ) − ∇f (xa ))> (x+ − xa ) = (x+ − xa )> G(x+ − xa ) > 0.
For general problems, ya> sa > 0 is ensured by the Wolfe-Powell step size strategy.
76 BMS Basic Course: Nonlinear Optimization
it holds that
ya> sa > 0.
Furthermore we have that the BFGS-update H+ of Ha satisfies
−1
F+ = H+ − I = (I − wa wa> )Fa (I − wa wa> ) + Da
sa
with wa = ksa k , Da ∈ Rn×n and kDa k ≤ KD ksa k with KD > 0.
Basically, Lemma 8.2 indicates that the approximation is close to the exact Hessian, if the
initial values are ”good” enough. This property is elementary for proving local superlinear
convergence.
Corollary 8.1. Under the assumptions of Lemma 8.2, it holds that
kF+ k ≤ kFa k + KD ksa k ≤ kFa k + KD (kxa − x∗ k + kx+ − x∗ k).
The second inequality in the assertion of Corollary 8.1 follows directly from sa = x+ − xa .
The first inequality can be obtained by expanding the representation of F+ from Lemma 8.2
and estimating the subsequent expression using kDa k ≤ KD ksa k.
At this point, we can prove local q-linear convergence.
Theorem 8.2. Let (A) be satisfied and σ ∈ (0, 1) be given. Then there exists δl such that for
(8.7) kx0 − x∗ k ≤ δl and k(H 0 )−1 − ∇2 f (x∗ )−1 k ≤ δl
the BFGS-iteration is well-defined and converges q-linearly to x∗ . The q-factor is bounded by
σ.
Note that in general δl is directly proportional to σ.
Proof. For sufficiently small δ̂ and
(8.8) kxa − x∗ k ≤ δ̂ and kFa k = kHa−1 − Ik ≤ δ̂,
(A) yields
kx+ − x∗ k ≤ kFa k kxa − x∗ k + O(kxa − x∗ k) ≤ δ̂kxa − x∗ k + O(kxa − x∗ k).
Let δ̂ be small enough such that
kx+ − x∗ k ≤ σkxa − x∗ k < kxa − x∗ k ≤ δ̂.
Choose δl such that (8.8) holds true for the entire iteration, provided that the initial value
satisfies (8.7). We choose
δ∗ KD (1 + σ) −1 δ ∗
(8.9) δl = 1+ <
2 1−σ 2
with KD from Lemma 8.2. In case that kI − H 0 k < δl with δl < 12 , we infer
kF 0 k =k(H 0 )−1 − Ik ≤ k(H 0 )−1 k kI − H 0 k = k(I − (I − H 0 ))−1 k kI − H 0 k
1 δl
≤ 0
kI − H 0 k ≤ ≤ 2δl ≤ δ ∗ .
1 − kI − H k 1 − δl
Corollary 8.1 yields
kF 1 k ≤ kF 0 k + KD (1 + σ)kx0 − x∗ k.
It remains to prove that (8.7) and (8.9) imply
kF k k < δ ∗ ∀k.
78 BMS Basic Course: Nonlinear Optimization
For this purpose we proceed inductively. Let kF k k < δ ∗ and kxj+1 −x∗ k ≤ σkxj −x∗ k ∀j ≤ n.
Then Corollary 8.1 implies
We derive now some useful relations. Assumption (A) also ensures ∇f (xa ) 6= 0 for xa (xa 6=
x∗ ) sufficiently close to x∗ . Moreover, it holds that
Z 1
(8.10) ∇f (xa ) = ∇2 f (x∗ + τ (xa − x∗ ))(xa − x∗ )dτ = (I + R1 )(xa − x∗ )
0
with
Z 1
R1 = (∇2 f (x∗ + τ (xa − x∗ )) − I)dτ.
0
γ
Thus we obtain kR1 k ≤ 2 kxa − x∗ k as well as
where
Z 1
R2 = (∇2 f (xa + τ sa ) − I)dτ.
0
Prof. Dr. Michael Hintermüller 79
Theorem 8.3. Suppose the standard assumption (A) is fulfilled. Let {H k }k∈N be a sequence
of nonsingular matrices with kH k k ≤ M for all k ∈ N. Further let x0 be given and {xk }∞
k=1
defined by
xk+1 = xk − (H k )−1 ∇f (xk ).
If xk converges q-linearly to x∗ , xk 6= x∗ for all k ∈ N, and (8.15) is met, then xk converges
q-superlinearly to x∗ .
Proof. The equations (8.13) and (8.14) yield
(8.13) (8.14)
F k sk = (H k )−1 − I (y k − R2 sk ) = F k y k + O (ksk k2 ).
It holds that xk → x∗ and, hence, sk → 0. The Dennis-Moré condition (8.15) can be rewritten
as
kF k y k k
(8.16) lim = 0.
k→∞ ksk k
Since (H k )−1 ∇f (xk ) = −sk and sk = y k + O(ksk k2 ) (owing to (8.13)), it follows that
F k y k = (H k )−1 − I (∇f (xk+1 ) − ∇f (xk ))
= (H k )−1 ∇f (xk+1 ) + sk − y k = (H k )−1 ∇f (xk+1 ) + O(ksk k2 )
= (H k )−1 (xk+1 − x∗ ) + O(kxk − x∗ k2 + ksk k2 ) = (H k )−1 (xk+1 − x∗ ) + O(kxk − x∗ k2 ).
80 BMS Basic Course: Nonlinear Optimization
Thus we have:
kF k y k k k(H k )−1 (xk+1 − x∗ )k
∗
= + O(kxk − x∗ k)
k
kx − x k kxk − x∗ k
kxk+1 − x∗ k
≥ M −1 + O(kxk − x∗ k) → 0,
kxk − x∗ k
which yields the q-superlinear convergence of xk to x∗ .
Now we can state the proof of Theorem 8.1.
Proof of Theorem 8.1. Assume (8.7) of Theorem 8.2 is fulfilled with δl such that (8.2)
holds true for σ ∈ (0, 1). It follows immediately that
∞
X
(8.17) ksk k < ∞.
k=0
Let kAk2F= ΣN 2
i,j=1 (A)ij
= trace(A> A) be the Frobenius norm of the matrix A. For v ∈ Rn
with kvk ≤ 1, it holds that
kA(I − vv > )k2F ≤ kAk2F − kAvk2 ,
as
kA(I − vv > )k2F = trace((I − vv > )> A> A(I − vv > ))
= trace((A> A − vv > A> A)(I − vv > ))
= trace((A> A) − vv > A> A − A> Avv > + vv > A> Avv > ),
trace(vv > A> A) = kAvk2 and trace(vv > A> Avv > ) ≤ trace(vv > )trace(A> Avv > ) = kvk2 kAvk2 ≤
kAvk2 . Moreover, k(I − vv > )Ak2F ≤ kAk2F . Thus, Lemma 8.2 implies
kF k+1 k2F ≤ kF k k2F − kF k wk k2 + O(ksk k) = (1 − Θ2k )kF k k2F + O(ksk k)
with (
kF k wk k
k sk kF k kF
if F k 6= 0.
w = k , Θk =
ks k 1 if F k = 0.
With (8.17), we have for k ≥ 0:
∞
X ∞
X
Θ2k kF k k2F ≤ (kF k k2F − kF k+1 k2F ) + O(1)
k=0 k=0
= kF 0 k2F − kF k+1 k2F + O(1) < ∞.
Hence, Θk kF k kF → 0 and finally
kF k wk k if F k =
k 6 0.
Θk kF kF =
0 if F k = 0.
Moreover,
kF k sk k
kF k wk k = .
ksk k
Thus, the Dennis-Moré condition is met.
Prof. Dr. Michael Hintermüller 81
3. Global convergence
Under the assumption that there exist some constants 0 < c1 < c2 < +∞ with
c1 kxk2 ≤ x> H k x ≤ c2 kxk2 ∀x ∈ Rn ∀k ∈ N,
dk = −(H k )−1 ∇f (xk ) is a gradient-related search direction. If the Armijo step size strategy
is applied, a statement analogous to Theorem 5.1 holds true. Note that for a local minimizer
x∗ one cannot expect in general to have xk → x∗ for x0 close to x∗ . In view of the local
theory, the situation where x0 is situated close to x∗ , but H 0 is not close to ∇2 f (x∗ ) is not
better than the case where x0 is not sufficiently close to x∗ .
Theorem 8.4. Let D := {x : f (x) ≤ f (x∗ )} be convex, f twice continuously differentiable in
D and the spectrum σ(∇2 f (x)) ⊂ [c1 , c2 ] for all x ∈ D. If H 0 ∈ S n is positive definite, then
the BFGS-method with Armijo step size strategy converges q-superlinearly to x∗ .
Similar statements hold true for the (strict) Wolfe-Powell step size strategy.
4. Numerical aspects
4.1. Memory-efficient updating. The BFGS-method always uses the current Hessian
approximation to calculate the new approximation with the aid of the rank-2-update. In
general, one expects that the performance of the approximation is improving in the course of
the iteration. Indeed, one can show that for quadratic problems, adequate initialization and
an exact step size strategy, the exact Hessian is perfectly approximated after at most n (for
∇2 f (x∗ ) ∈ Rn×n ) steps. Nonetheless, for general problems the situation is more complicated
and due to numerical reasons, a reset of H k to a well-scaled, positive definite matrix is
occasionally implemented. However, for high-dimensional problems storing the BFGS-matrix
is unesireable because of memory restrictions. The so-called limited memory-BFGS method
takes this issue into account. In this method, from the point of view of the current iterate
xk , the preceding pairs {(y l , sl )}, k − m ≤ l ≤ k, are stored and the BFGS-formula is realized
iteratively at each new iteration. Here, m ∈ N is a fixed number.
We will now specify a strategy, which approximates the inverse Hessian without occupying to
much memory space. Let sa = x+ − xa , ya = ∇f (x+ ) − ∇f (xa ) and H+ be computed from
Ha by the BFGS-formula. Then, for positive definite Ha ∈ S n and ya> sa 6= 0 it holds that
H+ is nonsingular and
sa y > ya s> sa s>
−1
(8.18) H+ = I − > a Ha−1 I − > a + > a .
ya sa ya sa ya sa
Rearranging yields
−1
H+ = Ha−1 + β0 sa s> a + γ 0 (H −1
a y )s
a a
>
+ s a (H −1
a y a )>
with coefficients
ya> sa + ya> Ha−1 ya 1
β0 = and γ0 = − .
(ya> sa )2 ya> sa
From the relation
sa
Ha−1 ya = Ha−1 ∇f (x+ ) − Ha−1 ∇f (xa ) = Ha−1 ∇f (x+ ) + ,
αa
it follows that
−1
(8.19) H+ = Ha−1 + β1 sa s> −1 > −1 >
a + γ0 sa (Ha ∇f (x+ )) + (Ha ∇f (x+ )) sa
82 BMS Basic Course: Nonlinear Optimization
2γ0
with β1 = β0 + αa .
For the new direction d+ this yields
with coefficients
Aa γ0
βa = β1 − 2γ0 and γa = .
Ca Ca α +
Finally that leads to
k
X
(H k+1 )−1 = (H 0 )−1 + βl sl (sl )> + γl sl (sl+1 )> sl+1 (sl )> .
l=0
4.2. Positive definiteness. In Lemma 8.1 we have already seen that ya> sa > 0 has to
be assumed, in order to guarantee the positive definiteness of H+ ∈ S n . A simple strategy
which ensures this property, consists of the following choice:
if ya> sa > 0,
BF GS
H+
H+ =
R if ya> sa ≤ 0
with R ∈ S n positive definite. Often one employs R = I.
A formula which is applied especially in connection with trust region methods is the PSB-
formula (Powell-symmetric-Broyden). This formula preserves symmetry, however the positive
definiteness generally gets lost:
(ya − Ha sa )s>
a + sa (ya − Ha sa )
> (s> >
a ya − Ha sa )sa sa
H+ = Ha + − .
s>
a sa (ya> sa )2
CHAPTER 9
Box-constrained problems
1. Necessary conditions
The first and second order necessary conditions in the scalar case X ⊂ R are as follows.
Theorem 9.1. Let f be twice continuously differentiable on [a, b], −∞ < a < b < +∞. Let
x∗ be a local minimizer of f in [a, b]. Then it holds that
f 0 (x∗ )(x − x∗ ) ≥ 0 ∀x ∈ [a, b]
and
f 00 (x∗ )(x∗ − a)(b − x∗ ) ≥ 0.
This theorem may serve as a basis for the corresponding conditions in the multidimensional
case.
The point x∗ ∈ X is called a stationary point for (9.1), if
(9.2) ∇f (x∗ )> (x − x∗ ) ≥ 0 ∀x ∈ X.
At the same time, (9.2) represents the first order necessary condition.
In the following we use the term “solution” for “local minimizer”.
Theorem 9.2. Let f be twice continuously differentiable in X and x∗ a solution of (9.1).
Then x∗ is a stationary point for (9.1).
Proof. Let y ∈ X. Since X is convex, we have z(t) = x∗ + t(y − x∗ ) ∈ X for t ∈ [0, 1].
Consider
φ(t) = f (z(t)).
It holds that φ(t) has a minimum in t = 0.
Now Theorem 9.1 implies
0 ≤ φ0 (t)|t=0 = ∇f (z(t))> ∗ ∗ > ∗
|t=0 · (y − x ) = ∇f (x ) (y − x ),
1Of course, this is due to our assumption −∞ < L < U < +∞. However, if L = −∞ or U = +∞, then
i i i i
we had to ensure–like in the unconstrained case–the existence of a solution, for instance by assuming certain
convexity properties of f .
85
86 BMS Basic Course: Nonlinear Optimization
Now assume that ∇2 φ(x∗ (I(x∗ )) has a negative eigenvalue λ∗ . Let u∗ denote a corresponding
eigenvector. Then it holds with z := x∗I(x∗ ) + tu∗ that
t2 ∗> 2
φ(z) = φ(x∗I(x∗ ) ) + t∇φ(x∗I(x∗ ) )> u∗ + u ∇ φ(x∗I(x∗ ) )u∗ + O(t3 )
2
t2 ∗ ∗ 2
= φ(x∗I(x∗ ) ) + λ ku k + O(t3 ) < φ(x∗I(x∗ ) )
2
for t sufficiently small. Note that the latter represents a contradiction to the optimality of
x∗ . Thus, ∇2 φ(x∗I(x∗ ) ) is positive-semidefinite.
Let P : Rn → X denote the projection onto X, which is given by
Li if xi < Li ,
P (x)i = xi if Li ≤ xi ≤ Ui ,
Ui if xi > Ui .
Furthermore define
(9.3) x(α) = P (x − α∇f (x)).
Obviously it holds for A(x∗ ) = ∅ that x(α) = P (x∗ ) = x∗ . Also, we have
kx(α) − x + α∇f (x∗ )k ≤ ky − x + α∇f (x)k ∀y ∈ X.
Prof. Dr. Michael Hintermüller 87
Thus
1
ψ(λ) = k(1 − λ)x(α) + λy − x + α∇f (x)k2
2
has a local minimizer in λ = 0, i.e.:
0 ≤ ψ 0 (0) = ((1 − λ)x(α) + λy − x + α∇f (x))>
|λ=0 (y − x(α))
2. Sufficient conditions
When formulating the sufficient condition, we make use of the notion of a non-degenerate
stationary point.
Definition 9.1. A point x∗ ∈ X is a non-degenerate stationary point for (9.1), if x∗ is a
stationary point and
∇f (x∗ )i 6= 0 ∀i ∈ A(x∗ ).
If x∗ is a local minimizer of (9.1), then x∗ is called non-degenerate local minimizer.
88 BMS Basic Course: Nonlinear Optimization
φ00 (0) = (x − x∗ )> PI(x∗ ) ∇2 f (x∗ )PI (x∗ )(x − x∗ ) = (x − x∗ )> ∇2R f (x∗ )(x − x∗ ).
Thus, φ0 (0) = 0 and φ00 (0) > 0.
The stopping criterion will be specified later. In any case, one should fix an upper bound kmax
for the maximum number of iterations and terminate the algorithm if this bound is exceeded.
Now we want to focus on the stopping criterion. Evidently, k∇f (xk )k ≤ τr k∇f (x0 )k + τa is in
general not appropriate. We start by studying the active and inactive sets of adjacent points.
Lemma 9.2. Let f be twice continuously differentiable on X, and let x∗ be a non-degenerate
stationary point for (9.1). Let α ∈ (0, 1]. Then it holds for x sufficiently close to x∗ that
(1) A(x) ⊂ A(x∗ ) and xi = x∗i ∀i ∈ A(x).
(2) A(x(α)) = A(x∗ ) and x(α)i = x∗i ∀i ∈ A(x∗ ).
Proof. Let
δ1 = min {(Ui − x∗i ), (x∗i − Li )}.
i∈I(x∗ )
If i ∈ I(x∗ ) and kx − x∗ k < δ1 , then Li < xi < Ui . Moreover, I(x∗ ) ⊂ I(x) and thus
A(x) ⊂ A(x∗ ), which proves (1).
90 BMS Basic Course: Nonlinear Optimization
Let A(α) and I(α) be the active resp. inactive indices for x(α). Let i ∈ A(x∗ ) 6= ∅. According
to Lemma 9.1 and the continuity of ∇f there exists a constant δ2 > 0 with
σ
kx − x∗ k < δ2 ⇒ (∇f (x∗ + (x − x∗ ))i (x − x∗ )i ≥ (x − x∗ )i .
2
For
σ
δ3 < min( , δ2 ) ∧ kx − x∗ k < δ3
2
it follows: i ∈ A(α) ∧ x(α)i = x∗i . Hence, A(x∗ ) ⊂ A(α). On the other hand, the definition
of P implies
kP (x) − P (y)k ≤ kx − yk ∀x, y ∈ Rn .
By continuity of ∇2 f , ∇f is Lipschitz continuous on X. Let L denote the corresponding
Lipschitz constant. We have
x∗ = x∗ (α) = P (x∗ − α∇f (x∗ ))
and therefore
kx∗ − x(α)k = kP (x∗ − α∇f (x∗ )) − P (x − α∇f (x))k
(9.8)
≤ kx∗ − xk + αk∇f (x∗ ) − ∇f (x)k ≤ (1 + Lα)kx − x∗ k.
If i ∈ A(α) ∩ I(x∗ ), then it holds:
(9.9) kx∗ − x(α)k ≥ δ1 = min {(Ui − x∗i ), (x∗i − Li )}.
i∈I(x∗ )
δ1
If now kx − x∗ k < δ4 := min{δ3 , 1+L }, then (9.8) implies, that (9.9) cannot be satisfied.
Now we can prove the equivalence of kx − x∗ k und kx − x(1)k, which will lead to a suitable
stopping criterion.
Theorem 9.6. Let f be twice continuously differentiable on X, and x∗ a non-degenerate
stationary point for (9.1). Further assume that the second order necessary condition holds at
x∗ . Then there exist δ > 0 and K > 0 such that for kx − x∗ k ≤ δ and A(x) = A(x∗ ) it holds
that
(9.10) K −1 kx − x∗ k ≤ kx − x(1)k ≤ Kkx − x∗ k.
Proof. We have
kx − x(1)k = kx − x∗ − (x(1) − x∗ (1))k
≤ kx − x∗ k + kP (x − ∇f (x)) − P (x∗ − ∇f (x∗ ))k
≤ 2kx − x∗ k + k∇f (x) − ∇f (x∗ )k ≤ (2 + L)kx − x∗ k.
This implies the inequality on the right hand side of (9.10). Choose δ1 such that kx−x∗ k < δ1
implies that Lemma 9.2 holds for α = 1. One has
∇f (x)i i ∈ I(x∗ ),
(x − x(1))i =
(x − x∗ )i i ∈ A(x∗ ).
It remains to consider i ∈ I(x∗ ). The sufficient conditions yield the existence of a µ > 0 with
u> PI(x∗ ) ∇2 f (x∗ )PI(x∗ ) u ≥ µkPI(x∗ ) uk2 ∀u ∈ Rn .
Thus, there exists another constant δ2 such that for kx − x∗ k < δ2 :
µ
u> PI(x∗ ) ∇2 f (x)PI(x∗ ) u ≥ kPI(x∗ ) uk2 ∀u ∈ Rn .
2
Prof. Dr. Michael Hintermüller 91
At first we demonstrate that the Armijo step sizes are bounded away from 0.
Theorem 9.7. Let ∇f be Lipschitz continuous with modulus L. Let x ∈ X. Then, (9.7) is
satisfied for all α with
2(1 − σ)
0<α≤ .
L
Proof. Let y := x − x(α). Then it holds that
Z 1
f (x − y) − f (x) = f (x(α)) − f (x) = − ∇f (x − τ y)> ydτ.
0
By definition of y,
Z 1
>
f (x(α)) = f (x) + ∇f (x) (x(α) − x) − (∇f (x − τ y) − ∇f (x))> ydτ.
0
Thus:
Z 1
>
α(f (x) − f (x(α))) = α∇f (x) (x − x(α)) + α (∇f (x − τ y) − ∇f (x)> ydτ.
0
It holds that Z 1
L
k (∇f (x − τ y) − ∇f (x))> ydτ k ≤ kx − x(α)k2 ,
0 2
which implies
αL
α(f (x) − f (x(α))) ≥ α∇f (x)> (x − x(α)) − kx − x(α)k2 .
2
Using (9.5), we infer
αL
α(f (x) − f (x(α))) ≥ (1 − )kx − x(α)k2 .
2
Now,
L 1
f (x(α)) − f (x) ≤ ( − )kx − x(α)k2 ,
2 α
92 BMS Basic Course: Nonlinear Optimization
where
L 1 σ 2(1 − σ)
− ≤− ⇔ α≤ .
2 α α L
According to this, the step size strategy terminates successfully, if
2(1 − σ)
βm ≤ < β m−1 .
L
Furthermore, we observe that
2β(1 − σ)
α= >0
L
is a (uniform) lower bound on the step sizes αk .
For the projected gradient method, the following convergence result can now be shown:
Theorem 9.8. Let ∇f be Lipschitz-continuous with modulus L. Let {xk } be generated by
algorithm 9.1. Then every accumulation point of {xk } is a stationary point for (9.1).
Proof. Owing to the Armijo step size strategy, {f (xk )} is monotonically decreasing. In
addition, {f (xk )} is bounded from below on X. Hence there exists a limit point f ∗ ∈ R.
Conditions (9.7) and (9.10) imply
α 1
k→∞
kxk − xk+1 k2 ≤ f (xk ) − f (xk+1 ) ≤ f (xk ) − f (xk+1 ) −→ 0.
σ σ
For all y ∈ X it holds that
∇f (xk )> (xk − y) = ∇f (xk )> (xk+1 − y) + ∇f (xk )> (xk − xk+1 )
(9.4) 1 k
≤ (x − xk+1 )> (xk+1 − y) + ∇f (xk )> (xk − xk+1 )
αk
and
k > k k 1 k+1
k+1 k
∇f (x ) (x − y) ≤ kx − x k kx − yk + k∇f (x )k
αk
(9.11)
k k+1 1 k+1 k
≤ kx − x k kx − yk + k∇f (x )k .
α
Let {xk(l) } be a subsequence converging to x∗ , then (9.11) implies
The method which realizes these ideas is referred to as scaled projected gradient method.
Thus:
PI(xa ) ∇f (xa ) = PI(xa ) ∇2 f (xa )PI(xa ) (xa − x∗ ) + PI(xa ) E1
= PA(xa ) (xa − x∗ ) + PI(xa ) ∇2 f (xa )PI(xa ) (xa − x∗ ) + PI(xa ) E1
= ∇2R f (xa )(xa − x∗ ) + PI(xa ) E1 .
By definition of ∇2R f , we have
PI(xa ) (∇2R f (xa ))−1 ∇f (xa ) = xa − x∗ + E2
with kE2 k ≤ K2 kxa − x∗ k2 , K2 > 0. As PI(xa ) (P (w)) = P (PI(xa ) w) for all w ∈ Rn , it holds
PI(xa ) x+ = PI(xa ) P (xa − (∇2R f (xa ))−1 ∇f (xa ))
= P (PI(xa ) (xa − (∇2R f (xa ))−1 ∇f (xa )))
= P (x∗ − E2 ).
Thus kx+ − x∗ k ≤ K2 kxa − x∗ k2 .
Remark 9.1. It should be mentioned that there exist projected variants of the BFGS method.
In order to take account of the box-constraints, the update formula has to be slightly mod-
ified by means of the projections PI(x) and PA(x) . Moreover, under the assumptions of
Theorem 9.11 and a sufficiently good initial approximation of the reduced Hessian, local
superlinear convergence of the projected BFGS method can be proven.
Bibliography
[1] D. Bertsekas, Nonlinear Programming, Athena Scientific Publisher, Belmont, Massachusetts, 1995.
[2] J. F. Bonnans, J. C. Gilbert, C. Lemaréchal, C. Sagastizábal, Optimisation Numérique, Mathématiques
& Applications 27, Springer-Verlag, Berlin, 1997.
[3] A. R. Conn, N. I. M. Gould, P. L. Toint, Trust-Region Methods, SIAM, Philadelphia, 2000.
[4] J. E. Dennis, R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equa-
tions, SIAM Philadelphia, 1996.
[5] R. Fletcher, Practical Methods of Optimization I + II, Wiley & Sons Publisher, New York, 1980.
[6] C. Geiger, C. Kanzow, Numerische Verfahren zur Lösung unrestringierter Optimierungsaufgaben,
Springer-Verlag, Berlin, 1999.
[7] P. E. Gill, W. Murray, M. Wright, Practical Optimization, Academic Press, San Diego, 1981.
[8] F. Jarre, J. Stoer, Optimierung, Springer-Verlag, Berlin, 2004.
[9] C. T. Kelley, Iterative Methods for Optimization, Frontiers in Applied Mathematics, SIAM, Philadelphia,
1999.
[10] P. Spellucci, Numerische Verfahren der nichtlinearen Optimierung, Birkhäuser-Verlag, Basel, 1993.
97