0% found this document useful (0 votes)
176 views103 pages

Bms Basic NLP 120609

This document provides an introduction to nonlinear optimization problems. It defines an unconstrained minimization problem as finding x* in a set X that minimizes an objective function f(x) for all x in X. Constrained problems add restrictions on the set X. The document provides two examples: 1) fitting parameters in a harmonic oscillator equation to measurement data, and 2) minimizing total costs of production by varying output quantity. It motivates solving such problems to find optimal solutions in applications like science, engineering, economics and medicine.

Uploaded by

Marwan oudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
176 views103 pages

Bms Basic NLP 120609

This document provides an introduction to nonlinear optimization problems. It defines an unconstrained minimization problem as finding x* in a set X that minimizes an objective function f(x) for all x in X. Constrained problems add restrictions on the set X. The document provides two examples: 1) fitting parameters in a harmonic oscillator equation to measurement data, and 2) minimizing total costs of production by varying output quantity. It motivates solving such problems to find optimal solutions in applications like science, engineering, economics and medicine.

Uploaded by

Marwan oudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Berlin School of Mathematics – Basic Course

Nonlinear Optimization
Part I: Unconstrained and box-constrained problems

Prof. Dr. Michael Hintermüller

Humboldt-University of Berlin
Department of Mathematics
John-von-Neumann Haus, Room 2.426
Rudower Chaussee 25, Berlin-Adlershof
[email protected]
Contents

Acknowledgement i

Chapter 1. Introduction 1
1. Preface and motivation 1
2. Notions of solutions 3

Chapter 2. Optimality conditions 5


1. General case 5
2. Convex functions 7

Chapter 3. General descent methods and step size strategies 13


1. Globally convergent descent methods 14
2. Step size strategies and -algorithms 15
2.1. Armijo rule 16
2.2. Wolfe-Powell rule 22
2.3. Strong Wolfe-Powell rule 27
3. Practical aspects 27

Chapter 4. Rate of convergence 29


1. Q-convergence and R-convergence 29
2. Characterizations 31

Chapter 5. Gradient based methods 35


1. The method of steepest descent 35
2. Gradient-related methods 37

Chapter 6. Conjugate gradient method 39


1. Quadratic minimization problems 39
2. Nonlinear functions 45
2.1. Fletcher-Reeves method 45
2.2. Polak-Ribière method and modifications 46

Chapter 7. Newton’s method 49


1. Inaccuracies in function, gradient and Hessian evaluation 51
2. Nonlinear least-squares problems 55
2.1. Gauss-Newton iteration 56
2.2. Overdetermined problems 56
2.3. Underdetermined problems 58
3. Inexact Newton methods 59
3.1. Implementation of the Newton-CG method 60
iii
iv BMS Basic Course: Nonlinear Optimization

4. Global convergence 61
4.1. Trust-Region method 61
4.2. Global convergence of the trust region algorithm 64
4.2.1. Superlinear convergence 68
Chapter 8. Quasi-Newton methods 73
1. Update rules 73
2. Local convergence theory 76
3. Global convergence 81
4. Numerical aspects 81
4.1. Memory-efficient updating 81
4.2. Positive definiteness 82
5. Further Quasi-Newton formulae 82
Chapter 9. Box-constrained problems 85
1. Necessary conditions 85
2. Sufficient conditions 87
3. Projected gradient method 89
4. Superlinearly convergent methods 93
4.1. Projected Newton method 95
Bibliography 97
Acknowledgement

These lecture notes grew out of several courses which I held at the Karl-Franzens University
of Graz and the University of the Philippines in Manila, respectively.
For the careful typesetting of all the proofs in my original manuscript (and for the tedious task
of deciphering my hand-writing) I would like to express my sincere thanks to Mag. Cornelia
Kulmer. Mag. Ian Kopacka was invaluable for the input he gave and for tracing typos in an
earlier version of the script.
These lecture notes are largely based on the monographs listed in the bibliography.

i
CHAPTER 1

Introduction

1. Preface and motivation


The following task is known as a finite dimensional minimization problem :
n
 Let X ⊂ R an arbitrary set and f : X → R a continuous function.

(1.1) The problem is to find an x∗ ∈ X such that


f (x∗ ) ≤ f (x) for all x ∈ X.

Using a more compact notation we write:


min f (x) s.t. x ∈ X,
where “s.t.” stands for “subject to”, or
(1.2) min f (x).
x∈X
If X = Rn , then (1.1) (resp. (1.2) ) is called unconstrained, otherwise constrained. In general
X is called the feasible set and f the objective function.
Remark 1.1. Maximizing f (x) for x ∈ X is equivalent to minimizing −f (x) s.t. x ∈ X.
Therefore we can restrict ourselves to problems of the form (1.1) or (1.2), resp.
Being a mathematical model for several problems, e.g., in physics, medicine, economy and
engineering science, problem (1.1) is of great importance.
Example 1. In many cases one is interested in computing certain parameters by means of
observation (measurements) of the corresponding system. Typically, the difference between
measured data and data based on the computations should be minimal.
Let M be a mass point with mass m which is fixated on a vertical spring lying on a vertical
y-axis. If the spring is relaxed, M is located at the origin (equilibrium position). If M is
displaced, the (compressed or expanded) spring applies a restoring force K which tries to
replace M in its equilibrium position. For small displacements of y, the force can accurately
be modeled by Hooke’s law K = −k̂y, where k̂ is a positive spring constant. If the position of
M at time t is denoted by y(t), then (neglecting damping or friction) , according to Newton’s
law:
(1.3) mÿ = −k̂y,
which is called the undamped harmonic oscillator equation. In most cases friction- and
damping forces behave proportionally to the velocity of M , i.e. −rẏ with fixed r > 0.
Together with (1.3) we obtain
mÿ + rẏ + k̂y = 0.
Setting c := r/m, k := k̂/m we get
(1.4) ÿ + cẏ + ky = 0.
1
2 BMS Basic Course: Nonlinear Optimization

Let us assume that at time t = 0 the displacement is y(0) = y0 and furthermore ẏ(0) = 0,
then the following initial conditions hold true:
(1.5) y(0) = y0 , ẏ(0) = 0
In the following we will concentrate on the time interval [0, T ]. Let {y j }N
j=1 be measurements
of the spring’s deviation at time instances tj = (j −1)T /(N −1). The objective is to determine
the spring constant k and the damping factor c with the help of measurements.
Let x = (c, k)> . To emphasize the dependence of y(t) on x, we also write y(x; t). Following
the motivation at the beginning of the example, we try to solve the following unconstrained
non-linear minimization problem:
N
1X
(1.6) min f (x) := |y(x; tj ) − y j |2 .
x∈R2 2
j=1

It should be mentioned that y is differentiable with respect to x if c2 − 4k 6= 0 holds true.


Problem (1.6) aims at minimizing the sum of the squares of the errors (“nonlinear least
squares problem”).
A further simple example taken from economy could be as follows:
Example 2. In a company the following fictional relation between the output quantity x and
the corresponding total costs is found, using ”least squares” estimations:
K(x) = Kv (x) + Kf (x).
Here Kv (x) denotes the variable costs for output quantity x. Moreover there are fixed costs
(lease rental charges, ...) in the amount of Kf (x) = c, c > 0. Normally one is looking for x∗
minimizing the total costs of K(x), i.e.,

(1.7) x∗ = argmin{Kf (x) + Kv (x) : x ∈ R} = argmin{Kv (x) : x ∈ R},


which is equivalent to finding an x∗ solving
min Kv (x).
x∈R
For general problems one cannot expect the set argmin{Kf (x) + Kv (x) : x ∈ R} to contain
just one element. However, if Kv is uniformly convex (see section 2), then the first equality
in (1.7) is justified.
In case that X 6= Rn is the feasible set, it can often be written in the form
X = X1 ∩ X2 ∩ X3
with sets
X1 = {x ∈ Rn : ci (x) = 0, i ∈ I1 },
X2 = {x ∈ Rn : ci (x) ≤ 0, i ∈ I2 },
X3 = {x ∈ Rn : xi ∈ Z, i ∈ I3 }.
Here I1 , I2 , I3 ⊂ N are finite index sets. The sets X1 , X2 and X3 are called equality-, inequality-
and integer constraints.
Remark 1.2. In example 1 we have
X = x ∈ R2 : xi ≥ 0, i ∈ {1, 2} .

Prof. Dr. Michael Hintermüller 3

If X is a set of discrete points, one refers to (1.1) as a discrete (or combinatorial) optimization
problem; otherwise the optimization problem is called continuous.
Occasionally, f is not differentiable, then we call (1.1) a nondifferentiable optimization prob-
lem (this would also be the case if one of the ci ’s in X1 or X2 would be non-differentiable,
even if f would be differentiable).
Remark 1.3. For instance, replacing the objective function f (x) in example 1 by
N
X
g(x) = |y(x; tj ) − yj |,
j=1
we obtain a nondifferentiable optimization problem.

2. Notions of solutions
In the following definition we introduce our basic notions of optimality.
Definition 1.1. Let f : X → R with X ⊂ Rn . The point x∗ ∈ X is called a
(i) (strict) global minimizer of f (on X), if and only if
f (x∗ ) ≤ f (x) (f (x∗ ) < f (x)) for all x ∈ X \ {x∗ }.
The optimal objective value f (x∗ ) is called a (strict) global minimum;
(ii) (strict) local minimizer of f (on X), if there exists a neighborhood U of x∗ such that
f (x∗ ) ≤ f (x) (f (x∗ ) < f (x)) for all x ∈ (X ∩ U ) \ {x∗ },
The optimal objective value f (x∗ ) is called a (strict) local minimum.
Remark 1.4. The point x∗ is a ((strict) global, (strict) local) maximizer of f (on X), if and
only if x∗ is a ((strict) global, (strict) local) minimizer of −f (on X).
In the following the gradient of f in x is denoted by
 >
∂f ∂f
∇f (x) = (x), . . . , (x) .
∂x1 ∂xn
Definition 1.2. Let X ⊂ Rn be an open set and f : X → R be a continuously differentiable
function. The point x∗ ∈ X is called a stationary point of f , if
∇f (x∗ ) = 0
holds true.
CHAPTER 2

Optimality conditions

This chapter deals with necessary and sufficient conditions for characterizing minimizers
(under certain differentiability assumptions on f ).

1. General case
If f does not possess any structure nor properties apart from differentiability, then we can
only make statements about local minimizers, in general.
Theorem 2.1. Let X ⊂ Rn an open set and f : X → R a continuously differentiable function.
If x∗ ∈ X is a local minimizer of f (on X), then
(2.1) ∇f (x∗ ) = 0,
i.e., x∗ is a stationary point.
Proof. We prove the statement by means of contradiction. Let us assume that x∗ is a
local minimizer for which ∇f (x∗ ) 6= 0 does not hold true. Then there exists d ∈ Rn with
∇f (x∗ )> d < 0 (choose for instance d = −∇f (x∗ )) .
By assumption, f is continuously differentiable. Consequently, the directional derivative of f
at x∗ in direction d exists:
f (x∗ + αd) − f (x∗ )
f 0 (x∗ ; d) = lim = ∇f (x∗ )> d < 0.
α↓0 α
Due to the continuity of the derivative there exists ᾱ > 0 satisfying x∗ + αd ∈ X (X is open)
and
f (x∗ + αd) − f (x∗ )
<0
α
for all 0 < α ≤ ᾱ. Therefore, it holds that
f (x∗ + αd) < f (x∗ ) ∀0 < α ≤ ᾱ,
which contradicts the assumption that x∗ is a local minimizer of f in X. 
Remark 2.1. (1) Since Theorem 2.1 only uses first order derivatives and assumes x∗ to
be a (local) minimizer, it specifies a first order necessary condition.
(2) The condition ∇f (x∗ ) = 0 is not sufficient for a local minimum; consider, e.g.,
f (x) = −x2 with x∗ = 0.
As preparation for the next Theorem 2.2 we need the following lemma about the continuity
of the smallest eigenvalue of a matrix.
Lemma 2.1. Let Sn be the vector space of symmetrical n × n-matrices. For A ∈ Sn let
λ(A) ∈ R be the smallest eigenvalue of A. Then the following estimate holds true:
|λ(A) − λ(B)| ≤ kA − Bk for all A, B ∈ Sn .
5
6 BMS Basic Course: Nonlinear Optimization

Note that the vector norm and the matrix norm are denoted by the same symbol, i.e. k · k.
If f is twice continuously differentiable it follows from Lemma 2.1 and from the continuity
of ∇2 f ∈ Rn×n (the Hessian of f ), that ∇2 f (x) is positive definite in a neighborhood of
x∗ if ∇2 f (x∗ ) is positive definite. An analogue statement holds true, if ∇2 f (x∗ ) is negative
definite.
Theorem 2.2. Let X ⊂ Rn be open and f : X → R be twice continuously differentiable.
If x∗ ∈ X is a local minimizer of f (on X), then ∇f (x∗ ) = 0 and the Hessian ∇2 f (x∗ ) is
positive semi-definite.
Proof. The statement that ∇f (x∗ ) = 0 holds true, follows from Theorem 2.1. Therefore,
we only have to consider the positive-semi-definiteness of the Hessian of f at x∗ . Again we
prove the statement by means of contradiction. Let us assume that x∗ is a local minimizer of
f , but ∇2 f (x∗ ) is not positive semi-definite. Then there exists d ∈ Rn such that
(2.2) d> ∇2 f (x∗ )d < 0.
Applying Taylor’s theorem, we obtain for sufficiently small α > 0:
α2 > 2 α2 > 2
(2.3) f (x∗ + αd) = f (x∗ ) + α∇f (x∗ )> d +d ∇ f (ξ(α))d = f (x∗ ) + d ∇ f (ξ(α))d,
2 2
where we used ∇f (x∗ ) = 0 and the existence of ϑ = ϑ(α) ∈ (0, 1) with ξ(α) = x∗ + ϑαd ∈ X.
By Lemma 2.1 and (2.2) there exists ᾱ > 0, such that
d> ∇2 f (ξ(α))d < 0 ∀0 < α ≤ ᾱ.
Now, (2.3) yields
f (x∗ + αd) < f (x∗ ) ∀0 < α ≤ ᾱ,
which contradicts the assumption that x∗ is a local minimizer of f on X. 
Remark 2.2. (1) The conditions of Theorem 2.1 and Theorem 2.2 are not sufficient for
local minimality; consider, e.g., f (x) = x21 − x42 with x∗ = (0, 0)> .
(2) As Theorem 2.2 involves second order derivatives and assumes x∗ to be a (local)
minimizer, it defines second order necessary conditions.
The subsequent theorem specifies second order sufficient conditions: If conditions (a) and (b)
of Theorem 2.3 are satisfied at the point x∗ , then x∗ is a strict local minimizer of f on X.
Theorem 2.3. Let X ⊂ Rn be open and f : X → R twice continuously differentiable. If
(a) ∇f (x∗ ) = 0 and
(b) ∇2 f (x∗ ) is positive definite,
then x∗ is a strict local minimizer of f (on X).
Proof. Assumption (b) ensures that λ(∇2 f (x∗ )) > 0, i.e., the smallest eigenvalue of the
Hessian of f in x∗ is positive. Therefore it holds:
d> ∇2 f (x∗ )d ≥ µd> d = µkdk2 ∀d ∈ Rn ,
for 0 < µ ≤ λ(∇2 f (x∗ )). From Taylor’s Theorem we obtain for all d sufficiently close to 0:
x∗ + d ∈ X and
1
f (x∗ + d) = f (x∗ ) + ∇f (x∗ )> d + d> ∇2 f (ξ(d))d
2
Prof. Dr. Michael Hintermüller 7

with ξ(d) = x∗ + ϑd for ϑ = ϑ(d) ∈ (0, 1). Applying (a) and the Cauchy-Schwarz inequality,
we obtain
1 1
f (x∗ + d) = f (x∗ ) + d> ∇f (x∗ )d + d> ∇2 f (ξ(d)) − ∇2 f (x∗ ) d

2 2
1
≥ f (x∗ ) + µ − k∇2 f (ξ(d)) − ∇2 f (x∗ )k kdk2 .

2
Given that ∇2 f is continuous, we are able to choose d small enough, such that k∇2 f (ξ(d)) −
∇2 f (x∗ )k ≤ µ2 holds true. Thus,
µ
f (x∗ + d) ≥ f (x∗ ) + kdk2 > f (x∗ )
4
for all sufficiently small d ∈ R with d 6= 0. Hence x∗ is a strict local minimizer of f on X. 
n

Remark 2.3. (1) Conditions (a) and (b) in Theorem 2.3 are not necessary for the local
minimality of x∗ ; consider f (x) = x21 + x42 with x∗ = (0, 0)> . To some extent there
is a “gap” between necessary and sufficient conditions.
(2) Given (a) of Theorem 2.3 in the case of an indefinite Hessian ∇2 f (x∗ ), we refer to
x∗ as a saddle point.

2. Convex functions
Convex functions are of particular importance for optimization. For a convex function f
we are able to show that the first order necessary conditions are also sufficient for local
optimality (see Theorem 2.6) . In the following we will introduce procedures that approximate
a complicated non-linear minimization problem by a sequence of convex problems. Apart
from global properties, these convex problems offer a simple way of computing solutions or
approximations.
Definition 2.1. (1) A set X ⊂ Rn is called convex, if for all x, y ∈ X and all λ ∈ (0, 1)
λx + (1 − λ)y ∈ X,
i.e. the segment [x, y] lies completely in X.
(2) Let X ⊂ Rn convex. A function f : X → R is called
(i) (strictly) convex (in X), if for all x, y ∈ X and for all λ ∈ (0, 1) the following
holds true:
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y), x 6= y .
Geometrically, the (strict) convexity of f means that the line segment between
f (x) and f (y) is located (strictly) above the graph of f .
(ii) uniformly convex (on X), if there exists µ > 0 with
f (λx + (1 − λ)y) + µλ(1 − λ)kx − yk2 ≤ λf (x) + (1 − λ)f (y)
for all x, y ∈ X and all λ ∈ (0, 1). (In that case, f is sometimes called uniformly
convex with module(us) µ)
By definition, every uniformly convex function is also strictly convex and every strictly convex
function is also convex. The converse is not true in general!
8 BMS Basic Course: Nonlinear Optimization

Remark 2.4. Let f : Rn → R be a quadratic function, i.e.,


1
f (x) = x> Ax + b> x + c
2
with A ∈ Sn , b ∈ Rn and c ∈ R. Then the following statements hold true:
(a) f is convex ⇐⇒ A positive semi-definite.
(b) f is strictly convex ⇐⇒ f uniformly convex ⇐⇒ A is positive definite.
If the (strictly, uniformly) convex function f is continuously differentiable, the following
characterizations, referring to the graph of the function and the tangential hyperplane, hold
true.
Lemma 2.2. Let X ⊂ Rn open, convex and f : X → R continuously differentiable. Then the
following assertions hold true:
(a) f is convex (on X) if and only if for all x, y ∈ X there holds:
(2.4) f (x) ≥ f (y) + ∇f (y)> (x − y).
(b) f is strictly convex (on X) if and only if for all x, y ∈ X, with x 6= y, there holds:
(2.5) f (x) > f (y) + ∇f (y)> (x − y).
(c) f is uniformly convex (on X) if and only if there exists µ > 0 such that
(2.6) f (x) ≥ f (y) + ∇f (y)> (x − y) + µkx − yk2
for all x, y ∈ X.
Proof. We prove (a) and assume that (2.4) is satisfied. If x, y ∈ X and λ ∈ [0, 1] are
arbitrarily chosen and z := λx + (1 − λ)y, then the following holds true:
f (x) ≥ f (z) + ∇f (z)> (x − z),
f (y) ≥ f (z) + ∇f (z)> (y − z).
If we multiply the first inequality by λ > 0 and the second by (1 − λ) > 0, then the addition
of the two terms provide the following:
λf (x) + (1 − λ)f (y) − f (z) ≥ ∇f (z)> (λx + (1 − λ)y − z) = 0.
Hence
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)
which proves the convexity of f . Conversely, for all x, y ∈ X and λ ∈ (0, 1), the convexity of
f provides
f (λx + (1 − λ)y) = f (y + λ(x − y)) ≤ λf (x) + (1 − λ)f (y)
and
f (y + λ(x − y)) − f (y)
≤ f (x) − f (y).
λ
Given that f is continuously differentiable and considering the limit λ ↓ 0 in the inequality
above, yields condition (2.4), i.e.
∇f (y)> (x − y) ≤ f (x) − f (y).
Now, we turn to (b). The verification of the strict convexity of f under (2.5) is analogous to
(a). However, the reverse direction cannot be shown by applying the same arguments as in
(a), since the application of the limit does not ensure the strict inequality. Hence, we assume
Prof. Dr. Michael Hintermüller 9

that f is strictly convex. Since a strictly convex function is also convex, we already have the
result from (a). For z := 12 (x + y), the inequality of (2.4) yields
∇f (y)> (x − y) = 2∇f (y)> (z − y) ≤ 2 (f (z) − f (y)) .
If x 6= y holds true then the strict convexity of f implies that 2f (z) < f (x) + f (y). Thanks
to the relation above, we deduce that
∇f (y)> (x − y) < f (x) − f (y),
which corresponds to (2.5).
Taking into account the quadratic term, the proof of (c) is completely analogous to (a). 
Now we provide a characterization of twice continuously differentiable (strictly, uniformly)
convex functions, enabling us to read off the convexity qualities of f from the definiteness of
the Hessian of f .
Theorem 2.4. Let X ⊂ Rn an open, convex set and f : X → R twice continuously differen-
tiable. Then the following statements hold true:
(a) f is convex (on X) if and only if ∇2 f (x) is positive semi-definite for all x ∈ X .
(b) If ∇2 f (x) is positive definite for all x ∈ X, then f is strictly convex (on X).
(c) f is uniformly convex (on X) if and only if ∇2 f (x) is uniformly positive definite on
X, i.e., if there exists µ > 0 such that
d> ∇2 f (x)d ≥ µkdk2
for x ∈ X and for all d ∈ Rn .
Proof. We start with (a) and assume that f is convex (on X). Due to the assumption
that f is twice continuously differentiable, the application of Taylor’s Theorem yields the
following equation:
1
(2.7) f (y) = f (x) + ∇f (x)> (y − x) + (y − x)> ∇2 f (x)(y − x) + r(y − x)
2
for all y ∈ X sufficiently close to x. The remainder term has the following property: r(y −
x)/ky − xk2 → 0 for y → x. Now we choose y = x + αd, where d ∈ Rn is arbitrary and α > 0
is sufficiently small. Lemma 2.2 (a) yields
α2 > 2
0≤ d ∇ f (x)d + r(αd).
2
Dividing by α2 /2 and considering the limit for α ↓ 0 we obtain
0 ≤ d> ∇2 f (x)d.
Since x ∈ X and d ∈ Rn were chosen arbitrarily, the statement holds true. Conversely: Given
that f is twice continuously differentiable with ∇2 f (x) positive semi-definite for all x ∈ X,
the subsequent equation follows from Taylor’s theorem and the mean value theorem
1 1
Z
>
(2.8) f (y) = f (x) + ∇f (x) (y − x) + (y − x)> ∇2 f (x + τ (y − x)))(y − x)dτ.
2 0
The positive semi-definiteness of ∇2 f yields
f (y) ≥ f (x) + ∇f (x)> (y − x)
for y, x ∈ X. Then the convexity of f on X follows from Theorem 2.2 (a).
10 BMS Basic Course: Nonlinear Optimization

The proof of (b) works analogously to the second part of proof (a). Now let us turn to the
verification of (c). We assume that f is uniformly convex. Then, analogously to (a), we
obtain (2.7). Theorem 2.2 (c) with y = x + αd, where d ∈ Rn and α > 0 is sufficiently small,
provides
α2 > 2
µα2 kdk2 ≤ d ∇ f (x)d + r(αd).
2
Dividing by α2 and considering the limit for α ↓ 0 gives
1
µkdk2 ≤ d> ∇2 f (x)d
2
for an arbitrary d ∈ R , which proves the assertion. Conversely: Let ∇2 f be uniformly
n

positive definite (with modulus µ > 0). Then the assertion follows from relation (2.8),
Z 1
(y − x)> ∇2 f (x + τ (y − x))(y − x)dτ ≥ µkx − yk2
0
and Theorem 2.2 (c). 
Note that the statement (b) of Theorem 2.4 can not be reversed in general; consider, e.g.,
f (x) = x4 in R.
The following lemma deals with the level sets of uniformly convex functions. In the con-
text of general continuously differentiable functions, the statement of Lemma 2.3 is of local
importance, i.e., in a neighborhood of the local minimizer x∗ of f (in X).
Lemma 2.3. Let f : Rn → R be continuously differentiable and x0 ∈ Rn arbitrary. Further
assume that the level set
L(x0 ) := {x ∈ Rn : f (x) ≤ f (x0 )}
is convex and that f is uniformly convex in L(x0 ). Then the set L(x0 ) is compact.
Proof. First note that due to our construction L(x0 ) 6= ∅ holds true. Let x ∈ L(x0 ).
From the uniform convexity of f in L(x0 ) we obtain (with λ = 21 )
µ 1
kx − x0 k2 + f ( 12 (x + x0 )) ≤ f (x) + f (x0 ) ,

4 2
where µ > 0. With the aid of Theorem 2.2 (a) we get the following estimate:
µ 1
kx − x0 k2 ≤ f (x) − f (x0 ) − f ( 12 (x + x0 )) − f (x0 )
 
4 2
≤ − f ( 21 (x + x0 )) − f (x0 )


1 1
≤ − ∇f (x0 )> (x − x0 ) ≤ k∇f (x0 )k kx − x0 k.
2 2
¿From this we infer
2
kx − x0 k ≤ k∇f (x0 )k ∀x ∈ L(x0 )
µ
or–in other words–the boundedness of L(x0 ). Given that f is continuous, we find that L(x0 )
is closed. Closedness and boundedness yields the compactness of L(x0 )., which ends the
proof. 
Now we have gathered all ingredients for proving the following theorem which illustrate why
(strictly, uniformly) convex functions are of fundamental importance in optimization.
Prof. Dr. Michael Hintermüller 11

Theorem 2.5. Let f : Rn → R be continuously differentiable and X ⊂ Rn be convex. Consider


the optimization problem
(2.9) min f (x) s.t. x ∈ X,
then the following statements hold true:
(a) If f is convex on X, the solution set of (2.9) is convex (possibly empty).
(b) If f is strictly convex on X, (2.9) has at most one solution.
(c) If f is uniformly convex (on X) and X is non-empty and closed, then (2.9) has
exactly one solution.
Proof. (a) Let x1 , x2 ∈ X be two different solutions of (2.9). Then f (x1 ) = f (x2 ) =
minx∈X f (x) holds true. For λ ∈ [0, 1] the convexity of X ensures that also λx1 +(1−λ)x2 ∈ X.
Moreover we have
f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ) = min f (x).
x∈X

Therefore f attains a minimum at λx1


+ (1 − λ)x2 .
Given that λ ∈ [0, 1] was arbitrary, this
proves the convexity of the solution set.
(b) Let us assume that (2.9) has two solutions x1 6= x2 . Due to the strict convexity of f on
X , we find for λx1 + (1 − λ)x2 ∈ X that
f (λx1 + (1 − λ)x2 ) < λf (x1 ) + (1 − λ)f (x2 ) = min f (x),
x∈X

holds true, which is a contradiction.


(c) Let x0 ∈ X be chosen arbitrarily. Lemma 2.3 ensures the compactness of L(x0 ) and also
of X ∩ L(x0 ). The continuous function f attains a minimum on the compact set X ∩ L(x0 ).
Since every uniformly convex function is also strictly convex, (b) holds true. Consequently,
the minimizer is unique. 
Remark 2.5. (1) Even for a strictly convex f , the problem (2.9) need not have a solu-
tion; consider for instance f (x) = exp(x) with X = R.
(2) The requirement for the closeness of X in Lemma 2.5 cannot be dismissed! Consider
f (x) = x2 on X = (0, 1].
A further immediate consequence is given in the following lemma.
Lemma 2.4. Let f : Rn → R be continuously differentiable, x0 ∈ Rn , and let the level set
L(x0 ) be convex and f be uniformly convex on L(x0 ). Further suppose that x∗ ∈ Rn is the
unique global minimizer of f . Then there exists µ > 0 with
µkx − x∗ k2 ≤ f (x) − f (x∗ ) for all x ∈ L(x0 ).
Proof. The statement follows immediately from Lemma 2.2 (c) and ∇f (x∗ ) = 0. 

The central result of this section is given in Theorem 2.6. It proves that the necessary
condition ∇f (x∗ ) = 0 is also sufficient for x∗ being a global minimizer of the convex function
f.
Theorem 2.6. Let f : Rn → R be a continuously differentiable and convex function, and let
x∗ ∈ Rn be a stationary point of f . Then x∗ is a global minimizer of f in Rn .
12 BMS Basic Course: Nonlinear Optimization

Proof. Theorem 2.2 (a) yields


f (x) − f (x∗ ) ≥ ∇f (x∗ )> (x − x∗ ) = 0 ∀x ∈ Rn ,
where we have used ∇f (x∗ ) = 0. As an immediate consequence we obtain
f (x∗ ) ≤ f (x) ∀x ∈ Rn ,
i.e., x∗ is a global minimizer of f . 
CHAPTER 3

General descent methods and step size strategies

In general, only exceptional cases allow the explicit calculation of (local) solutions of the
minimization problem
(3.1) min f (x), x ∈ Rn .
In practice, iterative methods are applied for computing approximate (local) minimizers.
After a convergence analysis, these methods are normally represented in algorithmic form
and implemented on a computer.
For this reason, we now consider descent methods for finding solutions of problem (3.1), in
which f : Rn → R is a continuously differentiable function. The fundamental idea of the
methods in this chapter is as follows:
(i) At a point x ∈ Rn , one chooses a direction d ∈ Rn in which the function value
decreases (descent method).
(ii) Starting at x, one proceeds along this direction d as long as the function value of f
reduces sufficiently (step size strategy).
These steps will be formalized.
Definition 3.1. Let f : Rn → R and x ∈ Rn . The vector d ∈ Rn is called a descent direction
of f at x, if there exists an ᾱ > 0 such that
f (x + αd) < f (x) for all α ∈ (0, ᾱ].
Let us assume that f is continuously differentiable at x ∈ Rn . Then
(3.2) ∇f (x)> d < 0
is sufficient to show that d ∈ Rn is a descent direction of f in x. To see this, we define
ϕ(α) := f (x + αd). The continuous differentiability of f implies
(3.3) ϕ(α) = ϕ(0) + αϕ0 (0) + r(α),
where r(α)/α → 0 for α → 0+ . We have
ϕ(0) = f (x) and ϕ0 (0) = ∇f (x)> d.
Transforming (3.3) and dividing by α yields
ϕ(α) − ϕ(0) r(α)
= ∇f (x)> d + .
α α
Since r(α)/α → 0 for α → 0+ and ∇f (x)> d < 0 (by assumption (3.2)), the existence of an
ᾱ > 0 from Definition 3.1 is proven. Thus, d is a descent direction of f in x.
Remark 3.1. (1) Condition (3.2) indicates that the angle between d and the negative
gradient of f in x is less than 90o .
13
14 BMS Basic Course: Nonlinear Optimization

(2) The criterion (3.2) is not necessary for d to be a descent direction of f at x. Consider,
for instance, the case where x is a strict local maximizer. Then all directions d ∈ Rn
would be descent directions of f in x, but (3.2) does not hold.
Examples for possible descent directions d = d(x) are:
d = −∇f (x) (direction of steepest descent),
d = −M ∇f (x) mit M ∈ S n positive definite (gradient-related descent direction).
1. Globally convergent descent methods
Now let us consider a general descent method. For the time being, we do neither specify the
exact choice of the descent direction nor the conditions on the step size along this direction.
Below, in Lemma 3.1 we introduce abstract conditions which ensure the convergence (with its
meaning still to be made precise) of the subsequent algorithm. In the following paragraphs
we then specify methods to determine an appropriate step size, and furthermore we address
the choice of the descent direction.

Algorithm 3.1 (General descent method).

input: f : Rn → R, starting point x0 ∈ Rn .


begin
k := 1
while convergence criterion is not fulfilled do
begin
specify a descent direction dk of f at xk .
determine a step size αk > 0 with f (xk + αk dk ) < f (xk ).
set xk+1 := xk + αk dk , k := k + 1.
end
end
For the moment, we do not discuss appropriate stopping rules in our theoretical considerations
on convergence. Moreover, we assume that an infinite sequence {xk } is generated. But, of
course, a practical implementation does not work without a numerically meaningful stopping
criterion, and, in fact, we are going to deal with this issue in Chapter 3.3.
Note that the calculation of the new iterate is often called line search, as one seeks for the
new iterate along the search direction (line). Algorithms aiming at the determination of a
suitable step size αk along the search direction are called algorithms for the line search or
step size algorithms .
The global convergence of Algorithm 3.1 is the topic of the following lemma. Here global
convergence refers to the fact that the algorithm converges for an arbitrarily chosen initial
value x0 ∈ Rn . In this sense, global convergence must not be confused with the convergence
of the sequence {xk } (or a subsequence) to a global minimizer of f !
Theorem 3.1. Let f : Rn → R be continuously differentiable and {xk } a sequence generated
by Algorithm 3.1, such that there exist constants Θ1 > 0 and Θ2 > 0 (independent of {xk }
and {dk }) such that
(a)
−∇f (xk )> dk ≥ Θ1 k∇f (xk )k kdk k for all k ∈ N (angle condition);
Prof. Dr. Michael Hintermüller 15

(b)
2
∇f (xk )> dk

f (xk + αk dk ) ≤ f (xk ) − Θ2 (sufficient decrease)
kdk k
with αk > 0 for all k ∈ N.
Then every accumulation point of the sequence {xk } is a stationary point of f .
Proof. According to the assumption, every step size αk fulfills the sufficient decrease
condition (b). Using (a) in (b) results in
(3.4) f (xk+1 ) ≤ f (xk ) − Θ21 Θ2 k∇f (xk )k2 .
The relation (3.4) ensures that the sequence of function values {f (xk )} decreases monoton-
ically. Let x∗ be a accumulation point of the sequence {xk }. Due to the continuity of f ,
{f (xk )} converges to f (x∗ ) for a subsequence of xk . The monotonicity ensures that {f (xk )}
itself converges to {f (x∗ )}, in particular it holds true:
f (xk+1 ) − f (xk ) → 0 for k → ∞.
The inequality (3.4) implies that
k∇f (xk )k → 0 for k → ∞.
Thus, every accumulation point x∗ of {xk } is a stationary point of f . 
Remark 3.2. If ηk denotes the angle between dk and −∇f (xk ), then part (a) of Lemma 3.1
means that
∇f (xk )> dk
cos(ηk ) = −
k∇f (xk )k kdk k
is bounded away from 0; or in other words the angle is uniformly smaller than 90o . A
famous example for a descent direction fulfilling the angle condition from Theorem 3.1 is
dk = −∇f (xk ), the direction of steepest descent of f at xk .
Additionally assuming that f is uniformly convex on the convex level set L(x0 ), it is possible
to substitute the angle condition of Lemma 3.1 by the weaker Zoutendijk-condition, i.e.
∞ 2
∇f (xk )> dk
X 
δk = ∞ mit δk = .
k∇f (xk )k kdk k
k=0

The Zoutendijk-conditon ensures that the angle between dk and −∇f (xk ) tends sufficiently
slowly to 90o .

2. Step size strategies and -algorithms


The general descent method (Algorithm 3.1) offers quite some freedom in the choice of the
descent direction dk and the step size αk > 0. The obvious minimization rule, i.e. αk := αkmin
with
f (xk + αkmin dk ) = min f (xk + αdk ),
α>0
is well-defined, provided that L(x0 ) is compact and ∇f Lipschitz-continuous on L(x0 ). Cer-
tainly, this rule is in general impracticable due to the tremendous effort necessary (at every
iteration k there is one exact (!) univariate minimization required). Fortunately, we can aban-
don the exact univariate minimization without endangering the convergence of the descent
16 BMS Basic Course: Nonlinear Optimization

method. In the following, we consider three important representatives of practicable step size
strategies, which all resign to an approximate minimization of f (xk + αdk ) w.r.t. α > 0.

2.1. Armijo rule. The subsequent strategy does not fit directly into the framework of
Theorem 3.1, because the sufficient decrease condition may be violated. However, it allows to
point out essential aspects of an approximate minimization of f (xk + αdk ). In addition, the
Armijo rule is an important element of alternative step size strategies, fulfilling the sufficient
decrease condition.
In order to keep the subsequent explanations exemplary, we consider gradient-related descent
directions of the form
d = d(x) = −M ∇f (x), M ∈ S n positive definite.
It has to be mentioned that the analysis can be done in a more general way, i.e. for descent
directions in the sense of Definition 3.1.
Let σ ∈ (0, 1) be fixed. The Armijo rule is a condition which ensures a sufficient descent in
the following sense:
(3.5) f (x + αd) ≤ f (x) + σα∇f (x)> d
This requirement can be interpreted as a restriction on the step size α. The meaning of (3.5)
can be illustrated: The solid line in Figure 1 represents the graph of f (x + αd) for α ≥ 0.

f(x)+σ α ∇ f(x)Td
f(x+α d)

f(x)

a b c
0
α>0

Figure 1. Illustration of the Armijo rule

The dashed line represents the half-line f (x) + σα∇f (x)> d. In our example, the condition
(3.5) is fulfilled for α ∈ [0, a] ∪ [b, c]. Note that due to the requirement ∇f (x)> d < 0 and the
(Lipschitz-)continuous differentiability of f , the existence of a > 0 is ensured. The fact that
even α ∈ [c, d] fulfills the Armijo-condition has accidentally occured in our example.
For the actual calculation of α, one checks (3.5) sequentially for e.g.
(3.6) α = βl, l = 0, 1, 2, . . . ,
where β ∈ (0, 1) is fixed. One begins with α(0) = β 0 = 1 and stops the test if (3.5) holds true;
otherwise l is incremented (resp. the α-value is decremented) and (3.5) will be checked once
again.
Prof. Dr. Michael Hintermüller 17

In the following algorithm, (3.6) is generalized: If α(l) does not fulfill the Armijo condition
(3.5), then α(l+1) is chosen such that,
(3.7) α(l+1) ∈ [να(l) , να(l) ]
with 0 < ν ≤ ν < 1.

Algorithm 3.2 (Armijo step size strategy).

input: descent direction d.


begin
l := 0
α(0) := 1
while (3.5) is not fulfilled do
begin
determine α(l+1) ∈ [να(l) , να(l) ].
set l := l + 1.
end
αk := α(l)
end
First of all we analyze Algorithms 3.2 and then Algorithm 3.1 with the step size determination
according to Algorithm 3.2.
Lemma 3.1. Let f : Rn → R be continuously differentiable with Lipschitz-continuous gradient
∇f , where L denotes the Lipschitz-constant. Let σ ∈ (0, 1), x ∈ Rn and M ∈ S n be positive
definite. Furthermore, let λs = λ(M −1 ) and λg ≥ λs be the smallest and the largest eigenvalue
of M −1 , respectively.
If ∇f (x) 6= 0, then (3.5) is fulfilled for all α with
2λs (1 − σ)
(3.8) 0<α≤ ,
Lκ(M −1 )
with κ(M −1 ) = λg /λs the (spectral) condition number of M −1 .
Proof. It holds
Z 1
f (x + αd) − f (x) = ∇f (x + τ αd)> d dτ.
0
We infer that
Z 1
>
f (x + αd) = f (x) + α∇f (x) d + α (∇f (x + τ αd) − ∇f (x))> d dτ.
0
By the Lipschitz-continuity of ∇f (x), this implies
Lα2
(3.9) f (x + αd) ≤ f (x) + α∇f (x)> d +kdk2 .
2
The positive definiteness of M implies λ−1 2 > −1 2 n
g kzk ≤ z M z ≤ λs kzk for all z ∈ R . Now,

kdk2 = kM ∇f (x)k2 = ∇f (x)> M 2 ∇f (x) ≤ λ−2


s k∇f (x)k
2

≤ λg λ−2 > −2 >


s ∇f (x) M ∇f (x) = −λg λs ∇f (x) d
= −κ(M −1 )λ−1 >
s ∇f (x) d.
18 BMS Basic Course: Nonlinear Optimization

Using inequality (3.9), we obtain


 
−1 −1 Lα
f (x + αd) ≤ f (x) + α 1 − κ(M )λs ∇f (x)> d.
2
This implies, that (3.5) is fulfilled if
 

σ≤ 1 − κ(M −1 )λ−1
s ,
2
which is equivalent to
2λs (1 − σ)
α≤ .
Lκ(M −1 )

Only proving the finite termination property of the Armijo step size strategy does not require
Lipschitz-continuity of ∇f in general. But without this additional assumption, no upper
bound like (3.8) can be expected. We will come back to this point later on.
Assuming that f has the same properties as in Lemma 3.1, we prove that
(3.10) αk ≥ α > 0 for all k ∈ N.
In order to allow for a more general choice of gradient-related directions, let us assume that
{M k } is a sequence of symmetric positive definite matrices and that
dk = −M k ∇f (xk ).
Lemma 3.2. Let f : Rn → R be continuously differentiable with Lipschitz-continuous gradient
∇f , where L denotes the Lipschitz-constant. Let {xk } be the iteration sequence generated
by Algorithm 3.1 with a step size choice according to Algorithm 3.2. Further let {M k } be a
sequence of symmetric positive definite matrices, such that there exists 0 < λ ≤ λ < +∞ with
λ ≤ λ(k) (k)
s ≤ λg ≤ λ for all k ∈ N,
(k) (k)
where λs and λg denote the smallest resp. the largest eigenvalue of (M k )−1 .
Then the step size αk fulfills the inequality
2ν λ(1 − σ)
(3.11) αk ≥ α = for all k ∈ N

with κ = λ/λ. Furthermore, in every iteration of Algorithm 3.1 there will be at most
 
2λ(1 − σ) 
(3.12) m ≤ log log(ν), m ∈ N

step size reductions necessary.
Proof. Lemma 3.1 proves that Algorithm 3.2 terminates if
2λks (1 − σ)
α≤
Lκ((M k )−1 )
or even before that. Given that λks ≥ λ > 0 and κ((M k )−1 ) = λkg /λks ≤ λ̄/λ = κ̄ hold true,
we have
2λks (1 − σ) 2λ(1 − σ)
k −1
≥ > 0 ∀k ∈ N.
Lκ((M ) ) Lκ̄
Prof. Dr. Michael Hintermüller 19

As a result of the step size strategy in Algorithm 3.2, the actual step size cannot be smaller
than 2λ(1−σ)
Lκ̄ multiplied by the factor ν. This proves (3.11).
Since Algorithm 3.2 chooses α(0) = 1 and α(l+1) ≤ ν̄α(l) , αk will be found after at most m
reductions, where m ∈ N fulfills the following relation:
2λ(1 − σ) 2λks (1 − σ)
ν̄ m < ≤ .
Lκ̄ Lκ((M k )−1 )
A simple calculation leads to (3.12). 
Abandoning the assumption that ∇f is Lipschitz-continuous, there might be subsequences
{αk(l) } such that αk(l) → 0. Then the statement of Lemma 3.2 does not hold true any
longer. In this case, let us assume that xk(l) → x∗ with ∇f (x∗ ) 6= 0. Further we assume that
αk(l) = β jk (l) . Then the following holds true:
f (xk(l) + β jk (l)−1 dk(l) ) − f (xk(l) )
> σ∇f (xk(l) )> dk(l)
β jk (l)−1
and after a transition to a further subsequence, we can observe that
0 ≥ ∇f (x∗ )> d∗ ≥ σ∇f (x∗ )> d∗ ,
which is a contradiction, as σ ∈ (0, 1) was assumed. This shows, despite a weakening of
assumptions of Lemma 3.2, the convergence of Algorithm 3.1 with a step size choice according
to Algorithm 3.2. The indication of an upper resp. lower bound on {αk } is dropped.
Theorem 3.2. Let f : Rn → R be continuously differentiable and let {M k } fulfill the assump-
tion of Lemma 3.2. Then either {f (xk )} is unbounded from below or
(3.13) lim ∇f (xk ) = 0,
k→∞

and thus every accumulation point of {xk } is a stationary point of f in Rn . In particular, it


holds true that if {f (xk )} is bounded from below and liml→∞ xk(l) = x∗ , then ∇f (x∗ ) = 0.
Remark 3.3. (1) In general there is no guarantee for the existence of a unique accumu-
lation point.
(2) The following variation of the Armijo rule can be analyzed with Lemma 3.1, Lemma 3.2
and Theorem 3.2: Let r > 0 be a scaling factor. Determine
(3.14) α = max{rβ l : l = 0, 1, 2, . . .},
such that (3.5) is fulfilled.
(3) The determination of the step size α according to (3.6) or (3.14) is called backtracking
and fulfills (3.7).
(4) In several cases the restriction of the Armijo rule by backtracking, i.e. only permitting
a reduction of the step size after the initial choice of α(0) , is a drawback. In contrast,
the following strategy is more flexible: In addition to the Armijo rule (3.5) one tests
the subsequent equation
1
(3.15) f (x + αd) ≥ f (x) + µα∇f (x)> d, mit 0 < σ < < µ < 1.
2
The step size strategy (3.5)+(3.15) is called Armijo-Goldstein rule. The Armijo
condition (3.5) implies that α shall not be too large, whereas the Goldstein condition
(3.15) requires that α shall not be too small. Illustrated by Figure 2, the step size
20 BMS Basic Course: Nonlinear Optimization

f(x)+σ α ∇ f(x)Td
f(x+α d)

f(x)

f(x)+µ α ∇ f(x)Td

a b c d
0
α>0

Figure 2. Illustration of the Armijo-Goldstein rule

in [a, b] ∪ [c, d] fulfills the Armijo-Goldstein rule; compare to Figure 1.


A typical approach to find α, fulfilling the conditions (3.5) and (3.15) simulta-
(0) (0)
neously, is as follows: One sets α1 := 0, α2 = +∞, α(0) > 01. In the l-th iteration
(l) (l) (l) (l)
we have an α(l) and an interval [α1 , α2 ] with the properties that α1 < α2 and
(l) (l) (l+1)
α(l) ∈ [α1 , α2 ]. If (3.5) is violated for α = α(l) , then one sets α2 = α(l) and
(l+1) (l)
α1 = α1 (reduction of the step size). If (3.15) is violated for α = α(l) , then one
(l+1) (l+1) (l)
sets α1 = α(l) and α2 = α2 (enlargement of the step size). The new trial step
size α(l+1) will be chosen, s.t.
(l+1) (l+1)
α(l+1) ∈ [α1 + τ ∆(l+1) , α2 − τ ∆(l+1) ],
(l+1) (l+1)
where ∆(l+1) = α2 − α1 denotes the interval length and 0 < τ  1 is fixed
(l+1)
(bisection!). It is obvious that this approach only makes sense, if α2 < +∞ holds
(l+1) (l)
true. If that is not (yet) the case, one chooses α ≥ ξ max(α1 , ), where ξ > 1
and  > 0 are fixed. From a numerical point of view, one not only has to make
sure that the step size algorithm terminates when meeting the Armijo-Goldstein
conditions (3.5) and (3.15), but also when ∆(l+1) is relatively small.
Finally we note that (3.15) together with (3.5) and backtracking does not make
sense in general.

Now we discuss another step size choice based on polynomial models of the function ϕ(α).
Our discussion is based on quadratic models. However, models of higher order are often
applied in practice.
Armijo step size algorithm based on polynomial models. Apart from the simple
backtracking method, there are strategies which apply polynomial models of ϕ(α)(= f (x+αd))
.

1Superscripts denote the index in iterative methods to determine α resp. α in the k-th iteration of the
k
minimization algorithm for f .
Prof. Dr. Michael Hintermüller 21

In every iteration of Algorithm 3.2 we have the following data at hand:

ϕ(0) = f (x), ϕ(α(l) ) = f (x + α(l) d), ϕ0 (0) = ∇f (x)> d < 0.


With the help of this data, we create a quadratic model of ϕ(α). The ansatz
q(α) = a + bα + cα2 , a, b, c ∈ R,
with the conditions
q(0) = ϕ(0), q(α(l) ) = ϕ(α(l) ), q 0 (0) = ϕ0 (0)
leads to
1
(3.16) q(α) = ϕ(0) + ϕ0 (0)α + (ϕ(α(l) ) − ϕ(0) − ϕ0 (0)α(l) )α2 .
(α(l) )2
Since the quadratic model (3.16) is only computed when the Armijo rule (3.5) is violated for
α = α(l) , it follows that

ϕ(α(l) ) − ϕ(0) − ϕ0 (0)α(l) > ϕ(α(l) ) − ϕ(0) − σϕ0 (0)α(l) > 0.


Thus, a global minimizer α̂min for q(α) exists:

−ϕ0 (0)(α(l) )2
α̂min = > 0.
2(ϕ(α(l) ) − ϕ(0) − ϕ0 (0)α(l) )

Now, we can choose the next trial step size α(l+1) according to

 να(l) if α̂min < να(l) ,



(l+1)
(3.17) α := α̂min if να(l) ≤ α̂min ≤ να(l) ,
να(l) if α̂min > να(l) .

This choice ensures the required property αl+1 ∈ [ναl , να(l) ]. In addition, (3.17) ensures that

ν l+1 α(0) ≤ να(l) ≤ α(l+1) ≤ να(l) ≤ ν l+1 α(0) for all l ∈ N,


where liml→∞ (ν l , ν l ) = 0 because of 0 < ν ≤ ν < 1.
Remark 3.4. (1) Cubic polynomial models for ϕ(α) can be obtained by using

(3.18) ϕ(0), ϕ(α(l) ), ϕ0 (0), ϕ0 (α(l) )


or
(3.19) ϕ(0), ϕ(α(l) ), ϕ0 (0), ϕ(α(l−1) ) for l ≥ 1.
In case the determination of the derivatives of f is expensive, then one prefers (3.19).
(2) As we have already mentioned above, in the case of the Armijo-Goldstein step size
(l)
strategy, (3.5) and (3.15) together with the bisection idea of Remark 3.3 (4), α2 =
+∞ is possible. Methods based on polynomial models have to take this into account
when determining coefficients and choosing the step size.
22 BMS Basic Course: Nonlinear Optimization

2.2. Wolfe-Powell rule. Let σ ∈ (0, 12 ) and ρ ∈ [σ, 1) be fixed. The Wolfe-Powell
conditions are: For x, d ∈ Rn with ∇f (x)> d < 0 determine a step size α > 0 such that

(3.20) f (x + αd) ≤ f (x) + σα∇f (x)> d,


(3.21) ∇f (x + αd)> d ≥ ρ∇f (x)> d.

Just like the Armijo-Goldstein rule, the choice of σ < 21 enables us to accept the exact
minimizer of a quadratic function as a Wolfe-Powell step size. The condition (3.21) ideally
implies that the graph of ϕ(α) = f (x + αd) at α > 0 does not descend as ”steeply” as at
α = 0. This claim is motivated by the fact that

ϕ0 (α̂) = ∇f (x + α̂d)> d = 0

is satisfied at a (local) minimizer α̂ of ϕ. Similar to condition (3.15) of the Armijo-Goldstein


line search, (3.21) prevents α from getting too small. Figure 3 illustrates the conditions (3.20)
and (3.21). The first condition yields - as before - a restriction on the extent of the step size.
The second condition (3.21) ensures that points with ∇f (x + αd)> d = 0 are always located
in the acceptable step size set ([a, b] ∪ [c, d] in our example).

f(x)+σ α ∇ f(x)Td
f(x+α d)

f(x)

ρ∇ f(x)Td

a b c d
0
α>0

Figure 3. Illustration of the Wolfe-Powell rule

Now we want to prove that for given x and d, the set of Wolfe-Powell step sizes is non-empty
and that for step sizes which fulfill (3.20) and (3.21), the sufficient decrease condition of
Theorem 3.1 is satisfied.

Theorem 3.3. Let f : Rn → R be continuously differentiable, σ ∈ (0, 12 ), ρ ∈ [σ, 1) and


x0 ∈ Rn be fixed. For x ∈ L(x0 ) and d ∈ Rn with ∇f (x)> d < 0 let

SWP (x, d) := {α > 0 : (3.20) and (3.21) are fulfilled}

be the set of Wolfe-Powell step sizes at x in direction d. Then the following statements hold
true:
(a) If f is bounded from below, then SWP (x, d) 6= ∅, i.e. the Wolfe-Powell step size
strategy is well-defined.
Prof. Dr. Michael Hintermüller 23

(b) If, in addition, ∇f is Lipschitz-continuous on L(x0 ), then there is a constant Θ > 0


(independent of x and d) with
2
∇f (x)> d

f (x + αd) ≤ f (x) − Θ for all α ∈ SWP (x, d).
kdk
Proof. Define ψ(α) := f (x) + σα∇f (x)> d. We have to show that there exists a step
size α > 0, such that
ϕ(α) ≤ ψ(α) and ϕ0 (α) ≥ ρϕ0 (0).
Note that ψ(0) = ϕ(0) and
ψ 0 (α) = σ ∇f (x)> d > ∇f (x)> d
hold true. Therefore the graph of ϕ is located below the graph of ψ for sufficiently small
α > 0. Let α∗ be the smallest step size α > 0 with ϕ(α) = ψ(α). The existence of α∗ is
guaranteed, since f is assumed to be bounded from below and ψ(α) → −∞ holds true for
α → ∞. Obviously,
ϕ0 (α∗ ) ≥ ψ 0 (α∗ )
holds true. We distinguish between two cases:
(1) ϕ0 (α∗ ) < 0 holds true. The relation between the derivatives of ϕ and ψ in α = α∗
yields
−ϕ0 (α∗ ) ≤ −ψ 0 (α∗ ) = −σ ∇f (x)> d = −σ ϕ0 (0) ≤ −ρ ϕ0 (0),
since 0 < σ ≤ ρ. Due to ϕ(α∗ ) = ψ(α∗ ), we have α∗ ∈ SWP (x, d).
(2) Now let ϕ0 (α∗ ) ≥ 0. Since ϕ0 (0) < 0 there exists α∗∗ ∈ (0, α∗ ] with ϕ0 (α∗∗ ) = 0 (by
continuity). Since α∗∗ ≤ α∗ holds true, the condition ϕ(α) ≤ ψ(α) is fulfilled for
α = α∗∗ . By definition of α∗∗ , ϕ0 (α∗∗ ) = 0, thus
0 = ϕ0 (α∗∗ ) ≥ ρ ϕ0 (0).
Consequently α = α∗∗ ∈ SWP (x, d).
This proves assertion (a).
Next we establish (b). Let α ∈ SWP (x, d). Then f (x+αd) ≤ f (x) holds true and in particular
x + αd ∈ L(x0 ). From the Wolfe-Powell rule, it follows
(ρ − 1)∇f (x)> d ≤ (∇f (x + αd) − ∇f (x))> d.
Moreover,
(ρ − 1)∇f (x)> d ≤ k∇f (x + αd) − ∇f (x)k kdk ≤ Lαkdk2 ,
where L > 0 denotes the Lipschitz-constant of ∇f on L(x0 ). We obtain
(ρ − 1)∇f (x)> d
α≥
Lkdk2
for the Wolfe-Powell-step size and consequently
2
∇f (x)> d

> (1 − ρ)σ
f (x + αd) ≤ f (x) + σα∇f (x) d ≤ f (x) − .
L kdk2

24 BMS Basic Course: Nonlinear Optimization

If the descent directions dk in the general descent method, cf. Algorithm 3.1, are chosen such
that the angle condition of Theorem 3.1 is fulfilled, then we can easily infer from Theorem 3.3
(and Theorem 3.1) that every accumulation point of the sequence {xk } is a stationary point.
It remains to examine the numerical realization of the Wolfe-Powell-rule.
Before specifying the corresponding step size algorithm, we consider the following lemma
which is going to be used for the determination of an appropriate starting point for the
numerical determination of a Wolfe-Powell step size in the subsequent algorithm.
Lemma 3.3. Let σ < ρ (cf. Theorem 3.3), ϕ0 (0) < 0 and Φ(α) := ϕ(α) − ϕ(0) − σαϕ0 (0). If
[a, b] denotes an interval with the properties
(3.22) Φ(a) ≤ 0, Φ(b) ≥ 0, Φ0 (a) < 0,
then [a, b] contains a point ᾱ with
Φ(ᾱ) < 0, Φ0 (ᾱ) = 0.
ᾱ is an interior point of an interval I such that for all α ∈ I there holds:
Φ(α) ≤ 0 and ϕ0 (α) ≥ ρϕ0 (0),
i.e. I ⊂ SWP (x, d).
Proof. According to the assumption, Φ(a) ≤ 0, Φ0 (a) < 0 and Φ(b) ≥ 0 hold true.
Therefore there is at least one point ξ ∈ (a, b) with Φ0 (ξ) ≥  for sufficiently small  > 0. If
there was no such a point, it would hold that Φ0 (α) ≤ 0 for all α ∈ [a, b]. Since Φ0 (a) < 0
was assumed, then from the continuous differentiability of Φ we would get that Φ(b) < 0, a
contradiction. Now let ξˆ be the smallest element ξ in (a, b) satisfying Φ0 (ξ) ≥ . Given that
ˆ with Φ0 (ξ0 ) = 0 by Bolzano’s Root Theorem . Let
Φ0 is continuous, there exists a ξ0 ∈ (a, ξ)
ξ0 be the smallest element having this property. Then Φ(ξ0 ) < 0 holds true. If that was
not the case, there would be an ξ1 ∈ (a, ξ0 ) with Φ(a) = Φ(ξ1 ) ≤ 0 due to the continuity of
Φ. Rolle’s Theorem however ensures the existence of a ξ00 ∈ (a, ξ1 ) with Φ0 (ξ00 ) = 0 which
contradicts the choice of ξ0 . To conclude the first part of the proof, we set ᾱ = ξ0 .
For the proof of the second part, note that
Φ0 (α) =ϕ0 (α) − σ ϕ0 (0) = ∇f (x + αd)> d − σ ∇f (x)> d
=∇f (x + αd)> d − ρ ∇f (x)> d + (ρ − σ)ϕ0 (0)
holds true. This implies
(3.23) ∇f (x + ᾱd)> d > ∇f (x + ᾱd)> d + (ρ − σ)ϕ0 (0) = ρ ∇f (x)> d.
As Φ(ᾱ) < 0, there exists a neighborhood [ᾱ − r0 , ᾱ + r0 ], r0 > 0, s.t. Φ(α) ≤ 0 holds
true for all α ∈ [ᾱ − r0 , ᾱ + r0 ]. Due to the continuity of Φ0 , ρ > σ and ϕ0 (0) < 0, for
0 <  ≤ 12 (σ − ρ)ϕ0 (0) there exists r > 0 such that
ϕ0 (α) = ∇f (x + αd)> d > ∇f (x + αd)> d −  ≥ ρ ∇f (x)> d = ρ ϕ0 (0)
for all α ∈ [ᾱ − r , ᾱ + r ]. Choosing r = min(r0 , r ) > 0, it follows that
Φ(α) ≤ 0 and ϕ0 (α) ≥ ρϕ0 (0)
for all α ∈ I := [ᾱ − r, ᾱ + r] ⊂ SWP (x, d). 
Lemma 3.3 is crucial for the following algorithm.
Prof. Dr. Michael Hintermüller 25

Algorithm 3.3 (Wolfe-Powell step size algorithm).

input: descent direction d ∈ Rn .


begin
choose α(0) > 0, γ > 1, i := 0
(A.1) if Φ(α(i) ) ≥ 0 then
begin
a := 0, b := α(i)
goto (B.0)
end
else
if ϕ0 (α(i) ) ≥ ρϕ0 (0) then α := α(i) , RETURN 1 end
if ϕ0 (α(i) ) < ρϕ0 (0) then
begin
α(i+1) := γα(i) , i := i + 1
goto (A.1)
end
end
end
(0) (0) (0) (0)
(B.0) choose τ1 , τ2 ∈ (0, 12 ], j := 0, α1 := a, α2 := b, ∆(0) := α2 − α1
(j) (j)
(B.1) choose α(j) ∈ [α1 + τ1 ∆(j) , α2 − τ2 ∆(j) ]
if Φ(α(j) ) ≥ 0 then
begin
(j+1) (j) (j+1) (j+1) (j+1)
α1 := α1 , α2 := α(j) , ∆(j+1) := α2 − α1 , j := j + 1
goto (B.1)
end
else
if ϕ0 (α(j) ) ≥ ρϕ0 (0) then α := α(j) , RETURN 2 end
if ϕ0 (α(j) ) < ρϕ0 (0) then
begin
(j+1) (j+1) (j) (j+1) (j+1)
α1 := α(j) , α2 = α2 , ∆(j+1) := α2 − α1 , j =j+1
goto (B.1)
end
end
end
end

Concerning the choice of α(0) , we remark that from the second iteration of the descent method
Algorithm 3.1 on, i.e. k ≥ 2, α(0) can be chosen as αk−1 . If a lower bound f of the function
values of f on Rn is known, we can infer immediately
0 ≥ Φ(α) = ϕ(α) − ϕ(0) − σα∇f (x)> d ≥ f − ϕ(0) − σαϕ0 (0)
from Φ(α) ≤ 0. Rearranging yields
f − ϕ(0)
α≤ =: ᾱ.
σϕ0 (0)
26 BMS Basic Course: Nonlinear Optimization

Thus it is reasonable to choose α(0) ∈ (0, ᾱ] in this case. With regard to Newton- or Quasi-
Newton methods, the step size α = 1 is of particular importance. Hence it is recommandable
to choose α(0) = min{1, ᾱ} for these methods.
Now we demonstrate that Algorithm 3.3 terminates successfully after finitely many steps,
under certain conditions.
Theorem 3.4. Let f : Rn → R be continuously differentiable and bounded from below. Fur-
thermore let σ ∈ (0, 12 ) and ρ ∈ ( 12 , 1) be fixed. Then Algorithm 3.3 terminates after finitely
many steps at “RETURN 1” or “RETURN 2” with α ∈ SWP (x, d).
Proof. First we consider the case where Algorithm 3.3 terminates at ”RETURN 1”.
Then it is obvious that the Wolfe-Powell conditions are fulfilled.
In the next step we show that the loop in the first part of the algorithm (switch to (A.1)) is
finite, i.e. after finitely many steps we continue with (B.0) or we terminate with ”RETURN
1”. Let us assume that the loop was not finite, resp. the algorithm would return infinitely
often to the position (A.1). In this case it holds α(i) = γ α(0) and Φ(α(i) ) < 0 for all i ∈ N.
But the last inequality implies
f (x + α(i) d) < f (x) + σ α(i) ∇f (x)> d ∀i ∈ N.
By assumption, γ > 1 and ϕ0 (0) < 0. Hence we would obtain f (x + α(i) d) ↓ −∞, which con-
tradicts the boundedness of f from below. Hence the loop (switch to (A.1)) has to terminate
with ”RETURN 1” in the first part of the algorithm after finitely many steps or it has to
jump to position (B.0).
Now, assume one has reached (B.0). In that case, the interval [a, b] has the property of
Lemma 3.3 and ϕ0 (a) < ρϕ0 (0). If the algorithm terminates with ”RETURN 2”, then the
Wolfe-Powell conditions are satisfied. It remains to prove that ”RETURN 2” can be reached
after finitely many trials (switch to (B.1)). First we prove by induction that the interval
(j) (j) (j) (j)
[α1 , α2 ] has, for every j ∈ N, the property (3.22) (with a = α1 and b = α2 ) and fulfills
(j)
ϕ0 (α1 ) < ρϕ0 (0).
• j = 0. The assertion follows from the conditions which allow to arrive at (B.0).
(j) (j) (j) (j)
• j → j + 1. We assume that [α1 , α2 ] fulfills (3.22) (with a = α1 and b = α2 ) and
(j)
ϕ0 (α1 ) < ρϕ0 (0).
(j+1) (j) (j+1)
If Φ(α(j) ) ≥ 0, then setting α1 := α1 , α2 := α(j) gives:
(j+1) (j)
Φ(α1 ) = Φ(α1 ) ≤ 0,
(j+1)
Φ(α2 ) = Φ(α(j) ) ≥ 0,
(j+1) (j)
Φ0 (α1 ) = Φ0 (α1 ) < 0.
In case Φ(α(j) ) < 0 and ϕ0 (α(j) ) < ρϕ0 (0) (otherwise the algorithm will terminate
(j+1) (j+1) (j)
with ”RETURN 2”), we have α1 = α(j) and α2 = α2 and
(j+1)
Φ(α1 ) = Φ(α(j) ) < 0,
(j+1) (j)
Φ(α2 ) = Φ(α2 ) ≥ 0,
(j+1)
Φ0 (α1 ) = Φ0 (α(j) ) = ϕ0 (α(j) ) − σ ϕ0 (0) < 0,
since ρ > σ > 0.
Prof. Dr. Michael Hintermüller 27

(j) (j)
In both cases, [α1 , α2 ] has the desired property.
Finally we show that the loop (switch to (B.1)) is finite. Let us assume that this was not
(j) (j)
true. Then the intervals [α1 , α2 ] would ”shrink” to a point α∗ . This results from the fact
that
(j) (j) (j+1) (j+1)
0 < α2 − α1 ≤ max{1 − τ1 , 1 − τ2 } (α2 − −α1 )
and max{1 − τ1 , 1 − τ2 } < 1 hold true. Lemma 3.3 would yield that for each j ∈ N there
(j) (j)
exists α̂(j) ∈ (α1 , α2 ) , such that
Φ(α̂(j) ) < 0 and Φ0 (α̂(j) ) = 0
would be fulfilled. Because of α̂(j) → α∗ for j → ∞, it follows Φ0 (α∗ ) = 0 and also
(3.24) ϕ0 (α∗ ) = σϕ0 (0) > ρϕ0 (0),
(j)
since by assumption 0 < σ < ρ and ϕ0 (0) < 0. On the other hand, ϕ0 (α1 ) < ρϕ0 (0) and the
continuity of ϕ0 would imply ϕ0 (α∗ ) ≤ ρϕ0 (0). This, however, would contradict (3.24). Hence,
the loop (switch to (B.1)) has to terminate after finitely many iterations with ”RETURN
2”. 
The freedom of choice w.r.t. α(j) in (B.1) can again be used to apply quadratic or cubic
polynomial models.

2.3. Strong Wolfe-Powell rule. Let σ ∈ (0, 21 ) and ρ ∈ [σ, 1) fixed. The Strong Wolfe-
Powell rule requires: For x, d ∈ Rn with ∇f (x)> d < 0 determine a step size α > 0 with
(3.25) f (x + αd) ≤ f (x) + σα∇f (x)> d,
(3.26) |∇f (x + αd)> d| ≤ −ρ∇f (x)> d.
In comparison to the Wolfe-Powell rule, condition (3.26) requires not only that the graph of
ϕ(α)(= f (x + αd)) in α > 0 does not decrease as steeply as in α = 0, but also that the graph
does not increase too steeply. A step size, which fulfills (3.26) for a very small ρ (and thus
also for a very small σ), is near to a stationary point of ϕ(·).
For the set of Strong Wolfe-Powell step sizes in x in direction d, i.e.
SSWP (x, d) := {α > 0 : (3.25) + (3.26) are fulfilled}
an analogue statement to Theorem 3.3 holds true. Furthermore we can prove an analogous
result to Lemma 3.3, in which the third condition in (3.22) has to be modified. The corre-
sponding step size algorithm is structured similarly to Algorithm 3.3.

3. Practical aspects
The algorithms in section 2 of this chapter are idealized. In numerical practice it has to
be taken into account that the exactness in the evaluation of functions and derivatives is
machine- and problem-dependent. If these “inaccuracies” are not taken into account in the
conditions of Paragraphs 2.1–2.3, dead loops are likely to occur. In the best case, error
bounds (α), (0) ≥ 0 and ˆ(α), ˆ(0) ≥ 0 are known for the function values ϕ(α), ϕ(0) and
derivatives ϕ0 (α), ϕ0 (0). Then it would be possible to modify resp. attenuate the condition
ϕ(α) ≤ ϕ(0) + σαϕ0 (0) in the following way:
ϕ(α) ≤ ϕ(0) + σα ϕ0 (0) + ˆ(0) + (α) + (0).

28 BMS Basic Course: Nonlinear Optimization

Further, ϕ0 (α) ≥ ρϕ0 (0) could be implemented in the form


ϕ0 (α) ≥ ρ ϕ0 (0) − ˆ(0) − ˆ(α).


In most cases, such error bounds are not available. Then a possible approach contains the
application of error bounds of the following form

(α) :=  1 + |ϕ(α)|
(or also (α) := |ϕ(α)| for sufficiently large |ϕ(α)|), where  ≥ M . The value M > 0
corresponds to the machine precision (or the relative accuracy in computations). If the
analytic form of the derivative is implemented, then one may choose ˆ(α) = ˆ(0) = 0 (and 
slightly enlarged).
(j) (j)
Furthermore the step size algorithm has to be terminated whenever the interval [α1 , α2 ]
(j) (j)
gets ”too small”, i.e., when α2 − α1 > 0 becomes small. A practicable criterion uses a
tolerance level of the form
(j) (j)
∆ := (1 + α2 ) for α2 < +∞
(j)
(or ∆ := α2 ).
Since Theorem 3.4 assumes the boundedness of f from below, a lower bound to ϕ should be
employed in the step size strategy, terminating the algorithm when φ drops below this bound.
CHAPTER 4

Rate of convergence

For the realization of a numerical method to solve


min f (x), x ∈ Rn ,
not only the convergence of iterates xk to a solution (or probably only a stationary point) is
of importance, but also “how fast” this convergence takes place.

1. Q-convergence and R-convergence


First we discuss classical approaches to characterize the rate of convergence.
Definition 4.1. (a) For the sequence {xk } ⊂ Rn with limit x∗ ∈ Rn and p ∈ [1, +∞)
we refer to

kxk+1 −x∗ k
 lim supk→∞

kxk −x∗ kp
, if xk 6= x∗ ∀k ≥ k0 ,
k
Qp {x } := 0, if xk = x∗ ∀k ≥ k0 ,

+∞, otherwise,

as the quotient-convergence factor (Q-factor) of {xk }.


(b) We refer to
OQ {xk } := inf{p ∈ [1, +∞) : Qp {xk } = +∞}
as the Q-convergence order (Q-order) of the sequence {xk }.
We summarize some important properties in the following remark.
Remark 4.1. (1) The Q-factor depends on the applied norm, but the Q-order does not.
(2) There always exists a value p0 ∈ [1, +∞) such that

k 0 for p ∈[1, p0 ),
Qp {x } =
+∞ for p ∈(p0 , +∞).
(3) The Q-orders 1 and 2 are of particular importance. The following notions have
become popular:

Q1 {xk } = 0 Q-superlinear convergence


0 < Q1 {xk } < 1 Q-linear convergence
k
Q2 {x } = 0 Q-superquadratic convergence
0 < Q2 {xk } < +∞ Q-quadratic convergence

29
30 BMS Basic Course: Nonlinear Optimization

The implementation of the criterion


(4.1) kxk − x∗ k ≤ 
for Q-superlinearly convergent iteration sequences requires the knowledge of x∗ , which, how-
ever, is unrealistic! The following result allows to replace (4.1) by the practical criterion
(4.2) kxk+1 − xk k ≤ .
Theorem 4.1. Any sequence {xk } ⊂ Rn with limk xk = x∗ satisfies
k+1 − xk k k+1 − x∗ k

1 − kx ≤ kx for xk 6= x∗ .

kxk − x∗ k kxk − x∗ k
If {xk } converges Q-superlinearly to x∗ and xk 6= x∗ for k ≥ k0 then
kxk+1 − xk k
lim = 1.
k→∞ kxk − x∗ k

Proof. Using the triangle inequality we obtain



∗ k
kx − x k − kx k+1
− x k ≤ kxk+1 − xk k ≤ kx∗ − xk k + kxk+1 − x∗ k.

Dividing by both sides by kxk − x∗ k, the assertions follow immediately. 


In addition to Q-convergence, the notion of R-convergence plays a decisive role.
Definition 4.2. (a) For a sequence {xk } ⊂ Rn with limit x∗ ∈ Rn and p ∈ [1, +∞) we
refer to
lim supk→∞ kxk − x∗ k1/k , if p= 1,

k
Rp {x } := k
lim supk→∞ kxk − x∗ k1/p , if p> 1
as the root-convergence factor (R-factor) of {xk }.
(b) We refer to
OR {xk } := inf{p ∈ [1, +∞) : Rp {xk } = 1}
as the R-convergence order (R-order) of the sequence {xk }.
Remark 4.2. (1) In contrast to the Q-factor the R-factor is independent of the applied
norm due to the norm equivalence in Rn . To see this, let k · ka and k · kb be two
norms in Rn , necessarily satisfying c1 kxkb ≤ kxka ≤ c2 kxkb for all x ∈ Rn , where
c1 , c2 are positive constants. Furthermore let {γ k }, γk > 0 for all k ∈ N, be a zero
sequence. Then
lim sup kxk − x∗ kγak ≤ lim cγ2k lim sup kxk − x∗ kγb k
k→∞ k→∞ k→∞
= lim sup kxk − x∗ kγb k ≤ lim c−γ
1
k
lim sup kxk − x∗ kγak
k→∞ k→∞ k→∞
= lim sup kxk − x∗ kγak .
k→∞

(2) There always exists a p0 ∈ [1, +∞) such that



k 0 for p ∈[1, p0 ),
Rp {x } =
1 for p ∈(p0 , +∞).
Prof. Dr. Michael Hintermüller 31

(3) Between Q- and R-convergence resp. the Q- and R-factor, the following relations
hold true:
OQ {xk } ≤ OR {xk } and R1 {xk } ≤ Q1 {xk }.
It is often convenient to use the Landau symbols O and O for describing the convergence
behavior.
Definition 4.3. Let f, g : Rn → Rm and x∗ ∈ Rn . We write
(a) f (x) = O(g(x)) for x → x∗ if and only if there exist a uniform constant λ > 0 and a
neighborhood U of x∗ such that for all x ∈ U \ {x∗ } the following relation holds true:
kf (x)k ≤ λkg(x)k.
(b) f (x) = O(g(x)) for x → x∗ if and only if for all  > 0 there exists a neighborhood U
of x∗ such that for all x ∈ U \ {x∗ } we have
kf (x)k ≤ kg(x)k.
Remark 4.3. If limk xk = x∗ , then {xk } converges to x∗ (at least)
(1) Q-superlinearly, if kxk+1 − x∗ k = O(kxk − x∗ k);
(2) Q-quadratically, if kxk+1 − x∗ k = O(kxk − x∗ k2 ).

2. Characterizations
The aim is to specify an alternative characterization for (Q-)superlinear and (Q-)quadratic
convergence of a sequence {xk }. For that purpose we need some auxiliary results which will
also be applied in the subsequent chapters.
Lemma 4.1. Let f : Rn → R and {xk } ⊂ Rn with limk→∞ xk = x∗ ∈ Rn . Then the following
assertions holds true:
(a) If f is twice continuously differentiable, then
k∇f (xk ) − ∇f (x∗ ) − ∇2 f (xk )(xk − x∗ )k = O(kxk − x∗ k).
(b) If, in addition, ∇2 f is locally Lipschitz-continuous, then
k∇f (xk ) − ∇f (x∗ ) − ∇2 f (xk )(xk − x∗ )k = O(kxk − x∗ k2 ).
Proof. (a)The triangle inequality yields
k∇f (xk ) − ∇f (x∗ ) − ∇2 f (xk )(xk − x∗ )k ≤
≤ k∇f (xk ) − ∇f (x∗ ) − ∇2 f (x∗ )(xk − x∗ )k + k∇2 f (x∗ ) − ∇2 f (xk )k · kxk − x∗ k.

Since f ∈ C 2 by assumption, we have


k∇f (xk ) − ∇f (x∗ ) − ∇2 f (x∗ )(xk − x∗ )k = O(kxk − x∗ k)
as well as
k∇2 f (xk ) − ∇2 f (x∗ )k −→ 0 for k → ∞.
and thus assertion (a).
32 BMS Basic Course: Nonlinear Optimization

(b) The Mean Value Theorem yields


k∇f (xk ) − ∇f (x∗ ) − ∇2 f (xk )(xk − x∗ )k =
Z 1
=k ∇2 f (x∗ + τ (xk − x∗ ))(xk − x∗ )dτ − ∇2 f (xk )(xk − x∗ )k
0
Z 1
≤ k∇2 f (x∗ + τ (xk − x∗ )) − ∇2 f (xk )k · kxk − x∗ kdτ
0
Z 1
≤ Lkxk − x∗ k k(τ − 1)(xk − x∗ )kdτ
0
L
= kxk − x∗ k2 = O(kxk − x∗ k2 ).
2

The following lemma ensures that for sufficiently small  > 0,
(4.3) k∇f (xk )k ≤ 
represents a reasonable (and frequently applied) stopping criterion.
Lemma 4.2. Let f : Rn → R be twice continuously differentiable, {xk } ⊂ Rn with limk xk =
x∗ , ∇f (x∗ ) = 0 and ∇2 f (x∗ ) nonsingular. Then there exist an index k0 ∈ N and a constant
β > 0 such that
k∇f (xk )k ≥ βkxk − x∗ k for all k ≥ k0 .
Proof. Since f ∈ C 2 we get
k∇f (xk ) − ∇f (x∗ ) − ∇2 f (x∗ )(xk − x∗ )k = O (kxk − x∗ k).
Hence, for every ε > 0 there exists an index k0 (ε) ∈ N with
k∇f (xk ) − ∇f (x∗ ) − ∇2 f (x∗ )(xk − x∗ )k ≤ εkxk − x∗ k ∀k ≥ k0 (ε).
We assume, with out loss of generality, that  < 1/k∇2 f (x∗ )−1 k. Consequently ∀k ≥ k0 (ε) it
holds that
k∇f (xk )k =k∇f (xk ) − ∇f (x∗ ) − ∇2 (x∗ )(xk − x∗ ) + ∇2 f (x∗ )(xk − x∗ )k
≥k∇2 f (x∗ )(xk − x∗ )k − k∇f (xk ) − ∇f (x∗ ) − ∇2 f (x∗ )(xk − x∗ )k
1
≥ 2 kxk − x∗ k − εkxk − x∗ k
k∇ f (x∗ )−1 k
1
=βkxk − x∗ k with β = − ε.
k∇2 f ((x∗ )−1 )k
Here we used
kxk − x∗ k = k∇2 f (x∗ )−1 ∇2 f (x∗ )(xk − x∗ )k ≤ k∇2 f (x∗ )−1 k · k∇2 f (x∗ )(xk − x∗ )k.

In numerical practice (4.3) is usually implemented as a relative criterion. This aspect will be
addressed later.
Now we can characterize superlinear convergence of a sequence {xk } as follows.
Prof. Dr. Michael Hintermüller 33

Theorem 4.2. Let f : Rn → R be twice continuously differentiable, {xk } ⊂ Rn with limk xk =


x∗ ∈ Rn , xk 6= x∗ for all k ∈ N and ∇2 f (x∗ ) nonsingular. Then the following assertions are
equivalent:
(a) {xk } converges superlinearly to x∗ and ∇f (x∗ ) = 0.
(b) k∇f (xk ) + ∇2 f (xk )(xk+1 − xk )k = O(kxk+1 − xk k).
(c) k∇f (xk ) + ∇2 f (x∗ )(xk+1 − xk )k = O(kxk+1 − xk k).
Proof. (c) =⇒ (a): The Mean Value Theorem yields:

k∇f (xk+1 )k =k∇f (xk+1 ) − ∇f (xk ) − ∇2 f (x∗ )(xk+1 − xk ) + ∇f (xk ) + ∇2 f (x∗ )(xk+1 − xk )k
Z 1
=k (∇2 f (xk + τ (xk+1 − xk ) − ∇2 f (x∗ ))(xk+1 − xk )dτ
0
+ ∇f (xk ) + ∇2 f (x∗ )(xk+1 − xk )k
Z 1
≤ k∇2 f (xk + τ (xk+1 − xk ) − ∇2 f (x∗ )k dτ · kxk+1 − xk k
0
+ k∇f (xk ) + ∇2 f (x∗ )(xk+1 − xk )k.

By assumption (c), the continuity of ∇2 f (·) and limk→∞ xk = x∗ we infer the existence of a
zero sequence (εk ) ⊂ R with

(4.4) k∇f (xk+1 )k ≤ εk kxk+1 − xk k.


Hence ∇f (xk+1 ) → 0 and consequently ∇f (x∗ ) = 0. Lemma 4.3 ensures the existence of
β ≥ 0 with
k∇f (xk+1 )k ≥ βkxk+1 − x∗ k
for all sufficiently large k ∈ N . We infer (using (4.2)) that

βkxk+1 − x∗ k ≤ εk kxk+1 − xk k ≤ εk (kxk+1 − x∗ k + kxk − x∗ k).


Now we obtain for sufficiently large k ∈ N
kxk+1 − x∗ k εk
k ∗

kx − x k β − εk
and thus superlinear convergence of {xk } to x∗ .
(a)=⇒(c): By assumption f is twice continuously differentiable =⇒ ∇f is locally Lipschitz-
continuous. Since xk −→ x∗ , there exists a constant L > 0 with
k∇f (xk+1 ) − ∇f (x∗ )k ≤ Lkxk+1 − x∗ k ∀k ∈ N sufficiently large.
Thus ∇f (x∗ ) = 0 implies
Lkxk+1 − x∗ k kxk − x∗ k
k∇f (xk+1 )k = k∇f (xk+1 ) − ∇f (x∗ )k ≤ · k+1 · kxk+1 − xk k
kxk − x∗ k kx − xk k
Since, by assumption, {xk } converges superlinearly to x∗ there exists a zero sequence {εk } ⊂
R+ with
k∇f (xk+1 )k ≤ εk kxk+1 − xk k (use Theorem 4.1)
34 BMS Basic Course: Nonlinear Optimization

Moreover it holds that


k∇f (xk ) + ∇2 f (x∗ )(xk+1 − x∗ )k ≤
Z 1
k+1
≤k∇f (x )k + k∇2 f (xk + τ (xk+1 − xk )) − ∇2 f (x∗ )kdτ · kxk+1 − xk k
0
 Z 1 
2 k k+1 ∗
≤ εk + k∇ f (x + τ (x − x )) − ∇ f (x )kdτ kxk+1 − xk k
k 2
0
Since xk converges to x∗we have xk + τ (xk+1 − xk ) −→ x∗ uniformly in τ ∈ [0, 1]. Together
with the continuity of ∇2 f (·), this yields
Z 1
xk →x∗
k∇2 f (xk + τ (xk+1 − xk )) − ∇2 f (x∗ )kdτ −→ 0,
0
which implies (c). The equivalence (b) ⇐⇒ (c) is easy to verify. 
A simple, but very important consequence of Theorem 4.2 is associated with gradient-related
methods. For this purpose let {H k } ⊂ Rn×n be a sequence of nonsingular matrices and let
the sequence {xk } be defined by
xk+1 := xk − (H k )−1 ∇f (xk ), k = 0, 1, 2, . . . .
Suppose {xk }
converges to x∗ and ∇2 f (x∗ )
is nonsingular, then the following assertions are
equivalent:
(a) {xk } converges superlinearly to x∗ and ∇f (x∗ ) = 0.
(b) k(∇2 f (xk ) − H k )(xk+1 − xk )k = O(kxk+1 − xk k).
(c) k(∇2 f (x∗ ) − H k )(xk+1 − xk )k = O(kxk+1 − xk k).
Remark 4.4. The assertion of Theorem 4.2 even holds true if “superlinear” and “O(kxk+1 −
xk k) are replaced by “quadratically” and “O(kxk+1 − xk k2 )”.
CHAPTER 5

Gradient based methods

Concerning the general descent method (cf. Algorithm 3.1), we still have to make a reasonable
choice of the descent direction dk .

1. The method of steepest descent


An obvious way to choose d (also with respect to the angle condition from Theorem 3.1 (a))
is by solving
(5.1) min ∇f (x)> d s.t. kdk = 1.
The aim is to determine a direction d along which f in x decreases the most (steepest descent).
Obviously, the solution of (5.1) satisfies
0 ≤ |∇f (x)> d| ≤ k∇f (x)k.
The choice
∇f (x)
d=−
k∇f (x)k
yields ∇f (x)> d = −k∇f (x)k and thus solves (5.1). Applying the Wolfe-Powell step size
strategy, it follows immediately from Theorem 3.3 and Theorem 3.1, that every accumulation
point of the sequence {xk } is a stationary point of f . An analogous statement holds true for
the strict Wolfe-Powell rule. As the Armijo condition does not necessarily satisfy assertion (b)
from Theorem 3.1, we would like to indicate the proof. Thereby we assume (for simplicity) a
“backtracking” strategy. To begin with, we consider the following lemma.
Lemma 5.1. Let f : Rn → R be continuously differentiable, x, d ∈ Rn , {xk }, {dk } ⊂ Rn with
limk xk = x and limk dk = d as well as {αk } ⊂ R++ with limk αk = 0. Then it holds that
f (xk + αk dk ) − f (xk )
lim = ∇f (x)> d.
k→∞ αk
Proof. Due to the mean-value theorem, for all k there exists a vector ξ k ∈ Rn on the
line segment joining xk and xk + αk dk with
f (xk + αk dk ) − f (xk ) = αk ∇f (ξ k )> dk .
Since ξ k → x as αk → 0 and f ∈ C 1 , it follows
∇f (ξ k )> dk −→ ∇f (x)> d.


This lemma is useful for the following convergence theorem.


35
36 BMS Basic Course: Nonlinear Optimization

Theorem 5.1. If f : Rn → R is continuously differentiable, then every accumulation point


of a sequence {xk } generated by Algorithm 3.1 with Armijo step size strategy and dk =
−∇f (xk )/k∇f (xk )k is a stationary point of f .
Proof. Let x∗ ∈ Rn be an accumulation point of {xk }, and let {xk(l) } be a subsequence
converging to x∗ . Assume ∇f (x∗ ) 6= 0. Since {f (xk )} is monotonically decreasing and the
subsequence {f (xk(l) )} converges to f (x∗ ), the entire sequence {f (xk )} converges to f (x∗ ).
Thus,
f (xk+1 ) − f (xk ) → 0 für k → ∞.
¿From ∇f (xk(l) ) → ∇f (x∗ ) 6= 0 it follows that αk(l) → 0, where αk(l) = β mk(l) , mk(l) ∈ N
denoting the uniquely determined exponent from the Armijo rule. Consequently it holds that
f (xk(l) + β mk(l) −1 dk(l) ) > f (xk(l) ) + σβ mk(l) −1 ∇f (xk(l) )> dk(l)
for sufficiently large l. This implies
f (xk(l) + β mk(l) −1 dk(l) )
> σ∇f (xk(l) )> dk(l) .
β mk(l) −1
For l → ∞, it follows from β mk(l) −1 → 0 and Lemma 5.1 that
−k∇f (x∗ )k2 ≥ −σk∇f (x∗ )k2 .
Since ∇f (x∗ ) 6= 0 and σ ∈ (0, 1) this yields a contradiction. 
The rate of convergence of the steepest descent method can be very slow. We want to illustrate
this with the help of quadratic functions: Let
1
f (x) = x> Qx + c> x + γ
2
with Q ∈ S n positive definite, c ∈ Rn and γ ∈ R. Let xk , dk be given. We have
1
ϕ(α) := (xk + αdk )> Q(xk + αdk ) + c> (xk + αdk ) + γ
2
α2 1
= (dk )> Qdk + α((dk )> Qxk + c> dk ) + (xk )> Qxk + c> xk + γ.
2 2
The quadratic nature of ϕ(α) = f (xk + αdk ) allows to explicitly determine the minimizing
step size αk , i.e.
f (xk + αk dk ) = min{f (xk + αdk ) : α ≥ 0}.
The step size αk fulfills ϕ0 (αk ) = 0 with
(Qxk + c)> dk ∇f (xk )> dk ∇f (xk )> ∇f (xk )
αk = − = − = ,
(dk )> Qdk (dk )> Qdk ∇f (xk )> Q∇f (xk )
where we use dk = −∇f (xk ) (without scaling by k∇f (xk )k−1 ). It can be shown that
λmax − λmin 2
 
(5.2) k k+1
f (x ) − f (x ) ≤ (f (xk ) − f (x∗ )),
λmax + λmin
where λmax ≥ λmin > 0 are the largest resp. the smallest eigenvalues of Q and x∗ is the global
minimizer of f . Let κ = λmax /λmin be the spectral condition number of the matrix Q. Then
Prof. Dr. Michael Hintermüller 37

(5.2) is equivalent to
 2
κ−1
k
f (x ) − f (x k+1
)≤ (f (xk ) − f (x∗ )).
κ+1
Furthermore, λmin x> x ≤ x> Qx ≤ λmax x> x implies
√ κ−1 k 0
 

k
kx − x k ≤ κ kx − x∗ k.
κ+1
Evidently the rate of convergence is very slow if κ is very large (zig-zagging effect).

2. Gradient-related methods
A possible remedy for the slow convergence of the method of steepest descent consists in
choosing
dk = −H −1 ∇f (xk ),
where H ∈ S n is positive definite. Additionally, the matrix H should be chosen such that
λmax (H −1 Q) λmax (Q)
0< <
λmin (H −1 Q) λmin (Q)
and Hdk = −∇f (xk ) is simple to solve.
Definition 5.1. Let f : Rn → R be continuously differentiable and {xk } ⊂ Rn . A sequence
{dk } ⊂ Rn is called gradient-related w.r.t. f and {xk }, if for every subsequence {xk(l) }
converging to a nonstationary point of f , there exists c > 0 and  > 0 such that
(a) kdk(l) k ≤ c for all l ∈ N,
(b) ∇f (xk(l) )> dk(l) ≤ − for all sufficiently large l ∈ N.
Remark 5.1. (1) Let {H k } ⊂ S n be a sequence of positive definite matrices, which
fulfill
c1 kxk2 ≤ x> H k x ≤ c2 kxk2 for all x ∈ Rn , k ∈ N,
with constants c2 ≥ c1 > 0. Then {dk } given by
H k dk = −∇f (xk ) for all k ∈ N
is gradient-related.
(2) For Algorithm 3.1 with gradient-related search directions and Armijo step size strat-
egy an analogous statement to Theorem 5.1.
(3) Sometimes the choice
∂ 2 f (xk )
H k = diag(hkii ) with hkii =
∂x2i
results in a significant improvement of the rate of convergence.
CHAPTER 6

Conjugate gradient method

We first derive the conjugate gradient method for minimizing strictly convex quadratic func-
tions. Then we transfer the technique to minimization problems of general nonlinear functions.
In this context we consider the Fletcher-Reeves and the Polak-Ribière variants of the conju-
gate gradient (CG) method. The two versions differ in the update strategy of a scalar which
has an impact on the determination of the search direction and the line search algorithm.
While the original Polak-Ribière method requires an impractical step size strategy in order
to be analyzed successfully, we will briefly elaborate on a modified Polak-Ribière variant of
the conjugate gradient method, which is based on an implementable step size strategy.

1. Quadratic minimization problems


As we have already seen in section 1 of chapter 5, the method of steepest descent may be
very slow even if an exact line search is performed. Still utilizing gradient information only,
in this section our goal is to devise a strategy which overcomes (to some extent) this difficulty
of the steepest descent method and which is computationally efficient. In fact, often one is
confronted with minimizing
1
f (x) = x> Ax − b> x + c with A ∈ S n pos. def.,
2
where A is a very large matrice often possessing special structure. The last aspect frequently
allows an efficient computation of the matrix-vector product Ax: For instance if A is a dense
matrix being the product of sparse matrices, i.e.,
A = Ql · Ql−1 · · · Q1 , l ∈ N,
then it is easy to calculate the product Ax iteratively according to
y i+1 = Qi y i , y0 = x
with y l = Ax, exploiting the structure of Qi . In the subsequent chapter 8 we will study
a strategy which approximates the Hessian ∇2 f by means of special difference methods.
In every iteration k of the resulting method (Quasi-Newton method) one determines the
search direction dk as the solution of H k d = −∇f (xk ), where H k is an approximation of
∇2 f (xk ). Then a possible choice of H k+1 yielding a positive definite symmetric approximation
of ∇2 f (xk ) is given by
y k (y k )> H k sk (sk )> H k
H k+1 = H k + − ,
(sk )> y k (sk )> H k sk
where y k = ∇f (xk+1 ) − ∇f (xk ) and sk = xk+1 − xk . If indeed one performs the multiplica-
tions on the right hand side, then one usually obtains a dense matrix H k+1 , even if ∇2 f (xk+1 )
is sparse. In case of very large Hessians this would directly lead to storage problems. Al-
ternatively one could only store the vectors (sk , y k ) in every iteration k and determine the
39
40 BMS Basic Course: Nonlinear Optimization

matrix-vector product H k+1 d iteratively as a vector-vector product according to the update


formula. In this way one can avoid the storage problems mentioned above.
The conjugate gradient method introduced in this section, when applied to quadratic functions
with positive definite Hessians A, terminates after finitely many steps and requires only
matrix-vector products involving A. Hence it is not necessary to store A as an array.
The necessary and sufficient condition for the minimizer x∗ of
1 >
x Ax − b> x + c with A ∈ S n pos. def.
2
is
Ax∗ = b.
For a given initial value x0 we now specify a method which iteratively constructs a solution
to
(6.1) Ax = b.
The following result motivates our strategy.
Lemma 6.1. Let f (x) = 12 x> Ax − b> x + c with A ∈ S n positive definite, b ∈ Rn , c ∈ R and
x0 ∈ Rn . Further let d0 , d1 , . . . , dn−1 ∈ Rn be different from the null vector with
(6.2) (di )> Adj = 0, ∀i, j = 0, 1, . . . , n − 1, i 6= j.
Then the method of successive one-dimensional minimization along the directions d0 , d1 , . . . , dn−1 ,
i.e. the computation of the sequence {xk } according to
xk+1 = xk + αk dk
with
(6.3) f (xk + αk dk ) = min f (xk + αdk ), k = 0, 1, . . . , n − 1
α∈R
yields, after at most n steps, the minimizer xn = x∗ of f . Furthermore, for k = 0, . . . , n − 1
with g k := Axk − b = ∇f (xk ) it holds that
(g k )> dk
(6.4) αk = −
(dk )> Adk
and
(6.5) (g k+1 )> dj = 0, j = 0, 1, . . . , k.
Proof. We have
1 1
f (xk + αdk ) = α2 dkT Adk + α(xkT Adk − b> dk ) + c + (xk )> Axk − b> xk
2 2
1 1
= α2 (dk )> Adk + α(g k )dk + c + (xk )> Axk − b> xk .
2 2
From (6.3) it follows that the minimizer αk fulfills
αk (dk )> Adk + (g k )> dk = 0.
This proves (6.4). Moreover, we obtain
0 = αk (dk )> Adk + (g k )> dk
= (αk (dk )> A + (xk )> A − b> )dk
= (A(xk + αk dk ) − b)> dk = (Axk+1 − b)> dk .
Prof. Dr. Michael Hintermüller 41

We infer
(6.6) (g k+1 )> dk = 0 for k = 0, 1, ..., n − 1.
The application of (6.2) for i 6= j implies
(g i+1 − g i )> dj = (Axi+1 − Axi )> dj = αj (di )> Adj = 0.
Together with (6.6) this yields
(g k+1 )> dj = (g j+1 )> dj + Σki=j+1 (g i+1 − g i )> dj = 0
for j = 0, ..., k. This proves (6.5). Due to (6.2), the vectors d0 , . . . , dn−1 are pairwise orthog-
onal w.r.t. the scalar product
< u, v >A := u> Av.
Consequently these vectors are also linearly independent. From (6.5), it follows immediately
that g n = 0 or equivalently xn = x∗ , the solution of the problem (6.1). 
Remark 6.1. Vectors d0 , d1 , . . . , dn−1 with the property (6.2) are called A-conjugated or A-
orthogonal.
We continue by defining a strategy to determine A-conjugate directions d0 , d1 , . . . , dn−1 : We
begin with
d0 = −∇f (x0 ) = −g 0 .
Assume we already know l + 1 vectors d0 , . . . , dl with
(6.7) (di )> Adj = 0 for i, j = 0, . . . , l mit i 6= j.
According to Lemma 6.1, (6.4) and (6.5) hold true for k = 0, . . . , l. We suppose that g l+1 6= 0,
otherwise we have already found the solution. We make the ansatz
l
X
l+1 l+1
(6.8) d := −g + βil di .
i=0

(because of (6.5) g l+1 is linearly independent of d0 , . . . , dl ). Our goal is to have


(dl+1 )> Adj = 0 j = 0, . . . , l.
It holds that
l
!>
X
l+1 > j l+1
(d ) Ad = −g + βil di Adj
i=0
l
X
= −(g l+1 )> Adj + βil (di )> Adj
i=0
= −(g l+1 )> Adj + βjl (dj )> Adj ,

since by (6.7), (di )> Adj = 0 for i, j = 0, . . . , l with i 6= j. This implies


(g l+1 )> Adj
(6.9) βjl = for j = 0, . . . , l.
(dj )> Adj
42 BMS Basic Course: Nonlinear Optimization

We study further properties. In fact, multiplying (6.8) by (g l+1 )> , we obtain


l l
!
X X
(g l+1 )> −g l+1 + βil di = −kg l+1 k2 + βil (g l+1 )> di = −kg l+1 k2 < 0.
i=0 i=0

Obviously dl+1 is a descent direction for f at xl+1 . Then, by (6.4),


(g l+1 )> dl+1
αl+1 = − > 0.
(dl+1 )> Adl+1
As a result of the construction of dk , k = 0, . . . , l, it holds that
(g k )> dk = −kg k k2 < 0 and αk > 0 for k = 0, . . . , l.
A further orthogonality relation can be obtained in the following way:
j−1
!
j−1
X
(6.10) (g l+1 )> g j = (g l+1 )> βi di − dj
i=0
j−1
βij−1 (g l+1 )> di − (g l+1 )> dj = 0.
X
(6.11) =
i=0

Thus, for the left hand side in (6.9) we get


1 l+1 > j+1 (6.11)
(g l+1 )> Adj = (g ) (g − g j ) = 0.
αj
because of g j+1 − g j = Axj+1 − Axj = αj Adj for j = 0, . . . , l − 1. Hence,
βjl = 0 for j = 0, . . . , l − 1
and
(g l+1 )> Adl 1 (g l+1 )> (g l+1 − g l )
βll = =
(dl )> Adl αl (dl )> Adl
kg l+1 k2 kg l+1 k2 kg l+1 k2
= = = =: βl .
(dl )> (g l+1 − g l ) (−dl )> g l kg l k2
Consequently (6.8) is reduced to
dl+1 = −g l+1 + βl dl .
Due to g k = Axk − b we also have:
g k+1 − g k = Axk+1 − Axk = αk Adk .
Therefore g k can be updated at each step without requiring a further matrix-vector product.
The product Adk was already necessary to determine αk . To spare the evaluation of the scalar
product, one can rearrange (6.4) by means of (g k )> dk = −kg k k2 to
kg k k2
αk = .
(dk )> Adk
The CG-algorithm is as follows:
Prof. Dr. Michael Hintermüller 43

Algorithm 6.1 (CG-algorithm for quadratic functions).

input: x0 ∈ Rn
begin
set g 0 := Ax0 − b, d0 := −g 0 , k := 0; choose  ≥ 0.
while kg k k > 
begin
set

kg k k2
αk :=
(dk )> Adk
xk+1 := xk + αk dk
g k+1 := g k + αk Adk
kg k+1 k2
βk :=
kg k k2
dk+1 := −g k+1 + βk dk
k := k + 1
end
end
end

Remark 6.2. (1) The main complexity of Algorithm 6.1 consists of the calculation of
Ad . Given that the product is needed twice, one should store it as z k := Adk .
k

(2) Because of (6.11) βk can be calculated according to


(g k+1 − g k )> g k+1
(6.12) βk = .
kg k k2
In this case, however, an additional scalar product is required. But from a numerical
point of view (6.12) is often more appropriate. The reason is that the directions dk
quickly lose their A-conjugacy due to numerical errors. Consequently the descent
property of the direction dk might get lost and αk might become very small. Approx-
imately evaluating Adk might even imply a negative αk . Therefore it is recommended
to choose
(g k+1 − g k )> g k+1
βk = max{0, },
kg k k2
kg k k2
αk = max{0, }.
(dk )> Adk
If αk ≈ 0 or βk ≈ 0 due to error influences, then xk+1 = xk + αk dk and g k+1 =
g k + αk Adk yield
xk+1 ≈ xk and g k+1 ≈ g k .
The choice of the next direction dk+1 is dominated by −g k+1 . Thus
dk+1 ≈ −g k+1
44 BMS Basic Course: Nonlinear Optimization

basically corresponds to the direction of steepest descent in xk+1 . In this sense a


kind of automatical restart is carried out.
(3) Even if (6.12) is applied, it is recommended to execute a restart from time to time.
The CG-method, which is theoretically a direct method, i.e. terminating after finitely
many steps with the exact solution, can be numerically regarded as an iterative
method.
(4) As mentioned before, the CG-method finds the exact solution after at most n steps.
Moreover it can be shown: If A possesses m (≤ n) different eigenvalues, then the CG-
method terminates after m steps with the exact solution. In addition, the method
terminates after m steps if b can be represented as a linear combination of at most
m eigenvectors of A and if x0 = 0 is used.
Let κ = λmax (A)/λmin(A), then the iteration sequence {xk } of the CG-method satisfies
√ k
∗ √ κ−1
k
kx − x k ≤ 2 κ √ kx0 − x∗ k.
κ+1
Obviously, the closer κ approaches 1, the faster the method converges. This leads to the
concept of preconditioning.
For an efficient preconditioning, our aim is to find a transformation of A, such that as many
eigenvalues (of the transformed matrix) as possible are 1 (or cluster near 1). For this purpose,
let W ∈ S n be positive definite. The solution of Ax = b can be found by solving the system
W −1/2 AW −1/2 y = W −1/2 b
and using x = W −1/2 y. The matrix W −1/2 AW −1/2 =: R has the same eigenvalues as W −1 A,
since W −1/2 RW 1/2 = W −1 A. The quest of determining W such that as many eigenvalues
as possible are 1 and all others are close to 1 corresponds to the fact that the condition
number of W −1 A is preferably small. The matrix W is called preconditioning matrix or
simply preconditioner. We note that in practice one only employs W , but not W 1/2 .

Algorithm 6.2 (Preconditioned CG-algorithm for quadratic functions).

input: x0 ∈ Rn , W ∈ S n positive definite.


begin
set g 0 := Ax0 − b, d0 := −W −1 g 0 , k := 0; choose  ≥ 0.
while kg k k > 
begin
set

(g k )> W −1 g k
αk :=
(dk )> Adk
xk+1 := xk + αk dk
g k+1 := g k + αk Adk
(g k+1 )> W −1 g k+1
βk :=
(g k )> W −1 g k
dk+1 := −W −1 g k+1 + βk dk
k := k + 1
Prof. Dr. Michael Hintermüller 45

end
end
end

Naturally, in numerical practice one does not evaluate W −1 . Merely


Wd = g
is solved. This demands a certain efficiency when solving the system and often requires a
compromise between “κ(W −1 A) preferably close to 1” and simple solvability of W d = g.

2. Nonlinear functions
2.1. Fletcher-Reeves method. The elementary structure of Algorithm 6.1 is the mo-
tivation for the following variant of the conjugate gradient method for the minimization of
continuously differentiable but not necessarily quadratic functions.

Algorithm 6.3 (Fletcher-Reeves method).

input: x0 ∈ Rn , 0 < σ < ρ < 12 .


begin
set d0 := −∇f (x0 ), k := 0; choose  ≥ 0.
while k∇f (xk )k > 
begin
determine αk s.t. the strong Wolfe-Powell rule is satisfied, i.e.

f (xk + αk dk ) ≤ f (xk ) + σαk ∇f (xk )> dk and


|∇f (xk + αk dk )> dk | ≤ −ρ∇f (xk )> dk .
set
xk+1 := xk + αk dk
k∇f (xk+1 )k2
βkF R :=
k∇f (xk )k2
dk+1 := −∇f (xk+1 ) + βkF R dk
k := k + 1
end
end
end

Note that we now require ρ ∈ (σ, 12 ) which is more restrictive than the condition ρ ∈ [σ, 1)
introduced in chapter 2.3. The current restriction is due to the convergence analysis of
the method. Firstly it can be shown that Algorithm 6.3 is well-defined for a continuously
differentiable function f : Rn → R which is bounded from below. Moreover we have the
following convergence property.
46 BMS Basic Course: Nonlinear Optimization

Theorem 6.1. Let f : Rn → R be continuously differentiable, bounded from below and ∇f


Lipschitz continuous on L(x0 ) := {x ∈ Rn : f (x) ≤ f (x0 )}. Then it holds that
lim inf k∇f (xk )k = 0.
k→∞

If L(x0 )
is convex, f twice continuously differentiable and uniformly convex on L(x0 ), then
the sequence {xk } generated by the Fletcher-Reeves method (Algorithm 6.3) converges to the
unique global minimizer of f .
2.2. Polak-Ribière method and modifications. In numerical practice it is often ob-
served that variants of the following nonlinear CG-method (Polak-Ribière method ) have better
convergence behavior.

Algorithm 6.4 (Polak-Ribière method).

input: x0 ∈ Rn .
begin
set d0 := −∇f (x0 ), k := 0; choose  ≥ 0.
while k∇f (xk )k > 
begin
determine αk such that
(6.13) αk = min{α > 0|∇f (xk + αdk )> dk = 0}.
set
xk+1 := xk + αk dk
(∇f (xk+1 ) − ∇f (xk ))> ∇f (xk+1 )
βkP R :=
k∇f (xk )k2
dk+1 := −∇f (xk+1 ) + βkP R dk
k := k + 1
end
end
end

The Fletcher-Reeves and the Polak-Ribière-method differ in the strategy for determining αk
and for the choice of βk :
• The step size choice (6.13) in the Polak-Ribière method is impractical, but necessary
for the convergence analysis. However in numerical practice the strong Wolfe-Powell
rule (here with small ρ), which is also applied in the Fletcher-Reeves algorithm,
yields satisfying results. There also exist so-called modified Polak-Ribière methods,
which work with an implementable step size strategy. In one instance, one computes
αk such that xk+1 = xk + αk dk and dk+1 = −∇f (xk+1 ) + βkP R dk satisfy the following
conditions:
(6.14) f (xk+1 ) ≤ f (xk ) − σαk2 kdk k2 and
(6.15) − δ2 k∇f (xk+1 )k2 ≤ ∇f (xk+1 )> dk+1 ≤ −δ1 k∇f (xk+1 )k2 ,
Prof. Dr. Michael Hintermüller 47

where αk = max{ρk β ` : ` = 0, 1, 2, . . .} with ρk := |∇f (xk )> dk |/kdk k2 s.t. (6.14)


and (6.15) are satisfied. The parameter restrictions are: β ∈ (0, 1), σ ∈ (0, 1) and
0 < δ1 < 1 < δ2 .
Under stronger assumptions compared to the ones for the Fletcher-Reeves method,
it can be shown that a sequence {xk } generated by the modified Polak-Ribière
method satisfies:
lim k∇f (xk )k = 0.
k→∞
Provided that the level set L(x0 ) is convex and f uniformly convex on L(x0 ), then
it holds again that the sequence {xk } converges to the uniquely determined global
minimizer of f .
• Concerning the choice of βkF R resp. βkP R note that in the case where dk yields
hardly any progress in the minimization of f , owing to bad descent- and conjugation
properties, the Polak-Ribière method tends to perform better than the Fletcher-
Reeves method. This can be seen in the following way: In the just mentioned
situation we can expect xk+1 to be close to xk , because αk is very small. Hence
∇f (xk+1 ) will also be close to ∇f (xk ), even if k∇f (xk+1 )k is still relatively large.
In such cases 0 ≤ |βkP R |  βkF R can be expected. If βkP R is close to 0, then it holds
that dk+1 ≈ −∇f (xk+1 ) and the method almost carries out a restart.
CHAPTER 7

Newton’s method

From now on we assume that f and its local minimizers x∗ fulfill the following conditions:
1. f is twice continuously differentiable with k∇2 f (x) − ∇2 f (y)k ≤ γkx − yk
in a neighborhood of x∗ with
(A)
2. ∇f (x∗ ) = 0,
3. ∇2 f (x∗ ) positive definite.
To simplify notations we will frequently denote the current iteration point by xa and the new
iterate by x+ .
We consider the following quadratic model of f near xa :
1
ma (x) = f (xa ) + ∇f (xa )> (x − xa ) + (x − xa )> ∇2 f (xa )(x − xa ).
2
2
If ∇ f (xa ) is positive definite, then we define x+ as the minimizer of ma (x):
0 = m0a (x+ ) = ∇f (xa ) + ∇2 f (xa )(x+ − xa ).
Rearranging yields the iteration rule of Newton’s method, i.e.
x+ = xa − (∇2 f (xa ))−1 ∇f (xa ).
Naturally the inverse (∇2 f (xa ))−1 is not computed. Merely
∇2 f (xa )d = −∇f (xa )
is solved, and we set x+ = xa + d. If xa is far away from a local minimizer, then ∇2 f (xa )
might possess negative eigenvalues. Then x+ can be a local maximizer or a saddle point of
ma . In order to cope with this, we have to introduce certain modifications. But for the time
being we assume that xa is sufficiently close to a local minimizer.
In what follows we will often make use of the following result:
Lemma 7.1. Let (A) be satisfied. Then there exists δ > 0 such that for all x ∈ B(δ) := {y :
ky − x∗ k < δ} it holds that
k∇2 f (x)k ≤ 2k∇2 f (x∗ )k,
k(∇2 f (x))−1 k ≤ 2k(∇2 f (x∗ ))−1 k,
kx − x∗ k
≤ k∇f (x)k ≤ 2k∇2 f (x∗ )k kx − x∗ k.
2k(∇2 f (x∗ ))−1 k
This enables us to prove local convergence of Newton’s method.
Theorem 7.1. Let (A) be satisfied. Then there exists constants K > 0 and δ > 0 (independent
of xa and x+ ) such that, for xa ∈ B(δ)1, the Newton step
x+ = xa − (∇2 f (xa ))−1 ∇f (xa )
49
50 BMS Basic Course: Nonlinear Optimization

satisfies the following estimate:


kx+ − x∗ k ≤ Kkxa − x∗ k2 .
Proof. Let δ > 0 be chosen sufficiently small such that the assertion of Lemma 7.1 holds
true. Then it holds that
x+ − x∗ = xa − x∗ − (∇2 f (xa ))−1 ∇f (xa )
= (∇2 f (xa ))−1 (∇2 f (xa )(xa − x∗ ) − ∇f (xa ))
= (∇2 f (xa ))−1 (∇2 f (xa )(xa − x∗ ) − ∇f (x∗ )
Z 1
− ∇2 f (x∗ + t(xa − x∗ ))(xa − x∗ )dt)
0
Z 1
= (∇2 f (xa ))−1 ∇2 f (xa ) − ∇2 f (x∗ + t(xa − x∗ )) (xa − x∗ )dt

0
Thus, we have
Z 1
kx+ − x∗ k ≤ k(∇2 f (xa ))−1 k · k∇2 f (xa ) − ∇2 f (x∗ + t(xa − x∗ ))kdtkxa − x∗ k
0
Z 1
∗ −1
2
≤ 2k(∇ f (x )) k · γ kxa − x∗ − t(xa − x∗ )kdt · kxa − x∗ k
0
Z 1
2 ∗ −1 ∗ 2
= 2γk(∇ f (x )) k · kxa − x k · (1 − t)dt
0
1
= 2γk(∇2 f (x∗ ))−1 k · · kxa − x∗ k2
2
= γk(∇2 f (x∗ ))−1 k · kxa − x∗ k2 .
Setting K := γk(∇2 f (x∗ ))−1 k proves the assertion. 
Theorem 7.1 immediately implies the local Q-quadratic convergence of Newton’s method.
Theorem 7.2. Let (A) be satisfied. Then there exists δ > 0 such that Newton’s method
xk+1 = xk − (∇2 f (xk ))−1 ∇f (xk )
for x0 ∈ B(δ) converges Q-quadratically to x∗ .
Proof. Let δ > 0 be sufficiently small such that the assertion of Theorem 7.1 holds true.
If necessary reduce δ to guarantee Kδ =: µ < 1. For k ≥ 0 and xk ∈ B(δ), Theorem 7.1
yields
(7.1) kxk+1 − x∗ k ≤ Kkxk − x∗ k2 ≤ µkxk − x∗ k < kxk − x∗ k < δ.
Thus xk+1 ∈ B(µδ) ⊂ B(δ). As x0 ∈ B(δ), this implies that Newton’s method is well-defined
and {xk } ⊂ B(δ). Now, (7.1) yields the Q-quadratic convergence of xk → x∗ . 
A canonical stopping criterion for Newton’s method (as well as for the gradient based methods
of section 5 and the CG-method of section 1) consists of a relative and an absolute error
bound. Let τr ∈ (0, 1) be a desired reduction in the gradient norm and τa , with 1  τa > 0,
an absolute error bound, then the algorithm terminates as soon as
k∇f (xk )k ≤ τr k∇f (x0 )k + τa
holds true.
Prof. Dr. Michael Hintermüller 51

1. Inaccuracies in function, gradient and Hessian evaluation


We discuss the error influence by means of an one-dimensional problem. For that purpose
assume that f can be approximately evaluated, i.e.
f˜(·) = f (·) + ˜f (·) with errors ˜f (·) ≥ 0 and |˜
f (·)| ≤ f .
Determining the derivatives numerically e.g. by forward differences
f˜(x + h) − f˜(x)
Dh+ f (x) =
h
results in
f˜(x + h) − f˜(x)
kDh+ f (x) − f 0 (x)k = k − f 0 (x)k
h
f (x + h) + ˜f (x + h) − f (x) − ˜f (x)
=k − f 0 (x)k
h
f (x + h) − f (x) 2f
≤k − f 0 (x)k +
h h
1 00 2f f
= kf (ξ)hk + = O(h + ).
2 h h
Here ξ lies on the line segment joining x and x + h. The minimizer h∗ of the error function

err+ (h) = h + hf fulfills
f
err0+ (h∗ ) = 1 − ∗ 2 = 0.
(h )
This implies
√ √
h∗ = f and err+ (h∗ ) = 2 f .
For the error in the gradient we obtain

g = O(h∗ ) = O( f ).
Applying once again forward differences to calculate the Hessian, we obtain that the error H
is of order
√ 1/4
H = O( g ) = O(f ).
This implies that Hessian matrices computed by two numerical differentiations are relatively
inaccurate. Even if we assume that f ≈ 10−16 (order of machine accuracy!) the error in the
approximation of the Hessian is H ≈ 10−4 !
The application of central (or symmetric) differences yields better results, i.e.
f˜(x + h) − f˜(x − h)
Dh0 f (x) = .
2h
It holds that
f (x + h) + ˜f (x + h) − f (x − h) − ˜f (x − h)
kDh0 f (x) − f 0 (x)k = k − f 0 (x)k
2h
f (x + h) − f (x − h) f
≤k − f 0 (x)k +
2h h
1 000 000 2 f f
= (kf (ξ1 )k + kf (ξ2 )k)h + = O(h2 + ).
12 h h
52 BMS Basic Course: Nonlinear Optimization

For the estimates above one uses third order Taylor expansions and intermediate values ξ1 , ξ2 .

The minimizer h∗ of err0 (h) = h2 + hf fulfills
f
err00 (h∗ ) = 2h∗ − ∗ 2 = 0,
(h )
yielding
1/3 2/3
h∗ = f and err0 (h∗ ) = O(f ).
Thus the gradient error is of the order
2/3
g = O(f ).
For the approximation of the Hessian we obtain
4/9
H = O(g2/3 ) = O(f ),
which is significantly better than in the case of forward differences.
Naturally, one can only expect convergence of an iterative scheme if g → 0 during the
iteration. This is illustrated by the following result. Errors denoted by f (·), g (·) and H (·)
have to be understood scalar-, vector- and matrix-valued. Inequalities like H (·) <  are to
be read elementwise.
Theorem 7.3. Let (A) be satisfied. Then there exist constants K̄ > 0, δ > 0 and  > 0 such
that for xa ∈ B(δ) and H (xa ) <  it holds that
−1
x+ = xa − ∇2 f (xa ) + H (xa ) (∇f (xa ) + g (xa ))
−1
is well-defined, i.e. ∇2 f (xa ) + H (xa )

is nonsingular and satisfies
kx+ − x∗ k ≤ K̄ kxa − x∗ k2 + kH (xa )kkxa − x∗ k + kg (xa )k .


Proof. Let δ be chosen such that Lemma 7.1 and Theorem 7.1 hold true.
Define
−1
xN 2
+ = xa − (∇ f (xa )) ∇f (xa )
and note that
−1
x+ = xN 2
+ + ((∇ f (xa )) − (∇2 f (xa ) + H (xa ))−1 )∇f (xa )
− (∇2 f (xa ) + H (xa ))−1 g (xa ).
Lemma 7.1 and Theorem 7.1 imply
kx+ − x∗ k ≤ k xN ∗ 2
+ − x k + k((∇ f (xa ))
−1
− (∇2 f (xa ) + H (xa ))−1 )∇f (xa )
+ k(∇2 f (xa ) + H (xa ))−1 k · kg (xa )k
≤ Kkxa − x∗ k2 + k(∇2 f (xa ))−1 − (∇2 f (xa ) + H (xa ))−1 k · k∇f (xa )k
(7.2) + k(∇2 f (xa ) + H (xa ))−1 k · kg (xa )k
≤ Kkxa − x∗ k2 + 2k(∇2 f (xa ))−1 − (∇2 f (xa ) + H (xa ))−1 k · k∇2 f (x∗ )k·
−1
· kxN 2
+ − xa k + k(∇ f (xa ) + H (xa )) k · kg (xa )k

≤K̃kxa − x∗ k2 + 2 · k(∇2 f (xa ))−1 − (∇2 f (xa ) + H (xa ))−1 k · k∇2 f (x∗ )k
· kxa − x∗ k + k(∇2 f (xa ) + H (xa ))−1 k · kg (xa )k.
Prof. Dr. Michael Hintermüller 53

The last inequality holds due to


∗ ∗ ∗ 2 ∗
kxN N
+ − xa k ≤ kx+ − x k + kxa − x k ≤ Kkxa − x k + kxa − x k.

For kH (xa )k ≤ k(∇2 f (x∗ ))−1 k−1 /4 Lemma 7.1 yields
kH (xa )k ≤ k(∇2 f (xa ))−1 k−1 /2.
Setting B = ∇2 f (xa ) + H (xa ) and A = (∇2 f (xa ))−1 , one obtains:
kI − BAk =kI − (∇2 f (xa ) + H (xa ))(∇2 f (xa ))−1 k
1
(7.3) ≤ kH (xa )k · k(∇2 f (xa ))−1 k ≤
2
The Banach Lemma implies that B = ∇2 f (xa ) + H (xa ) is nonsingular and additionally:
k(∇2 f (xa )−1 k
k(∇2 f (xa ) + H (xa ))−1 k ≤
1 − kH (xa )(∇2 f (xa )−1 k
k(∇2 f (xa ))−1 k
≤ 1 = 2k(∇2 f (xa ))−1 k
2
(7.4) ≤ 4k(∇2 f (x∗ ))−1 k.
Thus we have by (7.4), (7.3) and Lemma 7.1:
k(∇2 f (xa ))−1 − (∇2 f (xa ) + H (xa ))−1 k ≤
k(∇2 f (xa ) + H (xa ))−1 k · kI − (∇2 f (xa ) + H (xa ))(∇2 f (xa ))−1 k
| {z } | {z }
≤ 4·k(∇2 f (x∗ ))−1 k 2 −1
≤ k (x )k k(∇ f (x )) k
{za
(7.4) H a
(7.3) | }
≤2·k(∇2 f (x∗ ))−1 k

≤ 8k(∇2 f (x∗ ))−1 k2 · kH (xa )k


From (7.1) we infer that
kx+ − x∗ k ≤ K kxa − x∗ k2 + 16k(∇2 f (x∗ )−1 k2 k∇2 f (x∗ )k · kH (xa )k · kxa − x∗ k
+ 4k(∇2 f (x∗ ))−1 k · kg (xa )k.

The interpretation of Theorem 7.3 is as follows: The error g of the gradient evaluation
influences the accuracy of Newton’s method. The error H of the Hessian evaluation reduces
the rate of convergence. In addition Theorem 7.3 gives a hint on how the individual errors
should behave in order to get superlinear convergence.
Now we want to discuss some variants of Newton’s method. The evaluation and factorization
of the Hessian of f can be very expensive. If x0 is sufficiently close to x∗ , then the following
iteration rule reduces this effort considerably:
(7.5) xk+1 = xk − (∇2 f (x0 ))−1 ∇f (xk ), k = 0, 1, . . . .
Here we have H (xk ) = ∇2 f (x0 ) − ∇2 f (xk ) and
(7.6) kH (xk )k ≤ k∇2 f (x0 ) − ∇2 f (xk )k ≤ γkx0 − xk k ≤ γ(kx0 − x∗ k + kxk − x∗ k).
The convergence of method (7.5) follows from Theorem 7.3 with g = 0 and H = O(kx0 −x∗ k).
54 BMS Basic Course: Nonlinear Optimization

Theorem 7.4. Let (A) be satisfied. Then there exist K > 0 and δ > 0 such that for x0 ∈ B(δ)
it holds that the sequence {xk } generated by (7.5) converges q-linearly to x∗ and satisfies
kxk+1 − x∗ k ≤ Kkx0 − x∗ k kxk − x∗ k.
Proof. Let δ be chosen such that Theorem 7.3 holds true. Then (7.6) implies
kxk+1 − x∗ k ≤ K(kxk − x∗ k2 + γ(kx0 − x∗ k + kxk − x∗ k)kxk − x∗ k
= K(kxk − x∗ k(1 + γ) + γkx0 − x∗ k)kxk − x∗ k
≤ K(1 + 2γ)δkxk − x∗ k.
Decreasing δ until K(1 + 2γ)δ = γ < 1 yields the assertion. 
The Sharmanskii-method is a generalization of (7.5). In this variant, the Hessian will be
updated at each (m + 1)-st iteration by the Hessian at the current iterate. For m = 0 one
obtains Newton’s method and for m = ∞ iteration (7.5). We state the following result.
Theorem 7.5. Let (A) be satisfied and m ≥ 0 be given. Then there exist constants K ≥ 0
and δ > 0 such that the Sharmanskii-method converges q-superlinearly to x∗ for all x0 ∈ B(δ).

Appropriate stopping criteria. We have already mentioned that the stopping criterion
(7.7) k∇f (xk )k ≤ τr k∇f (x0 )k + τa
with τr ∈ (0, 1) and 1  τa > 0 is adequate whereas testing the difference between two
consecutive functions values (as sole stopping criterion) is not reasonable. Consider e.g.
k
X
f (xk ) = − j −1 .
j=1

Then it holds: f (xk ) → −∞ for k → ∞ and f (xk+1 ) − f (xk ) → 0.


Very often one is not only interested in a sufficiently small gradient norm, but also in the
proximity of the current iterates to a stationary point (or a local minimizer). It turns out
that when designing a corresponding stopping criterion we have to take special care of the
rate of convergence of the method. In view of Lemma 4.2 we state the following result about
the relation between relative errors (in x) and relative gradients.
Lemma 7.2. Let (A) be satisfied. Let δ > 0 be chosen such that Lemma 7.1 is satisfied for
all x ∈ B(δ). Then it holds for all x ∈ B(δ) that
kx − x∗ k k∇f (x)k 4kx − x∗ kκ(∇2 f (x∗ ))
∗ ∗
≤ ≤ with x0 ∈ B(δ).
4kx0 2
− x kκ(∇ f (x )) 0
k∇f (x )k kx0 − x∗ k
Lemma 7.2 demonstrates that the relative error in x, i.e. kx − x∗ k/kx0 − x∗ k corresponds
–except for a constant factor–to the relative gradient i.e. k∇f (x)k/k∇f (x0 )k.
In case of Newton’s method (quadratically convergent) the length of the search direction d
can be considered as a sufficiently exact error estimator for the order of kx+ − x∗ k, since
(7.8) kxa − x∗ k = kdk + O(kxa − x∗ k2 ).
In case one aspires to terminate the algorithm if kx+ − x∗ k is of the same order as the the
bound τs > 0, then it should be checked whether

kdk = O( τs ).
Prof. Dr. Michael Hintermüller 55

Moerover relation (7.8) yields


kx+ − x∗ k = O(kxa − x∗ k2 ) = O(τs ).
In case of superlinearly convergent methods it holds that
kxa − x∗ k = kdk + O(kxa − x∗ k).
Here kxa − x∗ k ≤ τs in general only implies kx+ − x∗ k < τs .
In q-linear convergent methods one should be careful when applying kdk ≤ τs . One has
kxa − x∗ k − kdk

kxa − x∗ + dk kx+ − x∗ k
≤ ≤ .
kxa − x∗ k kxa − x∗ k kxa − x∗ k
A termination due to the smallness of kdk is only allowed in case of very fast linearly conver-
gent methods. Suppose we have a good estimate ρ for the Q-factor, i.e.
kx+ − x∗ k ≤ ρkxa − x∗ k,
then it follows that
(1 − ρ)kxa − x∗ k ≤ kxa − x∗ k − kx+ − x∗ k ≤ kxa − x+ k = kdk.
Thus it holds that
ρ
kx+ − x∗ k ≤ ρkxa − x∗ k ≤ kdk.
1−ρ
In this case one applies the stopping criterion
(1 − ρ)
kdk ≤ τs .
ρ
yielding kx+ − x∗ k ≤ τs .
In fact one frequently implements more than one stopping criterion. It is conceivable to
combine (7.7) with an aforementioned reasonable criterion for the smallness of kdk k = kxk+1 −
xk k, i.e.
kxk+1 − xk k ≤ τx (1 + kxk k) =: τs with τx ∈ (0, 1).
Additionally, one can test the smallness of the difference of consecutive function values (but
this criterion should never be applied alone):
|f (xk+1 ) − f (xk )| ≤ τf (1 + |f (xk )|) with τf ∈ (0, 1).
A typical choice of τf is given by τf = τx2 .

2. Nonlinear least-squares problems


A nonlinear least-squares problem is a minimization task with an objective function of the
form
M
1X 1
f (x) = kri (x)k2 = R(x)> R(x),
2 2
i=1
where the vector R = (r1 , r2 , . . . , rM )> is called the residuum. Problems of this kind typically
appear in data fitting (regression). Thereby M represents the number of observations (data)
and n is the number of parameters which have to be determined. The problem is called
overdetermined, if M > n, and underdetermined for M < n. If M = n, then the problem
reduces to solving a nonlinear equation.
56 BMS Basic Course: Nonlinear Optimization

If x∗ is a local minimizer of f and f (x∗ ) = 0, then the problem min f (x) is called a null-
residuum problem. In case f (x∗ ) is small, i.e. the data-fitting is good, then one refers to a
problem with small residuum.
Let R0 ∈ RM ×n be the Jacobian of R, then it holds that
∇f (x) = R0 (x)> R(x) ∈ Rn .
The necessary condition for a local minimizer x∗ is given by
(7.9) 0 = ∇f (x∗ ) = R0 (x∗ )> R(x∗ ).
For an underdetermined problem with rank (R0 (x∗ )) = M , (7.9) does imply R(x∗ ) = 0.
However for n > M this is not the case. The Hessian of f is given by
M
X
2 0 > 0
∇ f (x) = R (x) R (x) + ri (x)∇2 ri (x).
j=1

Observe that for the computation of ∇2 f (x), the M Hessians ∇2 ri (x) have to be evaluated.
2.1. Gauss-Newton iteration. Let us assume that min f (x) is a null-residuum prob-
lem. Then it holds that
∇2 f (x∗ ) = R0 (x∗ )> R0 (x∗ ), as ri (x∗ ) = 0 ∀i.
This suggests to use R0 (x)> R0 (x) as an approximation of the Hessian of f , which converges
to ∇2 f (x∗ ) for x → x∗ . In case of small residuals ri (x∗ ), R0 (x)> R0 (x) typically represents a
good Hessian approximation at x near x∗ .
With the help of this Hessian approximation, we construct the following quadratic model:
1
ma (x) = f (xa ) + R(xa )> R0 (xa )(x − xa ) + (x − xa )> R0 (xa )> R0 (xa )(x − xa ).
2
Assuming that R0 (xa )> R0 (xa ) has full rank, there exists a unique minimizer x+ of ma (x)
which satisfies
0 = R0 (xa )> R(xa ) + R0 (xa )> R0 (xa )(x+ − xa ).
In the following we will consider over- and underdetermined problems separately. In any case
we make the following assumption:
Assumption 7.1. The point x∗ is a local minimizer of min kR(x)k2 , R0 (x) is Lipschitz con-
tinuous at x∗ , and R0 (x∗ )> R0 (x∗ ) has full rank. The last assumption means that
• R0 (x∗ ) is nonsingular for M = n;
• R0 (x∗ ) has a full column rank for M > n;
• R0 (x∗ ) has a full row rank for M < n.
2.2. Overdetermined problems. The Gauss-Newton method is given by the iteration
rule  −1
xk+1 = xk − R0 (xk )> R0 (xk ) R0 (xk )> R(xk ), x0 ∈ Rn given.
We have the following result.
Theorem 7.6. Let M > n and Assumption 7.1 satisfied. Then there exist K > 0 and δ > 0
such that the Gauss-Newton-step
 −1
(7.10) x+ = xa − R0 (xa )> R0 (xa ) R0 (xa )> R(xa )
Prof. Dr. Michael Hintermüller 57

fulfills the following estimate for xa ∈ B(δ):


kx+ − x∗ k ≤ K(kxa − x∗ k2 + kR(x∗ )k kxa − x∗ k).
Proof. Let δ be chosen such that kx−x∗ k < δ implies rank(R0 (x)> R0 (x)) = n. Moreover
let γ be the Lipschitz constant of R near x∗ . From (7.10) we infer
x+ − x∗ = xa − x∗ − (R0 (xa )> R0 (xa ))−1 R0 (xa )> R(xa )
= (R0 (xa )> R0 (xa ))−1 R0 (xa )> (R0 (xa )(xa − x∗ ) − R(xa )).
It holds that
R0 (xa )(xa − x∗ ) − R(xa ) = R0 (xa )(xa − x∗ ) − R(x∗ ) + R(x∗ ) − R(xa )
as well as
kR(x∗ ) − R(xa ) − R0 (xa )(x∗ − xa )k =
Z 1
= kR(xa ) + R0 (xa + τ (x∗ − xa ))(x∗ − xa )dτ − R0 (xa )(x∗ − xa ) − R(xa )k
0
Z 1
γ
≤ kR0 (xa + τ (x∗ − xa )) − R0 (xa )kdτ kx∗ − xa k ≤ kxa − x∗ k2 .
0 2
The first order necessary conditions yield R0 (x∗ )> R(x∗ ) = 0 and thus
−R0 (xa )> R(x∗ ) = (R0 (x∗ ) − R0 (xa ))> R(x∗ ).
This gives
kx+ − x∗ k ≤ k(R0 (xa )> R0 (xa )−1 k · kR0 (xa )> (R(x∗ ) − [R(x∗ ) − R(xa )
− R0 (xa )(x∗ − xa )])k
≤ k(R0 (xa )> R0 (xa )−1 k · [k(R0 (x∗ ) − R0 (xa )> R(x∗ )k
γ ∗
(7.11) +kR0 (xa )k · kx − xa k2 ]
2
kR0 (xa )k ∗
≤ k(R0 (xa )> R0 (xa ))−1 k · γkx∗ − xa k · [kR(x∗ )k + kx − xa k].
2
The choice
kR0 (x)k
K = γ max k(R0 (x)> R0 (x)−1 k · (1 + )
x∈B(δ) 2
proves the assertion. 
Theorem 7.6 shows that the local rate of convergence is Q-quadratic in case of R(x∗ ) = 0. In
addition we observe that for R(x∗ ) 6= 0 not even linear convergence follows immediately. The
proof shows that linear convergence requires KkR0 (x∗ )k < 1.
A more subtle estimate in the proof of Theorem 7.6 can be obtained by using
>
R0 (xa )> R(x∗ ) = R0 (x∗ ) + R00 (x∗ )(xa − x∗ ) + O(kxa − x∗ k2 ) R(x∗ )
= (xa − x∗ )> R00 (x∗ )> R(x∗ ) + O(kxa − x∗ k2 ).
We have tacitly introduced the tensor R00 and applied R0 (x∗ )> R(x∗ ) = 0. We obtain the
estimate
k(R0 (x∗ ) − R0 (xa ))> R(x∗ )k ≤ k∇2 f (x∗ ) − R0 (x∗ )> R0 (x∗ )k kR(x∗ )k + O(kxa − x∗ k2 ).
58 BMS Basic Course: Nonlinear Optimization

Thus we have seen that the Gauss-Newton method converges even for problems with large
residuum, provided that R00 (x∗ ) is sufficiently small.
2.3. Underdetermined problems. At first we consider the following underdetermined
linear least-squares problem
min kAx − bk2 , A ∈ RM ×n , M < n.
It can be demonstrated that there is no unique minimizer, but a unique minimizer with
minimal norm. This special solution can be expressed with the help of the singular value
decomposition of A, which is given by given by
A = U ΣV >
with Σ = diag(σi ) ∈ RM ×n a diagonal matrix whose diagonal entries are called singular
values. It holds that σi ≥ 0 and σi = 0 for i > M . The columns of U ∈ RM ×M and V ∈ Rn×n
are called left and right singular vectors. The matrices U and V are orthogonal.
The solution with minimal norm is given by
x = A† b,
where A† = V Σ† U > , σ † = diag(σi† ) and
 −1
† σi for σi 6= 0,
σi =
0 for σi = 0.
The matrix A† is called the Moore-Penrose inverse of A. If A is a nonsingular quadratic
matrix, then it holds that A† = A−1 . The singular value decomposition also exists for
M > n, and—if A has full column rank—one obtains A† = (A> A)−1 A> . In addition it holds
that A† A is a projection onto the image of A† and AA† is a projection onto the image of A,
i.e.
A† AA† = A† , (A† A)> = A† A and AA† A = A, (AA† )> = AA> .
The solution with minimal norm of
1
min kR(xa ) + R0 (xa )(x − xa )k2
2
in case of underdetermined problems is
x+ = xa − R0 (xa )† R(xa ),
which corresponds to the Gauss-Newton iteration for the associated nonlinear least-squares
problem. In the linear case, i.e. R(x) = Ax − b, it follows
x+ = xa − A† (Axa − b) = (I − A† A)xa + A† b.
Let ea = xa − A† b and e+ = x+ − A† b, then A† AA† b = A† b implies
e+ = (I − A† A)ea .
This does not ensure that x+ = A† b is the solution with minimal norm, but it does imply that
x+ solves the problem and that the method terminates after one step. Let Z = {x : R(x) = 0}.
Theorem 7.7. Let M < n and the Assumption 7.1 be fulfilled for z ∗ ∈ Z. Then there exists
δ > 0 such that the Gauss-Newton iteration
xk+1 = xk − R0 (xk )† R(xk )
is well-defined for kx0 − x∗ k ≤ δ and converges R-quadratically to z ∗ ∈ Z.
Prof. Dr. Michael Hintermüller 59

3. Inexact Newton methods


Inexact Newton methods use an approximate Newton step d, ˜ which satisfies
(7.12) k∇2 f (xa )d˜ + ∇f (xa )k ≤ ηa k∇f (xa )k.
We refer to d˜ as an inexact Newton step. In our context we consider Newton methods with
iterative solvers for
(7.13) ∇2 f (xa )d˜ = −∇f (xa ).
In particular we know that ∇2 f (xa ) is positive definite for xa near x∗ . Therefore the CG
method of chapter 1 is appropriate for the iterative solution of (7.13). The resulting overall
algorithm is called the Newton-CG method.
Theorem 7.8. Let (A) be fulfilled. Then there exist constants KI ≥ 0, δ > 0 such that for
xa ∈ B(δ) with d˜ from (7.12) and x+ = xa + d˜ it holds that
kx+ − x∗ k ≤ KI (kxa − x∗ k + ηa )kxa − x∗ k.
Proof. Let δ be chosen such that Lemma 7.1 and Theorem 7.1 hold true. Let r =
−∇2 f (xa )d˜ − ∇f (xa ). Then one obtains
d˜ + ∇2 f (xa )−1 ∇f (xa ) = −(∇2 f (xa )−1 r
as well as the equation
(7.14) x+ − x∗ = xa − x∗ + d˜ = xa − x∗ − (∇2 f (xa ))−1 ∇f (xa ) − (∇2 f (xa ))−1 r.
From (7.7) and Lemma 7.1 it follows that
˜
kd+(∇ 2
f (xa ))−1 ∇f (xa )k = k(∇2 f (xa ))−1 (∇2 f (xa )d˜ + ∇f (xa ))k
≤ k(∇2 f (xa ))−1 k · k∇2 f (xa )d˜ + ∇f (xa )k
≤ k(∇2 f (xa ))−1 k · ηa k∇f (xa )k ≤ 4 · k (∇2 f (x∗ ))−1 k · k∇2 f (x∗ )kηa
| {z }
=K(∇2 f (x∗ ))·kxa −x∗ k
2 ∗ ∗
= 4K(∇ f (x ))ηa · kxa − x k.
Theorem 7.1 and (7.14) yield
kx+ − x∗ k ≤ kxa − x∗ − (∇2 f (xa ))−1 ∇f (xa )k + kd˜ + (∇2 f (xa ))−1 ∇f (xa )k
≤ K · kxa − x∗ k2 + 4K(∇2 f (x∗ )) · ηa kxa − x∗ k.
Setting
KI = K + 4K(∇2 f (x∗ ))
proves the assertion. 
Theorem 7.8 also contains a rule on how to control ηa in order to achieve fast convergence.
Theorem 7.9. Let (A) be satisfied. Then there exist δ > 0 and η̄ > 0 such that the inexact
Newton iteration xk+1 = xk + d˜k with
k∇2 f (xk )d˜k + ∇f (xk )k ≤ ηk k∇f (xk )k
converges Q-linearly to x∗ for x0 ∈ B(δ) and {ηk } ⊂ [0, η̄]. Furthermore it holds that
• if ηk → 0, then the rate of convergence is Q-superlinear;
60 BMS Basic Course: Nonlinear Optimization

• if ηk ≤ Kη k∇f (xk )kp for Kη > 0, then the rate of convergence is Q-superlinear with
Q-order 1 + p.
3.1. Implementation of the Newton-CG method. As already mentioned, in the
Newton-CG method, the Newton-direction
∇2 f (xk )dk = −∇f (xk )
is determined with the help of the CG-method. In addition we assume that Dh2 f (x; d) is a
sufficiently exact and disposable approximation of the Hessian-vector product ∇2 f (x)d. The
quantity h can be interpreted for example as the step size of a difference-approximation of
the second derivative of f in the direction d. We now specify a variant of the preconditioned
CG-method, which terminates with an error message, if ∇2 f (x) is singular (w.r.t. d), i.e.
d> ∇2 f (x)d = 0; or if d turns out to be a direction of negative curvature i.e. d> ∇2 f (x)d < 0.
Later we will see that the case of a negative curvature can also lead to meaningful search
directions.

Algorithm 7.1.
input: W ∈ S n positive definite, η ∈ R+
0, x∈R .
n

begin
set d0 := 0, r0 := ∇f (x), p0 := −W −1 r0 , l := 0.
while krl k > ηk∇f (x)k
begin
wl := Dh2 f (x; pl )
if (pl )> wl = 0 then RETURN(“indefiniteness”)
if (pl )> wl < 0 then RETURN(“negative curvature”)
set
(rl )> W −1 rl
αl :=
(pl )> wl
dl+1 := dl + αl pl
rl+1 := rl + αl wl
(rl+1 )> W −1 rl+1
βl+1 :=
(rl )> W −1 rl
pl+1 := −W −1 rl+1 + βl+1 pl
l := l + 1
end
end

In the implementation of the Newton-CG Algorithm, the preconditioner W and the error
bound η from Theorem 7.9 will be adjusted at each Newton iteration. This idea is imple-
mented in the following Newton-CG method:

Algorithm 7.2 (Newton-CG method).


input: x0 ∈ Rn .
begin
Prof. Dr. Michael Hintermüller 61

r0 := k∇f (x0 )k, k := 0.


while k∇f (xk )k > τr r0 + τa
begin
choose ηk , W k ∈ S n positive definite
calculate dk with Algorithm 7.1 with input W k , ηk , xk .
if “indefiniteness” then STOP with error message.
set xk+1 = xk + dk .
if f (xk+1 ) ≥ f (xk ) then STOP with error message.
end
end

Remark 7.1. (1) In Algorithm 7.2 we have made use of a rather simple stopping cri-
terion. Naturally, combined criteria (see the discussion in section 1 of this chapter)
can be applied as well.
(2) The algorithm terminates whenever “indefiniteness” occurs in the CG iteration. Pos-
sible modifications in case that dk is a direction of negative curvature will be discussed
later on.
(3) We also stop the Newton iteration as soon as the full step xk+1 = xk + dk does not
contribute to a decrease in f .

4. Global convergence
Regarding Newton methods we have only considered local convergence (results) so far. Until
now, we have always assumed that x0 is sufficiently close to a local solution x∗ . Now we will
introduce globalization approaches which allow for a relaxation of the choice of the starting
point.
If one ensures that ∇2 f (xk ) or a corresponding Hessian approximation satisfies
c1 kdk2 ≤ d> ∇2 f (xk )d ≤ c2 kdk2 ∀k ∈ N ∀d ∈ Rn ,
(0 < c1 ≤ c2 ), then dk defined as the solution of ∇2 f (xk )d = −∇f (xk ) is a gradient-related
search direction. Inserting this into the general descent method 3.1 yields (cf. chapter 5.2)
the global convergence of the global Newton method which means that the method converges
to a stationary point regardless of the choice of the starting point x0 . Furthermore it can be
shown that in the vicinity of x∗ , αk = 1 will be accepted as the step size if one initiates the
step size algorithm with α(0) = 1.
Here, we focus on another way to globalize Newton’s method. Given a current approxima-
tion of a solution, due to its strategy of confining the next iterate to a sufficiently small
neighborhood of the current iterate this strategy is called trust-region method (or trust region
globalization).

4.1. Trust-Region method. A major drawback of the general descent method 3.1 with
one of the step size strategies of chapter 3.2 is the necessity of ensuring that {∇2 f (xk )} ⊂ S n
is positive definite. Trust region methods deal with this problem in a suitable way and
solve it algorithmically. Roughly speaking, these methods realize a smooth transition from
the method of steepest descent to Newton’s method. In this way the global convergence
property of steepest descent is combined with the fast local convergence of Newton’s method
(Theorem 7.2).
62 BMS Basic Course: Nonlinear Optimization

The idea can be described as follows: Let ma (x) be a quadratic model of f in a neighborhood
of xa , which is given by
1
ma (x) = f (xa ) + ∇f (xa )> (x − xa ) + (x − xa )> ∇2 f (xa )(x − xa ),
2
and let ∆ be the radius of a ball about xa where we “trust” the model ma to represent f
well. The quantity ∆ is called the trust region radius, and one refers to
T (∆) = {x : kx − xa k ≤ ∆}
as the trust region.
Given xa , the next iterate x+ is chosen as an approximate minimizer of ma in T (∆). The
associated trust region subproblem is defined as
(7.15) min ma (xa + d) s.t. kdk ≤ ∆.
We denote the solution of (7.15) by dv (trial step) and the associated trial solution as xv =
xa + dv . Then we have to decide whether the step is acceptable or whether the trust region
radius needs to be changed. Usually, both options are checked simultaneously. For the former
one verifies whether the quadratic model is a good approximation of f in T (∆). For this
purpose we define
ared = f (xa ) − f (xv ) (actual reduction)
and also
pred = ma (xa ) − ma (xv ) (predicted reduction).
2
Note that (with Ha = ∇ f (xa ))
1
pred = ma (xa ) − ma (xv ) = −∇f (xa )> (xv − xa ) − (xv − xa )> Ha (xv − xa )
2
> 1 >
= −∇f (xa ) dv − dv Ha dv .
2
In the following algorithm we need the parameters
µ0 ≤ µ ≤ µ
to decide whether we reject the trial step (ared/pred < µ0 ) and/or reduce ∆ (ared/pred < µ),
whether we increase ∆ (ared/pred > µ) or leave the trust region radius unchanged. The
reduction resp. the increase of ∆ are realized by multiplication by 0 < ω < 1 < ω. Further
let C > 1 be fixed.

Algorithm 7.3.
input: xa ∈ Rn , xv ∈ Rn , ∆ ∈ R+ .
begin
ˆ (0) := ∆, l := 0.
z 0 := xa , zv0 := xv , ∆
while z l = xa
begin
ared(l) := f (xa ) − f (zvl ), dlv := zvl − xa , pred(l) := −∇f (xa )> dlv − 12 (dlv )> Ha dlv
if ared(l) /pred(l) < µ0 then
z l+1 := xa , ∆ˆ (l+1) := ω ∆ ˆ (l)
if l ≥ 1 and ∆ ˆ (l) > ∆ˆ (l−1)
z l+1 l−1
:= zv , ∆ ˆ (l+1) := ∆ ˆ (l−1)
else
Prof. Dr. Michael Hintermüller 63

compute the solution dl+1 v


ˆ (l+1)
of the trust region sub-problem with radius ∆
l+1
zv := xa + dv l+1

end
elseif µ0 ≤ ared(l) /pred(l) ≤ µ then
ˆ (l+1) := ω ∆
z l+1 := z l , ∆ ˆ (l)
v
elseif µ ≤ ared(l) /pred(l) ≤ µ then
z l+1 := zvl
elseif µ ≤ ared(l) /pred(l)
if kdlv k = ∆ˆ (l) ≤ Ck∇f (xa )k then
z l+1 := xa , ∆ ˆ (l+1) := ω ∆
ˆ (l)
l+1 ˆ (l+1)
compute the solution dv of the trust region sub-problem with radius ∆
l+1
zv := xa + dv l+1

else
z l+1 = zvl
end
end
l := l + 1
end
ˆ (l)
x+ := z l , ∆+ = ∆
end

In Algorithm 7.3 we require


ˆ (l) ≤ Ck∇f (xa )k,

which bounds the trust region radius from above. The while-loop in Algorithm 7.3 is com-
parable to the loops of the step size algorithms and should terminate after finitely many
iterations. Algorithm 7.3 now fits into a general trust region paradigm.

Algorithm 7.4 (Trust region framework).


input: x0 ∈ Rn , ∆0 ∈ R+
begin
k := 0, r0 := k∇f (x0 )k
while k∇f (xk )k > τr r0 + τa
begin
compute an approximation H k of the Hessian ∇2 f (xk )
compute dkv as the solution of
1
min f (xk ) + ∇f (xk )> d + d> H k d s.t. kdk ≤ ∆k
2
compute (xk+1 , ∆k+1 ) by Algorithm 7.3 with input xk , xkv := xk + dkv , ∆k
k := k + 1
end
end
64 BMS Basic Course: Nonlinear Optimization

4.2. Global convergence of the trust region algorithm. Theoretically, the trust
region subproblem can be solved exactly. It turns out that even a relatively inaccurate solution
of the trust region subproblem suffices to prove global and locally superlinear convergence.
For the proof we invoke the following assumption.
Assumption 7.2.
(1) There exists σ > 0 such that
(7.16) pred = f (xa ) − ma (xv ) ≥ σk∇f (xa )k min{kdv k, k∇f (xa )k}.
(2) There exists M > 0 such that
k∇f (xa )k
kdv k ≥ or kdv k = ∆a .
M
We obtain the following global convergence result.
Theorem 7.10. Let ∇f be Lipschitz continuous with modulus L. Let {xk } be the sequence
generated by Algorithm 7.4, and further assume that the solutions of the trust region subprob-
lems fulfill Assumption 7.2. Moreover suppose that the matrices {H k } are bounded. Then
either f is bounded from below or ∇f (xk ) = 0 for a finite k, or
lim k∇f (xk )k = 0.
k→∞

Proof. Assume that ∇f (xk ) 6= 0 ∀k and f is bounded from below; otherwise the asser-
tion is immediate. We show that in case the step is accepted (and, hence, the radius is not
further enlarged), there exists MT ∈ (0, 1) such that
(7.17) kdkv k ≥ MT k∇f (xk )k.
Assume for the moment that (7.17) holds true. Since dkv is accepted, Algorithm 7.3 and
Assumption 7.2 yield
aredk ≥ µ0 predk ≥ µ0 σk∇f (xk )k min{kdkv k, k∇f (xk )k}.
Applying (7.17), we obtain
(7.18) aredk ≥ µ0 σk∇f (xk )k2 min{MT , 1} = µ0 σMT k∇f (xk )k2 .
Since {f (xk )} is monotonically decreasing and f is bounded from below, it follows that
limk→∞ aredk = 0. Thus, (7.18) implies limk→∞ k∇f (xk )k = 0.
It remains to prove (7.17). First note that for kdkv k < ∆k from Assumption 7.2(2), we get
k∇f (xk )k
kdkv k ≥ .
M
The case
(7.19) kdkv k = ∆k and kdkv k < k∇f (xk )k
1
remains. In fact, if (7.19) does not hold true, then (7.17) follows from MT = min{1, M }.
k
Provided that (7.19) is satisfied and dv is accepted we show that
2σ min{1 − µ, (1 − µ0 )ω −2 }
(7.20) kdkv k = ∆k ≥ k∇f (xk )k.
M +L
Then the assertion follows with
1 2σ min{1 − µ, (1 − µ0 )ω −2 }
 
MT = min 1, , .
M M +L
Prof. Dr. Michael Hintermüller 65

Let M of Assumption 7.2 be chosen sufficiently large such that


(7.21) kH k k ≤ M ∀k ∈ N.
We prove (7.20) by showing that the trust region radius is enlarged and the step associated
with the larger radius is accepted if (7.19) is fulfilled and (7.20) does not hold true. For this
purpose, let dkv be a trial step such that kdkv k < k∇f (xk )k and
2σ min{1 − µ, (1 − µ0 )ω −2 }
(7.22) kdkv k < k∇f (xk )k.
M +L
The Lipschitz continuity of ∇f and (7.21) yield
Z 1
k > k
k k k
aredk = f (x ) − f (x + dv ) = −∇f (x ) dv − (∇f (xk + τ dkv ) − ∇f (xk ))> dkv dτ
0
Z 1
k > k 1 k > k k 1 k > k k
= −∇f (x ) dv − (dv ) H dv + (dv ) H dv − (∇f (xk + τ dkv ) − ∇f (xk ))> dkv dτ
2 2 0
Z 1
1 k > k k
= predk + (dv ) H dv − (∇f (xk + τ dkv ) − ∇f (xk ))> dkv dτ
2 0
Z 1
M k 2
≥ predk − kdv k − Lkdkv k2 τ dτ
2 0
1
= predk − (M + L)kdkv k2 .
2
Assumption 7.2(1) implies
aredk (M + L)kdkv k2 (M + L)kdkv k2
(7.23) ≥1− ≥1− .
predk 2 predk 2σk∇f (xk )k min{kdkv k, k∇f (xk )k}
As kdkv k < k∇f (xk )k due to (7.19), it holds that
min{kdkv k, k∇f (xk )k} = kdkv k,
and thus we obtain, cf. (7.22),
aredk (M + L)kdkv k
≥1− > 1 − min{1 − µ̄, (1 − µ0 )w̄2 } ≥ µ̄.
predk 2σk∇f (xk )k
Thus, an enlargement step is carried out by setting ∆+ k
k = w̄∆n and replacing dv by dv
k,+ ,
+
the minimizer of the quadratic model with radius ∆k . Then, (7.23) is still fulfilled and it
follows that
kdk,+ k k
v k ≤ w̄kdv k < w̄k∇f (x )k.
Consequently,
k k,+ kdk,+
v k
min{k∇f (x )k, kdv k} > (w̄ > 1).

Thus,
ared+
k (M + L)kdk,+
v k
2
≥ 1 −
pred+
k 2σk∇f (xk )k min{k∇f (xk )k, kdk,+
v k}

(M + L)w̄kdk,+
v k (M + L)w̄2 kdkv k
≥1− ≥ 1 − ≥ µ0
2k∇f (xk )kσ 2k∇f (xk )kσ
66 BMS Basic Course: Nonlinear Optimization

owing to (7.22). Hence, the enlargement of the radius produces an acceptable step which
would be taken instead of dkv . Thus, (7.20) has to hold true. 

Next we study the computation of the trial step dv resp. the trial points xv = xa + dv . It
suffices to compute approximate solutions of the trust region subproblem (7.15), such that
Assumption 7.2 is satisfied. To this end, a simple idea is based on fixing the direction to the
steepest descent direction under the trust region constraint. Let xa be the current iterate
and ∆a be the current trust region radius. Then the trial point xv := xv (α) is defined as the
minimizer αa of
min Ψa (α) := ma (xa − α∇f (xa )) s.t. xv (α) := xa − α∇f (xa ) ∈ T (∆a ).
t≥0

It holds that
α2
Ψa (α) = ma (xa − α∇f (xa )) = f (xa ) − αk∇f (xa )k2 + ∇f (xa )> Ha ∇f (xa ),
2
Ψ0a (α) = −k∇f (xa )k2 + α∇f (xa )> Ha ∇f (xa ).
For determining αa we have to distinguish between the following cases:
(1) ∇f (xa )> Ha ∇f (xa ) ≤ 0. Obviously the Trust-Region constraint becomes active. It
holds that
kxv (αa ) − xa k = αa k∇f (xa )k = ∆a ,
∆a
which yields αa = k∇f (xa )k .
(2) ∇f (xa )> Ha ∇f (xa ) > 0. In this case, we have
k∇f (xa )k2
m0a (xa − α̂a ∇f (xa )) = 0 ⇒ α̂a = .
∇f (xa )> Ha ∇f (xa )
If kxv (α̂a ) − xa k ≤ ∆a is fulfilled, then we accept α̂a as αa ; otherwise the trust region
constraint becomes active and analogously to (1) it follows that αa = k∇f∆(xa a )k .
To summarize, we have
∆a
if ∇f (xa )> Ha ∇f (xa )≤0,
(
k∇f (xa )k
(7.24) αa := (xa )k2
min{ k∇f∆(xa a )k , ∇f (xk∇f
a )> H ∇f (x ) }
a a
if ∇f (xa )> Ha ∇f (xa )>0.

The minimizer of the quadratic model ma in the direction of the negative gradient is called the
1
Cauchy point and will be denoted by xCP a . The Cauchy point has the following properties
which will prove to be useful for the global convergence result.
(1) ∇f (xa )> Ha ∇f (xa ) ≤ 0. Then it follows that
αa2
f (xa ) − ma (xCP 2
a ) = αa k∇f (xa )k − ∇f (xa )> Ha ∇f (xa )
2
≥ ∆a k∇f (xa )k = kdv k · k∇f (xa )k.

(2) ∇f (xa )> Ha ∇f (xa ) > 0. Depending on where the minimum in (7.24) is attained, we
have the following situations:

1When using the iteration index k, then the Cauchy point will be written as xk .
CP
Prof. Dr. Michael Hintermüller 67

∆a
(i) αa = k∇f (xa )k . Then it holds that
k∇f (xa )k2
αa ≤ α̂a =
∇f (xa )> Ha ∇f (xa )
and thus
αa ∇f (xa )> Ha ∇f (xa ) ≤ k∇f (xa )k2 .
This implies
αa2
f (xa ) − ma (xCP
a ) = ∆a k∇f (xa )k − ∇f (xa )> Ha ∇f (xa )
2
αa
≥ ∆a k∇f (xa )k − k∇f (xa )k2
2
∆a ∆a
= ∆a k∇f (xa )k − k∇f (xa )k = k∇f (xa )k
2 2
1
= kdv k · k∇f (xa )k.
2
k∇f (xa )k2
(ii) The second case occurs, if αa = α̂a = ∇f (xa )> Ha ∇f (xa )
. Then we have

k∇f (xa )k4 1 k∇f (xa )k4


f (xa ) − ma (xCP
a )= − ∇f (xa )> Ha ∇f (xa )
∇f (xa )> Ha ∇f (xa ) 2 (∇f (xa )> Ha ∇f (xa ))2
k∇f (xa )k k∇f (xa )k3 k∇f (xa )k
= = α̂a k∇f (xa )k
2 ∇f (xa )> Ha ∇f (xa ) 2
k∇f (xa )k
= · kdv k.
2
Cases (1) and (2) yield
1
(7.25) f (xa ) − ma (xCP
a ) ≥ k∇f (xa )k min{kdv k, k∇f (xa )k}.
2
CP
Therefore the Cauchy point xa fulfills condition (1) of Assumption 7.2. The second condition
of Assumption 7.2 is satisfied too, because in case of kdv k < ∆a the definition of xCP
a implies
k∇f (xa )k2
dv = xv (αa ) − xa = xv (α̂a ) − xa = −α̂a ∇f (xa ) = − ∇f (xa ).
∇f (xa )> Ha ∇f (xa )
If kHa k ≤ M , then
k∇f (xa )k
kdv k ≥.
M
The second case, i.e. kdv k = ∆a , is trivial. Therefore the global convergence result of
Theorem 7.10 immediately yields the next result.
Theorem 7.11. Let ∇f be Lipschitz continuous with modulus L. Let {xk } be generated by
Algorithm 7.4 with xkv = xkCP and (7.24). Furthermore let the sequence of matrices {H k } be
bounded. Then either f (xk ) is bounded from below or ∇f (xk ) = 0 for a finite k or
lim ∇f (xk ) = 0.
k→∞

Remark 7.2. Weakening the assumptions of Theorem 7.11, one can still show lim inf k→∞ k∇f (xk )k =
0.
68 BMS Basic Course: Nonlinear Optimization

4.2.1. Superlinear convergence. The idea of always fixing the direction −∇f (xa ) for the
approximate solution of the trust region subproblem and determining the step size αa such
that xv (αa ) − xa ∈ T (∆a ) often leads to a very slow (only Q-linear) rate of convergence
(comparable to the method of steepest descent). For this reason, we discuss a technique
which locally realizes the transition to the Newton direction. For this purpose, at xa we
define the Newton point
−1
xN
a = xa − Ha ∇f (xa ).
If Ha ∈ S n is positive definite, then xN
a is the global minimizer of the usual quadratic model
of f at xa . In case that Ha possesses directions of negative curvature, then the quadratic
model has no finite minimizer. The Newton point, however, remains meaningful.
Now we consider a special approximate solution of the trust region subproblem, which finally
yields a locally superlinearly convergent method. This is achieved by minimizing ma along a
piecewise linear path P ∈ T (∆), which is called the dogleg path. Its classical variant makes
use of three points: xa , xN CP
a and x̂a , the global minimizer of the quadratic model in the
direction of steepest descent. It holds that x̂CP a only exists, if ∇f (xa )> Ha ∇f (xa ) > 0 is
CP
satisfied. Whenever x̂a exists and fulfills
CP > CP
(7.26) (xN
a − x̂a ) (x̂a − xa ) ≥ 0,

then we define xN n
a as the last node on the path. If Ha ∈ S is positive definite, then it can
be shown that (7.26) is fulfilled. In case (7.26) is violated, then xN
a is not used. A closer
examination of (7.26) yields
CP > CP CP > CP
0 ≤(xN N
a − x̂a ) (x̂a − xa ) = (xa − xa + xa − x̂a ) (x̂a − xa )
> CP
=(xN CP 2
a − xa ) (x̂a − xa ) − kx̂a − xa k .

Assuming that x̂CP


a 6= xa , this immediately implies
> CP > N
0 < kx̂CP 2 N
a − xa k ≤ (xa − xa ) (x̂a − xa ) = −α̂a ∇f (xa ) (xa − xa ).

Since α̂a > 0, we obtain


∇f (xa )> (xN
a − xa ) > 0.
The classical trial solution xD (∆a ) is computed according to

CP if kxa − xCP
a k = ∆a or x̂a
CP exists
x



 a and (7.26) is not fulfilled,


D
(D) x (∆a ) = N if kxa − xCP N
a k < kxa − xa k ≤ ∆a
x a and (7.26) holds true,




 D

y (∆a ) else.
Here, y D (∆a ) is the uniquely determined point between xCP
a and xN D
a , which fulfills ky (∆a )−
xa k = ∆a .
Typical properties of the ”dogleg”-method are the following ones:
• There do not exist two points on P, which have the same distance to xa . Thus P
can be parameterized by xa (s) with s = kxa (s) − xa k.
• ma (s) is a monotonically decreasing function of s.
Lemma 7.3. Let xa , Ha and ∆a be given, where Ha is nonsingular,
−1
dN N N
a = −Ha ∇f (xa ) and xa = xa + da .
Prof. Dr. Michael Hintermüller 69

We assume that ∇f (xa )> Ha ∇f (xa ) > 0 and


k∇f (xa )k2
dˆCP
a = x̂CP
a − xa = − ∇f (xa ) 6= dN
a .
∇f (xa )> Ha ∇f (xa )
Let P be the piecewise linear path of xa via x̂CP
a to xN
a . If

(dN ˆCP > ˆCP ≥ 0,


(7.27) a − da ) da

then, for arbitrary δ ≤ kdN


a k, there exists a unique point x(δ) ∈ P such that kx(δ) − xa k = δ.

Proof. On the segment from xa to x̂CP


a the assertion holds trivially. Consequently the
CP N
segment from x̂a to xa remains to be discussed. We have to show that
1
φ(λ) = k(1 − λ)dˆCP N 2
a + λda k
2
increases strictly monotonically for λ ∈ (0, 1). We start by assuming that (7.27) holds with
strict inequality. Then we have
ˆCP N > ¯CP
kdN
a k · kda k ≥ (da ) da > kdˆCP
a k
2
(7.27)

and therefore kdN ˆCP


a k > kda k. Thus, we obtain with (7.27)

φ(λ) =(dN ˆCP > ˆCP N ˆCP 2


a − da ) ((1 − λ)da + λda ) = −(1 − λ)kda k

+ (1 − λ)(dN )> dˆCP − λ(dˆCP )> dN + λkdN k2


a a a a a
(7.27)
> λ(kdN 2 ˆCP > N N ˆCP ˆCP
a k − (da ) da ) ≥ λ(kda k − kda k)kda k > 0.

Hence, φ is strictly monotonically increasing.


If (7.27) holds with equality, then dN ˆCP by assumption. Rearranging terms, we find that
a 6= da

φ0 (λ) = λkdˆCP N 2 N ˆCP > ˆCP = λkdˆCP − dN k2 > 0


a − da k + (da − da ) da a a

for λ > 0. Hence, φ is strictly monotonically increasing, which concludes the proof. 

Now we show that the quadratic model decreases monotonically along the ”dogleg”-path.
Lemma 7.4. Let the assumptions of Lemma 7.3 be satisfied. Then the local quadratic model
1
ma (x) = f (xa ) + ∇f (xa )> (x − xa ) + (x − xa )> Ha (xa )(x − xa )
2
is monotonically decreasing along P.
Proof. Let x̂CP
a (6= xa ) be the minimizer of ma in the direction −∇f (xa ). Thus, ma is
strictly monotonically decreasing from xa to x̂CP
a . Hence, only the segment from x̂a
CP to xN
a
remains to be discussed. For this purpose, let
>
ψ(λ) = ma (xa + (1 − λ)dˆCP N ˆCP N
a + λda ) = f (xa ) + ∇f (xa ) ((1 − λ)da + λda )
1 N >
+ ((1 − λ)dˆCP ˆCP
a + λda ) Ha ((1 − λ)da + λda ).
N
2
70 BMS Basic Course: Nonlinear Optimization

The relations Ha dN ˆCP = −α̂a ∇f (xa ) imply


a = −∇f (xa ) and da
1
ψ(λ) = f (xa ) − α̂a (1 − λ)k∇f (xa )k2 − λ∇f (xa )> Ha−1 ∇f (xa ) + (1 − λ)2 α̂a2 ∇f (xa )> Ha ∇f (xa )
2
1
+ (−α̂a )(1 − λ)∇f (xa )> Ha λ(−Ha−1 ∇f (xa )) + λ2 ∇f (xa )> Ha−1 Ha Ha−1 ∇f (xa )
2
1
= f (xa ) − α̂a (1 − λ)k∇f (xa )k − λ∇f (xa ) Ha ∇f (xa ) + (1 − λ)2 α̂a2 ∇f (xa )> Ha ∇f (xa )
2 > −1
2
1
+ α̂a (λ − λ2 )k∇f (xa )k2 + λ2 ∇f (xa )> Ha−1 ∇f (xa )
2
1 λ
= f (xa ) − α̂a (1 − λ)2 k∇f (xa )k2 + (1 − λ)2 α̂a2 ∇f (xa )> Ha ∇f (xa ) + λ(1 − )∇f (xa )> dNa
2 2
k∇f (xa )k2
Using α̂a = ∇f (xa )> Ha ∇f (xa )
, we obtain

ψ 0 (λ) = 2α̂a (1 − λ)k∇f (xa )k2 − (1 − λ)α̂a2 ∇f (xa )> Ha ∇f (xa ) + (1 − λ)∇f (xa )> dN
a

= 2α̂a (1 − λ)k∇f (xa )k2 − (1 − λ)α̂a k∇f (xa )k2 + (1 − λ)∇f (xa )> (−Ha−1 ∇f (xa ))
= (1 − λ)∇f (xa )> (α̂a ∇f (xa ) − Ha−1 ∇f (xa ))
1−λ > −1
= (xa − x̂CP
a ) (xa − Ha ∇f (xa ) − (xa − α̂a ∇f (xa )))
α̂a
1−λ > N
= (xa − x̂CP CP
a ) (xa − x̂a ) ≤ 0.
α̂a

We have shown that the trust region subproblem possesses a unique solution. Now we can
prove the following global convergence theorem.
Theorem 7.12. Let ∇f be Lipschitz continuous with modulus L. Let {xk } be generated by Al-
gorithm 7.4, where the solutions of the trust region subproblem are given by (D). Furthermore
we assume that the sequence of matrices {H k } is bounded. Then either {f (xk )} is bounded
from below or ∇f (xk ) = 0 for a finite k or
lim k∇f (xk )k = 0.
k→∞

Proof. We have to prove that the conditions of Assumption 7.2 are satisfied.
Condition 2: Let kH k k ≤ ∆. In case kdk k ≤ ∆, the definition of xD yields (7.26) and
xv = xkN . Further we have
k∇f (xk )k
kdk k = kxk − xkN k = k(H k )−1 ∇f (xk )k ≥ .
M
Condition 1: We distinguish the different cases for the determination of xD . If xD =
xCP CP
a , then either kda k = ∆a or (7.26) does not hold true. First we consider the case
∇f (xa )> Ha ∇f (xa ) ≤ 0. Then kdCP ∆a
a k = ∆a and αa = k∇f (xa )k . Therefore it holds that

αa2
predk = αa k∇f (xa )k2 − ∇f (x)> Ha ∇f (xa )
2
∇f (xa )> Ha ∇f (xa )
= ∆a k∇f (xa )k − ∆2a ≥ ∆a k∇f (xa )k = kdv k · k∇f (xa )k
2k∇f (xa )k2
Prof. Dr. Michael Hintermüller 71

Condition 1 holds true for σ = 1. The following cases can be checked similarly:
 CP
kda k = ∆a 1
∇f (xa )> Ha ∇f (xa ) > 0 ∧ ⇒σ= .
kdCP
a k < ∆ a 2
Finally, there is the case, where (7.26) holds true whereas xD 6= xCP a . In this situation we
have
k∇f (xa )k2
predk ≥ ma (xa ) − ma (xCP
a )≥ .
M
Now we can apply Theorem 7.10 to obtain the assertion. 
Finally, we establish fast local convergence.
Theorem 7.13. Let ∇f be Lipschitz continuous with modulus L. Let {xk } be generated by
Algorithm 7.4, where the solutions of the trust region subproblems are given by (D). Further
assume that H k = ∇2 f (xk ) and {H k } is bounded, f is bounded from below and x∗ is a
minimizer of f satisying (A). If limk xk = x∗ , then xk converges Q-quadratically to x∗ .
Proof. Since x∗ is the limit of {xk }, there exists a δ > 0 such that for sufficiently large
k it holds that
kxk − x∗ k ≤ δ, kH k k ≤ 2k∇2 f (x∗ )k, k(H k )−1 k ≤ 2k(∇2 f (x∗ ))−1 k.
Furthermore, let δ be chosen such that the assertions of Theorem 7.1 hold true. If H k ∈ S k
is positive definite, then (H k )−1 ∈ S k is positive definite as well and (7.26) is fulfilled. Hence,
the ”dogleg” path starts at xa and runs through xCP a to xN
a . For sufficiently small ρ, the it
holds that
k(H k )−1 ∇f (xk )k ≤ 2kxk − x∗ k ≤ 2ρ.
From this we infer (see proof of Theorem 7.14):
1
predk ≥ kdkv k · k∇f (xk )k.
2
Moreover, we have
Z 1
k > k
aredk = −∇f (x ) dv − (∇f (xk + τ dkv ) − ∇f xk )> dkv dτ
0
Z 1
1 k > 2 k k
= predk + (dv ) ∇ f (x )dv − (∇f (xk + τ dkv ) − ∇f xk )> dkv dτ
2 0
= predk + O (kdkv k · k∇f (xk )kρ)

Hence, ared
predk = 1 − O(ρ). For sufficiently small ρ, the trust region radius will be enalrged
k

until xN N
a is located in the trust region, and xa will be accepted. 
Remark 7.3. Some trust region methods realize the inexact Newton idea. The local rate of
convergence can be derived analogously to Theorem 7.9.
CHAPTER 8

Quasi-Newton methods

Unlike Newton’s method, quasi-Newton methods do not make use of second order derivatives
of f . Rather, they approximate the second order derivatives iteratively with the help of
first order derivatives. Here, we consider quasi-Newton methods, which essentially work
like Newton methods with line search, but ∇2 f (xk ) is approximated by a positive definite
matrix H k . At each iteration, H k is updated in an appropriate way. The general algorithmic
structure is as follows:

(1) Set dk = −(H k )−1 ∇f (xk ).


(2) Determine xk+1 = xk + αk dk by a step size strategy.
(3) Use xk , xk+1 and H k , to update H k to H k+1 .

For the initial matrix H 0 choose a (symmetric) positive definite matrix. A standard choice
is given by H 0 = I, but sometimes better scaling might be necessary. The benefits of quasi-
Newton methods are (amongst others):
• Only first order derivatives are required.
• H k is always positive definite such that dk is a descent direction for f at xk and our
line search framework is applicable.
• Some variants require only O(n2 ) multiplications per iteration (instead of O(n3 ) like
Newton’s method).
The last point is related to quasi-Newton variants, which approximate (∇2 f (xk ))−1 directly,
thus sparing the cost of solving the linear system to determine dk .
As we will see soon, positive definiteness is not guaranteed for all quasi-Newton methods. In
the case where {H k } consists only of positive definite matrices, then one speaks of a ”variable
metric”-method.

1. Update rules
In this section we will discuss several update rules for the Hessian approximation H. Let
sa = αa da (= −αa (H k )−1 ∇f (xk )),
ya = ∇f (x+ ) − ∇f (xa ) with x+ = xa + sa .
It holds that
ya = ∇f (x+ ) − ∇f (xa ) = ∇f (xa ) + ∇2 f (xa )(x+ − xa ) + O(kx+ − xa k) − ∇f (xa )
= ∇2 f (xa )sa + O(ksa k).
Therefore we require
(8.1) H+ sa = ya .
73
74 BMS Basic Course: Nonlinear Optimization

This condition is the called quasi-Newton condition (or secant condition). A simple ansatz
for H+ in (8.1) is
H+ = Ha + αuu> , α ∈ R, u ∈ Rn ,
which is referred to as the symmetric rank-1-update. Inserting this update into (8.1) yields
Ha sa + αu(u> sa ) = ya .
Thus, u is proportional to ya − Ha sa . We set u = ya − Ha sa (as the length can be adjusted
by α) which implies αu> sa = 1. This results in the symmetric rank-1-formula
(ya − Ha sa )(ya − Ha sa )>
(8.2) H+ = Ha + .
(ya − Ha sa )> sa
Unfortunately this formula has a few drawbacks. In particular the positive definiteness gets
lost (even if H 0 is chosen positive definite) and numerical problems appear whenever ya −
Ha sa ≈ 0 resp. (ya − Ha sa )> sa ≈ 0.
The non-symmetric ansatz
H+ = Ha + αuv > , α ∈ R, u, v ∈ Rn ,
with v := sa inserted into (8.1) yields
Ha sa + αu(s>
a sa ) = ya .
Hence, u is proportional to ya − Ha sa , which immediately implies α(s>
a sa ) = 1. We obtain
the non-symmetric rank-1-formula
(ya − Ha sa )s>
a
(8.3) H+ = Ha + >
.
sa sa
The fact that positive definiteness cannot be guaranteed and the absence of symmetry repre-
sent crucial disadvantages of (8.3).
More flexible update formulae can be derived by applying symmetric rank-2-updates, i.e.
H+ = Ha + αuu> + βvv > , α, β ∈ R, u, v ∈ Rn .
Inserting this into (8.1) yields
(8.4) Ha sa + αuu> sa + βvv > sa = ya .
The vectors u and v are no longer uniquely determined. In view of (8.4), it is adequate to
choose
u = ya and v = Ha sa .
Then we obtain
αya ya> sa + β(Ha sa )(Ha sa )> sa = ya − Ha sa ,
which implies
α(ya> sa ) = 1 and β(s> a Ha sa ) = −1.
Thus,
1 1
α= > and β = − >
ya sa sa Ha sa
and finally
ya ya> (Ha sa )(Ha sa )>
(8.5) H+ = Ha + − .
ya> sa s>a Ha sa
Prof. Dr. Michael Hintermüller 75

The update rule (8.5) is called the BFGS-formula (named after Broyden-Fletcher-Goldfarb-
Shanno).
One may also approximate ∇2 f (xk )−1 by B k . In this case the quasi-Newton condition reads

B+ ya = sa .

Applying a symmetric rank-2-update analogous to (8.4) with u = sa and v = Ba ya , one


obtains
sa s>
a (Ba ya )(Ba ya )>
(8.6) B+ = Ba + − .
s>
a ya ya> Ba ya

This formula is called the DFP-formula (after Davidon-Fletcher-Powell). Owing to the rela-
tions B ↔ H and y ↔ s, (8.5) and (8.6) are considered as dual to each other. Numerical
practice shows that the BFGS-method is often superior to the DFP-method. In the following,
we therefore will focus on the update according to (8.5).
First of all, note that if H 0 ∈ S n also H k ∈ S n due to the structure of (8.5). The positive
definiteness is considered in the following lemma.

Lemma 8.1. Let Ha ∈ S n be positive definite, ya> sa > 0 and H+ determined according to
(8.5). Then H+ ∈ S n is positive definite.

Proof. Positive definiteness of Ha and ya> sa 6= 0 yield for all z 6= 0:

z > ya ya> z z > Ha sa · (Ha sa )> z


z > H+ z =z > Ha z + −
ya> sa s>
a Ha sa
(z > ya )2 > (z > Ha sa )2
= + z Ha z − .
ya> sa s>
a Ha sa
1 1 1
Since Ha ∈ S n is positive definite, there exists Ha2 with Ha = Ha2 · Ha2 .
Thus, the following holds true:
1 1 1 1
|(z > Ha sa )| = |(z > Ha2 · Ha2 sa )| ≤ kHa2 zk · kHa2 sa k

and also
1 1
(z > Ha sa )2 ≤ kHa2 zk2 · kHa2 sa k2 = (z > Ha z) · (s>
a Ha sa ).

Equality only holds, if z = 0 or sa = 0 (but neither is relevant in our situation), or z = κsa , κ ∈


> 2 > 2
R . The latter implies either (zs>HHaassaa) < z > Ha z or z = κsa and (zy>ysaa) > 0. Altogether we
a a
obtain: z > H+ z > 0. As z was arbitrarily chosen, the assertion is proven. 

The condition ya> sa > 0 is realistic. For quadratic problems with positive definite Hessian G,
it holds that

ya> sa = (∇f (x+ ) − ∇f (xa ))> (x+ − xa ) = (x+ − xa )> G(x+ − xa ) > 0.

For general problems, ya> sa > 0 is ensured by the Wolfe-Powell step size strategy.
76 BMS Basic Course: Nonlinear Optimization

2. Local convergence theory


Before discussing local convergence properties, we demonstrate that Newton’s method resp.
the BFGS-method are invariant under affine transformations. For this purpose, let A be a
nonsingular n × n-matrix, b ∈ Rn and y = Ax + b resp. x = A−1 (y − b). The chain rule
implies
∂ X ∂yk ∂ X ∂
= = Aki .
∂xi ∂xi ∂yk ∂yk
k k
It follows ”∇x = A> ∇ y” and
∇x f = A> ∇y f,
∇2x f = A> ∇2y f A.
Let Hax and Hay denote the current BFGS-matrices corresponding to the differentiation w.r.t.
x resp. y. Suppose that
(Hax )−1 = A−1 (Hay )−1 A−> , ya = Axa + b,
and that in both cases the same step size αa is chosen. Then the BFGS -method is invariant
under affine transformations T (x) = Ax + b. To see this, consider
x+ = xa − αa (Hax )−1 ∇x f (xa )
and w.r.t. y,
y+ =ya − αa (Hay )−1 ∇y f (ya )
=Axa + b − αa A(Hax )−1 A> A−> ∇x f (xa )
=A(xa − αa (Hax )−1 ∇x f (xa )) + b = Ax+ + b.
An analogous argument can be used for Newton’s method. As a consequence, from now
on we may suppose that ∇2 f (x∗ ) = I holds true (this is ensured by the transformation
T (x) = (∇2 f (x∗ ))−1/2 ).
We next specify the central convergence theorem. Its proof is based on some auxiliary results,
which we establish in the remainder of this section.
Theorem 8.1. Let (A) be satisfied. Then there exists δ > 0 such that for
kx0 − x∗ k ≤ δ and kH 0 − ∇2 f (x∗ )k ≤ δ
the BFGS-method is well-defined and converges q-superlinearly to x∗ .
As announced, for the proof of the above result we need several auxiliary results. For this,
the error in the approximation of the inverse of the Hessian is denoted by
F = H −1 − ∇2 f (x∗ )−1 = H −1 − I.
We state without proof the following lemma.
Lemma 8.2. Let (A) be satisfied. If Ha ∈ S n is positive definite and
x+ = xa − Ha−1 ∇f (xa ),
then there exists δ0 such that for
0 < kxa − x∗ k ≤ δ0 and kFa k ≤ δ0 ,
Prof. Dr. Michael Hintermüller 77

it holds that
ya> sa > 0.
Furthermore we have that the BFGS-update H+ of Ha satisfies
−1
F+ = H+ − I = (I − wa wa> )Fa (I − wa wa> ) + Da
sa
with wa = ksa k , Da ∈ Rn×n and kDa k ≤ KD ksa k with KD > 0.
Basically, Lemma 8.2 indicates that the approximation is close to the exact Hessian, if the
initial values are ”good” enough. This property is elementary for proving local superlinear
convergence.
Corollary 8.1. Under the assumptions of Lemma 8.2, it holds that
kF+ k ≤ kFa k + KD ksa k ≤ kFa k + KD (kxa − x∗ k + kx+ − x∗ k).
The second inequality in the assertion of Corollary 8.1 follows directly from sa = x+ − xa .
The first inequality can be obtained by expanding the representation of F+ from Lemma 8.2
and estimating the subsequent expression using kDa k ≤ KD ksa k.
At this point, we can prove local q-linear convergence.
Theorem 8.2. Let (A) be satisfied and σ ∈ (0, 1) be given. Then there exists δl such that for
(8.7) kx0 − x∗ k ≤ δl and k(H 0 )−1 − ∇2 f (x∗ )−1 k ≤ δl
the BFGS-iteration is well-defined and converges q-linearly to x∗ . The q-factor is bounded by
σ.
Note that in general δl is directly proportional to σ.
Proof. For sufficiently small δ̂ and
(8.8) kxa − x∗ k ≤ δ̂ and kFa k = kHa−1 − Ik ≤ δ̂,
(A) yields
kx+ − x∗ k ≤ kFa k kxa − x∗ k + O(kxa − x∗ k) ≤ δ̂kxa − x∗ k + O(kxa − x∗ k).
Let δ̂ be small enough such that
kx+ − x∗ k ≤ σkxa − x∗ k < kxa − x∗ k ≤ δ̂.
Choose δl such that (8.8) holds true for the entire iteration, provided that the initial value
satisfies (8.7). We choose
δ∗ KD (1 + σ) −1 δ ∗
 
(8.9) δl = 1+ <
2 1−σ 2
with KD from Lemma 8.2. In case that kI − H 0 k < δl with δl < 12 , we infer
kF 0 k =k(H 0 )−1 − Ik ≤ k(H 0 )−1 k kI − H 0 k = k(I − (I − H 0 ))−1 k kI − H 0 k
1 δl
≤ 0
kI − H 0 k ≤ ≤ 2δl ≤ δ ∗ .
1 − kI − H k 1 − δl
Corollary 8.1 yields
kF 1 k ≤ kF 0 k + KD (1 + σ)kx0 − x∗ k.
It remains to prove that (8.7) and (8.9) imply
kF k k < δ ∗ ∀k.
78 BMS Basic Course: Nonlinear Optimization

For this purpose we proceed inductively. Let kF k k < δ ∗ and kxj+1 −x∗ k ≤ σkxj −x∗ k ∀j ≤ n.
Then Corollary 8.1 implies

kF k+1 k ≤ kF k k + KD (kxk − x∗ k + kxk+1 − x∗ k)


≤ kF k k + KD (1 + σ)kxk − x∗ k
≤ kF k k + KD (1 + σ)σ k kx0 − x∗ k
≤ kF k k + KD (1 + σ)σ k δl
k
X
≤ kF 0 k + δl KD (1 + σ) σj
j=0
KD (1 + σ)
≤ δl (1 + ) < δ∗.
1−σ


We derive now some useful relations. Assumption (A) also ensures ∇f (xa ) 6= 0 for xa (xa 6=
x∗ ) sufficiently close to x∗ . Moreover, it holds that
Z 1
(8.10) ∇f (xa ) = ∇2 f (x∗ + τ (xa − x∗ ))(xa − x∗ )dτ = (I + R1 )(xa − x∗ )
0

with
Z 1
R1 = (∇2 f (x∗ + τ (xa − x∗ )) − I)dτ.
0
γ
Thus we obtain kR1 k ≤ 2 kxa − x∗ k as well as

(8.11) sa = −Ha−1 ∇f (xa ) = −(I + Fa )(I + R1 )(xa − x∗ ).

In case kFa k ≤ δ0 und kxa − x∗ k ≤ δ0 (cf. Lemma 8.2), we have


γδ0 γδ0
kxa − x∗ k(1 − δ0 )(1 − ) ≤ ksa k ≤ kxa − x∗ k(1 + δ0 )(1 + )
2 2
and consequently
1
(8.12) 0 < kxa − x∗ k ≤ ksa k ≤ 2kxa − x∗ k
2
for δ0 ≤ min( 41 , 2γ
1
). Moreover,
Z 1
ya =∇f (x+ ) − ∇f (xa ) = ∇2 f (xa + τ sa )sa dτ
0
Z 1
(8.13) =sa + (∇2 f (xa + τ sa ) − I)sa dτ = sa + R2 sa ,
0

where
Z 1
R2 = (∇2 f (xa + τ sa ) − I)dτ.
0
Prof. Dr. Michael Hintermüller 79

Using I = ∇2 f (x∗ ) and the Lipschitz continuity of ∇2 f , we infer


Z 1
γ
kR2 k ≤ γkxa + τ sa − x∗ kdτ ≤ γkxa − x∗ k + ksa k
0 2
γ 5γ
(8.14) ≤2γksa k + ksa k = ksa k.
2 2
In the proof of the following theorem, the Dennis-Moré condition is applied. It represents a
necessary and sufficient condition for superlinear convergence of quasi-Newton methods. The
Dennis-Moré condition reads
kF k sk k
(8.15) lim = 0.
k→∞ ksk k

Theorem 8.3. Suppose the standard assumption (A) is fulfilled. Let {H k }k∈N be a sequence
of nonsingular matrices with kH k k ≤ M for all k ∈ N. Further let x0 be given and {xk }∞
k=1
defined by
xk+1 = xk − (H k )−1 ∇f (xk ).
If xk converges q-linearly to x∗ , xk 6= x∗ for all k ∈ N, and (8.15) is met, then xk converges
q-superlinearly to x∗ .
Proof. The equations (8.13) and (8.14) yield
(8.13) (8.14)
 
F k sk = (H k )−1 − I (y k − R2 sk ) = F k y k + O (ksk k2 ).

It holds that xk → x∗ and, hence, sk → 0. The Dennis-Moré condition (8.15) can be rewritten
as
kF k y k k
(8.16) lim = 0.
k→∞ ksk k

Now let σ denote the q-factor of {xk }. Then it holds that


(1 − σ)kxa − x∗ k ≤ ksa k ≤ (1 + σ)kxa − x∗ k,
as ksa k = kx+ − xa k and
kx+ − xa + x∗ − x∗ k ≤ kx+ − x∗ k + kxa − x∗ k ≤ σkxa − x∗ k + kxa − x∗ k,
resp.
kx+ − xa k ≥ (1 − σ)kxa − x∗ k.
Therefore (8.16) is equivalent to
kF k y k k
lim = 0.
k→∞ kxk − x∗ k

Since (H k )−1 ∇f (xk ) = −sk and sk = y k + O(ksk k2 ) (owing to (8.13)), it follows that
 
F k y k = (H k )−1 − I (∇f (xk+1 ) − ∇f (xk ))
= (H k )−1 ∇f (xk+1 ) + sk − y k = (H k )−1 ∇f (xk+1 ) + O(ksk k2 )
= (H k )−1 (xk+1 − x∗ ) + O(kxk − x∗ k2 + ksk k2 ) = (H k )−1 (xk+1 − x∗ ) + O(kxk − x∗ k2 ).
80 BMS Basic Course: Nonlinear Optimization

Thus we have:
kF k y k k k(H k )−1 (xk+1 − x∗ )k

= + O(kxk − x∗ k)
k
kx − x k kxk − x∗ k
kxk+1 − x∗ k
≥ M −1 + O(kxk − x∗ k) → 0,
kxk − x∗ k
which yields the q-superlinear convergence of xk to x∗ . 
Now we can state the proof of Theorem 8.1.
Proof of Theorem 8.1. Assume (8.7) of Theorem 8.2 is fulfilled with δl such that (8.2)
holds true for σ ∈ (0, 1). It follows immediately that

X
(8.17) ksk k < ∞.
k=0

Let kAk2F= ΣN 2
i,j=1 (A)ij
= trace(A> A) be the Frobenius norm of the matrix A. For v ∈ Rn
with kvk ≤ 1, it holds that
kA(I − vv > )k2F ≤ kAk2F − kAvk2 ,
as
kA(I − vv > )k2F = trace((I − vv > )> A> A(I − vv > ))
= trace((A> A − vv > A> A)(I − vv > ))
= trace((A> A) − vv > A> A − A> Avv > + vv > A> Avv > ),
trace(vv > A> A) = kAvk2 and trace(vv > A> Avv > ) ≤ trace(vv > )trace(A> Avv > ) = kvk2 kAvk2 ≤
kAvk2 . Moreover, k(I − vv > )Ak2F ≤ kAk2F . Thus, Lemma 8.2 implies
kF k+1 k2F ≤ kF k k2F − kF k wk k2 + O(ksk k) = (1 − Θ2k )kF k k2F + O(ksk k)
with (
kF k wk k
k sk kF k kF
if F k 6= 0.
w = k , Θk =
ks k 1 if F k = 0.
With (8.17), we have for k ≥ 0:

X ∞
X
Θ2k kF k k2F ≤ (kF k k2F − kF k+1 k2F ) + O(1)
k=0 k=0
= kF 0 k2F − kF k+1 k2F + O(1) < ∞.
Hence, Θk kF k kF → 0 and finally
kF k wk k if F k =

k 6 0.
Θk kF kF =
0 if F k = 0.
Moreover,
kF k sk k
kF k wk k = .
ksk k
Thus, the Dennis-Moré condition is met. 
Prof. Dr. Michael Hintermüller 81

3. Global convergence
Under the assumption that there exist some constants 0 < c1 < c2 < +∞ with
c1 kxk2 ≤ x> H k x ≤ c2 kxk2 ∀x ∈ Rn ∀k ∈ N,
dk = −(H k )−1 ∇f (xk ) is a gradient-related search direction. If the Armijo step size strategy
is applied, a statement analogous to Theorem 5.1 holds true. Note that for a local minimizer
x∗ one cannot expect in general to have xk → x∗ for x0 close to x∗ . In view of the local
theory, the situation where x0 is situated close to x∗ , but H 0 is not close to ∇2 f (x∗ ) is not
better than the case where x0 is not sufficiently close to x∗ .
Theorem 8.4. Let D := {x : f (x) ≤ f (x∗ )} be convex, f twice continuously differentiable in
D and the spectrum σ(∇2 f (x)) ⊂ [c1 , c2 ] for all x ∈ D. If H 0 ∈ S n is positive definite, then
the BFGS-method with Armijo step size strategy converges q-superlinearly to x∗ .
Similar statements hold true for the (strict) Wolfe-Powell step size strategy.

4. Numerical aspects
4.1. Memory-efficient updating. The BFGS-method always uses the current Hessian
approximation to calculate the new approximation with the aid of the rank-2-update. In
general, one expects that the performance of the approximation is improving in the course of
the iteration. Indeed, one can show that for quadratic problems, adequate initialization and
an exact step size strategy, the exact Hessian is perfectly approximated after at most n (for
∇2 f (x∗ ) ∈ Rn×n ) steps. Nonetheless, for general problems the situation is more complicated
and due to numerical reasons, a reset of H k to a well-scaled, positive definite matrix is
occasionally implemented. However, for high-dimensional problems storing the BFGS-matrix
is unesireable because of memory restrictions. The so-called limited memory-BFGS method
takes this issue into account. In this method, from the point of view of the current iterate
xk , the preceding pairs {(y l , sl )}, k − m ≤ l ≤ k, are stored and the BFGS-formula is realized
iteratively at each new iteration. Here, m ∈ N is a fixed number.
We will now specify a strategy, which approximates the inverse Hessian without occupying to
much memory space. Let sa = x+ − xa , ya = ∇f (x+ ) − ∇f (xa ) and H+ be computed from
Ha by the BFGS-formula. Then, for positive definite Ha ∈ S n and ya> sa 6= 0 it holds that
H+ is nonsingular and
sa y > ya s> sa s>
   
−1
(8.18) H+ = I − > a Ha−1 I − > a + > a .
ya sa ya sa ya sa
Rearranging yields
 
−1
H+ = Ha−1 + β0 sa s> a + γ 0 (H −1
a y )s
a a
>
+ s a (H −1
a y a )>

with coefficients
ya> sa + ya> Ha−1 ya 1
β0 = and γ0 = − .
(ya> sa )2 ya> sa
From the relation
sa
Ha−1 ya = Ha−1 ∇f (x+ ) − Ha−1 ∇f (xa ) = Ha−1 ∇f (x+ ) + ,
αa
it follows that
 
−1
(8.19) H+ = Ha−1 + β1 sa s> −1 > −1 >
a + γ0 sa (Ha ∇f (x+ )) + (Ha ∇f (x+ )) sa
82 BMS Basic Course: Nonlinear Optimization

2γ0
with β1 = β0 + αa .
For the new direction d+ this yields

sa ya> ya s> sa s>


   
−1 −1 a a ∇f (xa +)
d+ = − H+ ∇f (x+ ) = − I − > Ha I− > ∇f (x+ ) −
ya sa ya sa ya> sa
=Aa sa + Ca Ha−1 ∇f (x+ )
with
ya> ya s>
    >
−1 a 1 s ∇f (x+ )
Aa = >
Ha I− > ∇f (x+ ) + −1 a > ,
ya sa ya s a αa ya sa
s>
a ∇f (x+ )
Ca = −1 + .
ya> sa
Thus, we can determine d+ and consequently α+ , s+ just by means of Ha−1 ∇f (x+ ) (and sa ).
Obviously H+ is not required. Furthermore, we do not need to store the vectors {ya }: As
Ca 6= 0, it holds that
s+ Aa sa
−Ha−1 ∇f (x+ ) = − + .
Ca α + Ca
Inserting into (8.1) yields
−1
H+ = Ha−1 + βa sa s> > >
a + γa (sa s+ + s+ sa )

with coefficients
Aa γ0
βa = β1 − 2γ0 and γa = .
Ca Ca α +
Finally that leads to
k 
X  
(H k+1 )−1 = (H 0 )−1 + βl sl (sl )> + γl sl (sl+1 )> sl+1 (sl )> .
l=0

4.2. Positive definiteness. In Lemma 8.1 we have already seen that ya> sa > 0 has to
be assumed, in order to guarantee the positive definiteness of H+ ∈ S n . A simple strategy
which ensures this property, consists of the following choice:
if ya> sa > 0,
 BF GS
H+
H+ =
R if ya> sa ≤ 0
with R ∈ S n positive definite. Often one employs R = I.

5. Further Quasi-Newton formulae


Now we will indicate two other interesting updating rules for the Hessian.
The direct DFP-formula is
(ya − Ha sa )> ya ya ya>

(ya − Ha sa )ya> + ya (ya − Ha sa )>
H+ = Ha + − .
ya> sa (ya> sa )2
This updating strategy can be analyzed similarly to the BFGS-method. However, many
numerical observations favor the standard BFGS-formula.
Prof. Dr. Michael Hintermüller 83

A formula which is applied especially in connection with trust region methods is the PSB-
formula (Powell-symmetric-Broyden). This formula preserves symmetry, however the positive
definiteness generally gets lost:
(ya − Ha sa )s>
a + sa (ya − Ha sa )
> (s> >
a ya − Ha sa )sa sa
H+ = Ha + − .
s>
a sa (ya> sa )2
CHAPTER 9

Box-constrained problems

Let X = {x ∈ R : Li ≤ xi ≤ Ui } with −∞ < Li < Ui < +∞ for i = 1, . . . , n. We consider


the following problem: Find a local minimizer x∗ ∈ X of f , i.e.
f (x∗ ) ≤ f (x) ∀x ∈ X ∩ {z : kz − x∗ k ≤ }.
for an  > 0. Given that X is compact 1, there always exists a solution to
(9.1) min f (x) s.t. x ∈ X.
In case that xi = Li or xi = Ui , the index i is called active; otherwise i is called inactive. We
collect active indices in the active set A(x) and inactive indices in the inactive set I(x).

1. Necessary conditions
The first and second order necessary conditions in the scalar case X ⊂ R are as follows.
Theorem 9.1. Let f be twice continuously differentiable on [a, b], −∞ < a < b < +∞. Let
x∗ be a local minimizer of f in [a, b]. Then it holds that
f 0 (x∗ )(x − x∗ ) ≥ 0 ∀x ∈ [a, b]
and
f 00 (x∗ )(x∗ − a)(b − x∗ ) ≥ 0.
This theorem may serve as a basis for the corresponding conditions in the multidimensional
case.
The point x∗ ∈ X is called a stationary point for (9.1), if
(9.2) ∇f (x∗ )> (x − x∗ ) ≥ 0 ∀x ∈ X.
At the same time, (9.2) represents the first order necessary condition.
In the following we use the term “solution” for “local minimizer”.
Theorem 9.2. Let f be twice continuously differentiable in X and x∗ a solution of (9.1).
Then x∗ is a stationary point for (9.1).
Proof. Let y ∈ X. Since X is convex, we have z(t) = x∗ + t(y − x∗ ) ∈ X for t ∈ [0, 1].
Consider
φ(t) = f (z(t)).
It holds that φ(t) has a minimum in t = 0.
Now Theorem 9.1 implies
0 ≤ φ0 (t)|t=0 = ∇f (z(t))> ∗ ∗ > ∗
|t=0 · (y − x ) = ∇f (x ) (y − x ),

1Of course, this is due to our assumption −∞ < L < U < +∞. However, if L = −∞ or U = +∞, then
i i i i
we had to ensure–like in the unconstrained case–the existence of a solution, for instance by assuming certain
convexity properties of f .

85
86 BMS Basic Course: Nonlinear Optimization

which completes the proof. 


In order to gain information on second order derivatives, we consider the following situation
in R2 . Let X = [0, 1]2 . If x∗ ∈ (0, 1)2 , then it follows analogously to the unconstrained case,
that ∇2 f (x∗ ) is positive-semidefinite. Assume x∗ = (0, x∗2 ) is a solution with 0 < x∗2 < 1.
Consider φ(t) = f (0, t) for t ∈ [0, 1]. It necessarily holds:
∂2f
0 ≤ φ00 (x∗2 ) = (0, x∗2 ).
∂x22
2
However we cannot make a statement about ∂∂xf2 .
1
We will now introduce the reduced Hessian of f. Let f be twice continuously differentiable,
then the reduced Hessian ∇2R f (x) of f is defined as

δij for i ∈ A(x) or j ∈ A(x),
(∇2R f (x))ij =
(∇2 f (x))ij else.
Here, δij denotes the Kronecker-delta. Now we are able to formulate the second order neces-
sary condition.
Theorem 9.3. Let f be twice continuously differentiable in X and x∗ a solution to (9.1).
Then ∇2R f (x∗ ) is positive-semidefinite.
Proof. W.l.o.g. we consider the following partition of x∗ :
x∗ = (x∗A(x∗ ) , x∗I(x∗ ) ),
∗ )|
with x∗A(x∗ ) ∈ R|A(x consisting of the components i ∈ A(x∗ ); x∗I(x∗ ) analogously.

Thus it holds with φ(ξ) := f (x∗A(x∗ ) , ξ), ξ ∈ R|I(x )| that
 
2 ∗ I 0
∇R f (x ) = 0 ∇2 φ(x∗ .
I(x∗ ) )

Now assume that ∇2 φ(x∗ (I(x∗ )) has a negative eigenvalue λ∗ . Let u∗ denote a corresponding
eigenvector. Then it holds with z := x∗I(x∗ ) + tu∗ that
t2 ∗> 2
φ(z) = φ(x∗I(x∗ ) ) + t∇φ(x∗I(x∗ ) )> u∗ + u ∇ φ(x∗I(x∗ ) )u∗ + O(t3 )
2
t2 ∗ ∗ 2
= φ(x∗I(x∗ ) ) + λ ku k + O(t3 ) < φ(x∗I(x∗ ) )
2
for t sufficiently small. Note that the latter represents a contradiction to the optimality of
x∗ . Thus, ∇2 φ(x∗I(x∗ ) ) is positive-semidefinite. 
Let P : Rn → X denote the projection onto X, which is given by

 Li if xi < Li ,
P (x)i = xi if Li ≤ xi ≤ Ui ,
Ui if xi > Ui .

Furthermore define
(9.3) x(α) = P (x − α∇f (x)).
Obviously it holds for A(x∗ ) = ∅ that x(α) = P (x∗ ) = x∗ . Also, we have
kx(α) − x + α∇f (x∗ )k ≤ ky − x + α∇f (x)k ∀y ∈ X.
Prof. Dr. Michael Hintermüller 87

Thus
1
ψ(λ) = k(1 − λ)x(α) + λy − x + α∇f (x)k2
2
has a local minimizer in λ = 0, i.e.:
0 ≤ ψ 0 (0) = ((1 − λ)x(α) + λy − x + α∇f (x))>
|λ=0 (y − x(α))

(9.4) = (x(α) − x + α∇f (x))> (y − x(α)).


Furthermore, it holds for y = x that
(9.5) kx(α) − xk2 ≤ α∇f (x)> (x − x(α)) ∀α ≥ 0.
With the help of these auxiliary results, we can prove the following variant of the first order
necessary conditions.
Theorem 9.4. Let f be continuously differentiable on X. A point x∗ ∈ X is stationary for
(9.1) if and only if
x∗ = P (x∗ − α∇f (x∗ )) ∀α ≥ 0.
Proof. Let x∗ be a stationary point and x∗ (α) = P (x∗ − α∇f (x∗ )) ∈ X. (9.5) implies:
(9.6) 0 ≤ kx∗ (α) − x∗ k2 ≤ α∇f (x∗ )> (x∗ − x∗ (α)) ∀α ≥ 0.
As x∗ is stationary, it holds that
∇f (x∗ )> (x∗ − x∗ (α)) ≤ 0,
whereby, using (9.6), it follows, that x∗ = x∗ (α) ∀α ≥ 0. Now let x∗ = x∗ (α) for all α ≥ 0.
This implies
x∗ = P (x∗ − α∇f (x∗ )) ∀α > 0.
For i ∈ A(x∗ ) it can be inferred that
x∗i = Li ⇒ ∇f (x∗ )i ≥ 0 ⇒ ∇f (x∗ )i (ξ − x∗i ) ≥ 0 ∀ξ : Li ≤ ξ ≤ Ui ,
x∗i = Ui ⇒ ∇f (x∗ )i ≤ 0 ⇒ ∇f (x∗ )i (ξ − x∗i ) ≥ 0 ∀ξ : Li ≤ ξ ≤ Ui .
For i ∈ L(x∗ ) it follows: ∇f (x∗ )i = 0 ⇒ ∇f (x∗ )i (ξ − x∗i ) = 0 ∀ξ : Li ≤ ξ ≤ Ui . Finally we
obtain
∇f (x∗ )> (x − x∗ ) ≥ 0 ∀x ∈ X,
thus x∗ is a stationary point. 

2. Sufficient conditions
When formulating the sufficient condition, we make use of the notion of a non-degenerate
stationary point.
Definition 9.1. A point x∗ ∈ X is a non-degenerate stationary point for (9.1), if x∗ is a
stationary point and
∇f (x∗ )i 6= 0 ∀i ∈ A(x∗ ).
If x∗ is a local minimizer of (9.1), then x∗ is called non-degenerate local minimizer.
88 BMS Basic Course: Nonlinear Optimization

A non-degenerate stationary point also fulfills


(xi = Li ∨˙ xi = Ui ) ∧ ∇f (x)i 6= 0
or
Li < xi < Ui ∧ ∇f (x)i = 0.
One also refers to this situation as strict complementarity.
Let M be an index set, then we define

xi for i ∈ M,
(PM x)i =
0 else.
By means of this projection, we obtain the following useful relation.
Lemma 9.1. Let x∗ be a non-degenerate stationary point. Further let A(x∗ ) 6= ∅. Then there
exists a constant γ > 0, s.t.
∇f (x∗ )> (x − x∗ ) = ∇f (x∗ )> PA(x∗ ) (x − x∗ ) ≥ γkPA(x∗ ) (x − x∗ )k ∀x ∈ X.
Proof. For i ∈ A(x∗ ), non-degeneracy and stationarity of x∗ imply that there exists a
γ > 0 such that either
x∗i = Li and ∇f (x∗ )i ≥ γ
or
x∗i = Ui and ∇f (x∗ )i ≤ −γ.
For x ∈ X it follows for all i ∈ A(x∗ ) that
(∇f (x∗ ))i (x − x∗ )i ≥ γ|(x − x∗ )i |.
Since k · k1 ≥ k · k2 and (∇f (x∗ ))i = 0 for i ∈ I(x∗ ), it holds that
X
(∇f (x∗ ))i (x − x∗ )i = ∇f (x∗ )> (x − x∗ ) = ∇f (x∗ )> PA(x∗ ) (x − x∗ ) ≥ γkPA(x∗ ) (x − x∗ )k.
i

Now we can state the sufficient conditions.
Theorem 9.5. Let x∗ ∈ X be a non-degenerate stationary point for (9.1). Let f be twice
continuously differentiable in a neighborhood of x∗ . If the reduced Hessian ∇2R f (x∗ ) is positive
definite, then x∗ is a local minimizer of (9.1).
Proof. Let x ∈ X. Define φ(α) = f (x∗ + α(x − x∗ )). If we can show that either
φ0 (0) > 0
or
φ0 (0) = 0 and φ00 (0) > 0,
then x∗ is a non-degenerate local minimizer of f s.t. x ∈ X. It holds that
φ0 (0) = ∇f (x∗ )> (x − x∗ ) = ∇f (x∗ )> (PA(x∗ ) (x − x∗ ) + PI(x∗ ) (x − x∗ )).
Since x∗ is a stationary point, it follows that ∇f (x∗ )> PI(x∗ ) (x − x∗ ) = 0. In case PA(x∗ ) (x −
x∗ ) 6= 0, non-degeneracy of x∗ implies
∇f (x∗ )> PA(x∗ ) (x − x∗ ) > 0,
and thus φ0 (0) > 0.
Prof. Dr. Michael Hintermüller 89

In case PA(x∗ ) (x − x∗ ) = 0, then we can deduce

φ00 (0) = (x − x∗ )> PI(x∗ ) ∇2 f (x∗ )PI (x∗ )(x − x∗ ) = (x − x∗ )> ∇2R f (x∗ )(x − x∗ ).
Thus, φ0 (0) = 0 and φ00 (0) > 0. 

3. Projected gradient method


The projected gradient method is a natural extension of steepest descent to box-constrained
problems. Hence, it shows similar advantages and disadvantages.
Let xa denote the current iterate. In the projected gradient approach, the new iterate x+ is
given by
x+ = P (xa − α∇f (xa )).
Here, α is a step size computed e.g. by means of an Armijo step size strategy. For the
application of step size strategies, the expected descent has to be specified. Obviously, the
corresponding quantities from the unconstrained case are no longer adequate. In the case of
an Armijo step size strategy, we now apply the following condition:
σ
(9.7) f (x(α)) − f (x) ≤ − kx − x(α)k2 , 0 < σ < 1,
α
with x(α) = P (x − α∇f (x)).

Algorithm 9.1 (Projected gradient method).


input: x0 ∈ Rn , 0 < σ < 1, 0 < β < 1.
begin
k := 0
while ”stopping criterion not fulfilled”
begin
find m ∈ N as small as possible such that (9.7) is met for αk = β m .
xk+1 = xk (αk )
end
end

The stopping criterion will be specified later. In any case, one should fix an upper bound kmax
for the maximum number of iterations and terminate the algorithm if this bound is exceeded.
Now we want to focus on the stopping criterion. Evidently, k∇f (xk )k ≤ τr k∇f (x0 )k + τa is in
general not appropriate. We start by studying the active and inactive sets of adjacent points.
Lemma 9.2. Let f be twice continuously differentiable on X, and let x∗ be a non-degenerate
stationary point for (9.1). Let α ∈ (0, 1]. Then it holds for x sufficiently close to x∗ that
(1) A(x) ⊂ A(x∗ ) and xi = x∗i ∀i ∈ A(x).
(2) A(x(α)) = A(x∗ ) and x(α)i = x∗i ∀i ∈ A(x∗ ).
Proof. Let
δ1 = min {(Ui − x∗i ), (x∗i − Li )}.
i∈I(x∗ )

If i ∈ I(x∗ ) and kx − x∗ k < δ1 , then Li < xi < Ui . Moreover, I(x∗ ) ⊂ I(x) and thus
A(x) ⊂ A(x∗ ), which proves (1).
90 BMS Basic Course: Nonlinear Optimization

Let A(α) and I(α) be the active resp. inactive indices for x(α). Let i ∈ A(x∗ ) 6= ∅. According
to Lemma 9.1 and the continuity of ∇f there exists a constant δ2 > 0 with
σ
kx − x∗ k < δ2 ⇒ (∇f (x∗ + (x − x∗ ))i (x − x∗ )i ≥ (x − x∗ )i .
2
For
σ
δ3 < min( , δ2 ) ∧ kx − x∗ k < δ3
2
it follows: i ∈ A(α) ∧ x(α)i = x∗i . Hence, A(x∗ ) ⊂ A(α). On the other hand, the definition
of P implies
kP (x) − P (y)k ≤ kx − yk ∀x, y ∈ Rn .
By continuity of ∇2 f , ∇f is Lipschitz continuous on X. Let L denote the corresponding
Lipschitz constant. We have
x∗ = x∗ (α) = P (x∗ − α∇f (x∗ ))
and therefore
kx∗ − x(α)k = kP (x∗ − α∇f (x∗ )) − P (x − α∇f (x))k
(9.8)
≤ kx∗ − xk + αk∇f (x∗ ) − ∇f (x)k ≤ (1 + Lα)kx − x∗ k.
If i ∈ A(α) ∩ I(x∗ ), then it holds:
(9.9) kx∗ − x(α)k ≥ δ1 = min {(Ui − x∗i ), (x∗i − Li )}.
i∈I(x∗ )
δ1
If now kx − x∗ k < δ4 := min{δ3 , 1+L }, then (9.8) implies, that (9.9) cannot be satisfied. 
Now we can prove the equivalence of kx − x∗ k und kx − x(1)k, which will lead to a suitable
stopping criterion.
Theorem 9.6. Let f be twice continuously differentiable on X, and x∗ a non-degenerate
stationary point for (9.1). Further assume that the second order necessary condition holds at
x∗ . Then there exist δ > 0 and K > 0 such that for kx − x∗ k ≤ δ and A(x) = A(x∗ ) it holds
that
(9.10) K −1 kx − x∗ k ≤ kx − x(1)k ≤ Kkx − x∗ k.
Proof. We have
kx − x(1)k = kx − x∗ − (x(1) − x∗ (1))k
≤ kx − x∗ k + kP (x − ∇f (x)) − P (x∗ − ∇f (x∗ ))k
≤ 2kx − x∗ k + k∇f (x) − ∇f (x∗ )k ≤ (2 + L)kx − x∗ k.
This implies the inequality on the right hand side of (9.10). Choose δ1 such that kx−x∗ k < δ1
implies that Lemma 9.2 holds for α = 1. One has
∇f (x)i i ∈ I(x∗ ),

(x − x(1))i =
(x − x∗ )i i ∈ A(x∗ ).
It remains to consider i ∈ I(x∗ ). The sufficient conditions yield the existence of a µ > 0 with
u> PI(x∗ ) ∇2 f (x∗ )PI(x∗ ) u ≥ µkPI(x∗ ) uk2 ∀u ∈ Rn .
Thus, there exists another constant δ2 such that for kx − x∗ k < δ2 :
µ
u> PI(x∗ ) ∇2 f (x)PI(x∗ ) u ≥ kPI(x∗ ) uk2 ∀u ∈ Rn .
2
Prof. Dr. Michael Hintermüller 91

Since x − x∗ = PI(x∗ ) (x − x∗ ), we conclude:


Z 1
2
kPI(x∗ ) (x − x(1))k = (x − x∗ )> PI(x∗ ) ∇2 f (x∗ + t(x − x∗ )) (x − x∗ )dt
0
Z 1
= (x − x∗ )> PI(x∗ ) ∇2 f (x∗ + t(x − x∗ ))PI(x∗ ) (x − x∗ )dt
0
µ
≥ kPI(x∗ ) (x − x∗ )k2 .
2


This implies kx − x(1)k ≥ min(1, 2 )kx− x∗ k. By choosing


K = max(2 + L, 1, ),
2
the assertion is proven. 
At the end, we obtain the following stopping criterion:
kxk − xk (1)k ≤ τr kx0 − x0 (1)k + τa .
Convergence analysis.

At first we demonstrate that the Armijo step sizes are bounded away from 0.
Theorem 9.7. Let ∇f be Lipschitz continuous with modulus L. Let x ∈ X. Then, (9.7) is
satisfied for all α with
2(1 − σ)
0<α≤ .
L
Proof. Let y := x − x(α). Then it holds that
Z 1
f (x − y) − f (x) = f (x(α)) − f (x) = − ∇f (x − τ y)> ydτ.
0
By definition of y,
Z 1
>
f (x(α)) = f (x) + ∇f (x) (x(α) − x) − (∇f (x − τ y) − ∇f (x))> ydτ.
0
Thus:
Z 1
>
α(f (x) − f (x(α))) = α∇f (x) (x − x(α)) + α (∇f (x − τ y) − ∇f (x)> ydτ.
0
It holds that Z 1
L
k (∇f (x − τ y) − ∇f (x))> ydτ k ≤ kx − x(α)k2 ,
0 2
which implies
αL
α(f (x) − f (x(α))) ≥ α∇f (x)> (x − x(α)) − kx − x(α)k2 .
2
Using (9.5), we infer
αL
α(f (x) − f (x(α))) ≥ (1 − )kx − x(α)k2 .
2
Now,
L 1
f (x(α)) − f (x) ≤ ( − )kx − x(α)k2 ,
2 α
92 BMS Basic Course: Nonlinear Optimization

where
L 1 σ 2(1 − σ)
− ≤− ⇔ α≤ .
2 α α L

According to this, the step size strategy terminates successfully, if
2(1 − σ)
βm ≤ < β m−1 .
L
Furthermore, we observe that
2β(1 − σ)
α= >0
L
is a (uniform) lower bound on the step sizes αk .
For the projected gradient method, the following convergence result can now be shown:
Theorem 9.8. Let ∇f be Lipschitz-continuous with modulus L. Let {xk } be generated by
algorithm 9.1. Then every accumulation point of {xk } is a stationary point for (9.1).
Proof. Owing to the Armijo step size strategy, {f (xk )} is monotonically decreasing. In
addition, {f (xk )} is bounded from below on X. Hence there exists a limit point f ∗ ∈ R.
Conditions (9.7) and (9.10) imply
α  1 
k→∞
kxk − xk+1 k2 ≤ f (xk ) − f (xk+1 ) ≤ f (xk ) − f (xk+1 ) −→ 0.
σ σ
For all y ∈ X it holds that
∇f (xk )> (xk − y) = ∇f (xk )> (xk+1 − y) + ∇f (xk )> (xk − xk+1 )
(9.4) 1 k
≤ (x − xk+1 )> (xk+1 − y) + ∇f (xk )> (xk − xk+1 )
αk
and
 
k > k k 1 k+1
k+1 k
∇f (x ) (x − y) ≤ kx − x k kx − yk + k∇f (x )k
αk
(9.11)  
k k+1 1 k+1 k
≤ kx − x k kx − yk + k∇f (x )k .
α
Let {xk(l) } be a subsequence converging to x∗ , then (9.11) implies

∇f (x∗ )> (x∗ − y) ≤ 0 ∀y ∈ X.



The projected gradient method has the interesting property of identifying the active set
after finitely many steps. At this point, non-degeneracy of the local minimizer represents an
essential assumption.
Theorem 9.9. Let ∇f be Lipschitz continuous. If {xk } converges to a non-degenerate local
minimizer of (9.1), then there exists an index k0 ∈ N such that A(xk ) = A(x∗ ) for all k ≥ k0 .
Proof. Choose α sufficiently small such that Lemma 9.2 holds. By choosing k0 ensuring
kxk − x∗ k < δ4 ∀k ≥ k0 − 1,
where δ4 denotes the constant from the proof of Lemma 9.2, the assertion is proven. 
Prof. Dr. Michael Hintermüller 93

4. Superlinearly convergent methods


The theory presented in the preceding section cannot be extended to iterations of the form
x+ = P (xa − αHa−1 ∇f (xa ))
with positive definite Ha ∈ S n . This can be shown by means of a rather simple counter
example. For example, it can easily happen that x+ = xa for all α ≥ 0, although xa is not a
local minimizer.
A possible remedy consists of the introduction of the -active set
A (x) = {i : Ui −  ≤ xi ∨ xi ≤ Li + },
with 0 ≤  < min{ 21 (Ui − Li ) : i = 1, . . . , n} =: ¯. By I  (x) we denote the complement of
A (x). The magnitude  may well be varied depending on xa . Then we write a .
As a model for the reduced Hessian, we use
R(xa , a , Ha ) = PAa (xa ) IPAa (xa ) + PI a (xa ) Ha PI a (xa )
if i ∈ Aa (xa ) or j ∈ Aa (xa ),

δij
=
(Ha )ij else.
It holds: ∇2R f (xa ) = R(xa , 0, Ha ). For 0 ≤  < ¯ and positive definite Ha ∈ S n , we define
xH, (α) = P (x − αR(x, , Ha )−1 ∇f (xa )).
In view of the step size strategy, the following lemma proves very useful.
Lemma 9.3. Let x ∈ X, 0 ≤  < ¯ and Ha ∈ S n positive definite. Further let ∇f Lipschitz
continuous on X with modulus L. Then there exists αH, > 0 such that
(9.12) f (xH, (α)) − f (x) ≤ −σ∇f (x)> (x − xH, (α)) ∀α ∈ [0, αH, ].
Proof. It holds that
> >
∇f (x)> (x − xH, (α)) = PA (x) ∇f (x) (x − xH, (α)) + PI  (x) ∇f (x) (x − xH, (α)).
We have (xH, (α))i = (x(α))i for all i ∈ A (x). Consider α with

(9.13) α < α̂1 = .
maxk∇f (x)k∞
x∈X
Note that A(x) ⊂ A (x). Thus: A (x) = A(x) ∪ (I(x) ∩ A (x)). For i ∈ A(x), (9.13) implies
either
(x − x(α))i = α∇f (x)i
or
(x − x(α))i = 0.
In both cases it holds that
(x − x(α))i ∇f (x)i ≥ 0.
For i ∈ I(x) ∩ A (x)and (x − x(α))i 6= α∇f (x)i , we have: i ∈ A(x(α)) and consequently
(x − x(α))i ∇f (x)i ≥ 0. Altogether, we obtain
(9.14) (PA (x) ∇f (x))> (x − x(α)) ≥ 0.
Now consider α with

α ≤ α2 = .
maxkR(x, , Ha )−1 ∇f (x)k∞
x∈X
94 BMS Basic Course: Nonlinear Optimization

Then i is inactive for xH, (α) and x(α). Thus it follows:


> >
PI  (x) ∇f (x) (x − xH, (α)) = α PI  (x) ∇f (x) Ha−1 PI  (x) ∇f (x)
1
≥ kP  (x − x(α))k2
λmin α I (x)
1 >
= PI  (x) ∇f (x) (x − x(α)),
λmin
where λmin > 0 denotes the smallest eigenvalue of H. Moreover,
> >
∇f (x)> (x − xH, (α)) = PA (x) ∇f (x) (x − xH, (α)) + PI  (x) ∇f (x) (x − xH, (α))
1
≥ (PA (x) ∇f (x))> (x − x(α)) + (P  ∇f (x))> (x − x(α))
λmin I (x)
1
≥ min(1, )∇f (x)> (x − x(α))
λmin
(9.5) min(1, 1 )
λmin
≥ kx − x(α)k2 .
α
Now,
f (xH, (α)) − f (x) ≤ −∇f (x)> (x − xH, (α)) + Lkx − xH, (α)k2 ,
and finally

f (xH, (α)) − f (x) ≤ − (1 − L α max(1, λmin )) ∇f (x)> (x − xH, (α)).

Thus (9.12) holds for


1−σ
α ≤ α3 = .
L max(1, λmin )
Choose αH, = min(α¯1 , α¯2 , α¯3 ). 

The method which realizes these ideas is referred to as scaled projected gradient method.

Algorithm 9.2 (Scaled projected gradient method).


input: x0 ∈ Rn , 0 < σ < 1, 0 < β < 1.
begin
k := 0
while kxk − xk (1)k > τr kx0 − x0 (1)k + τa
begin
determine k , H k ∈ S n positive definite
solve R(xk , k , H k )d = −∇f (xk )
find m ∈ N preferably small such that (9.7) holds for αk = β m .
xk+1 = xk (αk )
end
end

We state the following global convergence result.


Prof. Dr. Michael Hintermüller 95

Theorem 9.10. Let ∇f be Lipschitz continuous with modulus L. Let {H k } ⊂ S n uniformly


positive definite and bounded. Further we suppose the existence of ,  > 0 such that  ≤ k ≤ 
for all k. Then it holds:
lim kxk − xk (1)k = 0,
k→∞
i.e. every accumulation point of {xk } is a stationary point for (9.1).
Furthermore we can deduce that for subsequences {xk(l) } with xk(l) → x∗ , it holds that
x∗ = x∗ (1). If x∗ is non-degenerate, then we have A(xk ) = A(x∗ ) for all sufficiently large k.

4.1. Projected Newton method. If x0 is located sufficently close to a non-degenerate


local minimizer of (9.1) and H k = ∇2R f (xk ), then one refers to the iteration rule
xk+1 = P (xk − α(H k )−1 ∇f (xk ))
as the projected Newton method. Since x0 is sufficiently close to x∗ , α = 1 is always accepted.
If k is chosen according to
k = min(kxk − xk (1)k, ¯),
then the projected Newton method converges locally Q-quadratically to x∗ .
Theorem 9.11. Let x∗ be a non-degenerate local minimizer of (9.1). If x0 is sufficiently
close to x∗ , A(x0 ) = A(x∗ ) and k = min(kxk − xk (1)k, ¯), then the projected Newton method
converges Q-quadratically to x∗ .
Proof. By assumption, we have: A(xa ) = A(x+ ) = A(x∗ ), hence
PA(xa ) (xa − x∗ ) = PA(x+ ) (x+ − x∗ ) = 0.
δ∗
Let δ ∗ = min (|xi − Ui |, |xi − Li |) > 0. Consider x ∈ Rn with kx − x∗ k ≤ K, with the
i∈I(x∗ )
constant K from Theorem 9.5. Theorem 9.5 implies a < δ ∗ and kxa − x∗ k ≤ δ ∗ . For
i ∈ Aa (xa ), it holds that i ∈ A(xa ) = A(x∗ ) and thus
Aa (xa ) = A(xa ) = A(x∗ ).
Consequently,
R(xa , a , ∇2R f (xa )) = ∇2R f (xa ).
For kxa − x∗ k sufficiently small, we obtain:
x+ = P (xa − (∇2R f (xa ))−1 ∇f (xa )).
We have
∇f (xa ) = ∇f (x∗ ) + ∇2 f (xa )(xa − x∗ ) + E1
with Z 1
∇2 f (x∗ + t(xa − x∗ )) − ∇2 f (xa ) (xa − x∗ )dt.

E1 =
0
Hence kE1 k ≤ K1 kxa − x∗ k2 for K1 > 0. The necessary condition yields
PI(x) ∇f (x∗ ) = PI(x∗ ) ∇f (x∗ ) = 0
for x sufficiently close to x∗ . Since I(xa ) = I(x∗ ), it follows that
xa − x∗ = PI(xa ) (xa − x∗ ) ∧ PA(xa ) (xa − x∗ ) = 0.
96 BMS Basic Course: Nonlinear Optimization

Thus:
PI(xa ) ∇f (xa ) = PI(xa ) ∇2 f (xa )PI(xa ) (xa − x∗ ) + PI(xa ) E1
= PA(xa ) (xa − x∗ ) + PI(xa ) ∇2 f (xa )PI(xa ) (xa − x∗ ) + PI(xa ) E1
= ∇2R f (xa )(xa − x∗ ) + PI(xa ) E1 .
By definition of ∇2R f , we have
PI(xa ) (∇2R f (xa ))−1 ∇f (xa ) = xa − x∗ + E2
with kE2 k ≤ K2 kxa − x∗ k2 , K2 > 0. As PI(xa ) (P (w)) = P (PI(xa ) w) for all w ∈ Rn , it holds
PI(xa ) x+ = PI(xa ) P (xa − (∇2R f (xa ))−1 ∇f (xa ))
= P (PI(xa ) (xa − (∇2R f (xa ))−1 ∇f (xa )))
= P (x∗ − E2 ).
Thus kx+ − x∗ k ≤ K2 kxa − x∗ k2 . 
Remark 9.1. It should be mentioned that there exist projected variants of the BFGS method.
In order to take account of the box-constraints, the update formula has to be slightly mod-
ified by means of the projections PI(x) and PA(x) . Moreover, under the assumptions of
Theorem 9.11 and a sufficiently good initial approximation of the reduced Hessian, local
superlinear convergence of the projected BFGS method can be proven.
Bibliography

[1] D. Bertsekas, Nonlinear Programming, Athena Scientific Publisher, Belmont, Massachusetts, 1995.
[2] J. F. Bonnans, J. C. Gilbert, C. Lemaréchal, C. Sagastizábal, Optimisation Numérique, Mathématiques
& Applications 27, Springer-Verlag, Berlin, 1997.
[3] A. R. Conn, N. I. M. Gould, P. L. Toint, Trust-Region Methods, SIAM, Philadelphia, 2000.
[4] J. E. Dennis, R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equa-
tions, SIAM Philadelphia, 1996.
[5] R. Fletcher, Practical Methods of Optimization I + II, Wiley & Sons Publisher, New York, 1980.
[6] C. Geiger, C. Kanzow, Numerische Verfahren zur Lösung unrestringierter Optimierungsaufgaben,
Springer-Verlag, Berlin, 1999.
[7] P. E. Gill, W. Murray, M. Wright, Practical Optimization, Academic Press, San Diego, 1981.
[8] F. Jarre, J. Stoer, Optimierung, Springer-Verlag, Berlin, 2004.
[9] C. T. Kelley, Iterative Methods for Optimization, Frontiers in Applied Mathematics, SIAM, Philadelphia,
1999.
[10] P. Spellucci, Numerische Verfahren der nichtlinearen Optimierung, Birkhäuser-Verlag, Basel, 1993.

97

You might also like