NLO Notes
NLO Notes
University of Edinburgh
Year 2016-2017
Course details 1
1 Introduction 3
7 Constrained Optimization 59
Introduction
Example 1.1
A soft drinks company needs to decide the shape of the cylindrical can for a new
product. The total volume will be a fixed value V , but the length L and the diame-
ter D must be chosen so that the total cost of design be minimum. Each area unit
of the side has a cost of cs monetary units and each area unit of the bottom and the
top has a cost of ct monetary units. What is the design at minimum cost?
Solution:
D 2
The bottom is a circle of radius D2 . Thus, its area is π
2
and the joint area of
2
bottom and top is twice this amount: π D2 .
The side is a rectangle of height L and width the length of the circle of the base
(πD). Thus, its area is πDL.
2
The volume of the can is π D4 L.
4
2017
c Sergio Garcı́a Quiles
The electric company may meet these requirements by using two owned turbines T1
and T2 . There is also the possibility of purchasing power from an external turbine T3
that belongs to a central energy grid. Associated with the owned turbines, there is
a cost of bi monetary units per day that the turbine is working and there is a cost
of ci monetary units per megawatt produced, i = 1, 2. The price of power purchased
from the grid is c3 monetary units per megawatt.
Due to the configuration of the power network, electricity must be produced first
by turbine T1 , until it is decided to switch to turbine T2 . That is, at that moment,
turbine T1 does not work any more and electricity is produced exclusively by the
second turbine. This happens until it is decided to stop turbine T2 and use exclusively
turbine T3 . How must the company plan for the electricity production?
power (megawatts)
h(t)
daytime
t1 t2 1
Solution:
Let ti be the time that turbine Ti is working, i = 1, 2.
The energy produced is the area below the curve. Therefore, the objective function
is
Z t1 Z t1 +t2 Z 1
F (t1 , t2 ) = b1 t1 + b2 t2 + c1 h(t)dt + c2 h(t)dt + c3 h(t)dt.
0 t1 t1 +t2
Besides, it is clear that the times cannot be negative and that we cannot go beyond
the timetable that we are considering. Thus, we have the following constraints:
t1 , t2 ≥ 0, t1 + t2 ≤ 1.
Chapter 1. Introduction 5
Since, in general, linear problems are much better understood and can be solved much
more efficiently than nonlinear problems, a possible approach to solve a nonlinear
problem is to use a linear approximation. However, it may happen that such an
approximation cannot be obtained or that it is not good enough to solve the problem
with enough accuracy. Therefore we must study and develop theory and algorithms
for nonlinear problems.
Moreover, some properties from linear optimization do not hold any more. For
example, we know that in a linear problem (linear objective function and linear
constraints), every local minimum is global. This is not necessarily true for a non-
linear problem: as we can see in Figure 1.1, x1 = 1 and x2 = 3 are both local minima,
but x1 is not global.
-1.4
-1.6
-1.8
-2
-2.2
-2.4
-2.6
-2.8
-3
-3.2
0.5 1 1.5 2 2.5 3 3.5
Definition 1.4
A point x∗ is a:
1. Global minimum if f (x∗ ) ≤ f (x) ∀x ∈ Ω.
2. Local minimum if there are ε > 0 and a neighborhood Nε such that f (x∗ ) ≤
f (x) ∀x ∈ Nε , where Nε = {x ∈ Ω / kx − x∗ k < ε}.
3. Strict local minimum if there are ε > 0 and a neighborhood Nε such that
f (x∗ ) < f (x) ∀x ∈ Nε , x 6= x∗ , where Nε = {x ∈ Ω / kx − x∗ k < ε}.
In Figure 1.2 we see an example of a 2-dimensional function. Point (0,0) is a local
maximum but not global. Points (0,1) and (0,-1) are global minima.
1.5
1 15
0.5
10
0
5
-0.5
0
2
-1 1 2
0 1
0
-1
-1.5 -1
-1.5 -1 -0.5 0 0.5 1 1.5 -2 -2
All the points in the interval [−1, 1] are global minima, but none is strict.
-3 -2 -1 0 1 2 3
Quite often, we assume that f is a smooth function, that is, a function that has
derivatives of all orders for all the points in Ω, which we write as f ∈ C(Ω). Or, at
least, that f is k times differentiable, with the k-th derivative continuous: f ∈ C k (Ω).
Unless otherwise stated, we will assume in this chapter that Ω is an open set.
First, we establish that certain directions allow us to improve the value of f , that is,
they lead to smaller values.
Proposition 2.2
Let f ∈ C 1 (Ω), x∗ ∈ Ω, and s ∈ Rn . If ∇f (x∗ )t s < 0, then there is a value λ̄ > 0
such that f (x∗ + λs) < f (x∗ ) ∀λ ∈ (0, λ̄).
Proof:
Since ∇f is continuous, there is λ̄ > 0 such that
For any λ ∈ (0, λ̄), using Taylor’s theorem we have that there is some ελ ∈ (0, 1)
such that
f (x∗ + λs) = f (x∗ ) + ∇f (x∗ + ελ λs)t λs.
Or, equivalently,
f (x∗ + λs) = f (x∗ ) + λ∇f (x∗ + λε s)t s
for some λε ∈ (0, λ).
Therefore,
f (x∗ + λs) < f (x∗ ) ∀λ ∈ (0, λ̄).
Definition 2.3
Given f ∈ C 1 (Ω), a vector s ∈ Rn is said to be a descent direction for f at
point x∗ ∈ Ω if ∇f (x∗ )t s < 0.
By changing the sign from “less than” to “greater than” in the previous result and
definition, we can define the notion of ascent direction.
Example 2.4
Let us consider f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 . See Figure 1.2 for a contour
plot of the function.
If x∗ = (1, 1) and s = (−1, 0), then ∇f (x∗ )t s = (4, 4)t (−1, 0) = −4 < 0. This means
that (-1,0) is a descent direction and we can reduce the value of f if we move in that
direction.
On the other hand, we can see that s̃ = (1, 0) is not a descent direction because
(4, 4)t (1, 0) = 4 > 0.
Chapter 2. Basics of Nonlinear Optimization 9
Therefore, the only way for not finding any ascent or descent direction is if the
gradient is zero.
Definition 2.6
Given f ∈ C 1 (Ω), a point x∗ ∈ Ω is a stationary point if ∇f (x∗ ) = 0.
Example 2.7
For f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 , its gradient is
x1 (x21 + x22 − 1) = 0,
x2 (x21 + 2x22 − 2) = 0.
If x1 = 0, then we have in the second equation that 0 = x2 (2x22 − 2) = 2x2 (x22 − 1),
for which there are 3 solutions: x2 = 0, x2 = −1, and x2 = 1. Therefore, we have
3 stationary points: (0, 0), (0, −1), and (0, 1).
If x22 − 1 = 0, then x1 = 0 and we obtain some of the stationary points that we had
already found.
Therefore, function f has 5 stationary points: (0, 0), (0, −1), (0, 1), (−1, 0), and
(1, 0).
10
2017
c Sergio Garcı́a Quiles
There are several more results on positive semidefinite matrices. However, it is not
the scope of this course to cover them all and we are going to review only some of
them which are most useful to us. More specifically, with the definitions that we
have seen, it is not easy to check if a matrix is positive semidefinite. So, we need a
different characterization.
Lemma 2.9
Let A be an n × n symmetric matrix.
1. A is positive semidefinite if, and only if, all its eigenvalues are nonnegative.
2. A is positive definite if, and only if, all its eigenvalues are positive.
3. A is negative semidefinite if, and only if, all its eigenvalues are nonpositive.
4. A is negative definite if, and only if, all its eigenvalues are negative.
5. A is indefinite if it has at least one negative eigenvalue and at least one positive
eigenvalue.
Although the characterization is important, it is more useful in practice to use a
test based on the concept of leading principal minor. Given an n × n symmetric
matrix A, the k-th leading principal minor, 1 ≤ k ≤ n, is the determinant of the
upper left k × k submatrix and we denote it by Mk (A).
Theorem 2.10 (Leading Principal Minors Criterion)
Let A be an n × n symmetric matrix.
1. A is positive definite if, and only if, M1 (A) > 0, M2 (A) > 0,. . . , Mn (A) > 0.
2. A is negative definite if, and only if, all its leading principal minors of odd order
are negative and all its leading principal minors of even order are positive.
Example 2.11
Let
4 2 3 2 2 2 −4 1 1
A = 2 3 2 , B= 2 2 2 , and C = 1 −4 1 .
3 2 4 2 2 −1 1 1 −4
Chapter 2. Basics of Nonlinear Optimization 11
Note that the theorem does not provide information concerning positive or negative
semidefiniteness, which is harder to check.
In Example 2.11, matrix B is indefinite because it has negative and positive elements
in the diagonal.
Given a fixed value λ ∈ (0, λ̄), using Taylor’s theorem, we know that there is some
ελ ∈ (0, 1) such that
1
f (x∗ + λp) = f (x∗ ) + ∇f (x∗ )t λp + (λp)t ∇2 f (x∗ + ελp)λp.
2
Or, equivalently,
1
f (x∗ + λp) = f (x∗ ) + λ∇f (x∗ )t p + λ2 pt ∇2 f (x∗ + λε p)p
2
for some λε ∈ (0, λ).
12
2017
c Sergio Garcı́a Quiles
Therefore, f (x∗ + λp) < f (x∗ ) ∀λ ∈ (0, λ̄) and x∗ cannot be a minimum.
Note, however, that the previous condition is not sufficient: f (x) = x3 satisfies that
f 0 (0) = f 00 (0) = 0, but x = 0 is neither a local minimum nor a local maximum.
The good news is that there is a condition that, if met, tells us that we have found
a local minimum or maximum.
Theorem 2.14 (Sufficient Second-Order Optimality Condition)
Let f ∈ C 2 (Ω) and x∗ ∈ Ω.
1. If ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is positive definite, then x∗ is a strict local minimum.
2. If ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is negative definite, then x∗ is a strict local maximum.
Proof:
Since ∇2 f is continuous and positive definite at x∗ , there is ε > 0 such that ∇2 f (x)
is positive definite for all x in the ball B = {y ∈ Rn / ky − x∗ k < ε}.
Let p ∈ Rn be such that kpk < ε. Then, x∗ + p ∈ B and, using that ∇f (x∗ ) = 0, by
Taylor’s formula we have that
1 1
f (x∗ + p) = f (x∗ ) + ∇f (x∗ )t p + pt ∇2 f (x∗ + λp)p = f (x∗ ) + pt ∇2 f (x∗ + λp)p
2 2
for some λ ∈ (0, 1). As x∗ +λp ∈ B, then pt ∇2 f (x∗ +λp)p > 0 and f (x∗ +λp) > f (x∗ ).
Therefore, x∗ is a strict local minimum.
However, the conditions of the previous theorem are not necessary. A point can
be a local minimum without the Hessian matrix being positive definite. Consider
f (x) = x4 at x = 0.
Example 2.15
In the function of Figure 1.1, f 00 (x) = 4x3 − 15x2 + 10x + 5. We had seen that
x = 1 and x = 3 are stationary points. Now, since f 00 (1) = 4 and f 00 (3) = 8, we can
guarantee that both points are strict local minima.
Example 2.16
In the function of Example 2.7,
2 3x21 + x22 − 1 2x1 x2
∇ f (x1 , x2 ) = 4 .
2x1 x2 x1 + 6x22 − 2
2
We had seen that the stationary points are (0, 0), (−1, 0), (1, 0), (0, 1), and (0, −1).
We have that:
2 −1 0
• ∇ f (0, 0) = 4 ≺ 0. Thus, (0, 0) is a strict local maximum.
0 −2
Chapter 2. Basics of Nonlinear Optimization 13
2 2 2 0
• ∇ f (−1, 0) = ∇ f (1, 0) = 4 , which is an indefinite matrix. So, these
0 −1
points are neither local minima or maxima.
0 0
• ∇2 f (0, −1) = ∇2 f (0, 1) = 4 0. We have not information enough to
0 4
decide if we have here a local minimum or maximum (or nothing).
Definition 2.17
A point that is neither a local maximum nor a local minimum is a saddle point.
This means that for any two points, the segment joining their images is above the
graph of the function.
4.5
3.5
2.5
1.5
0.5
−0.5
−1 −0.5 0 0.5 1 1.5 2
Proof:
“⇒” If f is convex, for all λ ∈ (0, 1] we have that
f (λy + (1 − λ)x) ≤ λf (y) + (1 − λ)f (x);
f (x + λ(y − x)) − f (x) ≤ λf (y) − λf (x);
f (x + λ(y − x)) − f (x)
≤ f (y) − f (x).
λ
Taking limits when λ tends to 0:
f (x + λ(y − x)) − f (x)
lim ≤ f (y) − f (x).
λ→0 λ
Now,
f (x + λ(y − x)) − f (x)
lim = ∇f (x)t (y − x),
λ→0 λ
because the left-hand side is the definition of directional derivative along vector y − x
and the equality is true for any differentiable function.
“⇐” Suppose that the gradient inequality is true. For all x, y ∈ Rn , and λ ∈ [0, 1],
let aλ = λx + (1 − λ)y. We have that
f (x) ≥ f (aλ ) + ∇f (aλ )t (x − aλ ),
and
f (y) ≥ f (aλ ) + ∇f (aλ )t (y − aλ ).
If we multiply these inequalities by λ and 1 − λ, respectively, and add them together:
λf (x) + (1 − λ)f (y) ≥ f (aλ ) + ∇f (aλ )t (λx + (1 − λ)y − aλ ) = f (λx + (1 − λ)y).
Therefore, f is convex.
The previous result states that, if f is convex, then its first order Taylor’s approxi-
mation underestimates f .
y = f(x)
y = f(x 0) + ∇f(x0)(x-x0)
x0
The gradient inequality is used in the proof of the following important result:
Chapter 2. Basics of Nonlinear Optimization 15
Theorem 2.20
Let f : Ω → R be convex and x∗ ∈ Ω.
1. If x∗ is a local minimum, then x∗ is also a global minimum.
2. If f ∈ C 1 (Ω) and x∗ is a stationary point, then x∗ is also a global minimum.
Proof:
1. Assume that x∗ is not a global minimum. Then, there is y ∈ Ω, y 6= x∗ , such that
f (y) < f (x∗ ). Since f is convex, for all λ ∈ (0, 1] we have that
f (λy + (1 − λ)x∗ ) ≤ λf (y) + (1 − λ)f (x∗ ) < λf (x∗ ) + (1 − λ)f (x∗ ) = f (x∗ ).
On the other hand, because x∗ is a local minimum, there is ε > 0 small enough
such that
f (x∗ ) ≤ f (x) ∀x / kx∗ − xk < ε.
Now, we define xλ = λy + (1 − λ)x∗ and we take λ = ε
2kx∗ −yk
. We have that
ε
kx∗ − xλ k = kx∗ − λy − (1 − λ)x∗ k = kλx∗ − λyk = λkx∗ − yk = .
2
This means that f (x∗ ) ≤ f (xλ ), but we have seen that f (xλ ) < f (x∗ ), which is a
contradiction.
2. Since x∗ is a stationary point, ∇f (x∗ ) = 0. Using the gradient inequality we have
that
f (y) ≥ f (x∗ ) + ∇f (x∗ )t (y − x∗ ) = f (x∗ ) ∀y ∈ Ω.
Thus, x∗ is a global minimum.
In general, convexity is not easy to check. However, if the function is twice differen-
tiable, we have a characterization of convexity.
Proposition 2.21
A function f ∈ C 2 (Rn ) is convex if, and only if, ∇2 f (x) is positive semidefinite for
all x ∈ Rn .
Example 2.22
Let us study the minima of the function f (x1 , x2 ) = x41 + x42 + x21 + x22 .
The gradient is ∇f (x1 , x2 ) = (4x31 + 2x1 , 4x32 + 2x2 )t and the only stationary point
is x∗ = (0, 0).
2 12x21 + 2 0
The Hessian matrix is ∇ f (x1 , x2 ) = . Since the first eigen-
0 12x22 + 2
value is M1 = 12x21 + 2 ≥ 0 and the eigenvalue is M2 = 12x22 + 2 ≥ 0, f is convex.
Therefore, x∗ = (0, 0) is a global minimum.
Chapter 3
In a line search strategy, our algorithm chooses a direction dk and searches along this
direction from the current iterate xk for a new iterate with a smaller function value.
That is, we solve the following one-variable problem to decide the step length α
along dk that we will move:
An exact solution of this problem is usually expensive and unnecessary. So, quite
often this problem is solved approximately to a level of accuracy that is satisfactory
enough and the following iterate is generated. At the new point, a new search
direction is generated and a new step length is computed, and so on.
18
2017
c Sergio Garcı́a Quiles
In a trust region strategy, a model function mk is defined so that its behavior near the
current iterate xk is similar to the one of the objective function f . Since mk may not
be a general good approximation of f , we want to stay in a “small” region around xk .
We solve the following problem to find a direction d so that the next iterate is xk + d:
Min. mk (xk + d)
s.t. xk + d is in the trust region.
Usually, a value δ > 0 is chosen and the trust region is set to {x ∈ Rn / kx−xk k < δ}.
As we can see, broadly speaking, the difference between line search and trust region
methods is that in a line search method, we choose first the direction and then
decide the step length, while in a trust region method, we set first a maximum step
length and then look for a direction. In this chapter, we are going to study the first
technique whereas the second one will be studied in Chapter 5.
Min. f (x)
s.t. x ∈ Rn ,
∇f (xk )t dk < 0.
When we use a line search method, we need to decide a starting point and, at each
iteration, a descent direction and a step size. The starting point is usually chosen
arbitrarily (unless there is some additional information about where a local minimum
could lie). Next we will see how to deal with the other two issues.
Chapter 3. Line Search Methods in Unconstrained Optimization 19
0.5
x0
0
x1
x2
-0.5
x3
-1
-1.5
-1.5 -1 -0.5 0 0.5
If we do not want to use a constant step size for every step (among other reasons,
because it is not clear what step size we should use), then we can try to solve exactly
the previous problem and we will be using an exact line search strategy. However,
this problem is usually difficult for nonlinear functions. Instead, inexact methods
are more commonly used in practice, methods that do not solve the problem exactly
but get a “good” α instead.
Since dk is a descent direction, there will be a certain αt for which the inequality of
Step 2 holds.
However, we want not only to improve the value of f , but that this decrease is
“big” enough. Otherwise, we could end up having very small improvements and the
algorithm would not converge. A rule used in practice is the Armijo condition:
What the rule says is that the decrease should be proportional to the step size αk
and the directional derivative ∇f (xk )t dk . If we define the one-dimensional functions
Φ(α) := f (xk + αdk ) and `(α) := f (xk ) + c1 α∇f (xk )t dk , then we accept those values
of α for which Φ(α) is below the linear function `(α) (see Figure 3.2).
f(xk +α d k )
k k t k
f(x )+c 1 α∇ f(x ) d
α
acceptable
But we need something more than just Armijo condition for the next iterate to be
good enough because this condition is satisfied for any value of α sufficiently small.
This is why a second condition, called curvature condition, is added to guarantee
that α is large enough:
∇f (xk + αk dk )t dk ≥ c2 ∇f (xk )t dk ,
for another constant c2 ∈ (c1 , 1) chosen beforehand. Since the left-hand side is Φ0 (α)
and the right-hand side is c2 Φ0 (0), what we are requiring is that the slope of Φ
at αk is not too negative, that is, we are not too close to the starting point xk (see
Figure 3.3). Usual values of c2 are 0.9 for Newton and quasi-Newton methods and
0.1 for a nonlinear conjugate gradient method (techniques which we will study in
some weeks).
Chapter 3. Line Search Methods in Unconstrained Optimization 21
∇ f(xk + α d k )t d k
c 2 ∇ f(xk )t d k
acceptable
α
Now it is natural to wonder if we can always find a step size that satisfies the Wolfe
conditions. The answer is affirmative under some mild conditions.
Proposition 3.3
Let f ∈ C 1 (Rn ), 0 < c1 < c2 < 1, and let dk be a descent direction for f at xk ∈ Rn .
If f is bounded below along the ray {xk + αdk / α > 0}, then there exist intervals
of step lengths that satisfy Wolfe conditions.
Note that, since now we are not using parameter c2 , we write β instead of c1 .
It is easy to see that, if we use backtracking, it is always possible to find (under mild
conditions) a step size α that satisfies the Armijo condition. We are going to need
Lipschitz continuity.
22
2017
c Sergio Garcı́a Quiles
Definition 3.5
A function f : Rn → Rm is Lipschitz continuous if there is L > 0 such that
It can be proved that f ∈ C 2 (Rn ) has gradient Lipschitz continuous with Lipschitz
constant L if, and only if, ||∇2 f (x)|| ≤ L for all x ∈ Rn .
Proof:
Given v ∈ Rn and λ ∈ R, if we define g(λ) := f (x + λv), we have that g 0 (λ) =
1
∇f (x + λv)t v and that f (x + v) − f (x) = 0 g 0 (λ)dλ. Therefore,
R
Z 1
t
f (x + p) − f (x) − ∇f (x) p = [∇f (x + λp) − ∇f (x)]t pdλ ≤
0
1 1 1
L||p||2
Z Z Z
2
||∇f (x + λp) − ∇f (x)||||p||dλ ≤ L||λp||||p||dλ = L||p|| λdλ = .
0 0 0 2
When the gradient is Lipschitz continuous, there is an interval of α that satisfies the
Armijo condition.
Lemma 3.7
Let f ∈ C 1 (Rn ), β ∈ (0, 1), x ∈ Rn , and let d ∈ Rn be a descent direction for f at x.
If ∇f is Lipschitz continuous with Lipschitz constant L, then the Armijo condition
Lα2 ||d||2
f (x + αd) ≤ f (x) + α∇f (x)t d + ≤
2
Lα 2(β − 1)∇f (x)t d
f (x) + α∇f (x)t d + · · ||d||2 = f (x) + αβ∇f (x)t d.
2 L||d||2
Moreover, as we can see next, if we apply backtracking and use the Armijo rule, we
will find a step size in a finite number of iterations.
Chapter 3. Line Search Methods in Unconstrained Optimization 23
Corollary 3.8
Let β, τ ∈ (0, 1), f ∈ C 1 (Rn ), xk ∈ Rn , and let dk ∈ Rn be a descent direction
for f at xk . If ∇f is Lipschitz continuous with Lipschitz constant L, then the
step size search generated by backtracking Armijo (BA) terminates with a step size
αk ≥ min{α0 , τ ωk }, where α0 is the initial value that starts the backtracking search
(xk )t dk
and ωk = 2(β−1)∇f
Lkdk k2
.
Proof:
If α0 satisfies the Armijo condition, then αk := α0 . Otherwise, we multiply this
value t times by τ , αt = τ t α0 , until we have that αt ≤ ωk and αt−1 > ωk . This last
inequality implies that τ ωk < τ αt−1 = αt =: αk .
If we use backtracking with the Armijo condition, then we have the following con-
vergence result for the General Line Search Method (we write for short BA-GLSM):
K1 = {k / αk = α0 }, K2 = {k / αk < α0 }.
24
2017
c Sergio Garcı́a Quiles
Next, we will study a popular choice of descent direction: the steepest descent
method.
Chapter 3. Line Search Methods in Unconstrained Optimization 25
In order to see this, we define Φ(α) := f (xk + αdk ), where the direction dk has norm
one and ∇f (xk ) 6= 0. We are interested in minimizing the rate of change of function f
at point xk along direction dk (i.e., getting a value as negative as possible), that is,
minimizing
Φ0 (0) = ∇f (xk )t dk = k∇f (xk )kkdk k cos θ = k∇f (xk )k cos θ,
where θ is the angle formed by ∇f (xk ) and dk . Clearly, the minimum is achieved
for θ = π. Therefore, the direction (among those with unitary norm) that gives the
largest decrease is
∇f (xk )
dk = − .
k∇f (xk )k
Besides, it is obvious that dk = −∇f (xk )/k∇f (xk )k is a descent direction because
∇f (xk )t dk = −k∇f (xk )k < 0.
This approach can be combined with the Armijo condition. Moreover, applying
Theorem 3.9 we have a convergence result for the Backtracking Armijo Steepest
Descent Method (BA-SDM).
Theorem 3.12 (Convergence for BA-SDM)
Let f ∈ C 1 (Rn ). If ∇f is Lipschitz continuous, then the iterates generated by
BA-SDM have three possibilities:
1. ∇f (xk̃ ) = 0 for some k̃ ≥ 0.
2. lim f (xk ) = −∞.
k→+∞
3. lim ||∇f (xk )|| = 0.
k→+∞
The theorem states that, if f is bounded below, then the steepest descent method
is globally convergent. The analogous result when using Wolfe conditions is an
immediate consequence of Theorem 3.10. However, it must be noted that in practice,
its convergence is usually quite slow.
26
2017
c Sergio Garcı́a Quiles
Example 3.13
Let us apply the steepest descent method (with backtracking Armijo) to find a
stationary point of the function f (x, y) = x4 + 2x3 + 2x2 + y 2 − 2xy.
f (1 − 12α0 , 1) ≤ 4 − 0.0144α0 .
In this case, we obtain α0 = 0.125.
Step 4: The new point is x1 := x0 − α0 ∇f (x0 ) = (−0.5, 1).
Step 5: ∇f (x1 ) = (−3, 3). Since k∇f (x1 )k > ε, we must generate a new point.
Step 6: We need to compute a step size α1 such that
We can summarize in the following table the information of the iterations that we
have done and we can see graphically these iterations (and some) in Figure 3.4
(x, y) ∇f f
(1,1) (12,0) 4
(-0.5,1) (-3,3) 2.3125
(0.25,0.25) (0.9375,0) 0.0977
(0.0156,0.25) (-0.4362,0.4688) 0.0552
Chapter 3. Line Search Methods in Unconstrained Optimization 27
1.2
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
This means that the distance to the limit x∗ decreases at each iteration by at least
a constant factor r. The convergence is superlinear if r = 0 and it is sublinear if
r = 1.
There is quadratic convergence if there is M > 0 (we do not require that M < 1)
such that
kxk+1 − x∗ k
lim = M.
k→+∞ kxk − x∗ k2
Example 3.15
Let us consider the following sequences:
1 1 1 1
an = , bn = n , cn = n , dn = 2n .
n 2 n 2
All of them converge to 0, but an does it sublinearly, bn linearly (with rate 0.5),
cn superlinearly, and dn quadratically.
28
2017
c Sergio Garcı́a Quiles
The following result tells us that the convergence rate of the steepest descent method
is linear if we use exact line search. In general, convergence does not improve if
inexact line search is used instead.
Theorem 3.16
Let f ∈ C 2 (Rn ) and assume that the iterates of the steepest descent method with
exact line search converge to a point x∗ for which the Hessian matrix ∇2 f (x∗ ) is
positive definite. If r is a scalar such that
λn − λ1
r∈ ,1 ,
λn + λ1
f (xk+1 ) − f (x∗ )
≤ r2
f (xk ) − f (x∗ )
for all k sufficiently large.
3.2.2. Scaling
The steepest descent method is sensitive to scaling. Therefore, depending on the
scale that we use, the convergence can be faster of slower.
Example 3.17
Let us study the convergence of the steepest descent method with exact line search
for the quadratic function
1 2 2 1 t a 0
q(x1 , x2 ) = (ax1 + x2 ) = x x,
2 2 0 1
It is easy to see that x∗ = (0, 0) is the only global minimum. Nevertheless, we are
going to analyze the convergence rate.
Since we have a quadratic form, it is easy to see (it is left as an exercise) that the
optimal step size when we use exact line search is
k∇q(x)k2
α∗ = ,
∇q(x)t B∇q(x)
a 0 a2 x21 +x22
where B = and ∇q(x) = Bx. Thus, ∇q(x) = (ax1 , x2 ) and α∗ = a3 x21 +x22
.
0 1
So:
1. x0 = (1, a).
2 1−a a(a−1) 1−a
2. x1 = x0 − α0 ∇f (x0 ) = (1, a) − 1+a
(a, a) = 1+a
, 1+a = 1+a
(1, −a).
1−a 2 1−a 1−a 2
3. x2 = x1 − α1 ∇f (x1 ) = 1+a
(1, −a) − 1+a
· 1+a
(a, −a) = 1+a
(1, a).
Chapter 3. Line Search Methods in Unconstrained Optimization 29
1−a k
1, (−1)k a . Therefore, lim xk = (0, 0)
In general, it can be seen that xk = 1+a k→+∞
and
1−a k+1
kxk+1 − (0, 0)k
1, (−1)k+1 a
1 − a 1 − a
1+a
lim = lim 1−a k = lim
=
.
k→+∞ kxk − (0, 0)k k→+∞ k(1, (−1)k a)k k→+∞ 1 + a 1 + a
1+a
Since 1−a
1+a
< 1, there is linear convergence.
1 − a
However, lim = 1. This means that, the larger a is, the slower the con-
a→+∞ 1 + a
2 a 0
vergence will be. Indeed, ∇ f = , which means that the eigenvalues of
0 1
∇2 f (0) are λ1 = 1 < λ2 = a (assuming now a > 1). So, in the previous theorem,
a−1
r ∈ a+1 , 1 , which can be very close to 1.
In Figure 3.5, we can see up to 25 iterations when we apply the steep descent method
but we change the value of a.
1
2
0.8
1.5
0.6
0.4 1
0.2
0.5
0
−0.2 0
−0.4
−0.5
−0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1
(a) a = 1 (b) a = 2
6
5 20
4
15
10
2
1 5
0
0
−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1
(c) a = 5 (d) a = 20
In practice, the slow convergence may mean convergence to a wrong point due to
the slow-progressing iterates and the cumulation of round-off errors.
Chapter 4
In the previous chapter, we have studied methods that use information from the
gradient of the function (first-order methods). In this chapter, we are going to
see that if we use information from the Hessian matrix (second-order methods),
then we can obtain much better algorithms. In particular, we will start with one
of the most known algorithms in Mathematics due to its easy implementation and
good performance: Newton’s method.
Unless otherwise stated, in this chapter we assume that all the functions are in C 2 (Rn ).
Since on the left-hand side we have that f (x∗ ) = 0, we generate a new point xk+1
32
2017
c Sergio Garcı́a Quiles
If f 0 (xk ) 6= 0, then
f (xk )
xk+1 = xk − .
f 0 (xk )
We have thus the following algorithm:
Geometrically, what we are doing is considering the tangent line to function f (x)
at point (xk , f (xk )). Then, xk+1 is the point where this tangent line intersects the
horizontal axis. See Figure 4.1.
-1
Example 4.2
Let us solve the equation x3 − 1 starting at x0 = 7.
3
Since f 0 (x) = 3x2 , then x − ff0(x) = x − x3x−1 1 1
(x) 2 = 3
2x + x2
and we have the following
iterations:
Chapter 4. Newton’s Method and Quasi-Newton Methods 33
k xk f (xk ) f 0 (xk )
0 7.00 342.00 147.00
1 4.67 101.07 65.52
2 3.13 29.69 29.41
3 2.12 8.55 13.50
4 1.49 2.30 6.64
5 1.14 0.49 3.92
6 1.02 0.05 3.10
7 1.00 0.00 3.00
If xk+1 is such that f (xk ) + J(xk )(xk+1 − xk ) = 0 and J(xk ) is nonsingular, then
Example 4.4
Let us solve the system
2x1 + x2 − 1 = 0,
2x1 + x22 − 3 = 0.
We define f1 (x) := 2x1 + x2 − 1 and f2 (x) = 2x1 + x22 − 3. It is easy to see that if
we start at x0 = (0, 0), we converge to (1, −1). If we start x0 = (5, 5), we converge
to (−0.5, 2), which is a different solution.
34
2017
c Sergio Garcı́a Quiles
In order to study the convergence of the method, we will use the following alternative
way of writing Taylor’s theorem.
Definition 4.6
A function f : Rn → Rm is locally Lipschitz continuous if for every x ∈ Rn there
is a neighborhood in which is Lipschitz continuous, that is, there are ε, L > 0 such
that
||f (x) − f (y)|| ≤ L||x − y|| ∀x, y / ||x − y|| < ε.
Provided that we start “close” to a solution, Newton’s method converges very fast.
Theorem 4.7
Let f ∈ C 1 (Rn , Rn ) and let x∗ ∈ Rn be such that f (x∗ ) = 0 and the Jacobian of f
at x∗ , J(x∗ ), is nonsingular. Assume also that the Jacobian of f is locally Lipschitz
continuous in a neighborhood of x∗ and that we generate the iterates xk+1 := xk −
J −1 (xk )f (xk ). If the starting point x0 is sufficiently close to x∗ , then:
1. The sequence of iterates {xk } converges quadratically to x∗ .
2. The sequence of norms {kf (xk )k} converges quadratically to zero.
Proof:
1. Since J(x∗ ) is nonsingular then kJ −1 (x∗ )k > 0. This is because the p matrix norm
that we are considering is the Euclidean norm: given A, kAk2 := λn (A), where
λn (A) is the largest eigenvalue of A.
As J is continuous, there is δ > 0 such that kJ −1 (x)k ≤ 2kJ −1 (x∗ )k if kx−x∗ k < δ.
On the other hand, as f (x∗ ) = 0, we have that
Therefore,
J(xk )(xk − x∗ ) − (f (xk ) − f (x∗ ))
=
Z 1
∗ ∗ ∗ ∗
k k k
k
J(x )(x − x ) −
=
J x + t(x − x ) (x − x )dt
=
0
Z 1
∗ ∗ ∗
k k
k
=
J(x ) − J x + t(x − x ) (x − x )dt
≤
0
Z 1
J(xk ) − J x∗ + t(xk − x∗ )
kxk − x∗ kdt = (∗)
≤
0
with Le = L kJ −1 (x∗ )k. Note that this value is positive because we have shown
earlier that kJ −1 (x∗ )k > 0.
Assume finally that we choose x0 such that kx0 − x∗ k < min{ε, δ, 21Le }. Then
0 ∗
e 0 − x∗ k2 < kx − x k < 1 min{ε, δ, 1 }.
kx1 − x∗ k ≤ Lkx
2 2 2L
e
Moreover, provided that we choose x0 such that kx0 − x∗ k < min{ε, δ, 21Le }, this
inequality is true for all k ≥ 1 and so are the previous chains of inequalities.
2k
∗ (Lkx
e 0 −x∗ k)
k
Now, it is easy to see that kx − x k ≤ L
e < 221k Le , which means that
{xk } converges to x∗ . Moreover, the convergence is quadratic because
kxk+1 − x∗ k
lim ≤ L.
e
k→+∞ kxk − x∗ k2
36
2017
c Sergio Garcı́a Quiles
In this case, we have that the Jacobian matrix of ∇f is the Hessian ∇2 f . Thus,
given a point xk ∈ Rn for which ∇2 f (x) is nonsingular, the next iterate is
−1
xk+1 := xk − ∇2 f (xk ) ∇f (xk ).
positive definite for all k, then we have a descent direction. More specifically, we
have a steepest descent method where the inverse of the Hessian is used for scaling.
Example 4.10
Let us consider the poorly scaled function f (x1 , x2 ) = 100x41 + 0.01x42 whose global
minimum is clearly x∗ = (0, 0).
However, it must be noted that Newton’s method has also some disadvantages:
• An iteration is not well defined if ∇2 f (xk ) is singular.
• Newton’s direction may not be a descent direction if ∇2 f (xk ) is not positive defi-
nite.
• It is not globally convergent: It may fail if we are “far” from the stationary point.
Finally, as in many nonlinear optimization algorithms, we should not forget that, in
case of convergence, it finds a stationary point, with no guarantee that this point is
a local minimum (it may be a local maximum or a saddle point).
Since f 0 (x) = −x5 + x3 + 4x and f 00 (x) = −5x4 + 3x2 + 4, we have that f 0 (x0 ) = 4
0 0)
and f 00 (x0 ) = 2. So, x1 = x0 − ff00(x
(x0 )
= −1.
0 1
Now, f 0 (x1 ) = −4 and f 00 (x1 ) = 2. So, x2 = x1 − ff00(x )
(x1 )
= −1 + 42 = 1 = x0 . As we
can see, we have entered into a loop and the points are not even stationary points.
But the function has indeed a local minimum at x = 0 as we can see in Figure 4.3.
38
2017
c Sergio Garcı́a Quiles
0.8
0.6
0.4
0.2
0.8
0.6
0.4
0.2
3.5
2.5
2 x2 x1
1.5
0.5
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Note finally that in Newton’s method, we are just requiring in Theorem 4.9 that
∇2 f (x∗ ) be nonsingular. However, the stationary point could be a saddle point or a
local maximum because we have no guarantee that Newton’s direction be a descent
direction.
Example 4.12
Let f (x) = −x2 . It is very easy to see that Newton’s method is globally convergent
because x1 = 0 for any starting point x0 . However, clearly x∗ = 0 is a global
maximum.
Note that due to the properties that we are requiring for the Hessian matrix, we
k −1
k 2
have that d = − ∇ f (x ) ∇f (xk ). Therefore, dk is a descent direction for all k.
Example 4.16 p p
Let us consider the function f (x1 , x2 ) = 1 + x21 + 1 + x22 whose global minimum
is clearly x∗ = (0, 0).
If we apply Newton’s method with x0 = (2, 2), the sequence of iterates diverges.
However, if the damped Newton’s method is applied with the usual parameters, the
new sequence of iterates converges in 4 iterations.
may not be a descent direction. For this reason, it is usual to solve instead
There are several possible ways of modifying ∇2 f (xk ) in the literature to produce
a modified Newton method, with no agreement on a best technique. Here we
provide two alternatives:
1. Add a multiple of the identity.
We take M k := τ In , where
τ = max{0, δ − λmin ∇2 f (xk ) }. This guarantees
that λmin ∇2 f (xk ) + M k ≥ δ.
However, if we are solving a high dimension problem, computing the eigenvalues
of the Hessian is computationally expensive. Therefore, λmin must be estimated
instead (a task that may be hard by itself). Moreover, this approach has the
disadvantage of giving too much importance to a single large negative eigenvalue.
2. Modified Cholesky factorization.
Every symmetric positive definite matrix A can be written as A = LLt where L
is a lower triangular matrix with all the elements of its diagonal positive. This
decomposition has the name of Cholesky factorization.
Compute Cholesky factorization ∇2 f (xk ) = Lk (Lk )t , where Lk is lower triangular.
During the factorization, if there is a risk that it fails, modify ∇2 f (xk ) in order
to prevent it.
Example 4.17
3 0
Let us assume that at a certain iterate the Hessian matrix is , which is
0 −1
1.1 0
not positive definite. We can fix this by adding, for example, matrix .
0 1.1
Although details will not be provided here, under similar conditions to the ones
given in Theorem 4.14, it is possible to guarantee the convergence of this modified
Newton’s method.
Instead of using the Hessian matrix ∇2 f (xk ), we would like to have a “good” ap-
proximation B k . Some desirable properties are:
1. The next matrix B k+1 can be computed using already computed values: ∇f (xk+1 ),
∇f (xk ), . . ., ∇f (x0 ), B k , dk .
42
2017
c Sergio Garcı́a Quiles
This convex quadratic form mk has a unique minimum which is pk = −(B k )−1 ∇f (xk ).
Since B k is positive definite, then pk is a descent direction and we generate the next
iterate as xk+1 := xk + αk pk , where the step length αk is chosen so that Wolfe
conditions are satisfied:
f (xk + αk pk ) ≤ f (xk ) + c1 αk ∇f (xk )t pk ,
∇f (xk + αk pk )t pk ≥ c2 ∇f (xk )t pk ,
As we can see, this iteration is done as in Newton’s method but we use matrix B k
instead of the Hessian matrix. Thus, we can think of B k as an approximation of the
true Hessian.
Since B k is positive definite, the secant equation has solution only if dk and y k satisfy
the curvature condition t
dk y k > 0.
This is very easy: Multiply at the secant equation by (dk )t by the left. It must be
said that this curvature condition does not always hold. However, it can be seen
that, if the Wolfe conditions hold, then the curvature condition is satisfied.
When this curvature condition is satisfied, the system B k+1 dk = y k has a solution.
Indeed, it has an infinite number of solutions because there are n(n + 1)/2 unknowns
with only n constraints plus n inequalities (requiring positive definiteness). In order
to obtain a unique solution, B k+1 is required to be the closest matrix to B k under a
certain matrix norm. That is, it is the solution of
Min. kB − B k k
s.t. B = Bt,
Bdk = y k ,
B ∈ Rn×n .
Different matrix norms lead to different solutions and, thus, to different quasi-Newton
methods. Particularly, there is a norm (the weighted Frobenius norm, we will skip
the details) for which the solution is
B k+1 := In − ρk y k (dk )t B k In − ρk dk (y k )t + ρk y k (y k )t ,
where ρk = (yk1)t dk . This expression is known as the DFP update formula in honor
to Davidon, who discovered it in 1959, and to Fletcher and Powell, who popularized
it in 1963.
Actually, in order to calculate the descent direction, we need the inverse of this
matrix. So, if we define H k := (B k )−1 (that is, the approximation of the inverse of
the Hessian), then
H k y k (y k )t H k dk (dk )t
H k+1 := H k − + k t k.
(y k )t H k y k (y ) d
Details on how this is obtained, using an expression named Sherman-Morrison-
Woodbury formula, are omitted.
However, the DFP formula, although quite effective, was soon improved by another
expression which is obtained by following the same argument than before but im-
posing the conditions on H k instead of on B k . The secant equation is now written
44
2017
c Sergio Garcı́a Quiles
The final issue is how to choose a starting H 0 , but there is no best answer for this.
For example, it can be the identity matrix.
where rj ∈ C ∞ (Rn ) for all j. Each function rj is called a residual and measures the
discrepancy between a model that we are fitting and an observed behaviour of the
system. Besides, we will assume that m, the number of data, is much larger than n,
the dimension of the data. x1 , . . . , xn are the unknown parameters of the model.
Example 4.20 (Minimum Squares)
We have a sample of data {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )} and we would like to fit a
curve of the form
φ(α1 , α2 ; x) = α1 1 − e−α2 x .
Computing a solution (α1∗ , α2∗ ) that minimizes this error function is an unconstrained
problem.
240
220
200
180
160
140
120
100
0 2 4 6 8 10 12
Given a model, if we assume that the discrepancies between the model and the
observations are independent, identically distributed and normal, then the maximum
likelihood estimated is obtained by minimizing the sum of squares of the residuals. If
the vector of residuals is r(x) = (r1 (x), r2 (x), . . . , rm (x))t , then the sum of residuals
can be written as f (x) = 12 ||r(x)||2 .
and let Ji• (x) and J•j (x) be the i-th row and the j-th column of J(x), respectively.
46
2017
c Sergio Garcı́a Quiles
If we calculate the derivatives needed to solve the least squares problem, then we
have that m
∂f (x) 1X ∂ri (x)
= 2ri (x) = J•j (x)t r(x),
∂xj 2 i=1 ∂xj
and, therefore
t
J•1 (x)t r(x) J•1 (x)t J1• (x)
J•2 (x)t r(x) J•2 (x)t J2• (x)
∇f (x) = = r(x) = r(x) = J(x)t r(x).
.. .. ..
. . .
J•n (x)t r(x) J•n (x)t Jn• (x)
With some extra work, it is possible to obtain an explicit expression for the Hes-
sian of f (the details are omitted here) and, putting all this together, we have the
following:
• ∇f (x) = J(x)t r(x).
Xm
2 t
• ∇ f (x) = J(x) J(x) + rj (x)∇2 rj (x).
j=1
The advantage of this second expression is that the first part of the ∇2 f (x) can be
computed using only first-order derivatives, something that it is potentially good
from a computational point of view. Since quite often the residuals rj (x) are close to
zero near the solutions, algorithms for nonlinear least-squares usually take advantage
of this special structure.
We have that:
• r1 (x) = x1 − x2 − 3,
• r2 (x) = x1 − 2,
• r3 (x) = x1 + x2 ,
• r4 (x) = x1 + 2x2 − 4.
Thus,
1 −1 −3
1 0 −2 t 4 2 t 9
A= 1
, b = 0 ,
AA= , −A b = .
1 2 6 5
1 2 −4
The solution of
4 2 x1 9
=
2 6 x2 5
is x∗ = 11 1
,
5 10
= (2.2, 0.1). So, the fitted model is
Instead of using the exact value of the Hessian matrix, we approximate it using only
the first term J(x)t J(x). This presents several advantages over Newton’s method:
1. We do not need to calculate second order derivatives, which can mean an impor-
tant saving in computational effort.
2. In many applications, the second term j rj (x)∇2 rj (x) is dominated by J(x)t J(x)
P
(at least, when we are close to an optimum point x∗ ), which means that J(x)t J(x)
is a good approximation to ∇f (x) in this neighborhood.
3. If J(x) has full rank, then J(x)t J(x) is positive definite and the solution of
J(x)t J(x)d = −J(x)t r(x) is a descent direction which is called the Gauss-
Newton direction,
−1
dGN := − J(x)t J(x) J(x)t r(x).
The last inequality is strict unless J(x)dGN = 0, but then dGN = 0 because J(x)
has full rank. Therefore, 0 = J(x)t r(x) = ∇f (x) and we have found a stationary
point of f .
Gauss-Newton method can be combined with a line search strategy and, under cer-
tain conditions, local convergence can be guaranteed.
Theorem 4.22
Let us assume the following:
1. Each residual function rj is Lipschitz continuously differentiable in a neighbor-
hood N of the level set {x / f (x) ≤ f (x0 )}, where x0 is the starting point of
Gauss-Newton method.
2. The Jacobian J(x) of the sum of squares of the residuals satisfies the uniform
full-rank condition
kJ(x)vk ≥ γkvk
for all x in N for a certain γ > 0.
3. The iterates {xk } are generated by the Gauss-Newton method with step lengths αk
that satisfy Wolfe conditions.
Then:
1. lim J(xk )t r(xk ) = 0, that is, the iterates converge to a stationary point x∗ .
k→+∞
m
X
2. Moreover, if rj (x∗ )∇2 rj (x∗ ) = 0, then the convergence rate is quadratic.
j=1
Nevertheless, it must be noted that Gauss-Newton method does not necessarily con-
verge and that, if it does, this convergence may be worse than quadratic.
Example 4.23
Let us suppose that the residuals function is r(x) = (x + 1, 0.1x2 + x − 1) and that we
are minimizing the total sum of the squares of the residuals, that is, f (x) = 21 ||r(x)||2 .
If now we consider that the residuals function is s(x) = (x, 0.1x2 + x), Gauss-Newton
converges to x∗ = (0, 0) in 3 iterations and Newton does it in 4 iterations. In this
case, s(x∗ ) = (0, 0), which guarantees that the last condition of the previous theorem
holds and, thus, the convergence is quadratic.
numerical issues prevent us from reaching that solution for a tolerance of ε = 10−6 .
In this case t(0) = (1, −1) and t00 (0) = (0, 2), so, again the mentioned property does
not hold. Moreover, now the first part of ∇2 f (x∗ ) is 2 and the second part is -2.
This means that the weight of the second part is not negligible, which is bad for
convergence. Moreover, if we did not impose Wolfe conditions, then we would have
no convergence.
Chapter 5
A trust-region method defines a region around the current iterate and uses a
model that is “trusted” to be a good local representation of the objective function.
Then, the step chosen is the approximate minimum of the model in this area. If
the step is not acceptable, then the trust region is shrunk. The size of the trust
region tends to be variable. If the algorithm is doing well, the trust region may be
increased. Otherwise, it is reduced in order to have a better local approximation of
the true objective function.
First, we need to decide the radius of the trust-region. Given a step pk , we define
the ratio
f (xk ) − f (xk + pk )
ρk := .
mk (0) − mk (pk )
Note that the numerator f (xk ) − f (xk + pk ) is the actual function decrease, while the
denominator mk (0) − mk (pk ) is the estimated function decrease. Since pk is obtained
by minimizing mk over a region that includes p = 0, the estimated decrease is always
nonnegative. Now:
• If ρk is close to 1, then mk is a good approximation. We update xk+1 := xk + pk
and ∆k+1 ≥ ∆k , that is, we may even consider increasing the size of the trust-
region.
• If ρk is positive but much lesser than 1 or if it is negative, then we have a poor
approximation. We shrink the trust region.
A generic trust-region method is as follows:
Algorithm 5.1 (Generic Trust-Region Method, GTRM)
Step 1: Choose ∆ ˆ > 0 which will be an upper bound on the step lengths.
0
Choose ∆ ∈ (0, ∆), ˆ η ∈ [0, 1 ), a tolerance ε > 0 and a starting point x0 ∈ Rn .
4
Let k := 0.
Step 2: If k∇f (xk )k < ε, STOP.
Step 3: Solve (approximately) problem (TRk ) and let pk be its solution.
Step 4: Evaluate ρk .
• If ρk < 14 , then ∆k+1 = 14 ∆k . The approximation is poor and we shrink the
trust region.
• Otherwise:
ˆ The approximation is
– If ρk > 34 and ||pk || = ∆k , then ∆k+1 = min{2∆k , ∆}.
very good and we enlarge the trust region.
– Otherwise, ∆k+1 = ∆k .
Step 5: If ρk > η, then xk+1 := xk + pk . Go to Step 2.
Otherwise, xk+1 := xk . The approximation is not good and we repeat the search
in a smaller region (see the previous step). Go to Step 3.
Note that this is a very remarkable and rare result because it provides a characteri-
zation for the optimal solutions of a nonconvex quadratic optimization problem.
First, we solve the linear version of (TR) in order to obtain a promising direction:
pS := arg min f (x) + ∇f (x)t p
s.t. kpk ≤ ∆,
p ∈ Rn .
Then we calculate a scalar τ that minimizes m(τ pS ) in the trust region:
τ := arg min m(τ pS )
s.t. kτ pS k ≤ ∆,
τ ≥ 0.
We define pC := τ pS . The Cauchy point is x + pC .
We can now obtain the Cauchy point explicitly. Note first that pS is the solution of
a linear problem with a constraint on the norm of the solution. It is easy to see that
∆
pS = − ∇f (x).
k∇f (x)k
(We are assuming that ∇f (x) 6= 0.) Now, in order to minimize m(τ pS ), we differen-
tiate whether ∇f (x)t B∇f (x) is positive or negative.
• Case 1: ∇f (x)t B∇f (x) ≤ 0.
m(τ pS ) decreases with τ . Thus, the optimal solution is τ = 1, which is the largest
value such that τ pS is still in the trust region.
54
2017
c Sergio Garcı́a Quiles
Therefore,
− k∇f∆(x)k ∇f (x),
(
C
if ∇f (x)t B∇f (x) ≤ 0,
p = n
(x)k2
o
− min k∇f∆(x)k , ∇fk∇f
(x)t B∇f (x)
∇f (x), if ∇f (x)t B∇f (x) > 0.
The Cauchy point provides sufficient reduction in mk so that we have global con-
vergence (see Theorem 5.3, no proof will be given here). However, in practice, after
having calculated the Cauchy point, we try to improve it (although how this can be
done will not be explained in this course). The reason is that the plain use of the
Cauchy point is just the steepest descent method with a particular step size. And we
know that the steepest descent method converges only linearly (even for an optimal
step size).
Theorem 5.3
Let η ∈ (0, 14 ) in Algorithm GTRM. Suppose that ||B k || ≤ β for some constant β, that
f is bounded below on the level set S = {x ∈ Rn / f (x) ≤ f (x0 )} and is Lipschitz
continuously differentiable in S(R0 ) = {x ∈ Rn / ||x − y|| < R0 for some y ∈ S} for
some R0 > 0. If the sequence of approximate solutions of (TRk ) satisfy the following
inequalities: n o k
1. mk (0) − mk (pk ) ≥ c1 ||∇f (xk )|| min ∆k , ||∇f (x )||
||B k ||
for some constant c1 ∈ (0, 1),
2. ||pk || ≤ γ∆k for some constant γ ≥ 1,
then lim ∇f (xk ) = 0.
k→∞
The conjugate gradient method is a technique for solving linear systems of equa-
tions. It is an alternative to Gaussian elimination which is used to solve large systems.
If we are solving Ax = b, with A a (square) symmetric positive definite matrix, we
know that x = A−1 b. However, the complexity of this calculation is O(n3 ), which is
computationally very expensive to solve large problems. Conjugate gradient methods
allow us to do better than that.
There is a version for nonlinear problems that was introduced by Fletcher and Reeves
in the 1960s. Here we will only study the linear conjugate gradient method. For this
reason, we will drop the word “linear”.
Definition 6.1
A set of nonzero vectors {p0 , p1 , . . . , p` } is said to be conjugate with respect to the
symmetric positive definite matrix A if
(pi )t Apj = 0 ∀i 6= j.
56
2017
c Sergio Garcı́a Quiles
Example 6.2
Let
3 0 1
A = 0 4 2 , p0 = (1, 0, 0), p1 = (1, 0, −3), and p2 = (1, 4, −3).
1 2 3
It is easy to check that these three vectors are conjugate with respect to the positive
definite matrix A.
xk+1 := xk + αk pk ,
If we multiply on the left by (pk )t A, due to the conjugacy property, we have that
n−1
X
(pk )t A(x∗ − x0 ) = σ i (pk )t Api = σ k (pk )t Apk ;
i=0
(pk )t A(x∗ − x0 )
σk = .
(pk )t Apk
Now, we will show that σ k = αk for all k.
We have that
xk = xk−1 +αk−1 pk−1 = xk−2 +αk−2 pk−2 +αk−1 pk−1 = . . . = x0 +α0 p0 +. . .+αk−1 pk−1 .
Chapter 6. Conjugate Gradient Methods 57
Thus,
and
(pk )t rk
σk = − = αk .
(pk )t Apk
A property of any sequence of points generated with the conjugate direction method
is that (rk )t pi = 0, i = 0, 1, . . . , k − 1. We will skip the proof of this result.
If we multiply on the left by (pk−1 )t A and impose that (pk−1 )t Apk = 0, it is easy to
see that
(rk )t Apk−1
β k = k−1 t k−1 .
(p ) Ap
If we choose p0 as the steepest descent direction (that is, −∇f (x0 )), then we obtain
the so-called conjugate gradient method.
Algorithm 6.4 (Conjugate Gradient Method, Preliminary Version)
Step 1: Choose x0 ∈ Rn . Let r0 := Ax0 − b, p0 := −r0 , k := 0.
Step 2: If rk 6= 0, then:
k t pk
• αk := − (p(rk ))t Ap k,
• xk+1 := xk + αk pk ,
• rk+1 := Axk+1 − b,
(rk+1 )t Apk
• β k+1 := (pk )t Apk
,
• pk+1 := −r k+1
+β k+1 k
p ,
• k := k + 1.
Repeat Step 2.
Moreover, it can be seen (we will skip the proof) that not only is pk conjugate
with pk−1 with respect to A, but also with p0 , p1 , . . . , pk−2 .
58
2017
c Sergio Garcı́a Quiles
Proposition 6.5
Suppose that the k-th iterate generated by the conjugate gradient method is not the
solution point x∗ . The following properties hold:
1. (rk )t ri = 0, i = 0, 1, . . . , k − 1.
2. (pk )t Api = 0, i = 0, 1, . . . , k − 1.
3. The sequence {xk } converges to x∗ in at most n steps.
There is a standard version of the conjugate gradient method that is slightly different
in order to do less multiplications. By using that in any conjugate direction algorithm
it holds that (rk )t pi = 0, i = 0, 1, . . . , k − 1, we have that
(rk )t pk (rk )t (−rk + β k pk−1 ) (rk )t rk
αk = − = − = .
(pk )t Apk (pk )t Apk (pk )t Apk
Now, we observe that rk+1 − rk = A(xk+1 − xk ) = αk Apk . So, using now that
(rk )t ri = 0, i = 0, 1, . . . , k − 1, we have that
(rk+1 )t Apk (rk+1 )t (rk+1 − rk )/αk (rk+1 )t rk+1
β k+1 = = = − =
(pk )t Apk (pk )t (rk+1 − rk )/αk (pk )t rk
(rk+1 )t rk+1 (rk+1 )t rk+1
=− = .
(−rk + β k pk−1 )t rk (rk )t rk
By writing these scalars in this way, we obtain the standard version of the conjugate
gradient method. Observe how this new version has less multiplications than the
previous one.
Algorithm 6.6 (Conjugate Gradient Method)
Step 1: Choose x0 ∈ Rn . Let p0 := −r0 , k := 0.
Step 2: If rk 6= 0, then:
k t rk
• αk := (p(rk ))t Ap k,
• xk+1 := xk + αk pk ,
• rk+1 := rk + αk Apk ,
(rk+1 )t rk+1
• β k+1 := (rk )t rk
,
• pk+1 := −r k+1
+β k+1 k
p ,
• k := k + 1.
Repeat Step 2.
As can be seen, it is not necessary to know the values of the vectors x, r, and p
for more than the two last iterations, which is an important saving in the compu-
tational implementation. The cost per iteration is O(n2 ). Thus, in theory it is an
O(n3 ) method. However, if we have a “good” distribution of the eigenvalues of A
(e.g., repeated eigenvalues or if there are some large eigenvalues and the other are
clustered around 1), then we need much less than n iterations. This method is more
efficient for large systems. For small systems, Gauss elimination performs better.
Chapter 7
Constrained Optimization
In most problems, there are some requirements that any solution must meet. In
this case, we are dealing with a constrained problem and these requirements are
the constraints. Usually, an equality constraint can be handled mathematically
more easily than an inequality. However, an equality is more difficult to be satisfied
numerically because of its restrictiveness. This contributes to make constrained
problems much more challenging than unconstrained problems.
Ω is the feasible set and we will assume that it is defined by a finite number of
constraints. Each of these constraints can be either an inequality (g(x) ≤ 0) or an
equality (h(x) = 0). Thus, the problem we are solving is as follows:
Min. f (x)
(CP) s.t. gi (x) ≤ 0, i ∈ I := {1, 2, . . . , m},
hj (x) = 0, j ∈ E := {m + 1, m + 2, . . . , m + p}.
Definition 7.2
A point x∗ ∈ Rn is a local minimum of f in Ω if there is a neighborhood N of x∗
such that f (x) ≥ f (x∗ ) for all x ∈ N ∩ Ω.
In the same way that we established necessary and sufficient conditions for local
minima in unconstrained problems, we will derive similar results for problems with
constraints. We begin by defining an important set.
Definition 7.3
Given a feasible point x∗ of (CP ), its active set A(x∗ ) is the set of constraints
satisfied with equality, that is,
A(x∗ ) := {i ∈ I / gi (x∗ ) = 0} ∪ E.
Example 7.4
In the problem
Min. (x1 − 1)2 + (x2 − 2)2
s.t. x1 + x2 ≤ 2,
x21 − x2 = 0,
the inequality constraint is active for (1, 1) but inactive for (0, 0).
1. ∇f (x∗ ) + i∈I λ∗i ∇gi (x∗ ) + j∈E µ∗j ∇hj (x∗ ) = 0. (Or, using the Lagrangian func-
P P
tion, ∇x L(x∗ , λ∗ , µ∗ ) = 0).
2. gi (x∗ ) ≤ 0 ∀i ∈ I,
hj (x∗ ) = 0 ∀i ∈ E.
3. λ∗i ≥ 0 ∀i ∈ I.
4. λ∗i gi (x∗ ) = 0 ∀i ∈ I.
These four conditions are known as the Karush-Kuhn-Tucker conditions. The
first condition is the stationarity condition, the second is the primal feasibility
condition, the third is the dual feasibility condition, and the fourth is the com-
plementary slackness condition. The values λ∗i and µ∗j are known as Lagrange
multipliers.
Example 7.6
In Example 7.4, we have that f (x) = (x1 − 1)2 + (x2 − 2)2 , g(x) = x1 + x2 − 2, and
h(x) = x21 − x2 .
• Primal feasibility:
x1 + x2 ≤ 2,
x21 − x2 = 0.
• Dual feasibility: λ ≥ 0.
• Slackness complementarity: λ(x1 + x2 − 2) = 0.
After some not too difficult calculations that we will skip, we obtain that there are
three KTT points:
• x∗ = (−1, 1) for λ = 0 and µ = −2.
√ √ √
• x∗ = 1−2 3 , 2−2 3 for λ = 0 and µ = −(2 + 3).
• x∗ = (1, 1) for λ = 4/3 and µ = −2/3.
62
2017
c Sergio Garcı́a Quiles
The following result establishes condition that every local minimum must satisfy.
Moreover, it can be proved that the LICQ grants that the Lagrange multiplier vector
is unique.
Now, before we can state a condition that uses second-order derivatives information,
we need another definition.
Definition 7.9
Given a KKT point (x∗ , λ∗ , µ∗ ), the critical cone for this point is the set
With the help of the critical cone, we can narrow down even further which points
could potentially be local minima.
Finally, a sufficient condition is established. Note that here the linear independence
constraint qualification is not required.
Chapter 7. Constrained Optimization 63
Example 7.12
In the previous example, we have seen that
2 + 2µ 0
Therefore, ∇2xx L(x, λ, µ) = .
0 2
which is positive definite. Particularly, the previous theorem holds and, therefore,
x∗ = (1, 1) is a strict local minimum.
Example 7.15
If the constraints are
2x1 + x2 ≤ 1,
x21 ≤ 1,
then the first constraint is linear and the second is nonlinear, but involves a convex
function. It is easy to see that GSCQ is satisfied by considering x̂ = (0, 0).
If LICQ is replaced by SCQ or GSCQ, then Theorem 7.8 is still true (although for
GSCQ we require that f be a convex function).
Theorem 7.16
If x∗ is a minimum of (CP) and either:
1. SCQ is satisfied, or
2. GSCQ is satisfied and f is convex,
Then there is a vector of Lagrange multipliers λ∗ (respectively, (λ∗ , µ∗ )) such that
(x∗ , λ∗ ) (respectively, (x∗ , λ∗ , µ∗ )) is a KKT point.
Note: Besides the conditions of SCQ/GSCQ, we require the nonlinear functions of
the constraints to be continuosly differentiable because this is needed in the definition
of KKT point.
In some special cases (but very general), the KKT conditions are necessary without
requiring any qualification constraint.
Proposition 7.17
If all the constraints in (CP) are linear and x∗ is a local minimum, then there is a
Lagrange multiplier vector (λ∗ , µ∗ ) such that (x∗ , λ∗ , µ∗ ) is a KKT point.
Therefore, if SCQ or GSCQ is satisfied, then x∗ is a global minimum if, and only if,
there is a vector of Lagrange multipliers so that there is an associated KKT point.
Finally, let us study what happens with linear problems. After all, they are a par-
ticular case in nonlinear optimization.
Linear Programming
In the case of a linear problem, we are solving
Min. ct x
(P) s.t. Ax = b,
x ≥ 0,
Chapter 7. Constrained Optimization 65
with A an m × n full row rank matrix. This is a convex problem with linear con-
straints. Thus, x∗ is a (global) minimum if, and only if, it is a KKT point. Since
f (x) = ct x, g(x) = −x, and h(x) = Ax − b, then the KKT conditions for a general
point x are:
c − λ + At µ = 0,
x ≥ 0,
Ax = b,
λ ≥ 0,
λi xi = 0, i = 1, . . . , n.
If we now consider the dual problem of (P), with µ ∈ Rm the dual variables for
the equality constraints and λ the dual variables for the inequality constraints, this
dual (D) is:
Max. bt µ
(D) s.t. At µ + λ = c,
λ ≥ 0.
The KKT conditions for (D) are the same than for the primal problem. Therefore,
x∗ is an optimal solution to (P) and (λ∗ , µ∗ ) is an optimal solution to (D) if, and
only if, (x∗ , λ∗ , µ∗ ) is a KKT point, in which case ct x = bt µ.
Chapter 8
In this chapter we will study interior point methods for solving linear and convex
quadratic problems. When solving the linear problem
Min. ct x
(P) s.t. Ax = b,
x ≥ 0,
with A an m × n full row rank matrix, the simplex method searches sequentially
form vertex to vertex along the boundary of the feasible region until it finds an
optimal solution. The idea behind interior point methods is exactly the opposite:
the search is performed along a path that is in the interior of the feasible region until
it converges to an optimal solution.
Simplex method worst-case complexity is exponential, while for interior point meth-
ods is polynomial. However, each iteration of an interior point method is more
expensive in computational terms.
Here we will study what is called a primal-dual method, being the origin of the name
in the simultaneous use of the primal and dual formulation of the problem.
Max. bt µ
(D) s.t. At µ + s = c,
s ≥ 0,
68
2017
c Sergio Garcı́a Quiles
then both (P) and (D) have the same (necessary and sufficient for optimality) KKT
conditions:
c − s + At µ = 0,
x ≥ 0,
Ax = b,
s ≥ 0,
si xi = 0, i = 1, . . . , n.
If we write −µ instead of µ (we can do this because it is a free variable) and rearrange
slightly how we display these conditions, then we have that:
t
A µ + s = c,
Ax = b,
x i si = 0, i = 1, . . . , n,
(x, s) ≥ 0.
What are known as primal-dual methods find solutions (x∗ , s∗ , µ∗ ) of this system by
applying variants of Newton’s method to the system of the three equalities and then
modifying the search direction so that the inequality is satisfied strictly.
In order to obtain a primal-dual interior point method, the previous optimality con-
ditions are written as follows:
t
A µ+s−c
Ax − b = 0,
F (x, s, µ) :=
XSe
(x, s) ≥ 0,
where
The goal is to obtain iterates (xk , sk , µk ) that satisfy xk > 0 and sk > 0.
The value
xt s
d :=
n
can be seen as a measure of how much desirable the solution is and it is known as
duality measure.
We will use Newton’s method to solve F (x, s, µ) = 0 and to obtain a search direc-
tion (∆x, ∆s, ∆µ). Thus, we need to solve
∆x
J(x, s, µ) ∆s = −F (x, s, µ),
∆µ
Chapter 8. Interior Point Methods 69
However, if we take a full step for the solution of this system, it is likely that we will
violate the condition (x, s) > 0. Therefore, we choose a smaller step size α 1 and
the new iterate is
(x, s, µ) + α(∆x, ∆s, ∆µ).
A less aggressive option is to just seek to reduce the value of products xi si by trying
to find a point for which xi si = σd, where d is the current duality measure and
σ ∈ (0, 1) is the reduction factor that we would like to achieve. The modified system
of equation is
0 In At ∆x −At µ − s + c
A 0 0 ∆s = −Ax + b .
S X 0 ∆µ −XSe + σde
σ is called the centering parameter.
F := (x, s, µ) / Ax = b, At µ + s = c, (x, s) ≥ 0 ,
70
2017
c Sergio Garcı́a Quiles
The central path C is an arc of strictly feasible points and it is very important in
primal-dual interior point algorithms. It is parameterized by a scalar τ > 0 and a
point (xτ , sτ , µτ ) ∈ C satisfies that
At µ + s = c,
Ax = b,
xi si = τ, i = 1, . . . , n,
(x, s) > 0.
Note that these are the KKT conditions where value 0 has been changed to τ in the
right-hand side of the third equation. We define
It can be shown that each (xτ , sτ , µτ ) is unique for that τ if, and only if, F 0 is
nonempty.
Additionally, it is easy to see that if we consider the strictly convex problem with
logarithmic barrier with parameter τ > 0
Min. ct x − τ ni=1 ln xi
P
s.t. Ax = b,
its KKT conditions (which characterize its global minimum) are exactly the first
three equations defining (xτ , sτ , µτ ), where we define si := τ /xi . It is also obvious
that x > 0 in any optimal solution (and, thus, s > 0).
Theorem 8.2
If the sequence {(xk , sk , µk )} is generated with LSPFPDM, then there is a con-
stant δ > 0, independent of n, such that
k+1 δ
d ≤ 1− dk for all k ≥ 0.
n
Therefore, lim dk = 0. Moreover, dk ≤ εd0 in O(n log 1/ε) steps.
k→+∞
Since we have a convex problem, its solutions are characterized by the KKT condi-
tions t
A µ + s − Qx = c,
Ax = b,
(x, s) ≥ 0,
xs = 0.
If we now apply Newton’s method to calculate the search direction, we must solve
−Q In At ∆x, −At µ − s + Qx + c
A 0 0 ∆s, = −Ax + b
S X 0 ∆µ −XSe + σde.