0% found this document useful (0 votes)
57 views75 pages

NLO Notes

The document is a course outline for a university course on nonlinear optimization taught at the University of Edinburgh in 2016-2017. It covers topics such as basics of nonlinear optimization including optimality conditions, line search methods, Newton's method and quasi-Newton methods, trust-region methods, conjugate gradient methods, constrained optimization, and interior point methods. The course includes lectures, tutorials, assignments, and an exam for assessment. The textbook is Numerical Optimization by Nocedal and Wright.

Uploaded by

Vanessa Romero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views75 pages

NLO Notes

The document is a course outline for a university course on nonlinear optimization taught at the University of Edinburgh in 2016-2017. It covers topics such as basics of nonlinear optimization including optimality conditions, line search methods, Newton's method and quasi-Newton methods, trust-region methods, conjugate gradient methods, constrained optimization, and interior point methods. The course includes lectures, tutorials, assignments, and an exam for assessment. The textbook is Numerical Optimization by Nocedal and Wright.

Uploaded by

Vanessa Romero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Nonlinear Optimization

University of Edinburgh

Year 2016-2017

Lecturer: Sergio Garcı́a Quiles

January 16, 2017


Contents

Course details 1

1 Introduction 3

2 Basics of Nonlinear Optimization 7

2.1 First-Order Optimality Conditions . . . . . . . . . . . . . . . . . . . 7

2.2 Positive Semidefinite Matrices . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Second-Order Optimality Conditions . . . . . . . . . . . . . . . . . . 11

2.4 Convexity and Global Optimality Conditions . . . . . . . . . . . . . . 13

3 Line Search Methods in Unconstrained Optimization 17

3.1 Step Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Steepest Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Newton’s Method and Quasi-Newton Methods 31

4.1 Newton’s Method for Root-Finding . . . . . . . . . . . . . . . . . . . 31

4.2 Newton’s Method in Optimization . . . . . . . . . . . . . . . . . . . . 36

4.3 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44


ii 2017
c Sergio Garcı́a Quiles

5 Trust-Region Methods for Unconstrained Optimization 51

5.1 Exact Solution of the Trust-Region Problem . . . . . . . . . . . . . . 53

5.2 The Cauchy Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Conjugate Gradient Methods 55

6.1 The Conjugate Direction Method . . . . . . . . . . . . . . . . . . . . 55

6.2 The Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . 57

7 Constrained Optimization 59

7.1 Karush-Kuhn-Tucker Conditions . . . . . . . . . . . . . . . . . . . . . 60

7.2 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8 Interior Point Methods 67

8.1 Primal-Dual Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 67

8.2 The Central Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.3 Convex Quadratic Problems . . . . . . . . . . . . . . . . . . . . . . . 71


Course Details

• Lectures: Wednesday, 9:00-10:50, JCMB 6201.


• Tutorials: Wednesday on Weeks 2,4,6,8, and 10, 13:10-14:00, JCMB 4325C.
• Assessment: 80% exam, 20% continuous assessment.
• Continuous assessment: there will be two assignments on Weeks 3 and 8 to be
handed in one week later. Each will be worth 10% of the final mark.
• Bibliography: Numerical Optimization, Jorge Nocedal and Stephen J. Wright,
Springer (1st or 2nd edition). The book is available at the library and can also be
downloaded electronically.
Chapter 1

Introduction

Optimization problems appear in many different areas: transportation, portfolio


selection, engineering, chemistry, physics, etc. It happens quite often that these
problems are nonlinear as we can see in the following examples.

Example 1.1
A soft drinks company needs to decide the shape of the cylindrical can for a new
product. The total volume will be a fixed value V , but the length L and the diame-
ter D must be chosen so that the total cost of design be minimum. Each area unit
of the side has a cost of cs monetary units and each area unit of the bottom and the
top has a cost of ct monetary units. What is the design at minimum cost?

Solution:
D 2
The bottom is a circle of radius D2 . Thus, its area is π

2
and the joint area of
2
bottom and top is twice this amount: π D2 .

The side is a rectangle of height L and width the length of the circle of the base
(πD). Thus, its area is πDL.
2
The volume of the can is π D4 L.
4 2017
c Sergio Garcı́a Quiles

Since neither D nor L can be negative, the problem to be solved is


2
Min. cs πDL + ct π D2
2
s.t. π D4 L = V,
D, L ≥ 0.
Example 1.2
An electric company must supply power during the next year according to some
power requirements that have been predicted as a curve h(t) using historical in-
formation, where h(t) is the demand in megawatts at time t. In order to make
calculations easier, the working day starts at t = 0 and ends at t = 1.

The electric company may meet these requirements by using two owned turbines T1
and T2 . There is also the possibility of purchasing power from an external turbine T3
that belongs to a central energy grid. Associated with the owned turbines, there is
a cost of bi monetary units per day that the turbine is working and there is a cost
of ci monetary units per megawatt produced, i = 1, 2. The price of power purchased
from the grid is c3 monetary units per megawatt.

Due to the configuration of the power network, electricity must be produced first
by turbine T1 , until it is decided to switch to turbine T2 . That is, at that moment,
turbine T1 does not work any more and electricity is produced exclusively by the
second turbine. This happens until it is decided to stop turbine T2 and use exclusively
turbine T3 . How must the company plan for the electricity production?

power (megawatts)

h(t)

daytime
t1 t2 1

Solution:
Let ti be the time that turbine Ti is working, i = 1, 2.

The energy produced is the area below the curve. Therefore, the objective function
is
Z t1 Z t1 +t2 Z 1
F (t1 , t2 ) = b1 t1 + b2 t2 + c1 h(t)dt + c2 h(t)dt + c3 h(t)dt.
0 t1 t1 +t2

Besides, it is clear that the times cannot be negative and that we cannot go beyond
the timetable that we are considering. Thus, we have the following constraints:
t1 , t2 ≥ 0, t1 + t2 ≤ 1.
Chapter 1. Introduction 5

Since, in general, linear problems are much better understood and can be solved much
more efficiently than nonlinear problems, a possible approach to solve a nonlinear
problem is to use a linear approximation. However, it may happen that such an
approximation cannot be obtained or that it is not good enough to solve the problem
with enough accuracy. Therefore we must study and develop theory and algorithms
for nonlinear problems.

Moreover, some properties from linear optimization do not hold any more. For
example, we know that in a linear problem (linear objective function and linear
constraints), every local minimum is global. This is not necessarily true for a non-
linear problem: as we can see in Figure 1.1, x1 = 1 and x2 = 3 are both local minima,
but x1 is not global.
-1.4

-1.6

-1.8

-2

-2.2

-2.4

-2.6

-2.8

-3

-3.2
0.5 1 1.5 2 2.5 3 3.5

x5 5x4 5x3 5x2


Figure 1.1: f (x) = 5
− 4
+ 3
+ 2
− 6x.

Since Nonlinear Optimization is a very wide area and it is impossible to cover it


completely in an 11-week course, we will study some selected topics that will provide
a good overview of the area.

In this course we will study how to solve problems of the form


Min. f (x)
s.t. x ∈ Ω ⊆ Rn .
Function f : Rn → R is the objective function. Ω is the feasible region and is
described by a set of constraints. Any point x ∈ Ω is a feasible solution. If Ω = ∅,
the problem is infeasible. If Ω = Rn , we have an unconstrained problem.

The following result tells us that it is the same to minimize f or to maximize −f .


Therefore, we will just study minimization problems unless otherwise stated.
Proposition 1.3
The problems min{f (x)}x∈Ω and − max{−f (x)}x∈ Ω have the same optimal value
and the same optimal solution(s):
1. min{f (x)}x∈Ω = − max{−f (x)}x∈ Ω .
2. arg min{f (x)}x∈Ω = arg max{−f (x)}x∈ Ω .
6 2017
c Sergio Garcı́a Quiles

Definition 1.4
A point x∗ is a:
1. Global minimum if f (x∗ ) ≤ f (x) ∀x ∈ Ω.
2. Local minimum if there are ε > 0 and a neighborhood Nε such that f (x∗ ) ≤
f (x) ∀x ∈ Nε , where Nε = {x ∈ Ω / kx − x∗ k < ε}.
3. Strict local minimum if there are ε > 0 and a neighborhood Nε such that
f (x∗ ) < f (x) ∀x ∈ Nε , x 6= x∗ , where Nε = {x ∈ Ω / kx − x∗ k < ε}.
In Figure 1.2 we see an example of a 2-dimensional function. Point (0,0) is a local
maximum but not global. Points (0,1) and (0,-1) are global minima.
1.5

1 15

0.5
10

0
5

-0.5
0
2
-1 1 2
0 1
0
-1
-1.5 -1
-1.5 -1 -0.5 0 0.5 1 1.5 -2 -2

Figure 1.2: f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

It is easy to find examples of functions with multiple global minima.


Example 1.5
Let us consider the function
(
x2 , |x| ≥ 1,
f=
1, |x| ≤ 1.

All the points in the interval [−1, 1] are global minima, but none is strict.

-3 -2 -1 0 1 2 3

Figure 1.3: A function with multiple global minima.


When we are minimizing, we are interested in finding a global minimum. However,
most of the results in Nonlinear Optimization can only guarantee that a certain point
is a local minimum. Finding global minima is the goal of Global Optimization, but
it is much harder to achieve.
Chapter 2

Basics of Nonlinear Optimization

We are interested in solving a problem of the form


Min. f (x)
s.t. x ∈ Ω ⊆ Rn .
Particularly, if Ω = Rn , we have an unconstrained problem.

Quite often, we assume that f is a smooth function, that is, a function that has
derivatives of all orders for all the points in Ω, which we write as f ∈ C(Ω). Or, at
least, that f is k times differentiable, with the k-th derivative continuous: f ∈ C k (Ω).

Unless otherwise stated, we will assume in this chapter that Ω is an open set.

2.1. First-Order Optimality Conditions


Given a point x∗ , we need to decide whether it is or not a minimum of function f .
A well-known result for a one-dimensional differentiable function f : (a, b) → R is
that, if x∗ is a minimum or a maximum, then f 0 (x∗ ) = 0. For a multidimensional
function, this result is extended using the gradient. Before proving the result, we
need to remember Taylor’s theorem and introduce the notion of descent direction.
Theorem 2.1 (Taylor’s Theorem)
Let f ∈ C 1 (Ω). If x ∈ Ω, p ∈ Rn , and [x, x + p] ⊆ Ω, then there is λ ∈ (0, 1) such
that
f (x + p) = f (x) + ∇f (x + λp)t p.
If f ∈ C 2 (Ω), then there is λ ∈ (0, 1) such that
1
f (x + p) = f (x) + ∇f (x)t p + pt ∇2 f (x + λp)p.
2
8 2017
c Sergio Garcı́a Quiles

First, we establish that certain directions allow us to improve the value of f , that is,
they lead to smaller values.

Proposition 2.2
Let f ∈ C 1 (Ω), x∗ ∈ Ω, and s ∈ Rn . If ∇f (x∗ )t s < 0, then there is a value λ̄ > 0
such that f (x∗ + λs) < f (x∗ ) ∀λ ∈ (0, λ̄).
Proof:
Since ∇f is continuous, there is λ̄ > 0 such that

x∗ + λs ∈ Ω and ∇f (x∗ + λs)t s < 0 ∀λ ∈ (0, λ̄).

For any λ ∈ (0, λ̄), using Taylor’s theorem we have that there is some ελ ∈ (0, 1)
such that
f (x∗ + λs) = f (x∗ ) + ∇f (x∗ + ελ λs)t λs.
Or, equivalently,
f (x∗ + λs) = f (x∗ ) + λ∇f (x∗ + λε s)t s
for some λε ∈ (0, λ).

As λε < λ < λ̄, then ∇f (x∗ + λε s)t s < 0 and, thus,

f (x∗ + λs) = f (x∗ ) + λ∇f (x∗ + λε s)t s < f (x∗ ).

Therefore,
f (x∗ + λs) < f (x∗ ) ∀λ ∈ (0, λ̄). 

Definition 2.3
Given f ∈ C 1 (Ω), a vector s ∈ Rn is said to be a descent direction for f at
point x∗ ∈ Ω if ∇f (x∗ )t s < 0.

By changing the sign from “less than” to “greater than” in the previous result and
definition, we can define the notion of ascent direction.

Example 2.4
Let us consider f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 . See Figure 1.2 for a contour
plot of the function.

The gradient of this function is

∇f (x1 , x2 ) = 4x1 (x21 + x22 − 1), 4x2 (x21 + 2x22 − 2) .




If x∗ = (1, 1) and s = (−1, 0), then ∇f (x∗ )t s = (4, 4)t (−1, 0) = −4 < 0. This means
that (-1,0) is a descent direction and we can reduce the value of f if we move in that
direction.

On the other hand, we can see that s̃ = (1, 0) is not a descent direction because
(4, 4)t (1, 0) = 4 > 0.
Chapter 2. Basics of Nonlinear Optimization 9

Therefore, the only way for not finding any ascent or descent direction is if the
gradient is zero.

Proposition 2.5 (Necessary First-Order Optimality Condition)


Given f ∈ C 1 (Ω) and x∗ ∈ Ω, if x∗ is a local minimum or a local maximum, then
∇f (x∗ ) = 0.
Proof:
Suppose that ∇f (x∗ ) 6= 0. Then s = −∇f (x∗ ) is a descent direction because

∇f (x∗ )t s = −k∇f (x∗ )k2 < 0,

which means that x∗ is not a local minimum.

A similar proof with s = ∇f (x∗ ) shows that x∗ is not a local maximum.

Definition 2.6
Given f ∈ C 1 (Ω), a point x∗ ∈ Ω is a stationary point if ∇f (x∗ ) = 0.

Therefore, every local minimum or maximum is a stationary point.

Example 2.7
For f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 , its gradient is

∇f (x1 , x2 ) = 4x1 (x21 + x22 − 1), 4x2 (x21 + 2x22 − 2) .




In order to obtain the stationary points, we need to solve the system

x1 (x21 + x22 − 1) = 0,
x2 (x21 + 2x22 − 2) = 0.

In the first equation, we have two possibilities: Either x1 = 0 or x21 + x22 − 1 = 0.

If x1 = 0, then we have in the second equation that 0 = x2 (2x22 − 2) = 2x2 (x22 − 1),
for which there are 3 solutions: x2 = 0, x2 = −1, and x2 = 1. Therefore, we have
3 stationary points: (0, 0), (0, −1), and (0, 1).

If x21 + x22 − 1 = 0, when we substitute in the second equation, we obtain that


x2 (x22 − 1) = 0, which gives us two possibilities: Either x2 = 0 or x22 − 1 = 0.

If x2 = 0, then x21 − 1 = 0 and x1 = ±1. We obtain 2 new stationary points: (−1, 0)


and (1, 0).

If x22 − 1 = 0, then x1 = 0 and we obtain some of the stationary points that we had
already found.

Therefore, function f has 5 stationary points: (0, 0), (0, −1), (0, 1), (−1, 0), and
(1, 0).
10 2017
c Sergio Garcı́a Quiles

2.2. Positive Semidefinite Matrices


Since not all stationary points are minima, we need to study higher order derivatives
to obtain a sufficient condition of optimality. Before that, we need to remember what
a positive semidefinite matrix is.
Definition 2.8
An n × n symmetric matrix A is:
1. Positive semidefinite if pt Ap ≥ 0 ∀p ∈ Rn . We denote it by A  0.
2. Positive definite if pt Ap > 0 ∀p ∈ Rn , p 6= 0. We denote it by A  0.
3. Negative semidefinite if pt Ap ≤ 0 ∀p ∈ Rn . We denote it by A  0.
4. Negative definite if pt Ap < 0 ∀p ∈ Rn , p 6= 0. We denote it by A ≺ 0.
5. Indefinite if there are p1 , p2 ∈ Rn such that pt1 Ap1 < 0 and pt2 Ap2 > 0.
Clearly, if a matrix is positive semidefinite (respectively, definite), then −A is nega-
tive semidefinite (respectively, negative definite).

There are several more results on positive semidefinite matrices. However, it is not
the scope of this course to cover them all and we are going to review only some of
them which are most useful to us. More specifically, with the definitions that we
have seen, it is not easy to check if a matrix is positive semidefinite. So, we need a
different characterization.
Lemma 2.9
Let A be an n × n symmetric matrix.
1. A is positive semidefinite if, and only if, all its eigenvalues are nonnegative.
2. A is positive definite if, and only if, all its eigenvalues are positive.
3. A is negative semidefinite if, and only if, all its eigenvalues are nonpositive.
4. A is negative definite if, and only if, all its eigenvalues are negative.
5. A is indefinite if it has at least one negative eigenvalue and at least one positive
eigenvalue.
Although the characterization is important, it is more useful in practice to use a
test based on the concept of leading principal minor. Given an n × n symmetric
matrix A, the k-th leading principal minor, 1 ≤ k ≤ n, is the determinant of the
upper left k × k submatrix and we denote it by Mk (A).
Theorem 2.10 (Leading Principal Minors Criterion)
Let A be an n × n symmetric matrix.
1. A is positive definite if, and only if, M1 (A) > 0, M2 (A) > 0,. . . , Mn (A) > 0.
2. A is negative definite if, and only if, all its leading principal minors of odd order
are negative and all its leading principal minors of even order are positive.
Example 2.11
Let
     
4 2 3 2 2 2 −4 1 1
A =  2 3 2 , B= 2 2 2 , and C =  1 −4 1 .
3 2 4 2 2 −1 1 1 −4
Chapter 2. Basics of Nonlinear Optimization 11

A is positive definite because M1 (A) = 4, M2 (A) = 8, and M3 (A) = 13.


B is neither positive definite neither negative definite because M2 (B) = 0.
C is negative definite because M1 (C) = −4, M2 (C) = 15, and M3 (C) = −50.

Note that the theorem does not provide information concerning positive or negative
semidefiniteness, which is harder to check.

One last useful result is the following:


Proposition 2.12
Let A be an n × n symmetric matrix.
1. If A is positive definite, then the diagonal elements of A are positive.
2. If A is positive semidefinite, then the diagonal elements of A are nonnegative.
3. If A is negative definite, then the diagonal elements of A are negative.
4. If A is negative semidefinite, then the diagonal elements of A are nonpositive.
5. If A has both positive and negative elements in the diagonal, then it is indefinite.

In Example 2.11, matrix B is indefinite because it has negative and positive elements
in the diagonal.

2.3. Second-Order Optimality Conditions


We will use information of the Hessian matrix to obtain information about minimum
and maximum points. All local minima or maxima must satisfy a certain property
(in addition to the gradient being zero).

Proposition 2.13 (Necessary Second-Order Optimality Condition)


Let f ∈ C 2 (Ω) and x∗ ∈ Ω.
1. If x∗ is a local minimum, then ∇2 f (x∗ ) is positive semidefinite.
2. If x∗ is a local maximum, then ∇2 f (x∗ ) is negative semidefinite.
Proof:
We only prove the first result. Assume that ∇2 f (x∗ ) is not positive semidefinite.
Then, there is p ∈ Rn , p 6= 0, such that pt ∇2 f (x∗ )p < 0. Since ∇2 f is continuous,
there is λ̄ > 0 such that pt ∇2 f (x∗ + λp)p < 0 ∀λ ∈ (0, λ̄).

Given a fixed value λ ∈ (0, λ̄), using Taylor’s theorem, we know that there is some
ελ ∈ (0, 1) such that
1
f (x∗ + λp) = f (x∗ ) + ∇f (x∗ )t λp + (λp)t ∇2 f (x∗ + ελp)λp.
2
Or, equivalently,
1
f (x∗ + λp) = f (x∗ ) + λ∇f (x∗ )t p + λ2 pt ∇2 f (x∗ + λε p)p
2
for some λε ∈ (0, λ).
12 2017
c Sergio Garcı́a Quiles

As ∇f (x∗ ) = 0 (because x∗ is a local minimum) and pt ∇2 f (x∗ + λε p)p < 0 (because


0 < λε < λ < λ̄ and ∇2 f is continuous), we have that f (x∗ + λp) < f (x∗ ).

Therefore, f (x∗ + λp) < f (x∗ ) ∀λ ∈ (0, λ̄) and x∗ cannot be a minimum.

Note, however, that the previous condition is not sufficient: f (x) = x3 satisfies that
f 0 (0) = f 00 (0) = 0, but x = 0 is neither a local minimum nor a local maximum.

The good news is that there is a condition that, if met, tells us that we have found
a local minimum or maximum.
Theorem 2.14 (Sufficient Second-Order Optimality Condition)
Let f ∈ C 2 (Ω) and x∗ ∈ Ω.
1. If ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is positive definite, then x∗ is a strict local minimum.
2. If ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is negative definite, then x∗ is a strict local maximum.
Proof:
Since ∇2 f is continuous and positive definite at x∗ , there is ε > 0 such that ∇2 f (x)
is positive definite for all x in the ball B = {y ∈ Rn / ky − x∗ k < ε}.

Let p ∈ Rn be such that kpk < ε. Then, x∗ + p ∈ B and, using that ∇f (x∗ ) = 0, by
Taylor’s formula we have that
1 1
f (x∗ + p) = f (x∗ ) + ∇f (x∗ )t p + pt ∇2 f (x∗ + λp)p = f (x∗ ) + pt ∇2 f (x∗ + λp)p
2 2
for some λ ∈ (0, 1). As x∗ +λp ∈ B, then pt ∇2 f (x∗ +λp)p > 0 and f (x∗ +λp) > f (x∗ ).
Therefore, x∗ is a strict local minimum. 

However, the conditions of the previous theorem are not necessary. A point can
be a local minimum without the Hessian matrix being positive definite. Consider
f (x) = x4 at x = 0.

Example 2.15
In the function of Figure 1.1, f 00 (x) = 4x3 − 15x2 + 10x + 5. We had seen that
x = 1 and x = 3 are stationary points. Now, since f 00 (1) = 4 and f 00 (3) = 8, we can
guarantee that both points are strict local minima.
Example 2.16
In the function of Example 2.7,
 
2 3x21 + x22 − 1 2x1 x2
∇ f (x1 , x2 ) = 4 .
2x1 x2 x1 + 6x22 − 2
2

We had seen that the stationary points are (0, 0), (−1, 0), (1, 0), (0, 1), and (0, −1).
We have that:  
2 −1 0
• ∇ f (0, 0) = 4 ≺ 0. Thus, (0, 0) is a strict local maximum.
0 −2
Chapter 2. Basics of Nonlinear Optimization 13

 
2 2 2 0
• ∇ f (−1, 0) = ∇ f (1, 0) = 4 , which is an indefinite matrix. So, these
0 −1
points are neither local minima or maxima.

0 0
• ∇2 f (0, −1) = ∇2 f (0, 1) = 4  0. We have not information enough to
0 4
decide if we have here a local minimum or maximum (or nothing).

Definition 2.17
A point that is neither a local maximum nor a local minimum is a saddle point.

In the previous example, (−1, 0) and (1, 0) are saddle points.

2.4. Convexity and Global Optimality Conditions


Unfortunately, in general, we cannot give sufficient conditions for global minima.
However, we are going to see that if the function is convex, then we can guarantee
that a local minimum is global.

Throughout this section we will assume that Ω is an open convex set.


Definition 2.18
A function f : Ω → R is convex if

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) ∀x, y ∈ Ω, ∀λ ∈ [0, 1].

This means that for any two points, the segment joining their images is above the
graph of the function.
4.5

3.5

2.5

1.5

0.5

−0.5
−1 −0.5 0 0.5 1 1.5 2

Figure 2.1: Convexity.

An important result is the following:

Proposition 2.19 (Gradient Inequality)


A function f ∈ C 1 (Ω) is convex if, and only if, it satisfies that

f (y) ≥ f (x) + ∇f (x)t (y − x) ∀x, y ∈ Ω.


14 2017
c Sergio Garcı́a Quiles

Proof:
“⇒” If f is convex, for all λ ∈ (0, 1] we have that
f (λy + (1 − λ)x) ≤ λf (y) + (1 − λ)f (x);
f (x + λ(y − x)) − f (x) ≤ λf (y) − λf (x);
f (x + λ(y − x)) − f (x)
≤ f (y) − f (x).
λ
Taking limits when λ tends to 0:
f (x + λ(y − x)) − f (x)
lim ≤ f (y) − f (x).
λ→0 λ
Now,
f (x + λ(y − x)) − f (x)
lim = ∇f (x)t (y − x),
λ→0 λ
because the left-hand side is the definition of directional derivative along vector y − x
and the equality is true for any differentiable function.
“⇐” Suppose that the gradient inequality is true. For all x, y ∈ Rn , and λ ∈ [0, 1],
let aλ = λx + (1 − λ)y. We have that
f (x) ≥ f (aλ ) + ∇f (aλ )t (x − aλ ),
and
f (y) ≥ f (aλ ) + ∇f (aλ )t (y − aλ ).
If we multiply these inequalities by λ and 1 − λ, respectively, and add them together:
λf (x) + (1 − λ)f (y) ≥ f (aλ ) + ∇f (aλ )t (λx + (1 − λ)y − aλ ) = f (λx + (1 − λ)y).
Therefore, f is convex.

The previous result states that, if f is convex, then its first order Taylor’s approxi-
mation underestimates f .

y = f(x)

y = f(x 0) + ∇f(x0)(x-x0)

x0

Figure 2.2: Gradient inequality.

The gradient inequality is used in the proof of the following important result:
Chapter 2. Basics of Nonlinear Optimization 15

Theorem 2.20
Let f : Ω → R be convex and x∗ ∈ Ω.
1. If x∗ is a local minimum, then x∗ is also a global minimum.
2. If f ∈ C 1 (Ω) and x∗ is a stationary point, then x∗ is also a global minimum.
Proof:
1. Assume that x∗ is not a global minimum. Then, there is y ∈ Ω, y 6= x∗ , such that
f (y) < f (x∗ ). Since f is convex, for all λ ∈ (0, 1] we have that

f (λy + (1 − λ)x∗ ) ≤ λf (y) + (1 − λ)f (x∗ ) < λf (x∗ ) + (1 − λ)f (x∗ ) = f (x∗ ).

On the other hand, because x∗ is a local minimum, there is ε > 0 small enough
such that
f (x∗ ) ≤ f (x) ∀x / kx∗ − xk < ε.
Now, we define xλ = λy + (1 − λ)x∗ and we take λ = ε
2kx∗ −yk
. We have that

ε
kx∗ − xλ k = kx∗ − λy − (1 − λ)x∗ k = kλx∗ − λyk = λkx∗ − yk = .
2
This means that f (x∗ ) ≤ f (xλ ), but we have seen that f (xλ ) < f (x∗ ), which is a
contradiction.
2. Since x∗ is a stationary point, ∇f (x∗ ) = 0. Using the gradient inequality we have
that
f (y) ≥ f (x∗ ) + ∇f (x∗ )t (y − x∗ ) = f (x∗ ) ∀y ∈ Ω.
Thus, x∗ is a global minimum.

In general, convexity is not easy to check. However, if the function is twice differen-
tiable, we have a characterization of convexity.
Proposition 2.21
A function f ∈ C 2 (Rn ) is convex if, and only if, ∇2 f (x) is positive semidefinite for
all x ∈ Rn .
Example 2.22
Let us study the minima of the function f (x1 , x2 ) = x41 + x42 + x21 + x22 .

The gradient is ∇f (x1 , x2 ) = (4x31 + 2x1 , 4x32 + 2x2 )t and the only stationary point
is x∗ = (0, 0).
 
2 12x21 + 2 0
The Hessian matrix is ∇ f (x1 , x2 ) = . Since the first eigen-
0 12x22 + 2
value is M1 = 12x21 + 2 ≥ 0 and the eigenvalue is M2 = 12x22 + 2 ≥ 0, f is convex.
Therefore, x∗ = (0, 0) is a global minimum.
Chapter 3

Line Search Methods in


Unconstrained Optimization

In general, it is very difficult to solve nonlinear optimization problems explicitly.


Although we know that, if we are optimizing a differentiable function f over an
open set Ω, all the local minima and maxima are solutions to the system of equa-
tions ∇f (x) = 0, solving this system is usually hard because nonlinear equations are
involved.

Therefore, we need algorithms. We choose a starting point x0 and generate a sequence


of iterates {xk }∞
k=1 that terminate when either no more progress can be made or when
the solution has been approximated with enough accuracy. Information on function f
is used to generate iterate xk from previous iterates x0 , x1 , . . . , xk−1 . There are two
main strategies for generating iterates: line search methods and trust region methods.

In a line search strategy, our algorithm chooses a direction dk and searches along this
direction from the current iterate xk for a new iterate with a smaller function value.
That is, we solve the following one-variable problem to decide the step length α
along dk that we will move:

Min. f (xk + αdk )


s.t. α > 0.

An exact solution of this problem is usually expensive and unnecessary. So, quite
often this problem is solved approximately to a level of accuracy that is satisfactory
enough and the following iterate is generated. At the new point, a new search
direction is generated and a new step length is computed, and so on.
18 2017
c Sergio Garcı́a Quiles

In a trust region strategy, a model function mk is defined so that its behavior near the
current iterate xk is similar to the one of the objective function f . Since mk may not
be a general good approximation of f , we want to stay in a “small” region around xk .
We solve the following problem to find a direction d so that the next iterate is xk + d:

Min. mk (xk + d)
s.t. xk + d is in the trust region.

Usually, a value δ > 0 is chosen and the trust region is set to {x ∈ Rn / kx−xk k < δ}.

As we can see, broadly speaking, the difference between line search and trust region
methods is that in a line search method, we choose first the direction and then
decide the step length, while in a trust region method, we set first a maximum step
length and then look for a direction. In this chapter, we are going to study the first
technique whereas the second one will be studied in Chapter 5.

Therefore, in this chapter we will study how to solve the problem

Min. f (x)
s.t. x ∈ Rn ,

using a line search method.

In general, if f ∈ C 1 (Rn ), a generic line search method is as follows:

Algorithm 3.1 (General Line Search Method, GLSM)


Step 1: Choose a tolerance level ε > 0 and a starting point x0 ∈ Rn . Let k := 0.
Step 2: If k∇f (xk )k < ε, then STOP.
Step 3: Compute a descent search direction dk ∈ Rn :

∇f (xk )t dk < 0.

Step 4: Compute a step length αk > 0 along dk such that

f (xk + αk dk ) < f (xk ).

Step 5: Update xk+1 := xk + αk dk and k := k + 1. Go to Step 2.

We can see some iterations of this method in Figure 3.1.

When we use a line search method, we need to decide a starting point and, at each
iteration, a descent direction and a step size. The starting point is usually chosen
arbitrarily (unless there is some additional information about where a local minimum
could lie). Next we will see how to deal with the other two issues.
Chapter 3. Line Search Methods in Unconstrained Optimization 19

0.5

x0

0
x1

x2
-0.5

x3

-1

-1.5
-1.5 -1 -0.5 0 0.5

Figure 3.1: Line search method.

3.1. Step Length


Given a descent direction dk , if we are to obtain the best step length αk , then we
must solve the problem
Min. f (xk + αdk )
s.t. α > 0.

If we do not want to use a constant step size for every step (among other reasons,
because it is not clear what step size we should use), then we can try to solve exactly
the previous problem and we will be using an exact line search strategy. However,
this problem is usually difficult for nonlinear functions. Instead, inexact methods
are more commonly used in practice, methods that do not solve the problem exactly
but get a “good” α instead.

A possible inexact strategy is to use backtracking: we take an initial value α0 and


decrease it by a factor τ until we obtain a better value of f .

Algorithm 3.2 (Backtracking for Step Size Selection)


Step 1: Choose α0 > 0 and τ ∈ (0, 1). Let t := 0.
Step 2: If f (xk + αt dk ) < f (xk ), then αk := αt and STOP.
Step 3: αt+1 := τ αt , t := t + 1. Go to Step 2.

For example, if α0 := 1 and τ = 0.5, then α1 = 0.5, α2 = 0.25, α3 = 0.125, etc.


20 2017
c Sergio Garcı́a Quiles

Since dk is a descent direction, there will be a certain αt for which the inequality of
Step 2 holds.

However, we want not only to improve the value of f , but that this decrease is
“big” enough. Otherwise, we could end up having very small improvements and the
algorithm would not converge. A rule used in practice is the Armijo condition:

f (xk + αk dk ) ≤ f (xk ) + c1 αk ∇f (xk )t dk ,

for a certain c1 ∈ (0, 1) chosen beforehand. In practice, c1 is chosen with a small


value (for example, 10−3 or 10−4 ).

What the rule says is that the decrease should be proportional to the step size αk
and the directional derivative ∇f (xk )t dk . If we define the one-dimensional functions
Φ(α) := f (xk + αdk ) and `(α) := f (xk ) + c1 α∇f (xk )t dk , then we accept those values
of α for which Φ(α) is below the linear function `(α) (see Figure 3.2).

f(xk +α d k )
k k t k
f(x )+c 1 α∇ f(x ) d

α
acceptable

Figure 3.2: Armijo condition.

But we need something more than just Armijo condition for the next iterate to be
good enough because this condition is satisfied for any value of α sufficiently small.
This is why a second condition, called curvature condition, is added to guarantee
that α is large enough:

∇f (xk + αk dk )t dk ≥ c2 ∇f (xk )t dk ,

for another constant c2 ∈ (c1 , 1) chosen beforehand. Since the left-hand side is Φ0 (α)
and the right-hand side is c2 Φ0 (0), what we are requiring is that the slope of Φ
at αk is not too negative, that is, we are not too close to the starting point xk (see
Figure 3.3). Usual values of c2 are 0.9 for Newton and quasi-Newton methods and
0.1 for a nonlinear conjugate gradient method (techniques which we will study in
some weeks).
Chapter 3. Line Search Methods in Unconstrained Optimization 21

∇ f(xk + α d k )t d k

c 2 ∇ f(xk )t d k

acceptable
α

Figure 3.3: Curvature condition.

The combination of the two conditions is known as Wolfe conditions:

f (xk + αk dk ) ≤ f (xk ) + c1 αk ∇f (xk )t dk ,


∇f (xk + αk dk )t dk ≥ c2 ∇f (xk )t dk ,

with 0 < c1 < c2 < 1.

Now it is natural to wonder if we can always find a step size that satisfies the Wolfe
conditions. The answer is affirmative under some mild conditions.

Proposition 3.3
Let f ∈ C 1 (Rn ), 0 < c1 < c2 < 1, and let dk be a descent direction for f at xk ∈ Rn .
If f is bounded below along the ray {xk + αdk / α > 0}, then there exist intervals
of step lengths that satisfy Wolfe conditions.

If we do backtracking to obtain the step size, we have the following algorithm:

Algorithm 3.4 (Armijo Condition with Backtracking, BA)


Step 1: Choose α0 > 0, β ∈ (0, 1), and τ ∈ (0, 1). Let t := 0.
Step 2: If f (xk + αt dk ) ≤ f (xk ) + βαt ∇f (xk )t dk , then αk := αt and STOP.
Step 3: αt+1 := τ αt , t := t + 1. Go to Step 2.

Note that, since now we are not using parameter c2 , we write β instead of c1 .

It is easy to see that, if we use backtracking, it is always possible to find (under mild
conditions) a step size α that satisfies the Armijo condition. We are going to need
Lipschitz continuity.
22 2017
c Sergio Garcı́a Quiles

Definition 3.5
A function f : Rn → Rm is Lipschitz continuous if there is L > 0 such that

||f (x) − f (y)|| ≤ L||x − y|| ∀x, y ∈ Rn .

It can be proved that f ∈ C 2 (Rn ) has gradient Lipschitz continuous with Lipschitz
constant L if, and only if, ||∇2 f (x)|| ≤ L for all x ∈ Rn .

The following result is used in the proof of Lemma 3.7.


Proposition 3.6
Let f ∈ C 1 (Rn ) and let ∇f be Lipschitz continuous with Lipschitz constant L. If
p ∈ Rn , then
L||p||2
f (x + p) ≤ f (x) + ∇f (x)t p + .
2

Proof:
Given v ∈ Rn and λ ∈ R, if we define g(λ) := f (x + λv), we have that g 0 (λ) =
1
∇f (x + λv)t v and that f (x + v) − f (x) = 0 g 0 (λ)dλ. Therefore,
R

Z 1
t
f (x + p) − f (x) − ∇f (x) p = [∇f (x + λp) − ∇f (x)]t pdλ ≤
0

1 1 1
L||p||2
Z Z Z
2
||∇f (x + λp) − ∇f (x)||||p||dλ ≤ L||λp||||p||dλ = L||p|| λdλ = .
0 0 0 2

When the gradient is Lipschitz continuous, there is an interval of α that satisfies the
Armijo condition.
Lemma 3.7
Let f ∈ C 1 (Rn ), β ∈ (0, 1), x ∈ Rn , and let d ∈ Rn be a descent direction for f at x.
If ∇f is Lipschitz continuous with Lipschitz constant L, then the Armijo condition

f (x + αd) ≤ f (x) + αβ∇f (x)t d


t
is satisfied for all α ∈ [0, ω], with ω = 2(β−1)∇f
Lkdk2
(x) d
.
Proof:
By Proposition 3.6, if α ∈ [0, ω], we have that

Lα2 ||d||2
f (x + αd) ≤ f (x) + α∇f (x)t d + ≤
2
Lα 2(β − 1)∇f (x)t d
f (x) + α∇f (x)t d + · · ||d||2 = f (x) + αβ∇f (x)t d.
2 L||d||2

Moreover, as we can see next, if we apply backtracking and use the Armijo rule, we
will find a step size in a finite number of iterations.
Chapter 3. Line Search Methods in Unconstrained Optimization 23

Corollary 3.8
Let β, τ ∈ (0, 1), f ∈ C 1 (Rn ), xk ∈ Rn , and let dk ∈ Rn be a descent direction
for f at xk . If ∇f is Lipschitz continuous with Lipschitz constant L, then the
step size search generated by backtracking Armijo (BA) terminates with a step size
αk ≥ min{α0 , τ ωk }, where α0 is the initial value that starts the backtracking search
(xk )t dk
and ωk = 2(β−1)∇f
Lkdk k2
.
Proof:
If α0 satisfies the Armijo condition, then αk := α0 . Otherwise, we multiply this
value t times by τ , αt = τ t α0 , until we have that αt ≤ ωk and αt−1 > ωk . This last
inequality implies that τ ωk < τ αt−1 = αt =: αk .

If we use backtracking with the Armijo condition, then we have the following con-
vergence result for the General Line Search Method (we write for short BA-GLSM):

Theorem 3.9 (Convergence for BA-GLSM)


Let f ∈ C 1 (Rn ). If ∇f is Lipschitz continuous on Rn , then the iterates generated
by BA-GLSM have three possibilities:
1. ∇f (xk̃ ) = 0 for some k̃ ≥ 0.
2. lim f (xk ) = −∞.
k→+∞
 
k t k 1
3. lim |∇f (x ) d | min 1, k = 0.
k→+∞ kd k
Proof:
Assume that neither of the two first possibilities holds. Using the Armijo condition
for each iteration, we have that

f (xj+1 ) ≤ f (xj ) + βαj ∇f (xj )t dj , j = 0, 1, . . . , k.

Adding together all these inequalities:


k
X
f (xk+1 ) ≤ f (x0 ) + β αj ∇f (xj )t dj .
j=0

Since f is not unbounded below then


+∞
f (xk+1 ) − f (x0 ) X j
−∞ < ≤ α ∇f (xj )t dj < +∞,
β j=0

where the last inequality is because ∇f (xj )t dj < 0 ∀j. Therefore,



lim αj ∇f (xj )t dj = 0.
j→+∞

Finally, we partition N as K1 ∪ K2 with

K1 = {k / αk = α0 }, K2 = {k / αk < α0 }.
24 2017
c Sergio Garcı́a Quiles

If we restrict k ∈ K1 , we have that



0 = lim αj ∇f (xj )t dj = α0 lim ∇f (xj )t dj ;
j→+∞ j→+∞
j∈K1 j∈K1

lim ∇f (xj )t dj = 0.
j→+∞
j∈K1

If we restrict k ∈ K2 , then αk < α0 , which means (by Corollary 3.8) that


2τ (β − 1)∇f (xk )t dk
αk ≥ τ ωk = .
Lkdk k2
Thus,
k t k 2

2τ (β − 1) ∇f (x ) d
αk ∇f (xk )t d ≤ < 0.
Lkdk k2
So, since the left-hand side tends to 0 when k tends to infinity, we can conclude that

∇f (xk )t dk
lim = 0.
k→+∞, kdk k
k∈K2

Since K1 ∪ K2 = N, this concludes the proof.

Therefore, if we do not find a stationary point in a finite number of steps (first


condition does not hold) and f is not bounded below (second condition does not
hold), then the third condition holds. Note that this condition means that, if the
descent direction and the gradient do not tend to be mutually orthogonal and the
search direction does not tend to the null vector, then BA-GLSM converges (that is,
||∇f (xk )|| converges to zero). In order to see this, remember that for any two vectors
v1 , v2 ∈ Rn which form an angle θ it holds that v1t v2 = ||v1 ||||v2 || cos θ. Taking this
into account, we can write the third condition of the previous theorem as follows:
   
k t k 1 k k k 1
0 = lim |∇f (x ) d | min 1, k = lim k∇f (x )kkd k| cos θ | min 1, k =
k→+∞ kd k k→+∞ kd k
= lim min k∇f (xk )kkdk k| cos θk |, k∇f (xk )k| cos θk | .

k→+∞

There exists a similar result for the Wolfe conditions.


Theorem 3.10
Let f ∈ C 1 (N ), where N is an open set containing the set L = {x ∈ Rn / f (x) ≤
f (x0 )} and x0 is the starting point of the General Line Search Method GLSM (Al-
gorithm 3.1) in which the step sizes satisfy the Wolfe conditions. If f is bounded
below in Rn and ∇f is Lipschitz continuous over N , then
lim | cos θk |||∇f (xk )|| = 0.
k→∞

Next, we will study a popular choice of descent direction: the steepest descent
method.
Chapter 3. Line Search Methods in Unconstrained Optimization 25

3.2. Steepest Descent Method


In the steepest descent method (also known as gradient descent method), we
look for the direction that decreases f most rapidly, which is the gradient.

In order to see this, we define Φ(α) := f (xk + αdk ), where the direction dk has norm
one and ∇f (xk ) 6= 0. We are interested in minimizing the rate of change of function f
at point xk along direction dk (i.e., getting a value as negative as possible), that is,
minimizing
Φ0 (0) = ∇f (xk )t dk = k∇f (xk )kkdk k cos θ = k∇f (xk )k cos θ,
where θ is the angle formed by ∇f (xk ) and dk . Clearly, the minimum is achieved
for θ = π. Therefore, the direction (among those with unitary norm) that gives the
largest decrease is
∇f (xk )
dk = − .
k∇f (xk )k
Besides, it is obvious that dk = −∇f (xk )/k∇f (xk )k is a descent direction because
∇f (xk )t dk = −k∇f (xk )k < 0.

We have therefore the following line search approach:


Algorithm 3.11 (Steepest Descent Method)
Step 1: Choose a tolerance level ε > 0 and a starting point x0 ∈ Rn . Let k := 0.
Step 2: If k∇f (xk )k < ε, then STOP.
Step 3: Compute a step length αk > 0 along −∇f (xk ) such that
f xk − αk ∇f (xk ) < f (xk ).


Step 4: Update xk+1 := xk − αk ∇f (xk ) and k := k + 1. Go to Step 2.

This approach can be combined with the Armijo condition. Moreover, applying
Theorem 3.9 we have a convergence result for the Backtracking Armijo Steepest
Descent Method (BA-SDM).
Theorem 3.12 (Convergence for BA-SDM)
Let f ∈ C 1 (Rn ). If ∇f is Lipschitz continuous, then the iterates generated by
BA-SDM have three possibilities:
1. ∇f (xk̃ ) = 0 for some k̃ ≥ 0.
2. lim f (xk ) = −∞.
k→+∞
3. lim ||∇f (xk )|| = 0.
k→+∞

The theorem states that, if f is bounded below, then the steepest descent method
is globally convergent. The analogous result when using Wolfe conditions is an
immediate consequence of Theorem 3.10. However, it must be noted that in practice,
its convergence is usually quite slow.
26 2017
c Sergio Garcı́a Quiles

Example 3.13
Let us apply the steepest descent method (with backtracking Armijo) to find a
stationary point of the function f (x, y) = x4 + 2x3 + 2x2 + y 2 − 2xy.

The gradient is ∇f (x, y) = (4x3 + 6x2 + 4x − 2y, 2y − 2x).


Step 1: We start with x0 := (1, 1). We set ε := 10−6 . In order to use the Armijo
condition, we set α0 := 1, τ := 0.5, and β := 10−4 .
Step 2: ∇f (x0 ) = (12, 0). Since kf (x0 )k > ε, we must generate a new point.
Step 3: We need to compute a step size α0 such that

f (x0 − α0 ∇f (x0 )) ≤ f (x0 ) − βα0 k∇f (x0 )k2 ;

f (1 − 12α0 , 1) ≤ 4 − 0.0144α0 .
In this case, we obtain α0 = 0.125.
Step 4: The new point is x1 := x0 − α0 ∇f (x0 ) = (−0.5, 1).
Step 5: ∇f (x1 ) = (−3, 3). Since k∇f (x1 )k > ε, we must generate a new point.
Step 6: We need to compute a step size α1 such that

f (x1 − α1 ∇f (x1 )) ≤ f (x1 ) − βα1 k∇f (x1 )k2 ;

f (−0.5 + 3α1 , 1 − 3α1 ) ≤ 2.3125 − 0.0018α1 .


In this case, α1 = 0.25.
Step 7: The new point is x2 := x1 − α1 ∇f (x1 ) = (0.25, 0.25).
Step 8: ∇f (x2 ) = (0.9375, 0). Since k∇f (x2 )k > ε, we must generate a new point.
Step 9: We need to compute a step size α2 such that

f (x2 − α2 ∇f (x2 )) ≤ f (x2 ) − βα2 k∇f (x2 )k2 ;

f (0.25 − 0.9375α2 , 0.25) ≤ 0.09765625 − 0.0000878906α2 .


In this case, α2 = 0.25.
Step 10: The new point is x3 := x2 − α2 ∇f (x2 ) = (0.015625, 0.25).
Step 11: ∇f (x3 ) = (−0.436202, 0.46875). Since k∇f (x3 )k > ε, we must generate
a new point.
The algorithm continues until the tolerance limit is reached or a stationary point.

We can summarize in the following table the information of the iterations that we
have done and we can see graphically these iterations (and some) in Figure 3.4

(x, y) ∇f f
(1,1) (12,0) 4
(-0.5,1) (-3,3) 2.3125
(0.25,0.25) (0.9375,0) 0.0977
(0.0156,0.25) (-0.4362,0.4688) 0.0552
Chapter 3. Line Search Methods in Unconstrained Optimization 27

1.2

0.8

0.6

0.4

0.2

−0.2

−0.4
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure 3.4: Steepest descent method.

3.2.1. Convergence Rate


Once we have an algorithm, we would like that the iterates terminate as soon as
possible. Therefore, it is important to consider the convergence rate.
Definition 3.14
Let xk be a sequence in Rn that converges to a point x∗ . It is said that there is
linear convergence if there is a constant r ∈ (0, 1), called rate of convergence,
such that
kxk+1 − x∗ k
lim = r.
k→+∞ kxk − x∗ k

This means that the distance to the limit x∗ decreases at each iteration by at least
a constant factor r. The convergence is superlinear if r = 0 and it is sublinear if
r = 1.

There is quadratic convergence if there is M > 0 (we do not require that M < 1)
such that
kxk+1 − x∗ k
lim = M.
k→+∞ kxk − x∗ k2

Example 3.15
Let us consider the following sequences:
1 1 1 1
an = , bn = n , cn = n , dn = 2n .
n 2 n 2
All of them converge to 0, but an does it sublinearly, bn linearly (with rate 0.5),
cn superlinearly, and dn quadratically.
28 2017
c Sergio Garcı́a Quiles

The following result tells us that the convergence rate of the steepest descent method
is linear if we use exact line search. In general, convergence does not improve if
inexact line search is used instead.
Theorem 3.16
Let f ∈ C 2 (Rn ) and assume that the iterates of the steepest descent method with
exact line search converge to a point x∗ for which the Hessian matrix ∇2 f (x∗ ) is
positive definite. If r is a scalar such that
 
λn − λ1
r∈ ,1 ,
λn + λ1

where λ1 ≤ λ2 ≤ . . . ≤ λn are the eigenvalues of ∇2 f (x∗ ), then

f (xk+1 ) − f (x∗ )
≤ r2
f (xk ) − f (x∗ )
for all k sufficiently large.

3.2.2. Scaling
The steepest descent method is sensitive to scaling. Therefore, depending on the
scale that we use, the convergence can be faster of slower.
Example 3.17
Let us study the convergence of the steepest descent method with exact line search
for the quadratic function
 
1 2 2 1 t a 0
q(x1 , x2 ) = (ax1 + x2 ) = x x,
2 2 0 1

where a > 0 and we start at point x0 = (1, a).

It is easy to see that x∗ = (0, 0) is the only global minimum. Nevertheless, we are
going to analyze the convergence rate.

Since we have a quadratic form, it is easy to see (it is left as an exercise) that the
optimal step size when we use exact line search is
k∇q(x)k2
α∗ = ,
∇q(x)t B∇q(x)
 
a 0 a2 x21 +x22
where B = and ∇q(x) = Bx. Thus, ∇q(x) = (ax1 , x2 ) and α∗ = a3 x21 +x22
.
0 1
So:
1. x0 = (1, a).  
2 1−a a(a−1) 1−a
2. x1 = x0 − α0 ∇f (x0 ) = (1, a) − 1+a
(a, a) = 1+a
, 1+a = 1+a
(1, −a).
1−a 2 1−a 1−a 2

3. x2 = x1 − α1 ∇f (x1 ) = 1+a
(1, −a) − 1+a
· 1+a
(a, −a) = 1+a
(1, a).
Chapter 3. Line Search Methods in Unconstrained Optimization 29

1−a k
1, (−1)k a . Therefore, lim xk = (0, 0)
 
In general, it can be seen that xk = 1+a k→+∞
and
1−a k+1 
kxk+1 − (0, 0)k 1, (−1)k+1 a
1 − a 1 − a

1+a
lim = lim 1−a k = lim
=
.
k→+∞ kxk − (0, 0)k k→+∞ k(1, (−1)k a)k k→+∞ 1 + a 1 + a
1+a


Since 1−a
1+a
< 1, there is linear convergence.

1 − a
However, lim = 1. This means that, the larger a is, the slower the con-
a→+∞ 1 + a
 
2 a 0
vergence will be. Indeed, ∇ f = , which means that the eigenvalues of
0 1
∇2 f (0) are λ1 = 1 < λ2 = a (assuming now a > 1). So, in the previous theorem,
a−1
r ∈ a+1 , 1 , which can be very close to 1.

In Figure 3.5, we can see up to 25 iterations when we apply the steep descent method
but we change the value of a.

1
2

0.8
1.5
0.6

0.4 1

0.2
0.5
0

−0.2 0

−0.4
−0.5
−0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1

(a) a = 1 (b) a = 2
6

5 20

4
15

10
2

1 5

0
0
−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1

(c) a = 5 (d) a = 20

Figure 3.5: Scaling problem for steepest descent method.

In practice, the slow convergence may mean convergence to a wrong point due to
the slow-progressing iterates and the cumulation of round-off errors.
Chapter 4

Newton’s Method and Quasi-Newton


Methods

In the previous chapter, we have studied methods that use information from the
gradient of the function (first-order methods). In this chapter, we are going to
see that if we use information from the Hessian matrix (second-order methods),
then we can obtain much better algorithms. In particular, we will start with one
of the most known algorithms in Mathematics due to its easy implementation and
good performance: Newton’s method.

Unless otherwise stated, in this chapter we assume that all the functions are in C 2 (Rn ).

4.1. Newton’s Method for Root-Finding


The immediate application of Newton’s method is for solving equations. Let us start
with one-dimensional functions and later we will study the extension to more general
functions.

4.1.1. Single Variable


Given a function f : R → R, we look for a point x∗ ∈ R such that f (x∗ ) = 0.

Let xk be an estimate of x∗ . If f ∈ C 2 (Rn ), using Taylor’s theorem we have the


following linear approximation:

f (x) ≈ f (xk ) + f 0 (xk )(x − xk ).

Since on the left-hand side we have that f (x∗ ) = 0, we generate a new point xk+1
32 2017
c Sergio Garcı́a Quiles

such that the right-hand side is also null:

f (xk ) + f 0 (xk )(xk+1 − xk ) = 0.

If f 0 (xk ) 6= 0, then
f (xk )
xk+1 = xk − .
f 0 (xk )
We have thus the following algorithm:

Algorithm 4.1 (Unidimensional Newton’s Method for Root-Finding)


Step 1: Choose ε > 0 and x0 ∈ R. Let k := 0.

Step 2: If f (xk ) < ε, STOP.
f (xk )
Step 3: Update xk+1 := xk − f 0 (xk )
, k := k + 1. Go to Step 2.

Geometrically, what we are doing is considering the tangent line to function f (x)
at point (xk , f (xk )). Then, xk+1 is the point where this tangent line intersects the
horizontal axis. See Figure 4.1.

f(x k ) + f'(x k )(x-xk )


0
f(x) k+1 k
x x

-1

Figure 4.1: Geometrical interpretation of Newton’s method.

Example 4.2
Let us solve the equation x3 − 1 starting at x0 = 7.
3
Since f 0 (x) = 3x2 , then x − ff0(x) = x − x3x−1 1 1

(x) 2 = 3
2x + x2
and we have the following
iterations:
Chapter 4. Newton’s Method and Quasi-Newton Methods 33

k xk f (xk ) f 0 (xk )
0 7.00 342.00 147.00
1 4.67 101.07 65.52
2 3.13 29.69 29.41
3 2.12 8.55 13.50
4 1.49 2.30 6.64
5 1.14 0.49 3.92
6 1.02 0.05 3.10
7 1.00 0.00 3.00

As we can see, there is a fast convergence to x∗ = 1, which is the solution of the


equation.

4.1.2. Several Variables


Given a function f : Rn → Rn , f = (f1 , . . . , fn ), for which we would like to solve
the equation system f (x) = 0, it is easy to generalize the unidimensional Newton’s
method. In order to do that, we need to consider the first order Taylor approximation
of f at a point xk ∈ Rn :

f (x) ≈ f (xk ) + J(xk )(x − xk ),


∂fi
where J(xk ) is the Jacobian matrix of f at xk , that is, J(xk )ij = ∂xj
(xk ), 1 ≤ i, j ≤ n.

If xk+1 is such that f (xk ) + J(xk )(xk+1 − xk ) = 0 and J(xk ) is nonsingular, then

xk+1 := xk − J −1 (xk )f (xk ).

We have an algorithm identical to the one given for single-dimensional functions:

Algorithm 4.3 (Multidimensional Newton’s Method for Root-Finding)


Step 1: Choose ε > 0 and x0 ∈ Rn . Let k := 0.
Step 2: If kf (xk )k < ε, STOP.
Step 3: Solve the system J(xk )wk = −f (xk ).
Step 4: Update xk+1 := xk + wk and k := k + 1. Go to Step 2.

Example 4.4
Let us solve the system

2x1 + x2 − 1 = 0,
2x1 + x22 − 3 = 0.
We define f1 (x) := 2x1 + x2 − 1 and f2 (x) = 2x1 + x22 − 3. It is easy to see that if
we start at x0 = (0, 0), we converge to (1, −1). If we start x0 = (5, 5), we converge
to (−0.5, 2), which is a different solution.
34 2017
c Sergio Garcı́a Quiles

k xk1 xk2 kf (xk )k k xk1 xk2 kf (xk )k


0 0.00 0.00 3.16 0 5.00 5.00 34.93
1 1.50 -2.00 4.00 1 -1.00 3.00 4.00
2 1.10 -1.20 0.64 2 -0.60 2.20 0.64
3 1.01 -1.01 0.04 3 -0.51 2.01 0.04
4 1.00 -1.00 0.00 4 -0.50 2.00 0.00

We can see that in both cases the convergence is very fast.

In order to study the convergence of the method, we will use the following alternative
way of writing Taylor’s theorem.

Theorem 4.5 (Taylor’s Theorem)


If f ∈ C 1 (Rn , Rm ), then
Z 1
f (x + p) = f (x) + J(x + tp)pdt.
0

We also need the following definition:

Definition 4.6
A function f : Rn → Rm is locally Lipschitz continuous if for every x ∈ Rn there
is a neighborhood in which is Lipschitz continuous, that is, there are ε, L > 0 such
that
||f (x) − f (y)|| ≤ L||x − y|| ∀x, y / ||x − y|| < ε.

Provided that we start “close” to a solution, Newton’s method converges very fast.

Theorem 4.7
Let f ∈ C 1 (Rn , Rn ) and let x∗ ∈ Rn be such that f (x∗ ) = 0 and the Jacobian of f
at x∗ , J(x∗ ), is nonsingular. Assume also that the Jacobian of f is locally Lipschitz
continuous in a neighborhood of x∗ and that we generate the iterates xk+1 := xk −
J −1 (xk )f (xk ). If the starting point x0 is sufficiently close to x∗ , then:
1. The sequence of iterates {xk } converges quadratically to x∗ .
2. The sequence of norms {kf (xk )k} converges quadratically to zero.
Proof:
1. Since J(x∗ ) is nonsingular then kJ −1 (x∗ )k > 0. This is because the p matrix norm
that we are considering is the Euclidean norm: given A, kAk2 := λn (A), where
λn (A) is the largest eigenvalue of A.
As J is continuous, there is δ > 0 such that kJ −1 (x)k ≤ 2kJ −1 (x∗ )k if kx−x∗ k < δ.
On the other hand, as f (x∗ ) = 0, we have that

xk+1 − x∗ = xk − J −1 (xk )f (xk ) − x∗ = (xk − x∗ ) − J −1 (xk )(f (xk ) − f (x∗ )) =


Chapter 4. Newton’s Method and Quasi-Newton Methods 35

= J −1 (xk ) J(xk )(xk − x∗ ) − (f (xk ) − f (x∗ )) .


 

Now, using the version of Taylor’s theorem provided by Theorem 4.5,


Z 1

k
J x∗ + t(xk − x∗ ) (xk − x∗ )dt.

f (x ) = f (x ) +
0

Therefore,
J(xk )(xk − x∗ ) − (f (xk ) − f (x∗ )) =

Z 1
∗ ∗ ∗ ∗
k k k
 k
J(x )(x − x ) −
= J x + t(x − x ) (x − x )dt =
0
Z 1
∗ ∗ ∗
k k
 k
= J(x ) − J x + t(x − x ) (x − x )dt ≤

0
Z 1
J(xk ) − J x∗ + t(xk − x∗ ) kxk − x∗ kdt = (∗)


0

Now, kxk − (x∗ + t(xk − x∗ ))k = k(1 − t)(xk − x∗ )k = (1 − t)kxk − x∗ k ≤ kxk − x∗ k.


Let ε be the radius of a ball B(x∗ ; ε) in which J(x) is Lipschitz continuous and
let L be the Lipschitz constant. Note that these values exist because J is locally
Lipschitz continuous.
If xk is so close to x∗ that kxk − x∗ k < ε, then we have that
Z 1
L
(∗) ≤ L(1 − t)kxk − x∗ k2 dt = kxk − x∗ k2 .
0 2

Now, if we assume that kxk − x∗ k < min{ε, δ}, then


L
kxk+1 − x∗ k ≤ 2 J −1 (x∗ ) · kxk − x∗ k2 = Lkx
e k − x∗ k2 ,

2

with Le = L kJ −1 (x∗ )k. Note that this value is positive because we have shown
earlier that kJ −1 (x∗ )k > 0.
Assume finally that we choose x0 such that kx0 − x∗ k < min{ε, δ, 21Le }. Then
0 ∗
e 0 − x∗ k2 < kx − x k < 1 min{ε, δ, 1 }.
kx1 − x∗ k ≤ Lkx
2 2 2L
e

Moreover, provided that we choose x0 such that kx0 − x∗ k < min{ε, δ, 21Le }, this
inequality is true for all k ≥ 1 and so are the previous chains of inequalities.
2k
∗ (Lkx
e 0 −x∗ k)
k
Now, it is easy to see that kx − x k ≤ L
e < 221k Le , which means that
{xk } converges to x∗ . Moreover, the convergence is quadratic because

kxk+1 − x∗ k
lim ≤ L.
e
k→+∞ kxk − x∗ k2
36 2017
c Sergio Garcı́a Quiles

2. As f (xk ) + J(xk )(xk+1 − xk ) = 0, we have that



kf (xk+1 )k = f (xk+1 ) − f (xk ) − J(xk )(xk+1 − xk ) =
1
Z
k k+1 k
 k+1 k k k+1

k
=
J x + t(x − x ) (x − x )dt − J(x )(x − x ) =
0
Z 1
k k+1 k
 k
 k+1 k

= J x + t(x − x ) − J(x ) (x − x )dt ≤

0
Z 1

J xk + t(xk+1 − xk ) − J(xk ) kxk+1 − xk kdt = (∗∗)


0

Now, since {x } converges to x∗ , we have that {xk + t(xk+1 − xk )} also converges


k

to x∗ . So xk , xk + t(xk+1 − xk ) ∈ B(x∗ ; ε) for k sufficiently large, and, using that


J is Lipschitz continuous in this neighborhood:
Z 1
L L 2
(∗∗) ≤ tLkxk+1 − xk k2 dt = kxk+1 − xk k2 = J −1 (xk )f (xk ) ≤
0 2 2

L J −1 (xk ) 2 kf (xk )k2 ≤ 2L J −1 (x∗ ) 2 kf (xk )k2 .




2
kf (xk+1 )k −1 ∗ 2
Therefore, lim ≤ 2L J (x ) and there is quadratic convergence.
k→+∞ kf (xk )k2

4.2. Newton’s Method in Optimization


If we have an optimization problem with no constraints, a necessary condition for a
point x∗ ∈ Rn to be a local minimum is that ∇f (x∗ ) = 0. Therefore, we can apply
the previous method to solve the equation system ∇f (x) = 0 because the roots of
this equation are the stationary points.

In this case, we have that the Jacobian matrix of ∇f is the Hessian ∇2 f . Thus,
given a point xk ∈ Rn for which ∇2 f (x) is nonsingular, the next iterate is
−1
xk+1 := xk − ∇2 f (xk ) ∇f (xk ).


Looking at this expression  as an iteration of a line search method, we have that


k −1
2

Newton direction is − ∇ f (x ) ∇f (x ) and the step size is 1. If ∇2 f (xk ) is
k

positive definite for all k, then we have a descent direction. More specifically, we
have a steepest descent method where the inverse of the Hessian is used for scaling.

Algorithm 4.8 (Newton’s Method for Unconstrained Optimization)


Step 1: Choose ε > 0 and x0 ∈ Rn . Let k := 0.
Step 2: If k∇f (xk )k < ε, STOP.
Step 3: Solve the system ∇2 f (xk )dk = −∇f (xk ).
Step 4: Update xk+1 := xk + dk and k := k + 1. Go to Step 2.
Chapter 4. Newton’s Method and Quasi-Newton Methods 37

The following convergence result is an immediate consequence of Theorem 4.7:


Theorem 4.9
Let f ∈ C 2 (Rn ) and let x∗ ∈ Rn be a stationary point of f such that the Hessian
of f at x∗ , ∇2 f (x∗ ), is nonsingular. Assume also that ∇2 f is locally Lipschitz
continuous in a neighborhood of x∗ and that we generate iterates xk+1 := xk −
−1
∇f (xk ). If the starting point x0 is sufficiently close to x∗ , then:

∇2 f (xk )
1. The sequence of iterates {xk } converges quadratically to x∗ .
2. The sequence of norms {k∇f (xk )k} converges quadratically to zero.

Newton’s method has some good properties:


• It is scale-invariant with respect to linear transformations of the variables.
• It has fast convergence rate.

Example 4.10
Let us consider the poorly scaled function f (x1 , x2 ) = 100x41 + 0.01x42 whose global
minimum is clearly x∗ = (0, 0).

If we apply steepest descent method with Armijo backtracking with α0 = 1, β = 10−4 ,


τ = 0.5, and ε = 10−6 , we “converge” to x∗ in 678 iterations. As we can see in
Figure 4.2a, we are still far from the optimal point.

On the other hand, if we apply Newton’s method, convergence is obtained in just


17 iterations and, as it can be seen in Figure 4.2b, we actually converge.

However, it must be noted that Newton’s method has also some disadvantages:
• An iteration is not well defined if ∇2 f (xk ) is singular.
• Newton’s direction may not be a descent direction if ∇2 f (xk ) is not positive defi-
nite.
• It is not globally convergent: It may fail if we are “far” from the stationary point.
Finally, as in many nonlinear optimization algorithms, we should not forget that, in
case of convergence, it finds a stationary point, with no guarantee that this point is
a local minimum (it may be a local maximum or a saddle point).

The following example shows a case in which convergence fails.


Example 4.11
6 x4
Let f (x) = − x6 + 4
+ 2x2 and let us start Newton’s method at x0 = 1.

Since f 0 (x) = −x5 + x3 + 4x and f 00 (x) = −5x4 + 3x2 + 4, we have that f 0 (x0 ) = 4
0 0)
and f 00 (x0 ) = 2. So, x1 = x0 − ff00(x
(x0 )
= −1.
0 1
Now, f 0 (x1 ) = −4 and f 00 (x1 ) = 2. So, x2 = x1 − ff00(x )
(x1 )
= −1 + 42 = 1 = x0 . As we
can see, we have entered into a loop and the points are not even stationary points.
But the function has indeed a local minimum at x = 0 as we can see in Figure 4.3.
38 2017
c Sergio Garcı́a Quiles

0.8

0.6

0.4

0.2

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

(a) Steepest descent method.

0.8

0.6

0.4

0.2

0 0.2 0.4 0.6 0.8 1

(b) Newton’s method.

Figure 4.2: Steepest descent versus Newton’s in a poorly scaled function.


Chapter 4. Newton’s Method and Quasi-Newton Methods 39

3.5

2.5

2 x2 x1

1.5

0.5

0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Figure 4.3: Loop in Newton’s method.

Note finally that in Newton’s method, we are just requiring in Theorem 4.9 that
∇2 f (x∗ ) be nonsingular. However, the stationary point could be a saddle point or a
local maximum because we have no guarantee that Newton’s direction be a descent
direction.
Example 4.12
Let f (x) = −x2 . It is very easy to see that Newton’s method is globally convergent
because x1 = 0 for any starting point x0 . However, clearly x∗ = 0 is a global
maximum.

4.2.1. Damped Newton’s Method


The standard Newton’s method uses a step size αk = 1. If we incorporate Armijo
condition, we obtain the damped Newton’s method, which has some better con-
vergence properties under some specific circumstances: we need that ∇2 f (xk ) is
positive definite for all k and that its eigenvalues are bounded below and above by
some fixed value (which is the same for all k).

Algorithm 4.13 (Damped Newton’s Method, DNM)


Step 1: Choose ε > 0, x0 ∈ Rn , α0 > 0, β > 0 and τ ∈ (0, 1). Let k := 0.
Step 2: If k∇f (xk )k < ε, STOP.
Step 3: Solve the system ∇2 f (xk )dk = −∇f (xk ).
Step 3a: Let t:=0.
Step 3b: If f (xk +αt dk ) ≤ f (xk )+βαt ∇f (xk )dk , then αk := αt and go to Step 4.
Step 3c: αt+1 := τ αt , t := t + 1. Go to Step 3b.
40 2017
c Sergio Garcı́a Quiles

Step 4: Update xk+1 := xk + αk dk and k := k + 1. Go to Step 2.

Note that due to the properties that we are requiring for the Hessian matrix, we
k −1
k 2

have that d = − ∇ f (x ) ∇f (xk ). Therefore, dk is a descent direction for all k.

Theorem 4.14 (Convergence for DNM)


Let f ∈ C 2 (Rn ) with its gradient Lipschitz continuous. Assume that ∇2 f (xk ) is
positive definite for all k and let λkmin and λkmax be its smallest and largest eigenvalues,
respectively. If there are λ1 , λ2 > 0 such that λkmin ≥ λ1 and λkmax ≤ λ2 for all k,
then the iterates generated by DNM have three possibilities:
1. ∇f (xk̃ ) = 0 for some k̃ ≥ 0.
2. lim f (xk ) = −∞.
k→+∞
3. lim k∇f (xk )k = 0.
k→+∞

Moreover, if we restrict the choice on α0 and β, then there is quadratic convergence.


Theorem 4.15
Let f ∈ C 2 (Rn ) with Hessian matrix ∇2 f Lipschitz continuous. Assume that the
iterates of DNM are generated with α0 = 1 and β < 21 and that they converge to a
point x∗ such that ∇2 f (x∗ ) is positive definite. If ∇2 f (xk ) is positive definite for
k sufficiently large, then:
1. αk = 1 for k sufficiently large.
2. The convergence of the sequence {xk } to x∗ is quadratic.

Example 4.16 p p
Let us consider the function f (x1 , x2 ) = 1 + x21 + 1 + x22 whose global minimum
is clearly x∗ = (0, 0).

If we apply Newton’s method with x0 = (2, 2), the sequence of iterates diverges.
However, if the damped Newton’s method is applied with the usual parameters, the
new sequence of iterates converges in 4 iterations.

4.2.2. Modified Newton’s Method


If we are far from a local minimum, it may happen that the Hessian matrix of the
iterate is not positive definite. When ∇2 f (xk ) is not positive definite, Newton’s
direction defined by solving

∇2 f (xk )dk = −∇f (xk )

may not be a descent direction. For this reason, it is usual to solve instead

∇2 f (xk ) + M k dk = −∇f (xk ),




where M k is chosen so that:


Chapter 4. Newton’s Method and Quasi-Newton Methods 41

1. ∇2 f (xk ) + M k is “sufficiently” positive definite.


2. M k := 0 when ∇2 f (xk ) is “sufficiently” positive definite.
By “sufficiently positive definite” we can mean, for example, that the smallest eigen-
value is at least δ > 0 for a certain δ that we have decided beforehand.

There are several possible ways of modifying ∇2 f (xk ) in the literature to produce
a modified Newton method, with no agreement on a best technique. Here we
provide two alternatives:
1. Add a multiple of the identity. 
We take M k := τ In , where
 τ = max{0, δ − λmin ∇2 f (xk ) }. This guarantees
that λmin ∇2 f (xk ) + M k ≥ δ.
However, if we are solving a high dimension problem, computing the eigenvalues
of the Hessian is computationally expensive. Therefore, λmin must be estimated
instead (a task that may be hard by itself). Moreover, this approach has the
disadvantage of giving too much importance to a single large negative eigenvalue.
2. Modified Cholesky factorization.
Every symmetric positive definite matrix A can be written as A = LLt where L
is a lower triangular matrix with all the elements of its diagonal positive. This
decomposition has the name of Cholesky factorization.
Compute Cholesky factorization ∇2 f (xk ) = Lk (Lk )t , where Lk is lower triangular.
During the factorization, if there is a risk that it fails, modify ∇2 f (xk ) in order
to prevent it.

Example 4.17  
3 0
Let us assume that at a certain iterate the Hessian matrix is , which is
0 −1  
1.1 0
not positive definite. We can fix this by adding, for example, matrix .
0 1.1

Although details will not be provided here, under similar conditions to the ones
given in Theorem 4.14, it is possible to guarantee the convergence of this modified
Newton’s method.

4.3. Quasi-Newton Methods


Quasi-Newton methods are based on Newton’s method, but do not require the com-
putation of the Hessian matrix, which can be very expensive computationally. In-
stead, they just use gradient vectors. Nevertheless, they have a faster convergence
than the steepest descent method.

Instead of using the Hessian matrix ∇2 f (xk ), we would like to have a “good” ap-
proximation B k . Some desirable properties are:
1. The next matrix B k+1 can be computed using already computed values: ∇f (xk+1 ),
∇f (xk ), . . ., ∇f (x0 ), B k , dk .
42 2017
c Sergio Garcı́a Quiles

2. B k+1 is symmetric and/or nonsingular and/or positive definite.


3. B k+1 is “close” to B k (a “cheap” update of B k ).

The BFGS Method


The Broyden-Fletcher-Goldfarb-Shanno method (BFGS) is among the most popular
algorithms for unconstrained nonlinear optimization. In order to derive this algo-
rithm, we need to remember that Newton’s method uses a quadratic approximation
of each iterate. With this idea, we begin by considering the following quadratic form
that approximates f at xk + p, that is, in a neighborhood of the current iterate xk :
1
mk (p) := f (xk ) + ∇f (xk )t p + pt B k p.
2
We assume that B k is a symmetric positive definite matrix that is updated at every
iteration. Note also that mk (0) = f (xk ) and that ∇mk (0) = ∇f (xk ).

This convex quadratic form mk has a unique minimum which is pk = −(B k )−1 ∇f (xk ).
Since B k is positive definite, then pk is a descent direction and we generate the next
iterate as xk+1 := xk + αk pk , where the step length αk is chosen so that Wolfe
conditions are satisfied:
f (xk + αk pk ) ≤ f (xk ) + c1 αk ∇f (xk )t pk ,
∇f (xk + αk pk )t pk ≥ c2 ∇f (xk )t pk ,

with 0 < c1 < c2 < 1.

As we can see, this iteration is done as in Newton’s method but we use matrix B k
instead of the Hessian matrix. Thus, we can think of B k as an approximation of the
true Hessian.

Instead of computing a totally new B k at each iteration, Davidon proposed to update


it using the curvature measured in the previous step. That is, suppose that we have
generated xk+1 and that we would like to construct a new quadratic approximation
1
mk+1 (p) := f (xk+1 ) + ∇f (xk+1 )t p + pt B k+1 p.
2
The question that remains is: What conditions should we require of B k+1 ? Something
reasonable is that the gradient of mk+1 match the gradient of f at the two latest
iterates xk+1 and xk . Since mk+1 (p) is an approximation of f (xk+1 + p), we have
that xk+1 is obtained for p = 0 and that xk is obtained for p = −αk pk (because
xk+1 = xk + αk pk ). In the first case, the equality is satisfied automatically because
∇mk+1 (0) = ∇f (xk+1 ). Thus, it suffices to impose that

∇mk+1 (−αk pk ) = ∇f (xk );

∇f (xk+1 ) − αk B k+1 pk = ∇f (xk );


Chapter 4. Newton’s Method and Quasi-Newton Methods 43

αk B k+1 pk = ∇f (xk+1 ) − ∇f (xk ).


If we define dk := xk+1 − xk = αk pk and y k := ∇f (xk+1 ) − ∇f (xk ), then this can be
rewritten as
B k+1 dk = y k ,
which is a formula known as secant equation.

Since B k is positive definite, the secant equation has solution only if dk and y k satisfy
the curvature condition t
dk y k > 0.
This is very easy: Multiply at the secant equation by (dk )t by the left. It must be
said that this curvature condition does not always hold. However, it can be seen
that, if the Wolfe conditions hold, then the curvature condition is satisfied.

When this curvature condition is satisfied, the system B k+1 dk = y k has a solution.
Indeed, it has an infinite number of solutions because there are n(n + 1)/2 unknowns
with only n constraints plus n inequalities (requiring positive definiteness). In order
to obtain a unique solution, B k+1 is required to be the closest matrix to B k under a
certain matrix norm. That is, it is the solution of
Min. kB − B k k
s.t. B = Bt,
Bdk = y k ,
B ∈ Rn×n .
Different matrix norms lead to different solutions and, thus, to different quasi-Newton
methods. Particularly, there is a norm (the weighted Frobenius norm, we will skip
the details) for which the solution is

B k+1 := In − ρk y k (dk )t B k In − ρk dk (y k )t + ρk y k (y k )t ,
 

where ρk = (yk1)t dk . This expression is known as the DFP update formula in honor
to Davidon, who discovered it in 1959, and to Fletcher and Powell, who popularized
it in 1963.

Actually, in order to calculate the descent direction, we need the inverse of this
matrix. So, if we define H k := (B k )−1 (that is, the approximation of the inverse of
the Hessian), then
H k y k (y k )t H k dk (dk )t
H k+1 := H k − + k t k.
(y k )t H k y k (y ) d
Details on how this is obtained, using an expression named Sherman-Morrison-
Woodbury formula, are omitted.

However, the DFP formula, although quite effective, was soon improved by another
expression which is obtained by following the same argument than before but im-
posing the conditions on H k instead of on B k . The secant equation is now written
44 2017
c Sergio Garcı́a Quiles

H k+1 y k = dk and we obtain the so-called BFGS updating formula

H k+1 := In − ρk dk (y k )t H k In − ρk y k (dk )t + ρk dk (dk )t ,


 

where ρk is defined as before.

The final issue is how to choose a starting H 0 , but there is no best answer for this.
For example, it can be the identity matrix.

In short, we have arrived at the following algorithm:


Algorithm 4.18 (BFGS Method)
Step 1: Choose a starting point x0 and a tolerance level ε > 0. If k∇f (x0 )k < ε,
STOP.
Step 2: Choose an approximation matrix H 0 of the inverse of the Hessian matrix.
Let k := 0.
Step 3: Compute the search direction pk := −H k ∇f (xk ).
Step 4: Let xk+1 := xk + αk pk , where αk is computed from a line search procedure
to satisfy the Wolfe conditions.
Step 5: If k∇f (xk+1 )k < ε, STOP.
Step 6: Compute H k+1 by means of the BFGS updating formula. Let k := k + 1
and go to Step 3.

The BFGS method has some interesting properties:


1. Its rate of convergence is superlinear.
2. Newton’s method converges faster, but it is more expensive computationally.
3. If H k is positive definite, then H k+1 is positive definite.
Example 4.19 p p
Let us consider the function f (x1 , x2 ) = 1 + x21 + 1 + x22 , which we already used
in Example 4.16 and whose global minimum is clearly x∗ = (0, 0).

The damped Newton’s method converged in 4 iterations when starting at (2,2). If


we start BFGS at the same point and use MATLAB’s implementation, the algorithm
converges in 5 iterations. It is not so fast as damped Newton’s method, but remember
that now BFGS uses information only from the gradient, whereas in the other method
the Hessian is also used.

4.4. Least Squares


The Least Squares Problem is one of the most studied problems in Nonlinear
Optimization because of its many areas of application (Chemistry or Finance, for
example). It consists in minimizing a function of the form
m
1X 2
f (x) = r (x),
2 j=1 j
Chapter 4. Newton’s Method and Quasi-Newton Methods 45

where rj ∈ C ∞ (Rn ) for all j. Each function rj is called a residual and measures the
discrepancy between a model that we are fitting and an observed behaviour of the
system. Besides, we will assume that m, the number of data, is much larger than n,
the dimension of the data. x1 , . . . , xn are the unknown parameters of the model.
Example 4.20 (Minimum Squares)
We have a sample of data {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )} and we would like to fit a
curve of the form
φ(α1 , α2 ; x) = α1 1 − e−α2 x .


If we define the residuals


rj (α) = yj − φ(α; xj ),
then the total sum of the squared errors is

f (α) = r1 (α)2 + r2 (α)2 + . . . + rm (α)2 .

Computing a solution (α1∗ , α2∗ ) that minimizes this error function is an unconstrained
problem.
240

220

200

180

160

140

120

100

0 2 4 6 8 10 12

Figure 4.4: Nonlinear regression.

Given a model, if we assume that the discrepancies between the model and the
observations are independent, identically distributed and normal, then the maximum
likelihood estimated is obtained by minimizing the sum of squares of the residuals. If
the vector of residuals is r(x) = (r1 (x), r2 (x), . . . , rm (x))t , then the sum of residuals
can be written as f (x) = 12 ||r(x)||2 .

Now, let J(x) be the Jacobian of function r at point x, that is,


 
∇r1 (x)t

∂ri
  ∇r2 (x)t 
J(x) = = ,
 
..
∂xj i=1,...,m,  . 
j=1,...,n t
∇rm (x)

and let Ji• (x) and J•j (x) be the i-th row and the j-th column of J(x), respectively.
46 2017
c Sergio Garcı́a Quiles

If we calculate the derivatives needed to solve the least squares problem, then we
have that m
∂f (x) 1X ∂ri (x)
= 2ri (x) = J•j (x)t r(x),
∂xj 2 i=1 ∂xj
and, therefore
     t
J•1 (x)t r(x) J•1 (x)t J1• (x)
 J•2 (x)t r(x)   J•2 (x)t   J2• (x) 
∇f (x) =  =  r(x) =   r(x) = J(x)t r(x).
     
.. .. ..
 .   .   . 
J•n (x)t r(x) J•n (x)t Jn• (x)
With some extra work, it is possible to obtain an explicit expression for the Hes-
sian of f (the details are omitted here) and, putting all this together, we have the
following:
• ∇f (x) = J(x)t r(x).
Xm
2 t
• ∇ f (x) = J(x) J(x) + rj (x)∇2 rj (x).
j=1
The advantage of this second expression is that the first part of the ∇2 f (x) can be
computed using only first-order derivatives, something that it is potentially good
from a computational point of view. Since quite often the residuals rj (x) are close to
zero near the solutions, algorithms for nonlinear least-squares usually take advantage
of this special structure.

4.4.1. Linear Least Squares


The easiest case, but also quite common, is that the residual is a linear function,
r(x) = Ax + b. Then the sum of squares to be minimized is
1
f (x) = kAx + bk2 ,
2
where A is an m × n real matrix and b = r(0) ∈ Rm . In this case:
• ∇f (x) = J(x)t r(x) = At (Ax + b).
• ∇2 f (x) = At A.
Note that in this case, that is, when the residuals are linear, f is globally convex
because At A is positive semidefinite. Particularly, x∗ ∈ Rn is a global minimum if,
and only if, is the solution of At (Ax + b) = 0. Equivalently,
At Ax = −At b.
This system receives the name of normal equations and can be solved efficiently
using specialized matrix calculation techniques (for example, Cholesky factorization).
Example 4.21
Given the set of four observations {(ti , yi )} = {(−1, 3), (0, 2), (1, 0), (2, 4)}, let us fit
a model of the form y(t) = x1 + x2 t, where x1 , x2 ∈ R must be determined so that
the residual sum of squares be minimum.
Chapter 4. Newton’s Method and Quasi-Newton Methods 47

We have that:
• r1 (x) = x1 − x2 − 3,
• r2 (x) = x1 − 2,
• r3 (x) = x1 + x2 ,
• r4 (x) = x1 + 2x2 − 4.
Thus,
   
1 −1 −3    
 1 0   −2  t 4 2 t 9
A=  1
, b =  0 ,
  AA= , −A b = .
1  2 6 5
1 2 −4

The solution of     
4 2 x1 9
=
2 6 x2 5
is x∗ = 11 1

,
5 10
= (2.2, 0.1). So, the fitted model is

y(t) = 2.2 + 0.1t.

4.4.2. Nonlinear Least Squares


When the residuals are nonlinear, we can use a modification of Newton’s method
called Gauss-Newton method. Let us remember that in Newton’s method, we
solve ∇2 f (x)t d = −∇f (x) in order to obtain the search direction. In our case, we
have now that m
X
2 t
∇ f (x) = J(x) J(x) + rj (x)∇2 rj (x).
j=1

Instead of using the exact value of the Hessian matrix, we approximate it using only
the first term J(x)t J(x). This presents several advantages over Newton’s method:
1. We do not need to calculate second order derivatives, which can mean an impor-
tant saving in computational effort.
2. In many applications, the second term j rj (x)∇2 rj (x) is dominated by J(x)t J(x)
P
(at least, when we are close to an optimum point x∗ ), which means that J(x)t J(x)
is a good approximation to ∇f (x) in this neighborhood.
3. If J(x) has full rank, then J(x)t J(x) is positive definite and the solution of
J(x)t J(x)d = −J(x)t r(x) is a descent direction which is called the Gauss-
Newton direction,
−1
dGN := − J(x)t J(x) J(x)t r(x).


In order to see this, just note that


t t
∇f (x)t dGN = J(x)t r(x) dGN = − J(x)t J(x)dGN dGN =
 

= −dtGN J(x)t J(x)dGN = − [J(x)dGN ]t J(x)dGN = −||J(x)dGN ||2 ≤ 0.


48 2017
c Sergio Garcı́a Quiles

The last inequality is strict unless J(x)dGN = 0, but then dGN = 0 because J(x)
has full rank. Therefore, 0 = J(x)t r(x) = ∇f (x) and we have found a stationary
point of f .
Gauss-Newton method can be combined with a line search strategy and, under cer-
tain conditions, local convergence can be guaranteed.
Theorem 4.22
Let us assume the following:
1. Each residual function rj is Lipschitz continuously differentiable in a neighbor-
hood N of the level set {x / f (x) ≤ f (x0 )}, where x0 is the starting point of
Gauss-Newton method.
2. The Jacobian J(x) of the sum of squares of the residuals satisfies the uniform
full-rank condition
kJ(x)vk ≥ γkvk
for all x in N for a certain γ > 0.
3. The iterates {xk } are generated by the Gauss-Newton method with step lengths αk
that satisfy Wolfe conditions.
Then:
1. lim J(xk )t r(xk ) = 0, that is, the iterates converge to a stationary point x∗ .
k→+∞
m
X
2. Moreover, if rj (x∗ )∇2 rj (x∗ ) = 0, then the convergence rate is quadratic.
j=1

Nevertheless, it must be noted that Gauss-Newton method does not necessarily con-
verge and that, if it does, this convergence may be worse than quadratic.
Example 4.23
Let us suppose that the residuals function is r(x) = (x + 1, 0.1x2 + x − 1) and that we
are minimizing the total sum of the squares of the residuals, that is, f (x) = 21 ||r(x)||2 .

If we apply Newton’s method to minimize f , we obtain x∗ = 0. We converge in


4 iterations (remember that the algorithm converges quadratically). On the other
hand, Gauss-Newton converges in 7 iterations. In this case, we can see that r(x∗ ) =
(1, −1) and that r00 (x∗ ) = (0, −0.2), so, the last condition of the previous theorem
does not hold and convergence is slower than quadratic. Anyway, convergence is still
good because J t (x∗ )J(x∗ ) = 2, which is much larger than 0.2. So, the second term
is dominated by the first term.

If now we consider that the residuals function is s(x) = (x, 0.1x2 + x), Gauss-Newton
converges to x∗ = (0, 0) in 3 iterations and Newton does it in 4 iterations. In this
case, s(x∗ ) = (0, 0), which guarantees that the last condition of the previous theorem
holds and, thus, the convergence is quadratic.

Finally, if we have the residuals function t(x) = (x + 1, x2 + x − 1), Gauss-Newton


converges to x∗ = 0.000560 in 74 steps. However, the true minimum is x∗ = 0, but
Chapter 4. Newton’s Method and Quasi-Newton Methods 49

numerical issues prevent us from reaching that solution for a tolerance of ε = 10−6 .
In this case t(0) = (1, −1) and t00 (0) = (0, 2), so, again the mentioned property does
not hold. Moreover, now the first part of ∇2 f (x∗ ) is 2 and the second part is -2.
This means that the weight of the second part is not negligible, which is bad for
convergence. Moreover, if we did not impose Wolfe conditions, then we would have
no convergence.
Chapter 5

Trust-Region Methods for


Unconstrained Optimization

A trust-region method defines a region around the current iterate and uses a
model that is “trusted” to be a good local representation of the objective function.
Then, the step chosen is the approximate minimum of the model in this area. If
the step is not acceptable, then the trust region is shrunk. The size of the trust
region tends to be variable. If the algorithm is doing well, the trust region may be
increased. Otherwise, it is reduced in order to have a better local approximation of
the true objective function.

Particularly, we will assume that we approximate f at point xk by a model based on


its Taylor-series quadratic expansion. That is, since
1
f (xk + p) = f (xk ) + ∇f (xk )t p + pt ∇2 f (xk + tp)p
2
for some t ∈ (0, 1), we define
1
mk (p) := f (xk ) + ∇f (xk )t p + pt B k p,
2
where B k is some symmetric matrix. This way, the difference between f (xk + p)
and m(p) is O(kpk2 ), which is small if p is small. In the special case of B k = ∇2 f (xk ),
then the error is O(kpk3 ) and leads to what is known as trust-region Newton
method.

In order to obtain the next step, we solve



 Min. mk (p)
k
(T R ) s.t. kpk ≤ ∆k ,
p ∈ Rn ,

52 2017
c Sergio Garcı́a Quiles

where ∆k is the radius of the trust-region. This is a quadratic problem, although


not necessarily convex because we are not requiring that B k be positive semidefinite.
Anyway, in order to obtain a good algorithmic performance, it is enough to solve
this problem (convex or not) approximately.

First, we need to decide the radius of the trust-region. Given a step pk , we define
the ratio
f (xk ) − f (xk + pk )
ρk := .
mk (0) − mk (pk )
Note that the numerator f (xk ) − f (xk + pk ) is the actual function decrease, while the
denominator mk (0) − mk (pk ) is the estimated function decrease. Since pk is obtained
by minimizing mk over a region that includes p = 0, the estimated decrease is always
nonnegative. Now:
• If ρk is close to 1, then mk is a good approximation. We update xk+1 := xk + pk
and ∆k+1 ≥ ∆k , that is, we may even consider increasing the size of the trust-
region.
• If ρk is positive but much lesser than 1 or if it is negative, then we have a poor
approximation. We shrink the trust region.
A generic trust-region method is as follows:
Algorithm 5.1 (Generic Trust-Region Method, GTRM)
Step 1: Choose ∆ ˆ > 0 which will be an upper bound on the step lengths.
0
Choose ∆ ∈ (0, ∆), ˆ η ∈ [0, 1 ), a tolerance ε > 0 and a starting point x0 ∈ Rn .
4
Let k := 0.
Step 2: If k∇f (xk )k < ε, STOP.
Step 3: Solve (approximately) problem (TRk ) and let pk be its solution.
Step 4: Evaluate ρk .
• If ρk < 14 , then ∆k+1 = 14 ∆k . The approximation is poor and we shrink the
trust region.
• Otherwise:
ˆ The approximation is
– If ρk > 34 and ||pk || = ∆k , then ∆k+1 = min{2∆k , ∆}.
very good and we enlarge the trust region.
– Otherwise, ∆k+1 = ∆k .
Step 5: If ρk > η, then xk+1 := xk + pk . Go to Step 2.
Otherwise, xk+1 := xk . The approximation is not good and we repeat the search
in a smaller region (see the previous step). Go to Step 3.

Now, it remains to see how we solve problem (TRk ).

Dropping superindex k , we need to study how to solve the problem


 t 1 t
 Min. m(p) = f (x) + ∇f (x) p + 2 p Bp,
(T R) s.t. kpk ≤ ∆,
p ∈ Rn ,

where B is a symmetric matrix.


Chapter 5. Trust-Region Methods for Unconstrained Optimization 53

5.1. Exact Solution of the Trust-Region Problem


When the trust-region problem is not too large, it may be worth solving it exactly.
The following result provides a characterization of the solution.
Theorem 5.2
The vector p∗ is a solution of (TR) if, and only if, p∗ is feasible and there is a
scalar λ ≥ 0 such that the following conditions are satisfied:
1. (B + λIn )p∗ = −∇f (x).
2. λ(∆ − kp∗ k) = 0.
3. B + λIn is positive semidefinite.

Note that this is a very remarkable and rare result because it provides a characteri-
zation for the optimal solutions of a nonconvex quadratic optimization problem.

5.2. The Cauchy Point


For the purposes of convergence, it is not necessary to solve (TR) exactly. It is enough
if there is a sufficient reduction in the model. We know that line search methods
have good theoretical convergence properties (they can be globally convergent with
conditions on the step size that are not too demanding). In the case of trust-region
methods, this is also true. So, we will solve (TR) approximately with this in mind.

First, we solve the linear version of (TR) in order to obtain a promising direction:
pS := arg min f (x) + ∇f (x)t p
s.t. kpk ≤ ∆,
p ∈ Rn .
Then we calculate a scalar τ that minimizes m(τ pS ) in the trust region:
τ := arg min m(τ pS )
s.t. kτ pS k ≤ ∆,
τ ≥ 0.
We define pC := τ pS . The Cauchy point is x + pC .

We can now obtain the Cauchy point explicitly. Note first that pS is the solution of
a linear problem with a constraint on the norm of the solution. It is easy to see that

pS = − ∇f (x).
k∇f (x)k
(We are assuming that ∇f (x) 6= 0.) Now, in order to minimize m(τ pS ), we differen-
tiate whether ∇f (x)t B∇f (x) is positive or negative.
• Case 1: ∇f (x)t B∇f (x) ≤ 0.
m(τ pS ) decreases with τ . Thus, the optimal solution is τ = 1, which is the largest
value such that τ pS is still in the trust region.
54 2017
c Sergio Garcı́a Quiles

• Case 2: ∇f (x)t B∇f (x) > 0.


In this case, m(τ pS ) is a convex quadratic form in τ . So, the optimal point is the
(x)k3
minimum between 1 (the border of the feasible region) or ∆∇fk∇f (x)t B∇f (x)
(the global
minimum of the quadratic form).
All this can be checked easily by calculating the first derivative of m(τ pS ).

Therefore,

− k∇f∆(x)k ∇f (x),
(
C
if ∇f (x)t B∇f (x) ≤ 0,
p = n
(x)k2
o
− min k∇f∆(x)k , ∇fk∇f
(x)t B∇f (x)
∇f (x), if ∇f (x)t B∇f (x) > 0.

The Cauchy point provides sufficient reduction in mk so that we have global con-
vergence (see Theorem 5.3, no proof will be given here). However, in practice, after
having calculated the Cauchy point, we try to improve it (although how this can be
done will not be explained in this course). The reason is that the plain use of the
Cauchy point is just the steepest descent method with a particular step size. And we
know that the steepest descent method converges only linearly (even for an optimal
step size).

Theorem 5.3
Let η ∈ (0, 14 ) in Algorithm GTRM. Suppose that ||B k || ≤ β for some constant β, that
f is bounded below on the level set S = {x ∈ Rn / f (x) ≤ f (x0 )} and is Lipschitz
continuously differentiable in S(R0 ) = {x ∈ Rn / ||x − y|| < R0 for some y ∈ S} for
some R0 > 0. If the sequence of approximate solutions of (TRk ) satisfy the following
inequalities: n o k
1. mk (0) − mk (pk ) ≥ c1 ||∇f (xk )|| min ∆k , ||∇f (x )||
||B k ||
for some constant c1 ∈ (0, 1),
2. ||pk || ≤ γ∆k for some constant γ ≥ 1,
then lim ∇f (xk ) = 0.
k→∞

The first inequality is satisfied if the approximate solutions pk achieve a reduction


that is at least some fixed fraction of the reduction achieved by the Cauchy point
(details are omitted here). The second condition just allows for a generalization of
the trust-region method. If we want to just apply GTRM, we restrict γ ≤ 1.
Chapter 6

Conjugate Gradient Methods

The conjugate gradient method is a technique for solving linear systems of equa-
tions. It is an alternative to Gaussian elimination which is used to solve large systems.
If we are solving Ax = b, with A a (square) symmetric positive definite matrix, we
know that x = A−1 b. However, the complexity of this calculation is O(n3 ), which is
computationally very expensive to solve large problems. Conjugate gradient methods
allow us to do better than that.

There is a version for nonlinear problems that was introduced by Fletcher and Reeves
in the 1960s. Here we will only study the linear conjugate gradient method. For this
reason, we will drop the word “linear”.

6.1. The Conjugate Direction Method


To begin with, we observe that the solution of Ax = b, with A a symmetric positive
definite matrix, is the unique global minimum of the function f (x) = 21 xt Ax − bt x.
This means that we can see conjugate gradient methods as either a technique for
solving linear systems or for minimizing convex quadratic functions.

The notion of conjugacy also is important in this algorithm.

Definition 6.1
A set of nonzero vectors {p0 , p1 , . . . , p` } is said to be conjugate with respect to the
symmetric positive definite matrix A if

(pi )t Apj = 0 ∀i 6= j.
56 2017
c Sergio Garcı́a Quiles

Example 6.2
Let
 
3 0 1
A =  0 4 2  , p0 = (1, 0, 0), p1 = (1, 0, −3), and p2 = (1, 4, −3).
1 2 3

It is easy to check that these three vectors are conjugate with respect to the positive
definite matrix A.

It is easy to see that if a set of vectors is conjugate, then it is linearly independent.

Given a starting point x0 ∈ Rn and a set of conjugate directions {p0 , p1 , . . . , pn−1 },


we generate the following sequence:

xk+1 := xk + αk pk ,

where αk is the one-dimensional minimum of f (xk + αpk ). It is easy to see (exercise 3


of tutorial 2) that
k (rk )t pk
α := − k t k ,
(p ) Ap
where rk := ∇f (xk ) = Axk − b. This algorithm is known as conjugate direction
method and converges in at most n steps.
Theorem 6.3
For any x0 ∈ Rn , the sequence {xk } generated by the conjugate direction method
converges to the solution x∗ of Ax = b in at most n steps.
Proof:
Since {p0 , p1 , . . . , pn−1 } is a linearly independent set, they span the whole space Rn .
Therefore, for the point x∗ − x0 ∈ Rn there are scalars σ i ∈ R such that
n−1
X
∗ 0
x −x = σ i pi .
i=0

If we multiply on the left by (pk )t A, due to the conjugacy property, we have that
n−1
X
(pk )t A(x∗ − x0 ) = σ i (pk )t Api = σ k (pk )t Apk ;
i=0

(pk )t A(x∗ − x0 )
σk = .
(pk )t Apk
Now, we will show that σ k = αk for all k.

We have that

xk = xk−1 +αk−1 pk−1 = xk−2 +αk−2 pk−2 +αk−1 pk−1 = . . . = x0 +α0 p0 +. . .+αk−1 pk−1 .
Chapter 6. Conjugate Gradient Methods 57

If we multiply on the left by (pk )t A, we have that

(pk )t Axk = (pk )t Ax0 .

Thus,

(pk )t A(x∗ − x0 ) = (pk )t A(x∗ − xk ) = (pk )t (b − Axk ) = −(pk )t rk ,

and
(pk )t rk
σk = − = αk .
(pk )t Apk

A property of any sequence of points generated with the conjugate direction method
is that (rk )t pi = 0, i = 0, 1, . . . , k − 1. We will skip the proof of this result.

6.2. The Conjugate Gradient Method


Among the many different ways of generating a conjugate set of vectors, we are
going to choose one where pk is generated using only the previous vector pk−1 . In
particular, we write
pk := −rk + β k pk−1
and now we look for a scalar β k such that pk and pk−1 are conjugates with respect
to A.

If we multiply on the left by (pk−1 )t A and impose that (pk−1 )t Apk = 0, it is easy to
see that
(rk )t Apk−1
β k = k−1 t k−1 .
(p ) Ap
If we choose p0 as the steepest descent direction (that is, −∇f (x0 )), then we obtain
the so-called conjugate gradient method.
Algorithm 6.4 (Conjugate Gradient Method, Preliminary Version)
Step 1: Choose x0 ∈ Rn . Let r0 := Ax0 − b, p0 := −r0 , k := 0.
Step 2: If rk 6= 0, then:
k t pk
• αk := − (p(rk ))t Ap k,

• xk+1 := xk + αk pk ,
• rk+1 := Axk+1 − b,
(rk+1 )t Apk
• β k+1 := (pk )t Apk
,
• pk+1 := −r k+1
+β k+1 k
p ,
• k := k + 1.
Repeat Step 2.

Moreover, it can be seen (we will skip the proof) that not only is pk conjugate
with pk−1 with respect to A, but also with p0 , p1 , . . . , pk−2 .
58 2017
c Sergio Garcı́a Quiles

Proposition 6.5
Suppose that the k-th iterate generated by the conjugate gradient method is not the
solution point x∗ . The following properties hold:
1. (rk )t ri = 0, i = 0, 1, . . . , k − 1.
2. (pk )t Api = 0, i = 0, 1, . . . , k − 1.
3. The sequence {xk } converges to x∗ in at most n steps.

There is a standard version of the conjugate gradient method that is slightly different
in order to do less multiplications. By using that in any conjugate direction algorithm
it holds that (rk )t pi = 0, i = 0, 1, . . . , k − 1, we have that
(rk )t pk (rk )t (−rk + β k pk−1 ) (rk )t rk
αk = − = − = .
(pk )t Apk (pk )t Apk (pk )t Apk

Now, we observe that rk+1 − rk = A(xk+1 − xk ) = αk Apk . So, using now that
(rk )t ri = 0, i = 0, 1, . . . , k − 1, we have that
(rk+1 )t Apk (rk+1 )t (rk+1 − rk )/αk (rk+1 )t rk+1
β k+1 = = = − =
(pk )t Apk (pk )t (rk+1 − rk )/αk (pk )t rk
(rk+1 )t rk+1 (rk+1 )t rk+1
=− = .
(−rk + β k pk−1 )t rk (rk )t rk

By writing these scalars in this way, we obtain the standard version of the conjugate
gradient method. Observe how this new version has less multiplications than the
previous one.
Algorithm 6.6 (Conjugate Gradient Method)
Step 1: Choose x0 ∈ Rn . Let p0 := −r0 , k := 0.
Step 2: If rk 6= 0, then:
k t rk
• αk := (p(rk ))t Ap k,

• xk+1 := xk + αk pk ,
• rk+1 := rk + αk Apk ,
(rk+1 )t rk+1
• β k+1 := (rk )t rk
,
• pk+1 := −r k+1
+β k+1 k
p ,
• k := k + 1.
Repeat Step 2.

As can be seen, it is not necessary to know the values of the vectors x, r, and p
for more than the two last iterations, which is an important saving in the compu-
tational implementation. The cost per iteration is O(n2 ). Thus, in theory it is an
O(n3 ) method. However, if we have a “good” distribution of the eigenvalues of A
(e.g., repeated eigenvalues or if there are some large eigenvalues and the other are
clustered around 1), then we need much less than n iterations. This method is more
efficient for large systems. For small systems, Gauss elimination performs better.
Chapter 7

Constrained Optimization

In most problems, there are some requirements that any solution must meet. In
this case, we are dealing with a constrained problem and these requirements are
the constraints. Usually, an equality constraint can be handled mathematically
more easily than an inequality. However, an equality is more difficult to be satisfied
numerically because of its restrictiveness. This contributes to make constrained
problems much more challenging than unconstrained problems.

The applications of constrained optimization are many. Moreover, some nonsmooth


unconstrained problems can be solved using constrained optimization.
Example 7.1
Let f, g be two differentiable functions. The following unconstrained problem is
nonsmooth because it involves a maximum:
Min f (x) := max{g1 (x), g2 (x)}
s.t. x ∈ Rn .
Just think of g1 (x) = x and g2 (x) = x2 . Clearly, the maximum function is not
differentiable at x = 0 and x = 1. However, by adding a new variable, we can write
the problem as a constrained smooth problem:
Min w
s.t. w ≥ g1 (x),
w ≥ g2 (x),
x ∈ Rn .

Now the problem to solve is


Min. f (x)
s.t. x ∈ Ω ⊆ Rn .
60 2017
c Sergio Garcı́a Quiles

Ω is the feasible set and we will assume that it is defined by a finite number of
constraints. Each of these constraints can be either an inequality (g(x) ≤ 0) or an
equality (h(x) = 0). Thus, the problem we are solving is as follows:

Min. f (x)
(CP) s.t. gi (x) ≤ 0, i ∈ I := {1, 2, . . . , m},
hj (x) = 0, j ∈ E := {m + 1, m + 2, . . . , m + p}.

Definition 7.2
A point x∗ ∈ Rn is a local minimum of f in Ω if there is a neighborhood N of x∗
such that f (x) ≥ f (x∗ ) for all x ∈ N ∩ Ω.

In the same way that we established necessary and sufficient conditions for local
minima in unconstrained problems, we will derive similar results for problems with
constraints. We begin by defining an important set.

Definition 7.3
Given a feasible point x∗ of (CP ), its active set A(x∗ ) is the set of constraints
satisfied with equality, that is,

A(x∗ ) := {i ∈ I / gi (x∗ ) = 0} ∪ E.

A constraint gi (x) ≤ 0 is said to be an active constraint for x∗ if gi (x∗ ) = 0.

Example 7.4
In the problem
Min. (x1 − 1)2 + (x2 − 2)2
s.t. x1 + x2 ≤ 2,
x21 − x2 = 0,
the inequality constraint is active for (1, 1) but inactive for (0, 0).

7.1. Karush-Kuhn-Tucker Conditions


The following Lagrangian function plays an important role in constrained opti-
mization: X X
L(x, λ, µ) := f (x) + λi gi (x) + µj hj (x).
i∈I j∈E

Another important element is the concept of Karush-Kuhn-Tucker point.

Definition 7.5 (KKT point)


Consider (CP) and assume that f , gi and hi are continuously differentiable for all i.
A point x∗ ∈ Rn is a Karush-Kuhn-Tucker (KKT) point if there are scalars
λ∗i , µ∗i ∈ R such that:
Chapter 7. Constrained Optimization 61

1. ∇f (x∗ ) + i∈I λ∗i ∇gi (x∗ ) + j∈E µ∗j ∇hj (x∗ ) = 0. (Or, using the Lagrangian func-
P P
tion, ∇x L(x∗ , λ∗ , µ∗ ) = 0).
2. gi (x∗ ) ≤ 0 ∀i ∈ I,
hj (x∗ ) = 0 ∀i ∈ E.

3. λ∗i ≥ 0 ∀i ∈ I.

4. λ∗i gi (x∗ ) = 0 ∀i ∈ I.
These four conditions are known as the Karush-Kuhn-Tucker conditions. The
first condition is the stationarity condition, the second is the primal feasibility
condition, the third is the dual feasibility condition, and the fourth is the com-
plementary slackness condition. The values λ∗i and µ∗j are known as Lagrange
multipliers.

An immediate consequence of this definition is that if (x∗ , λ∗ , µ∗ ) is a KKT point


and constraint gi (x) ≤ 0 is not active, then λ∗i = 0.

Example 7.6
In Example 7.4, we have that f (x) = (x1 − 1)2 + (x2 − 2)2 , g(x) = x1 + x2 − 2, and
h(x) = x21 − x2 .

The KKT conditions for a generic point (x1 , x2 ) are:


• Stationarity:
∇f (x) + λ∇f (x) + µ∇h(x) = 0;

(2x1 − 2, 2x2 − 4) + λ(1, 1) + µ(2x1 , −1) = (0, 0);



2x1 − 2 + λ + 2µx1 = 0,
2x2 − 4 + λ − µ = 0.

• Primal feasibility:

x1 + x2 ≤ 2,
x21 − x2 = 0.

• Dual feasibility: λ ≥ 0.
• Slackness complementarity: λ(x1 + x2 − 2) = 0.

After some not too difficult calculations that we will skip, we obtain that there are
three KTT points:
• x∗ = (−1, 1) for λ = 0 and µ = −2.
 √ √  √
• x∗ = 1−2 3 , 2−2 3 for λ = 0 and µ = −(2 + 3).
• x∗ = (1, 1) for λ = 4/3 and µ = −2/3.
62 2017
c Sergio Garcı́a Quiles

7.2. Optimality Conditions


As we are going to see, KKT points are important in order to characterize a minimum
in a constrained problem. Before that, we need a technical definition.

Definition 7.7 (Linear Independence Constraint Qualification)


Given the point x and the active set A(x) in (CP), we say that the linear inde-
pendence constraint qualification (LICQ) holds if the set of gradients

{∇gi (x)}i∈A(x)∩I ∪ {∇hj (x)}j∈E

is linearly independent. In some references of the literature, when this property


happens, it is said that x∗ is a regular point.

The following result establishes condition that every local minimum must satisfy.

Theorem 7.8 (First-Order Necessary Conditions)


Assume that x∗ is a local minimum of (CP), that the functions f , gi and hj are
continuously differentiable and that the LICQ holds at x∗ . Then there is a Lagrange
multiplier vector (λ∗ , µ∗ ) such that (x∗ , λ∗ , µ∗ ) is a KKT point.

Moreover, it can be proved that the LICQ grants that the Lagrange multiplier vector
is unique.

Now, before we can state a condition that uses second-order derivatives information,
we need another definition.
Definition 7.9
Given a KKT point (x∗ , λ∗ , µ∗ ), the critical cone for this point is the set

C(x∗ , λ∗ ) = {w / ∇hj (x∗ )t w = 0 ∀j ∈ E,


∇gi (x∗ )t w = 0 ∀i ∈ A(x∗ ) ∩ I with λ∗i > 0,
∇gi (x∗ )t w ≤ 0 ∀i ∈ A(x∗ ) ∩ I with λ∗i = 0}.

With the help of the critical cone, we can narrow down even further which points
could potentially be local minima.

Theorem 7.10 (Second-Order Necessary Conditions)


Assume that x∗ is a local minimum of (CP), that the functions f , gi and hj are
twice continuously differentiable and that the LICQ holds at x∗ . Let (λ∗ , µ∗ ) be the
Lagrange multiplier vector for which the KKT conditions are satisfied. Then

wt ∇2xx L(x∗ , λ∗ , µ∗ )w ≥ 0 ∀w ∈ C(x∗ , λ∗ ).

Finally, a sufficient condition is established. Note that here the linear independence
constraint qualification is not required.
Chapter 7. Constrained Optimization 63

Theorem 7.11 (Second-Order Sufficient Conditions)


Let x∗ be a feasible point of (CP) and assume that there is a Lagrange multiplier
vector (λ∗ , µ∗ ) such that (x∗ , λ∗ , µ∗ ) is a KKT point. If

wt ∇2xx L(x∗ , λ∗ , µ∗ )w > 0 ∀w ∈ C(x∗ , λ∗ ), w 6= 0,

then x∗ is a strict local minimum of (CP).

Example 7.12
In the previous example, we have seen that

∇x L(x, λ, µ) = (2x1 − 2 + λ + 2µx1 , 2x2 − 4 + λ − µ).

 
2 + 2µ 0
Therefore, ∇2xx L(x, λ, µ) = .
0 2

For x∗ = (1, 1), λ∗ = 4/3 and µ∗ = −2/3, we have that


 
2 2/3 0
∇xx L(x, λ, µ) = ,
0 2

which is positive definite. Particularly, the previous theorem holds and, therefore,
x∗ = (1, 1) is a strict local minimum.

Other Constraint Qualifications


There are other alternative conditions that we can use instead of LICQ so that KKT
conditions are necessary for local minima. In this case, convexity and/or linearity
will play an important role.

Definition 7.13 (Slater’s Constraint Qualification)


Consider (CP) and assume that gi is convex for all i ∈ I and that the only constraints
are inequalities, that is, E = ∅. We say that Slater’s Constraint Qualification
(SCQ) is satisfied if there is a point x̂ such that gi (x̂) < 0 ∀i ∈ I.

Definition 7.14 (Generalized Slater’s Constraint Qualification)


Consider (CP) and assume that the inequalities are either convex or affine functions
and that all the equalities are affine functions, that is:
1. I = Ic ∪ I` , with gi convex for all i ∈ Ic and gi affine for all i ∈ I` .
2. hj is affine for all j ∈ E.
We say that the Generalized Slater’s Constraint Qualification (GSCQ) is
satisfied if there is a point x̂ such that:
1. gi (x̂) < 0 ∀i ∈ Ic .
2. gi (x̂) ≤ 0 ∀i ∈ I` .
3. hj (x̂) = 0 ∀j ∈ E.
64 2017
c Sergio Garcı́a Quiles

Example 7.15
If the constraints are
2x1 + x2 ≤ 1,
x21 ≤ 1,
then the first constraint is linear and the second is nonlinear, but involves a convex
function. It is easy to see that GSCQ is satisfied by considering x̂ = (0, 0).

If LICQ is replaced by SCQ or GSCQ, then Theorem 7.8 is still true (although for
GSCQ we require that f be a convex function).
Theorem 7.16
If x∗ is a minimum of (CP) and either:
1. SCQ is satisfied, or
2. GSCQ is satisfied and f is convex,
Then there is a vector of Lagrange multipliers λ∗ (respectively, (λ∗ , µ∗ )) such that
(x∗ , λ∗ ) (respectively, (x∗ , λ∗ , µ∗ )) is a KKT point.
Note: Besides the conditions of SCQ/GSCQ, we require the nonlinear functions of
the constraints to be continuosly differentiable because this is needed in the definition
of KKT point.

In some special cases (but very general), the KKT conditions are necessary without
requiring any qualification constraint.
Proposition 7.17
If all the constraints in (CP) are linear and x∗ is a local minimum, then there is a
Lagrange multiplier vector (λ∗ , µ∗ ) such that (x∗ , λ∗ , µ∗ ) is a KKT point.

If there is convexity, KKT conditions are also sufficient.


Theorem 7.18
If (x∗ , λ∗ , µ∗ ) is a KKT point for (CP) with f and gi convex for all i and all the
constraints hj are affine, then x∗ is a global minimum of (CP).

Therefore, if SCQ or GSCQ is satisfied, then x∗ is a global minimum if, and only if,
there is a vector of Lagrange multipliers so that there is an associated KKT point.

Finally, let us study what happens with linear problems. After all, they are a par-
ticular case in nonlinear optimization.

Linear Programming
In the case of a linear problem, we are solving
Min. ct x
(P) s.t. Ax = b,
x ≥ 0,
Chapter 7. Constrained Optimization 65

with A an m × n full row rank matrix. This is a convex problem with linear con-
straints. Thus, x∗ is a (global) minimum if, and only if, it is a KKT point. Since
f (x) = ct x, g(x) = −x, and h(x) = Ax − b, then the KKT conditions for a general
point x are: 

 c − λ + At µ = 0,
 x ≥ 0,


Ax = b,
λ ≥ 0,




λi xi = 0, i = 1, . . . , n.

If we now consider the dual problem of (P), with µ ∈ Rm the dual variables for
the equality constraints and λ the dual variables for the inequality constraints, this
dual (D) is:
Max. bt µ
(D) s.t. At µ + λ = c,
λ ≥ 0.
The KKT conditions for (D) are the same than for the primal problem. Therefore,
x∗ is an optimal solution to (P) and (λ∗ , µ∗ ) is an optimal solution to (D) if, and
only if, (x∗ , λ∗ , µ∗ ) is a KKT point, in which case ct x = bt µ.
Chapter 8

Interior Point Methods

In this chapter we will study interior point methods for solving linear and convex
quadratic problems. When solving the linear problem

Min. ct x
(P) s.t. Ax = b,
x ≥ 0,

with A an m × n full row rank matrix, the simplex method searches sequentially
form vertex to vertex along the boundary of the feasible region until it finds an
optimal solution. The idea behind interior point methods is exactly the opposite:
the search is performed along a path that is in the interior of the feasible region until
it converges to an optimal solution.

Simplex method worst-case complexity is exponential, while for interior point meth-
ods is polynomial. However, each iteration of an interior point method is more
expensive in computational terms.

Here we will study what is called a primal-dual method, being the origin of the name
in the simultaneous use of the primal and dual formulation of the problem.

8.1. Primal-Dual Algorithms


We have seen that if we consider the dual linear problem of (P)

Max. bt µ
(D) s.t. At µ + s = c,
s ≥ 0,
68 2017
c Sergio Garcı́a Quiles

then both (P) and (D) have the same (necessary and sufficient for optimality) KKT
conditions: 

 c − s + At µ = 0,
 x ≥ 0,


Ax = b,
s ≥ 0,




si xi = 0, i = 1, . . . , n.

If we write −µ instead of µ (we can do this because it is a free variable) and rearrange
slightly how we display these conditions, then we have that:
 t

 A µ + s = c,
Ax = b,


 x i si = 0, i = 1, . . . , n,
(x, s) ≥ 0.

What are known as primal-dual methods find solutions (x∗ , s∗ , µ∗ ) of this system by
applying variants of Newton’s method to the system of the three equalities and then
modifying the search direction so that the inequality is satisfied strictly.

In order to obtain a primal-dual interior point method, the previous optimality con-
ditions are written as follows:
 t 
A µ+s−c
 Ax − b  = 0,
F (x, s, µ) :=
XSe
(x, s) ≥ 0,

where

X = diag{x1 , . . . , xn }, S = diag{s1 , . . . , sn }, and e = (1, . . . , 1)t .

The goal is to obtain iterates (xk , sk , µk ) that satisfy xk > 0 and sk > 0.

The value
xt s
d :=
n
can be seen as a measure of how much desirable the solution is and it is known as
duality measure.

We will use Newton’s method to solve F (x, s, µ) = 0 and to obtain a search direc-
tion (∆x, ∆s, ∆µ). Thus, we need to solve
 
∆x
J(x, s, µ)  ∆s  = −F (x, s, µ),
∆µ
Chapter 8. Interior Point Methods 69

where J is the Jacobian of F . Therefore, we must solve


    
0 In At ∆x −At µ − s + c
 A 0 0   ∆s  =  −Ax + b 
S X 0 ∆µ −XSe

However, if we take a full step for the solution of this system, it is likely that we will
violate the condition (x, s) > 0. Therefore, we choose a smaller step size α  1 and
the new iterate is
(x, s, µ) + α(∆x, ∆s, ∆µ).

A less aggressive option is to just seek to reduce the value of products xi si by trying
to find a point for which xi si = σd, where d is the current duality measure and
σ ∈ (0, 1) is the reduction factor that we would like to achieve. The modified system
of equation is     
0 In At ∆x −At µ − s + c
 A 0 0   ∆s  =  −Ax + b  .
S X 0 ∆µ −XSe + σde
σ is called the centering parameter.

To summarize, we have the following algorithm:

Algorithm 8.1 (General Primal-Dual Path-Following Method)


Step 1: Choose (x0 , s0 , µ0 ) such that Ax0 = b, At µ0 + s0 = c, and (x0 , s0 ) > 0.
Choose ε > 0 and 0 < σmin < σmax < 1. Let k := 0.
Step 2: If (xk )t sk < ε, then STOP.
Choose σ k ∈ [σmin , σmax ] and solve
    
0 In At ∆xk −At µk − sk + c
 A 0 0   ∆sk  =  −Axk + b 
k k k k k k k
S X 0 ∆µ −X S e + σ d e,
(xk )t sk
where dk := n
.
Update
(xk+1 , sk+1 , µk+1 ) = (xk , sk , µk ) + αk (∆xk , ∆sk , ∆µk ),
where αk is chosen so that (xk+1 , sk+1 ) > 0.
Repeat Step 2.

Different choices of σ k and αk give rise to methods with different properties.

8.2. The Central Path


We define the primal-dual feasible set F and the strictly feasible set F 0 as

F := (x, s, µ) / Ax = b, At µ + s = c, (x, s) ≥ 0 ,

70 2017
c Sergio Garcı́a Quiles

F 0 := (x, s, µ) / Ax = b, At µ + s = c, (x, s) > 0 .




The central path C is an arc of strictly feasible points and it is very important in
primal-dual interior point algorithms. It is parameterized by a scalar τ > 0 and a
point (xτ , sτ , µτ ) ∈ C satisfies that

At µ + s = c,
Ax = b,
xi si = τ, i = 1, . . . , n,
(x, s) > 0.

Note that these are the KKT conditions where value 0 has been changed to τ in the
right-hand side of the third equation. We define

C := {(xτ , sτ , µτ )}τ >0 .

It can be shown that each (xτ , sτ , µτ ) is unique for that τ if, and only if, F 0 is
nonempty.

Additionally, it is easy to see that if we consider the strictly convex problem with
logarithmic barrier with parameter τ > 0

Min. ct x − τ ni=1 ln xi
P
s.t. Ax = b,

its KKT conditions (which characterize its global minimum) are exactly the first
three equations defining (xτ , sτ , µτ ), where we define si := τ /xi . It is also obvious
that x > 0 in any optimal solution (and, thus, s > 0).

If (xτ , sτ , µτ ) converges when τ tends to zero, it is clear that it converges to a primal-


dual solution (x∗ , s∗ , µ∗ ) of the linear problem. Therefore, the central path follows
a route that goes towards a solution while keeping x and s positive and making the
products xi si decrease towards zero.

A Long-Step Path-Following Algorithm


We can prevent the iterates from getting too close to the boundary of the nonnegative
orthant by defining the following neighborhood of the central path:

N−∞ (γ) := {(x, s, µ) ∈ F 0 / xi si ≥ γd, i = 1, . . . , n},

for some γ ∈ (0, 1] (for example, γ = 0.001).

Integrating this in the general Algorithm 8.1, we compute αk so that it is the


largest α ∈ (0, 1] such that (xk+1 , sk+1 , µk+1 ) ∈ N−∞ (γ). This way, we obtain what
we call a Long-Step Path-Following Primal-Dual Method (LSPFPDM).
Chapter 8. Interior Point Methods 71

Theorem 8.2
If the sequence {(xk , sk , µk )} is generated with LSPFPDM, then there is a con-
stant δ > 0, independent of n, such that
 
k+1 δ
d ≤ 1− dk for all k ≥ 0.
n
Therefore, lim dk = 0. Moreover, dk ≤ εd0 in O(n log 1/ε) steps.
k→+∞

8.3. Convex Quadratic Problems


Now we are going to see that the previous methodology can be easily extended to
solve the convex quadratic problem
Min. ct x + 21 xt Qx
(CQP ) s.t. Ax = b,
x ≥ 0,
where Q is symmetric positive semidefinite and A is an m × n full row rank matrix.

Since we have a convex problem, its solutions are characterized by the KKT condi-
tions  t

 A µ + s − Qx = c,
Ax = b,


 (x, s) ≥ 0,
xs = 0.

As in the linear case, we develop a path-following primal-dual method by considering


the perturbed KKT conditions
 t 
A µ + s − Qx − c
 Ax − b  = 0,
F (x, s, µ) :=
XSe − σde
(x, s) ≥ 0,
t
where d := xns can be seen as the value that controls the distance to optimality and
σ ∈ (0, 1) is chosen arbitrarily. Actually, similarly to what happened with the linear
case, these equations are the KKT conditions for a certain quadratic problem with
logarithmic barrier.

If we now apply Newton’s method to calculate the search direction, we must solve
    
−Q In At ∆x, −At µ − s + Qx + c
 A 0 0   ∆s,  =  −Ax + b 
S X 0 ∆µ −XSe + σde.

The next iterate is


(x, s, µ) + α(∆x, ∆s, ∆µ),
where α is chosen so that (x, s) > 0.

You might also like