0% found this document useful (0 votes)
18 views125 pages

Opte - Optimization

Chapter 1 introduces optimization problems, defining key concepts such as feasible sets, local and global optima, and the importance of convex functions in optimization. It emphasizes that while convex optimization problems can be efficiently solved, the study of optimization is often restricted to special classes of problems. The chapter also highlights linear optimization as a significant subset of convex optimization, providing a framework for formulating and solving linear programming problems.

Uploaded by

qc6r7vfzkn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views125 pages

Opte - Optimization

Chapter 1 introduces optimization problems, defining key concepts such as feasible sets, local and global optima, and the importance of convex functions in optimization. It emphasizes that while convex optimization problems can be efficiently solved, the study of optimization is often restricted to special classes of problems. The chapter also highlights linear optimization as a significant subset of convex optimization, providing a framework for formulating and solving linear programming problems.

Uploaded by

qc6r7vfzkn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

Chapter 1

Introduction

1.1 General Problem Statement

Optimization considers problems of the form

min f (x) objective function


s. t. hi (x) = 0 i ∈ E subject to equality constraints
(OP )
gi (x) ≤ 0 i ∈ I inequality constraints
x∈Ω for variable x in ground set Ω.

In this, (OP) stands for optimization problem or program, min reads “min-
imize” (it describes the task; a minimum value need not exist) and s. t. is
short for “subject to”. There are a number of similar problem description
conventions, like “f (x) → min s. t. . . . ”, but the given one seems most widely
spread. In this course we mainly consider rather friendly finite dimensional
optimization problems.
• Ω ⊆ Rn (sometimes also Cn ); in advanced courses on optimal control
(e. g. optimal acceleration and steering over time) or PDE optimiza-
tion (e.g. optimal shapes or temperatures) the goal is to find optimal
functions, in which case Ω specifies the corresponding function space.
• f, gi , hi : Rn → R “sufficiently smooth” (e. g. ∈ C 2 (R) or convex).
• The index sets E, I will be finite throughout this course; the case of
infinite index sets combined with finite dimensional ground sets is
studied in the field “semiinfinite optimization”.
Definition 1.1 Consider the optimization problem (OP).
• X = {x ∈ Ω : hj (x) = 0, j ∈ E, gi (x) ≤ 0, i ∈ I} is the feasible set
( Menge der zulässigen Punkte / der zulässigen Lösungen oder zulässige
Menge)
• A problem with X = ∅ is called infeasible ( unzulässig).

1
CHAPTER 1. INTRODUCTION 2

• A point x ∈ X is a feasible point or feasible( zulässiger Punkt /


zulässige Lösung/zulässig).
• An x∗ ∈ X is a (global) optimum ( globales Optimum / (globale)
Optimallösung), if

f (x∗ ) ≤ f (x) ∀x ∈ X .

• An x∗ ∈ X is a local Optimum ( lokales Optimum / lokale Opti-


mallösung), if there is a neighborhood U (x∗ ) with

f (x∗ ) ≤ f (x) ∀x ∈ X ∩ U (x∗ ).

• The optimal value (Optimalwert) f ∗ = inf{f (x) : x ∈ X } is ∞ if


X = ∅. If f ∗ = −∞, the problem is unbounded ( unbeschränkt).

x
local optima global local


?

Typical subjects in optimization are


• Do optima exist and how can they be recognized? → optimality
conditions
• Can they be determined or approximated algorithmically?
• How efficient are the algorithms?
Regarding existence the following result taught in basic analysis courses is
essential.

Theorem On a compact set every continuous function attains its minimum


and maximum.

On the other hand this result is of little help in actually locating a minimum
even in the case of rather simple continuous functions. To see this it suffices
CHAPTER 1. INTRODUCTION 3

to imagine a function that is constant except for a tiny spike located in some
unknown position like illustrated below.

f:

Therefore there is no hope for an efficient algorithmic optimization method


in full generality. In consequence, the study of optimization problems is
restricted to special classes of problems satisfying additional requirements
on the properties of f, gi , hi and Ω.
While we will start with optimality conditions and line search methods for
unconstrained smooth optimization problems (f ∈ C 2 , I = E = ∅), it is
worth to first develop a wider perspective of the aims and scope of this course
and optimization in general. It will also allow to quickly experiment with a
wide range of different optimization applications in the exercises.

1.2 Convex Optimization

While convex functions may be nondifferentiable at some places, many convex


optimization problems can be solved quite efficiently and are a powerful
tool for modeling and solving a wide range of problems in practice. The
requirements on the functions and the ground set read
• f, gi are convex functions,
• hi are affine / linear functions (they are of the form a⊤
i x + βi ),
• Ω is a convex set.
It is helpful to investigate some of the main properties right away.

Definition 1.2 A set C ⊆ Rn is convex (cvx), if for any two elements


x, y ∈ C the straight line segment [x, y] is also contained in C, i. e.,

x, y ∈ C ⇒ αx + (1 − α)y ∈ C ∀α ∈ [0, 1].

Example
CHAPTER 1. INTRODUCTION 4

• ∅, Rn and its affine subspaces are convex.


• For a ∈ Rn , β ∈ R the halfspace Ha,β := {x ∈ Rn : a⊤ x ≤ β} is convex.

Definition 1.3
• A function f : Rn → R := R ∪ {∞} is convex if

f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) ∀α ∈ (0, 1), ∀x, y ∈ Rn .

• It is strictly convex if

f (αx+(1−α)y) < αf (x)+(1−α)f (y) ∀α ∈ (0, 1), ∀x, y ∈ Rn , x ̸= y.

• It is proper convex, if f convex and ∃x ∈ Rn f (x) < ∞.


• A function f : Rn → R := R ∪ {−∞} is concave if −f is convex.
• The epigraph of a function f : Rn → R is the set epi f := {(x, r) ∈
Rn+1 : r ≥ f (x)}.
• The level set of a function f : Rn → R to level r is the set Sr (f ) :=
{x ∈ Rn : f (x) ≤ r}.

f ∞ outside f

epi f
epi f

r r

x x
Sr (f )
Sr (f )

Exercise Convex functions are


• affine functions f (x) = a⊤ x + β
• f (x) = x⊤ Qx + a⊤ x + β with Q positive semidefinite (strictly convex
for Q positive definite)
• f (x) = max{a⊤ ⊤
1 x + β1 , a2 x + β2 }

Observation 1.4 f : Rn → R cvx ⇔ epi f cvx.


CHAPTER 1. INTRODUCTION 5

Proof: ⇒: Let (x, r), (y, p) ∈ epi f , then f (x) ≤ r, f (y) ≤ p. For α ∈ (0, 1)

f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) ≤ αr + (1 − α)p,

thus α(x, r) + (1 − α)(y, p) = (αx + (1 − α)y, αr + (1 − α)p) ∈ epi f .


⇐: Given (x, f (x)) and (y, f (y)), for α ∈ (0, 1) there holds (αx + (1 −
α)y, αf (x)+(1−α)f (y)) ∈ epi f , hence f (αx+(1−α)y) ≤ αf (x)+(1−α)f (y).

Observation 1.5 The intersection of a family of convex sets is convex.


Formaly, given cvx Ci ⊆ Rn for i ∈ I, then i∈I Ci is cvx.
T

T
Proof: Let x, y ∈ C = i∈I Ci , then x, y ∈ Ci for all i ∈ I, thus [x, y] ⊆ Ci
for all i ∈ I, hence [x, y] ⊆ C. □

Observation 1.6 The supremum of a family of convex functions is a convex


function. Formally, given cvx fi : Rn → R for i ∈ I, the function f : Rn →
R, x 7→ supi∈I fi (x) is cvx.

T
Proof: Check that epi f = i∈I epi fi and use Obs 1.4 and 1.5. □

Observation 1.7 Level sets of convex functions are convex.

Proof: Form Sr′ = epi f ∩ {(x, r) : x ∈ Rn }. By Obs 1.4 and 1.5 this Sr′ is
| {z }
cvx
cvx and therefore also Sr = {x : (x, r) ∈ Sr′ } = Sr (f ) (Ex). □
Note, however, that functions with convex level sets are not necessarily
convex (these are called quasiconvex ; draw some examples).
For f , gi cvx, hi affine and Ω cvx the optimization problem

min f (x)
s. t. hi (x) = 0 i ∈ E
gi (x) ≤ 0 i ∈ I
x∈Ω

has feasible set


\ \
X =Ω∩ {x : hi (x) = 0} ∩ {x : gi (x) ≤ 0} .
| {z } | {z }
i∈E aff. subsp. cvx i∈I
S0 (gi ) cvx

Because each single set is cvx, Obs 1.5 implies the convexity of X (even in
the case of infinitely many constraints).
An important reason for the attractivity of convex optimization is based on
the property that there is no need to discern local and global optima.
CHAPTER 1. INTRODUCTION 6

Theorem 1.8 For a convex optimization problem any local optimum is a


global optimum.

Proof: Assume, for contradiction, that x̄ ∈ X is a local minimum for


neighborhood U (x̄) and x∗ is a feasible point with f (x∗ ) < f (x̄). Because
X is convex, we have [x̄, x∗ ] ⊆ X . For each x ∈ (x̄, x∗ ) ∩ U (x̄) there is an
α ∈ (0, 1) with x = αx̄ + (1 − α)x∗ and by convexity of f ,

f (x) ≤ αf (x̄) + (1 − α)f (x∗ ) < αf (x̄) + (1 − α)f (x̄) = f (x̄)

in contradiction to the local optimality of x̄. □


In particular, the set of optimal solutions is always convex. Indeed,let X be
the feasible set and f ∗ = inf{f (x) : x ∈ X } the optimal value, then the set
of optimal solutions Sf ∗ (f ) ∩ X is the intersection of two convex sets and
therefore convex (use Obs 1.5 and 1.7). Note, however, that the set may be
empty even for finite f ∗ . Consider, e. g., ex for x ∈ R or also ex for x ∈ (0, 1].
Both examples share the property, that the set of objective values is not
closed at the lower end. The proof of existence of optima typically requires
the closedness of the epigraph and the feasible set.

1.3 Linear Optimization

This lecture series will put a special focus on an important special case of
convex optimization where closedness if no problem. Linear Optimization or
Linear Programming requires all f, gi , hi to be affine, i. e., of the form a⊤ x+β,
and Ω = Rn+ := {x ∈ Rn : xi ≥ 0, i = 1, . . . , n}.Instead of x ∈ Rn+ we usually
write x ≥ 0 which is to be interpreted as componentwise nonnegativity. As
we will see, all linear programs (LP) can be brought into the following normal
form,
min c⊤ x
s. t. Ax = b, LP in normal form
x ≥ 0.
Convex optimization offers a multitude of algorithmic possibilities, not only
for important special cases. Several of these allow to study algorithmic
complexity aspects, some of this will appear in this course. Already the
seemingly very special field of linear optimization allows for a wide spectrum
of applications. Let us start with a first glimpse of these and some related
theoretical questions.
Example The production of (simplified) Mozart-Balls and Mozart-Coins
requires respective quantities of marzipan, nougat and dark chocolate (in
CHAPTER 1. INTRODUCTION 7

irrelevant units) as in the table below.


Mozart-Balls Mozart-Coints available quantity
marzipan 1 1 6
nougat 2 1 11
dark chocolate 1 2 9
selling price 5 4
How much should be produced of each in order to maximize profit?

x2
With decision variables
x1 . . . number of Mozart-Balls 11
10
x2 . . . number of Mozart-Coins
9
the problem reads 8
7
max 5x1 + 4x2 c
6
s. t. x1 + x2 ≤ 6, 5
2x1 + x2 ≤ 11, 4
x1 + 2x2 ≤ 9, 3
2
x1 ≥ 0, x2 ≥ 0.
1

1 2 3 4 5 6 7 8 9 x1
Clearly, the optimal solution is attained in a vertex.“

How can this solution be computed precisely (instead of reading it off)?
Determine the intersection point of the two straight lines

x1 + x2 = 6
⇒ x1 = 5, x2 = 1.
2x1 + x2 = 11

It would be good to have some kind of proof or checkable certificate that


this is indeed the optimal solution.
It is also worth to point out that the integrality of the solution is a happy
coincidence. Enforcing integrality is, in general, a provably difficult task and
treated in the field of (mixed) integer programming. ♡
The graphical approach used in the example works reasonably well for two
variables. Realistic problems consist of up to several millions of variables and
constraints and can only be treated by algebraic methods. In this lecture
series we will come to study two quite different state of the art approaches,
the simplex method and interior point methods.
For the time being it will be sufficient to understand how general linear
programming problems can be brought in certain standard forms that are
required for implementing and solving models by commercial software or on
internet resources.
CHAPTER 1. INTRODUCTION 8

In general, the variables in a linear program are subsumed in a vector x ∈ Rn .


Usually one requires

xi ≥ 0 for i = 1, . . . , n ⇔: x ≥ 0, x ∈ Rn+ .

These sign constraints are already linear side constraints. General linear
inequalities are of the form
n
X
ai xi ≤ β ⇔ a⊤ x ≤ β.
i=1

Each such inequality describes a specific

halfspace Ha,β := {x ∈ Rn : a⊤ x ≤ β}.

Geometrically this can nicely be illustrated in R2 by making use of a⊤ x =


∥a∥∥x∥ cos ∠(a, x),

x2 x

a⊤ x a
∥a∥ ∥a∥

x1
{x : a⊤ x ≤ β} {x : a⊤ x ≥ β}

{x : a⊤ x = β}

What are the correct choices of a and β for the sign constraints xi ≥ 0?
The linear inequality constraints are collected in a matrix representation (the
indices within ai now refer to the constraint and no longer to components of
a)
a⊤1 x ≤ b1
" ⊤#
a1
" #
b1
.. ⇝ Ax ≤ b with A = .. , b = .. .
. . .

am b

am x ≤ bm m
CHAPTER 1. INTRODUCTION 9

This leads to the linear program in canoncial form


x2
a1
a2
max c⊤ x x1 ≥ 0
a3
s. t. Ax ≤ b,
x ≥ 0. x1
x2 ≥ 0

Note, the feasible set X = {x : Ax ≤ b, x ≤ 0} is the intersection of finitely


many halfspaces and may be visualized as a polyhedron. This form is often
preferred for geometric illustrations.
Beispiel: Writing the Mozart-problem in canonical form results in the
following data: h1 1i h6i
A = 2 1 , b = 11 , c = [ 54 ] .
12 9

How to transform a problem with equality and ≥-inequality constraints into
this form?
a⊤ x ≥ β ⇝ (−a)⊤ x ≤ −β,
 ⊤
⊤ a x ≤ β,
a x=β ⇝
a⊤ x ≥ β.
• negative variables: replace xi ≤ 0 by x̄i = −xi .
• free variables (xi without sign constraints): use two variables instead,
− + −
xi ⇝ x+
i − xi , xi ≥ 0, xi ≥ 0.

• minimization problem: replace min c⊤ x by (−) max(−c)⊤ x.


Any optimization problem with linear objective function and linear equality
and inequality constraints can be represents as a liner program in canonical
form.
Algorithmically it is more convenient to work exclusively with equality
constraints and sign constraints on the variables. The transformation of the
canonical form into this form is achieved by introducing slack variables as
follows,

max c⊤ x
s. t. Ax + s = b with si = bi − a⊤
i x ≥ 0 slack variables.
x ≥ 0, s ≥ 0

Interpreted geometrically, slack si gives the “distance” / slack of the current


point x to the inequality. For si > 0 there is still some room; for si = 0 the
CHAPTER 1. INTRODUCTION 10

point is right on the boundary, i. e., the inequality is satisfied by equality, it


is tight or active.
Putting Ā = [A I], x̄ = [ xs ], c̄ = [ 0c ] the program may be written compactly
as
max c̄⊤ x̄
s. t. Āx̄ = b
x̄ ≥ 0.
It remains to switch the optimization direction in order to see that any linear
program can be represented equivalently in the initial normal form.
A linear program in normal or standard form reads

min c⊤ x
s. t. Ax = b,
x ≥ 0.

Geometrically the feasible set may now be viewed as the intersection of the
cone of nonnegative vectors with an affine subspace, e. g.in the case of three
variables and one equality constraint,

x3

x2
x1

Example In normal form the Mozart-problem requires three new slack


variables x3 , x4 , x5 . The data reads
" −5 #
−4
h1 1 1 0 0i
A = 2 1 0 1 0 , b = [ 6 11 9 ] , c = 0 .
12001 0
0

Before, the optimal solution (x∗1 , x∗2 ) resulted from intersecting the first and
second inequality constraints. How does this translate to the normal form
setting?
(x1 , x2 ) lies on 1 ⇒ slack variable x3 = 0
lies on 2 ⇒ x4 = 0
It remains to solve h 1 1 0 i h x1 i h i
6
210 x2
x5
= 11 .
121 9
In general, in order to compute a vertex, choose n−m slack variables that are
set to zero ⇒ the solution lies on the corresponding equalities. The solution
of the remaining system determines the slack to the remaining constraints.
CHAPTER 1. INTRODUCTION 11

How to interpret x1 as slack? It is simply the slack of the constraint x1 ≥ 0;


in the same vein x2 is the slack of x2 ≥ 0. ♡
As indicated above, the basic set x ≥ 0 may be interpreted as a conic ground
set: x ∈ Rn+ lies in the cone of nonnegative vectors. The concept of linear
optimization can now be extended in a straight forward manner to linear
programs over convex cones,
min c⊤ x
s. t. Ax = b, with K ⊆ Rn a convex cone.
x ∈ K,
In this course we will devote some time to this interpretation. In particular
we will consider this setting for the second order cone (the value of the first
coordinate is at least the Euclidean norm of the remaining entries) and the
cone of (symmetric) positive semidefinite matrices.

1.4 Duality and Robust Optimization

In linear and, more generally, convex optimization there is always a second


“dual” problem to each optimization problem that is solved simultaneously
with the original “primal” problem. The solution to this dual problem, typi-
cally available for free, has high practical relevance as it provides important
information on the sensitivity of the primal optimal solution with respect
to changes in the data describing the constraints. In many cases it also
generates an easy checkable certificate of optimality. Duality theory is one of
the central topics of this lecture series and this section gives a first view of
some basic elements and benefits within the setting of linear programming.
The dual problem may be interpreted as identifying or deriving a feasible
linear constraint that restricts the objective value the most. Given an
a⊤
" #
1
inequality system Ax ≤ b with A = ... ∈ Rm×n and b ∈ Rm new valid
a⊤
m
inequalities can be obtained by taking nonnegative linear combinations of
the inequalities a⊤ ⊤
i x ≤ bi together with the trivial inequality 0 x ≤ 1,

a⊤
1 x ≤ b1 | · y1 ≥ 0
..
.
a⊤
m x ≤ bm | · ym ≥ 0

0 x ≤ 1 |·γ ≥0
Pm ⊤x ≤
Pm
y a
i=1 i i y b
i=1 i i + γ
Later theory will show that this indeed generates all linear inequalities that
are valid for {x : Ax ≤ b} and all relevant ones do not require the right hand
side shift by γ.
CHAPTER 1. INTRODUCTION 12

Let us apply this now to a linear program in canonical form,

x2 y = (1, 0, 1)⊤
a1 a2 1a1 + 1a3
max c⊤ x
s. t. Ax ≤ b 1a2 + 1a3
x≥0 y = (0, 1, 1)⊤
a3
x1

For any y ∈ Rm ⊤ ⊤ ⊤
+ , put a = y A, β = y b, then any x satisfying Ax ≤ b also

satisfies a x ≤ β. Now, if a ≥ c (componentwise as usual), then any x ∈ X
satisfies x ≥ 0 and therefore also c⊤ x ≤ a⊤ x ≤ β. In summary,

Any y ≥ 0 with A⊤ y ≥ c yields an upper bound b⊤ y

on the optimal objective value of the primal program.


Exercise If x is free (x ∈ Rn without sign constraints), what are the
requirements on the relation between a and c so that an analogous argument
gives rise to an upper bound?



The dual program in canonical form asks for the y ≥ 0 that gives rise to the
smallest upper bound,
max b⊤ y
s. t. A⊤ y ≥ c
y≥0

Where does the name “dual” come from?


In linear algebra a vector space V has as its dual vector space V ∗ the vector
space of its linear functionals, V ∗ = {f : V → R : f linear}. For V = Rn any
such f is of the form f (x) = a1 x1 + · · · + an xn , thus (Rn )∗ = {a⊤ x : a ∈
Rn }=R
b n . The linear constraints are linear functionals. Via y ≥ 0 we form
nonnegative combinations of these functionals from the dual space and obtain
a new element of the dual space, so the dual program actually optimizes
objects in the dual space.
In the case of linear programs the dual to the dual is again the original
primal problem.
For the primal problem in normal form the same ideas allow to derive the
dual in normal form. In contrast to the canonical form, we may to take
arbitrary linear combinations of the equality constraints to obtain new valid
equality constraints, so y is free. Furthermore, the primal minimizes c⊤ x
over x ≥ 0, thus a lower bound is obtained whenever A⊤ y ≤ c. For this
latter inequality system it is customary to introduce slack variables z ≥ 0 in
CHAPTER 1. INTRODUCTION 13

order to arrive at the primal dual pair of linear programs in normal form

min c⊤ x max b⊤ y
(P ) s. t. Ax = b (D) s. t. A⊤ y + z = c
x≥0 y ∈ Rm , z ≥ 0

Exercise How to derive the dual if =, ≤ and ≥ constraints appear at the


same time?



Weak Duality

refers to the fact that the objective value any dual feasible point provides a
(in normal form) lower bound on the objective value of any primal feasible
point and vice versa. This holds in convex optimization in general and will
be surprisingly easy to prove. To give a first flavor of this for linear programs
in normal form, denote their feasible set by

X = {x ≥ 0 : Ax = b}, Z = {(y, z) : A⊤ y + z = c, z ≥ 0}.

The construction of the dual proves

inf c⊤ x ≥ sup b⊤ y weak duality.


x∈X (y,z)∈Z

There is an alternative short proof, as well:


Proof: Let x ∈ X , (y, z) ∈ Z, then by primal and dual feasibility

c⊤ x = (a⊤ y + z)⊤ x = y ⊤ (Ax) + z ⊤ x = y ⊤ b + |{z}


z ⊤ |{z}
x ≥ y ⊤ b.
≥0 ≥0


Note that in this latter proof equality holds throughout whenever z⊤x
= 0.
By the nonnegativity of both vectors this amounts to each coordinate being
zero in at least one of z and x, so for each inequality constraint the slack
variable on the primal or the dual side (or both) must be zero. This will be
called the complementarity property. Whenever a complementary feasible
primal dual pair of solutions is found, this proves optimality of both sides
immediately.
An important question is therefore: Under which conditions can we be sure
that primal and dual solutions with the same optimal value exist (strong
duality holds) or is it possible that p∗ = inf x∈X c⊤ x > sup(y,z)∈Z b⊤ y =: d∗ ,
i. e., that a dualtiy gap p∗ − d∗ > 0 occurs?
CHAPTER 1. INTRODUCTION 14

We shall see that for LP there is no duality gap whenever at least one side is
feasible, but this may already fail to hold for linear programs over the second
order or the positive semidefinite cone.
Why is this important?
One the one hand, the optimality certificate and sensitivity information are
at their best only if strong duality holds. But there is also an entire modeling
technique of high practical relevance that relies on strong duality.

Robust Optimization

Robust Optimization refers to a rather recent modeling approach for coping


with uncertainties in given data. The main idea is quickly illustrated for the
following minimal LP with one linear inequality,
min c⊤ x
s. t. â⊤ x ≥ b where â is not known exactly.
x≥0
All relevant values for â are, however, known to lie in a well described
uncertainty set
â ∈ {a ∈ Rn : Da − d ∈ K} for some convex cone K.
To stay in the linear programming world, let K = Rm + . We would like to find
the best x that is feasible for all possible choices of â (this is a so called worst
case scenario). Thus, we consider an x feasible, if inf{x⊤ a : Da − d ≥ 0} ≥ b.
For any given x, the worst a (which is now the optimization variable) is
found by solving the optimization problem
min x⊤ a ≥ b for x to be feasible
s. t. Da ≥ d | · λ ≥ 0 (dual variable)
strong duality
⇔ max d⊤ λ ≥ b ⇔ ∃λ ≥ 0 (D⊤ λ = x ∧ d⊤ λ ≥ b).
s. t. D⊤ λ = x
λ≥0
In the presence of strong duality we my therefore replace the (semiinfinite)
constraint
â⊤ x ≥ b ∀â ∈ {a : Da ≥ d}
by much simpler linear constraints and arrive at
min c⊤ x
s. t. d⊤ λ ≥ b
”robust counterpart”.
D⊤ λ = x
x ≥ 0, λ ≥ 0
The optimal solution of this program is valid for any realization of â in the
uncertainty set.
Chapter 2

Smooth Unconstrained
Optimization

2.1 The Setting of Nonlinear Optimization

In Nonlinear Optimization the typical assumption is that all functions are


“sufficiently smooth” but in general not convex,

f, gi , hi ∈ C 1 or ∈ C 2 ,

i. e., they are assumed to be sufficiently often continuously differentiable as


required for the analysis or algorithm at hand. The ground set is Ω = Rn
unless explicitly stated otherwise.
In Nonlinear Optimization one discerns
• unconstrained [nonlinear] optimization (freie/unrestringierte nichtlin-
eare Optimierung): E = I = ∅, Ω = Rn

sketch 1d-function with local minima

15
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 16

• constrained [nonlinear] optimization (restringierte nichtlineare Opti-


mierung): |E| + |I| > 0, Ω = Rn or Ω some simple set like a box (an in-
terval). In nonlinear constrained optimization feasible sets may quickly
get arbitrarily bad (e. g., several feasible regions that are not connected),
so it may be challenging to even locate some feasible solution. Typi-
cally problem specific properties help to render the problem manageable.

sketch feasible regions enclosed by level lines

This chapter only deals with unconstrained optimization problems of the


form
minn f (x) with f ∈ C 2 (Rn ).
x∈R

Most practical problems in nonlinear optimization have many local optima.


The usual algorithmic challenge is to generate a sequence of points that
hopefully converge to a local optimum. Asymptotically they should exhibit
“fast” local convergence like Newton’s method.
Finding a global optimum is in general an almost impossible task. It either
requires special properties like convexity or sophisticated domain decompo-
sition and bounding procedures in order to exclude the existence of better
solutions in other regions. Methods for finding global optima in this general
setting are a special field of research called global optimization and will in
general work for small dimensions only.
In linear programming the problem is easily communicated to an algorithm
by specifying the coefficient matrices. In nonlinear optimization the first
challenge is to come up with a general model for giving access to the function.
The most general and most common model is to view f as given by an oracle,
i. e., a user specified subroutine that returns for a point x the function value
f (x) or maybe also some derivative information. In particular, the following
kinds of oracles are discerned.
• 0-order oracle: Given x̄, return the function value f (x̄).
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 17

• 1st-order oracle: Given x̄ return f (x̄) and ∇f (x̄) (the gradient),


 ∂f 
∂x1
(x̄)
∇f (x̄) =  .. 
.
∂f
∂xn
(x̄)

sketch 1d derivative and 2d tangent plane


The gradient points into the direction of steepest ascent of f at x, it
is the steeper the larger its norm. The gradient gives access to the
directional derivative,

f (x + td) − f (x)
f ′ (x; d) := lim ; f ′ (x̄; d) = ∇f (x̄)⊤ d.
t↘0 t

Geometrically it describes together with f (x̄) the tangent plane to f


in x, so it generates a local

linear/first-order model of f at x f (x̄) + ∇f (x̄)⊤ (x − x̄).

The error term of this model is of first order,

f (x) = f (x̄) + ∇f (x̄)⊤ (x − x̄) + o(∥x − x̄∥).

It is recommended to visualize the gradient and the model by means of


some numerical tools in order to have a solid geometric understanding
of these objects.
• 2nd-order oracle: Given x̄ return f (x̄), ∇f (x̄), and ∇2 f (x) (the Hes-
sian),
 ∂2f ∂2f

(x̄) ··· (x̄)
∂x2
1
∂xn ∂x1
2
∇ f (x̄) =  .. ..
 
. . 
∂2f ∂2f
∂x1 ∂xn
(x̄) ··· 2 (x̄).
∂xn

This matrix is symmetric if f ∈ C 2 , it describes the local curvature of


f at x̄. In particular, together with f (x̄) and ∇f (x̄) it gives rise to the
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 18

quadratic function having the same tangent plane and curvature at x̄


as f ; this is the

quadratic/second-order model of f at x
f (x̄) + ∇f (x̄)(x − x̄) + 12 (x − x̄)⊤ ∇2 f (x̄)(x − x̄).

The error of this model is of second order,


1
f (x) = f (x̄) + ∇f (x̄)(x − x̄) + (x − x̄)⊤ ∇2 f (x̄)(x − x̄) + o(∥x − x̄∥2 ).
2
Almost all arguments in nonlinear optimization rely on some form of Taylor
approximation. The following version is not the original form of Taylor’s
theorem but some combination with the mean value theorem. We will refer
to it by “Taylor” anyways.

Theorem (“Taylor’s Theorem”) Let x, p ∈ Rn and f : Rn → R.


If f ∈ C 1 (Rn ) then f (x + p) = f (x) + ∇f (x + tp)⊤ p for some t ∈ (0, 1).
If f ∈ C 2 (Rn ) there holds
Z 1
∇f (x + p) = ∇f (x) + ∇2 f (x + tp)dt
0
and f (x + p) = f (x) + ∇f (x)⊤ p + 12 p⊤ ∇2 f (x + tp)p for some t ∈ (0, 1).

For a proof see, e. g., Heuser, Chapter XX.

2.2 Optimality Conditions

Theorem 2.1 (first-order necessary (optimality) conditions)


Let x∗ be a local minimum and let f be continuously differentiable on a
neighborhood U (x∗ ), then ∇f (x∗ ) = 0. [tangent
plane has
slope 0]
Proof: Assume, for contradiction, ∇f (x∗ ) ̸= 0. Put p = −∇f (x∗ ) and
thus p⊤ ∇f (x∗ ) = −∥∇f (x∗ )∥ < 0. By continuity of ∇f (x∗ ) there holds
p⊤ ∇f (x∗ + tp) < 0 for all t ∈ (0, α) for all sufficiently small α > 0. Now
Taylor with suitable t ∈ (0, 1) implies

f (x∗ + αp) = f (x∗ ) + αp⊤ ∇f (x∗ + tαp) < f (x∗ ).


| {z }
<0 ∀t∈(0,1) □

Definition 2.2 A x̄ ∈ Rn with ∇f (x̄) = 0 is a stationary point (of f ).


CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 19

Attention! Stationarity is necessary but i. g. not sufficient.

sketch function

Theorem 2.3 (second-order necessary (optimality) conditions)


Let x∗ be a local minimum of f and let ∇2 f (x∗ ) be continuous on a neigh-
borhood U (x∗ ), then ∇f (x∗ ) = 0 and ∇2 f (x∗ ) ⪰ 0 (positive semidefinite).

Proof: Th2.1 implies ∇f (x∗ ) = 0 and the second part follows by the same ar-
gument: Suppose, for contradiction, ∇f (x∗ ) ̸⪰ 0, i. e., ∃p ∈ Rn p⊤ ∇2 f (x∗ )p <
0. For all sufficiently small α > 0 there holds p⊤ ∇2 f (x∗ + tαp)p < 0 for all
t ∈ (0, 1), so by Taylor there is a suitable t ∈ (0, 1) with

f (x + αp) = f (x∗ ) + α p⊤ ∇f (x∗ ) +α2 p⊤ ∇2 f (x∗ + tαp)p < f (x∗ ).


| {z } | {z }
=0 <0 □
Again, this is not sufficient in general. Consider, e.g.,
h i
2x
f ([ xx12 ]) = x21 − x42 , ∇f (0) = −4x13 = [ 00 ] , ∇2 f (0) = [ 20 00 ] .
2 ( 00 )
Note, the proofs show how to arrive at better points whenever the necessary
conditions fail to hold.
Theorem 2.4 (sufficient (optimiality) conditions)
Let ∇2 f (x∗ ) be continuous on an open neighborhood U (x∗ ). If ∇f (x∗ ) = 0
and ∇2 f (x∗ ) is positive definite (∇2 f (x∗ ) ≻ 0), x∗ is a local minimum.

Proof: Because ∇2 f is continuous on U (x∗ ), there exists r > 0 with


∇2 f (x) ≻ 0 for all x ∈ Br (x∗ ) := {x : ∥x − x∗ ∥ ≤ r}. By Taylor there
exists for all x∗ + p ∈ Br (x∗ ) with p ̸= 0 a suitable t ∈ (0, 1) with
1
f (x∗ + p) = f (x∗ ) + p⊤ ∇f (x∗ ) + p⊤ ∇2 f (x∗ + tp)p > f (x∗ ).
| {z } 2 | {z }
=0 >0 □
Note, the sufficient conditions are, i. g., not necessary. What is a good
example?
Because a positive definite Hessian characterizes a locally strictly convex
function, sufficient conditions just correspond to stationarity combined with
strict local convexity. For (globally) convex differentiable functions the
situation is, in fact, much better.
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 20

Theorem 2.5 If f is convex and differentiable, any stationary point x∗ is


a global minimum.

Proof: Let x∗ be a stationary point of f and x̄ ∈ Rn then by convexity

f (x∗ + t(x̄ − x∗ ) ≤ (1 − t)f (x∗ ) + tf (x̄) for all t ∈ [0, 1].

Thus,
f (x∗ + t(x̄ − x∗ )) − f (x∗ )
0 = ∇f (x∗ )⊤ (x̄ − x∗ ) = lim ≤ f (x̄) − f (x∗ ),
t↘0 t
therefore f (x∗ ) ≤ f (x̄). □
For differentiable convex functions the necessary conditions are also sufficient.

2.3 Newton’s method

In smooth unconstrained optimization the basic algorithmic framework


follows the following idea.
• Apply some descent method in order to get close to a local minimum.
• Hope that there the function is locally convex (true if sufficient condi-
tions hold) and that the quadratic model is a good approximation in
the vicinity of the local minimum.
• Determine iteratively the minimum of the current local quadratic model
and step there.
The last part is Newton’s method.
Theorem 2.6 (Newton’s method)
Let f ∈ C 2 , let x∗ satisfy the sufficient optimality conditions of Th 2.4, and
let ∇2 f be Lipschitz-continuous in a neighborhood U (x∗ ), i. e.,

∃L > 0 : ∥∇2 f (x) − ∇2 f (x̄)∥ ≤ L∥x − x̄∥ for all x, x̄ ∈ U (x∗ ).

For any starting point x0 close enough to x∗ the sequence generated by

xk+1 = xk − ∇2 f (xk )−1 ∇f (xk ) “Newton step”

satisfies
(i) xk converges to x∗ , i. e., xk → x∗ ,
(ii) the rate of convergence is quadratic, i. e.,

∃c > 0 ∃K ∀k ≥ K ∥xk+1 − x∗ ∥ ≤ c∥xk − x∗ ∥2 ,

(iii) the sequence of the norms of the gradients ∥∇f (xk )∥ converges quadrat-
ically to zero.
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 21

Before embarking on the proof it is worth to consider the geometric interpre-


tation.
For xk sufficiently close to x∗ the Hessian ∇2 f (xk ) is positive definite. In
consequence the quadratic model
1
q(x) = f (xk ) + ∇f (xk )⊤ (x − xk ) + (x − xk )⊤ ∇2 f (xk )(x − xk )
2
is strictly convex and has a unique minimum, by Th 2.5 this is the stationary
point satisfying

0 = ∇q(x) = ∇f (xk ) + ∇2 f (xk )(x − xk ) → x = xk − ∇f (xk )−1 ∇f (xk ).

This is exactly the description of the Newton step. Convergence will be fast
if the quadratic model is a good approximation, which is true if curvature
changes only slightly. The latter is ensured by the Lipschitz condition on
the second derivative.
An alternative important interpretation of Newton’s method is its true origin
as an iterative method for solving nonlinear equations,

given F : Rn → Rn find x with F (x) = 0.


!
f1
For this, form a linear model of F = .. at the current iterate xk with
.
fn
⊤!
∇f1
the help of the Jacobian JF = .. so that coordinate i holds the linear
.
∇fn⊤
model of fi at xk and equate that to zero,

F (xk ) + JF (xk )(x − xk ) = 0.

If JF is regular (typically assumed by some “regularity condition” like in the


implicit function theorem), the root of this linear approximation (a linear
system) is

x = xk − JF (xk )−1 F (xk ).

sketch 1d search for zeros

∂f
 
∂x1

Applied to optimization, the gradient is the function F = ∇f =  .. .


.
∂f
∂xn
Its zeros are the stationary points and JF = ∇2 f .
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 22

Note, Newton’s method searches for and typically converges to a nearby


stationary point. This need not be a minimum but might be a maximum or
also a saddle point!
Proof: Let f and x∗ satisfy the conditions of the Theorem. It will be
convenient to use the following abbreviations,

∇fk = ∇f (xk ), ∇2 fk = ∇2 f (xk ), ∇f∗ = ∇f (x∗ ) = 0, ∇2 f∗ = ∇2 f (x∗ ).

The progress of the Newton step may be written as


=0
z }| {

xk+1 − x = xk − x − ∗
∇2 fk−1 ∇fk = ∇2 fk−1 [∇2 fk (xk ∗
− x ) − (∇fk − ∇f∗ )].
(2.1)
Note, in this bracketed expression

∇f∗ − (∇fk + ∇2 fk (x∗ − xk ))

computes the linearized approximation


of the gradient in x∗ by the local linear
model of the derivative, thus the brack-
ets just hold the linearization error of
the gradient at x∗ .
sketch 1d linearization error

For estimating this error, compare it to the precise process described by


integration,
Z 1
∇f∗ = ∇fk + ∇f 2 (xk + t(x∗ − xk ))(x∗ − xk ) dt.
0

After rearranging terms and switching signs the norm of the bracketed
expression reads

∥∇2 fk (xk − x∗ ) − (∇fk − ∇f∗ )∥ =


Z 1
= [∇2 fk − ∇2 f (xk + t(x∗ − xk ))](xk − x∗ ) dt
0
Z 1
≤ ∥∇2 fk − ∇2 f (xk + t(x∗ − xk ))∥ ·∥xk − x∗ ∥dt
0 | {z }
Lipschitz ≤Lt∥xk −x∗ ∥ ∀xk ∈U (x∗ )
Z 1
1
≤ ∥xk − x∗ ∥2 · Lt dt = L∥xk − x∗ ∥2
0 2

For bounding the effect of ∇2 fk−1 on the brackets in (2.1) exploit ∇2 f∗ ≻ 0


and Lipschitz continuity of ∇2 f to see

∃r > 0 [Br (x∗ ) ⊆ U (x∗ ) ∧ (∀x ∈ Br (x∗ ) ∥∇2 f (x)−1 ∥ ≤ 2∥∇2 f∗−1 ∥)].
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 23

Thus, for xk ∈ Br (x∗ ), we may bound (2.1) by

∥xk+1 − x∗ ∥ ≤ L∥∇2 f∗−1 ∥ · ∥xk − x∗ ∥2 .

For any starting point x0 satisfying ∥x0 − x∗ ∥ < min{1, r, L∥∇21f −1 ∥ } this

proves xk → x∗ with quadratic convergence establishing (i) and (ii).
The proof of (iii) relies on the same estimate of the linearization error,

∥∇fk+1 ∥ = ∥∇fk+1 −∇fk − ∇2 fk (xk+1 − xk ) ∥


| {z }
=0 by def. of the Newton step
Z 1
= [∇2 f (xk + t(xk+1 − xk )) − ∇2 fk ](xk+1 − xk ) dt
0
Z 1
≤ ∥∇2 f (xk + t(xk+1 − xk )) − ∇2 fk ∥ · ∥xk+1 − xk ∥ dt
0
1 1
≤ L∥ xk+1 − xk ∥2 ≤ L∥∇2 fk−1 ∥2 · ∥∇fk ∥2 .
2 | {z } 2
Newton step
= −∇2 fk−1 ∇fk

By using the same r as above, for all xk ∈ Br (x∗ ) there holds 12 L∥∇2 fk−1 ∥2 ≤
2L∥∇2 f∗−1 ∥. The latter gives the constant factor proving quadratic conver-
gence in the limit. □
Note that the proof does not rely on any finiteness properties of the vector
space, so the same proof also works for Banach spaces.
Let us emphasize two important properties of Newton’s method,
• convergence is only guaranteed locally; within the region of quadratic
convergence, the number of correct digits double in every iteration.
• as it search for some stationary point, the step direction p = xk+1 − xk
may not even lead downwards but may well go for a local maximum
instead.
In order to obtain a minimization algorithm, Newton’s method has to com-
bined with some method that ensures some global form of descent. The two
main approaches for this are line search methods and trust region methods.
In this course we only deal with the most important aspects of the former.

2.4 Line Search Methods

Definition 2.7 A step direction p if a descent direction (for f in x) if its


directional derivative f ′ (x; p) < 0.
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 24

In our current setting f is sufficiently smooth and given by a first or


second-order oracle, thus the directional derivative is available via f ′ (x; p) =
∇f (x)⊤ p and

p is a descent direction ⇔ ∇f (x)⊤ p < 0.

The negative gradient −∇f (x) is always a descent direction if not zero and
it is called the “steepest descent direction”.
It is important, however, to understand that the gradient heavily depends
on the underlying scalar product. Indeed, the gradient represents the lin-
ear map of the derivative with respect to the canonical scalar product
⟨x, ∇f (x̄)⟩ = ∇f ⊤ x. A different scalar product results in a different coordi-
nate representation, so geometrically the steepest descent direction depends
on the choice of the scalar product. It is worth to better understand the
influence of the scalar product and its associated norm on the geometry of
linear and second-order model.
Consider a general inner product and its norm induced by a positive definite
H ≻ 0,
1 √
⟨x, y⟩H := y ⊤ Hx, ∥x∥H := ⟨x, x⟩H
2
= x⊤ Hx.
For H = I we obtain the canonical inner product with linear [and second]
order model
 
1 2
f (x̄) + ⟨x − x̄, ∇f (x̄)⟩ + x − x̄, ∇ f (x̄)(x − x̄)
2

In order to obtain the same linear model with respect to a general inner
product ⟨·, ·⟩H for some H ≻ 0, the gradient has to be transformed to
H −1 ∇f (x̄), because

∇f (x̄)⊤ x = (H −1 ∇f (x̄))⊤ Hx = x, H −1 ∇f (x̄) H


.

This results in the first [and second] order model


   
1
f (x̄) + x − x̄, H −1 ∇f (x̄) H + x − x̄, H −1 ∇2 f (x̄)(x − x̄) .
2 H

The rate at which the quality of the linear model deteriorates into a normal-
ized direction x(α) = x̄ + α ∥p∥p H is given by
 D E 
f (x + α ∥p∥p H ) − f (x̄) + ∥p∥p H , α∇f (x̄) ≈
H
2
D E
≈ α2 ∥p∥p H , H −1 ∇2 f (x̄) ∥p∥p H
H
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 25

p ⊤ 2 p
For H = I this results in the usual curvature ∥p∥ ∇ f (x̄) ∥p∥ which is most
transparent in the eigenvalue decomposition of the Hessian (symmetric
" # by
λ1
smoothness) ∇2 f (x̄) = P ΛP ⊤ with diagonal matrix Λ = Diag ..
. holding
λn
the eigenvalues λ1 ≥ · · · ≥ λn and orthognal P ∈ Rn×n with its columns
vi = P•,i holding an orthogonal basis of the corresponding eigenvectors. For
steps into an eigenvector direction vi the error behaves like
 
f (x + α ∥vvii ∥ ) − f (x̄) + α∇f (x̄)⊤ ∥vvii ∥ ≈ α2 ∥vvii ∥ ⊤ ∇2 f (x̄) ∥vvii ∥ = α2 λi

(upwards/convex for λi > 0, downwards/concave for λi < 0). So if |λi | is big,


the quality of the linear model deteriorates quickly into the direction vi , if
|λi | is small it stays good for large step sizes α.
Suppose now that ∇2 f (x̄) ≻ 0 (locally strictly convex case). The norm
for which the linear model locally approximates all directions equally well
is obtained for H = ∇ 2
D f (x̄). Indeed,
E then H −1 ∇2 f (x̄) = I and the error
2 p p 2
term simplifies to α2 ∥p∥H , ∥p∥H = α2 independent of p. The scaling
H
effect on the eigenvector directions may be understood by observing ∥vi ∥H =
q


vi Hvi = λi ∥vi ∥.

Thus, for ∇2 f (x̄) ≻ 0 (f locally strictly convex) the Newton direction


p = −∇2 f −1 (x̄)∇f (x̄) turns out to be the steepest descent direction with
respect to this “best” local norm H(x̄) := ∇2 f (x̄). One can proof (we
will not do this here) that getting close to local quadratic convergence
requires to get close to the Newton direction. All line search methods aim
for this by using a step direction of the form pk = −Hk−1 ∇fk where Hk ≻ 0
should approximate the Hessian in some way (in the remainder we use the
abbreviations fk = f (xk ), ∇fk = ∇f (xk ), etc. whenever convenient).
A typical line search framework is of the following form:
1. Compute a step direction pk = −Hk−1 ∇fk for some Hk ≻ 0 (pk is descent
direction by ∇fk⊤ pk = −∇fk⊤ Hk−1 ∇fk < 0 whenever ∇fk ̸= 0).
2. Line search: choose a step length αk > 0 giving an approximate minimizer
along the half ray xk + αpk (a rough approximation guaranteeing some
“sufficient decrease” condition suffices!)
3. Put xk+1 = xk + αk pk , use the step information for updating to Hk+1 and
repeat.
Examples for the choice of Hk are
(a) Hk = I is the “steepest descent method” which typically has rather
poor convergence properties.
(b) Hk = ∇2 fk is Newton’s method with line search but may fail to produce
descent directions if ∇2 fk ̸≻ 0.
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 26

(c) Hk = ∇2 fk + λI ≻ 0 are referred to as modified Newton methods.


(d) Quasi Newton methods start with H0 = I and use derivative information
to update Hk so as to approximate the positive definite part of ∇2 fk .
These methods will not be discussed here. The focus is on establishing
sufficient conditions for “global convergence”, which will turn out to be a
much weaker concept than the name suggests.

Choosing the step length α


Given a descent direction pk the task is to
solve the one dimensional line search problem

min Φ(α) := f (xk + αpk )


α>0 sketch line search Φ

Identifying αk ∈ Argminα>0 Φ(α) (where Argmin denotes the set of mini-


mizing arguments) is called exact line search and is used only if a direct
analytic expression is known. Otherwise it is computationally too expensive
and rather meaningless in view of the little knowledge the choice of the step
direction is based on. So a rough approximation ensuring some “sufficient
decrease” is all that is needed.
Sufficient decrease refers to making sure that a significant share of the possible
progress is achieved. The aim of this section is to formalize this rather vague
aim.
It is not sufficient to stop whenever f (xk +
αpk ) < f (xk ). Even in practice this often re-
sults in stalling at some points that do not even
satisfy the necessary optimality conditions.
sketch convergence failure

The only information assumed available be-


sides the step direction pk is that of a first-
order oracle evaluation, i. e., xk , fk and ∇fk
with ∇fk⊤ pk < 0. The local rate of decrease
is thus ∇fk⊤ pk and a step of length α should
also produce a proportional decrease. This is
the
sketch available knowledge

Armijo condition f (xk + αk pk ) ≤ fk + c1 α∇fk⊤ pk for given 0 < c1 < 1.

For smooth f , however, this is easily satisfied for an arbitrarily small step.
There is no point to stop, if the current decrease still predicts good progress.
It will be save to stop if the current decrease has flattened out or worsened
sufficiently, as required in the

curvature condition ∇f (xk +αk pk )⊤ pk ≥ c2 ∇fk⊤ pk for given c2 ∈ (c1 , 1).


CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 27

Note the requirement c1 < c2 which makes sure that c1 ∇fk⊤ pk > c2 ∇fk⊤ pk
(by ∇fk⊤ pk < 0), so that c2 flattens out the descent less than c1 .

sketch Armijo and curvature, acceptable intervals

Armijo and curvature conditions together are known as

Wolfe conditions for fixed 0 < c1 < c2 < 1

f (xk + αk pk ) ≤ fk + c1 α∇fk⊤ pk
∇f (xk + αk pk )⊤ pk ≥ c2 ∇fk⊤ pk

Stopping the search whenever these are satisfied will exhibit sufficient decrease.
There are other similar variants that frequently appear in the literature, e. g.,

Strong Wolfe conditions for fixed 0 < c1 < c2 < 1

f (xk + αk pk ) ≤ fk + c1 α∇fk⊤ pk
|∇f (xk + αk pk )⊤ pk | ≤ c2 |∇fk⊤ pk |

or the Goldstein conditions requiring fk + (1 − c)αk ∇fk⊤ pk ≤ f (xk + αk pk ) ≤


fk +cαk ∇fk⊤ pk for fixed 0 < c < 12 . Our focus will be on the Wolfe conditions.

Lemma 2.8 Suppose that f : Rn → R is continuously differentiable. Let pk


be a descent direction at xk and assume that f is bounded from below along
the half ray {xk + αpk : α > 0}. Then if 0 < c1 < c2 < 1 there exist intervals
of step lengths satisfying the Wolfe (and the strong Wolfe) conditions.

Proof: Because Φ(α) = f (xk + αpk ) is bounded from below for all α > 0,
the affine function fk + αc1 ∇fk⊤ pk must intersect Φ(α) for some α > 0.
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 28

sketch Armijo intersections, α′ , mean value theorem

Let α′ be the minimal α > 0 with this property, f (xk + α′ pk ) = fk +


α′ c1 ∇fk⊤ pk . Then the Armijo condition holds for α ∈ (0, α′ ). By the mean
value theorem there is an α′′ ∈ (0, α′ ) satisfying
f (xk + α′ pk ) − fk = α′ ∇f (xk + α′′ pk )⊤ pk .
| {z }
def α′ ′ ⊤
= αc 1 ∇fk pk

Thus
∇f (xk + α′′ pk )⊤ pk = c1 ∇fk⊤ pk > c2 ∇fk⊤ pk .
Therefore there is a neighborhood (and in it an interval) that satisfies Wolfe.
For strong Wolfe note that by ∇fk⊤ pk < 0 also ∇f (xk + α′′ pk )⊤ pk , hence
|∇f (xk + α′′ pk )⊤ pk | < c2 |∇fk⊤ pk |. □

Some hints for line search implementations

Here the αi refer to the internal iterative process of searching for some α
satisfying the Wolfe conditions for Φ(α) = f (xk + αpk ).
For Newton type methods, the size of the step pk is typically meaningful.
In this case one uses the initial step length α0 = 1 whenever possible and
checks Armijo only, because the Newton process typically ensures a sufficient
minimal step length. If Armijo fails, one may use a simple backtracking
9
scheme like repeatedly setting αi+1 = 10 αi until Armijo holds.
If nothing is known about the step direction, a reasonable very first step
size is almost impossible to predict. In the hope that a step size of norm 1
has a reasonable meaning in the problem formulation, one often starts with
1
α0 = ∥p∥ . If previous iterations exist, one either uses the last successful step
size or employs interpolation exploiting previous function values.
Given the information Φ(0), Φ′ (0) = ∇fk⊤ pk , Φ(α0 ) and typically also
Φ′ (α0 ) = ∇f (xk + α0 pk )⊤ pk (if evaluating the gradient is not too expensive)
with α0 not yet satisfying Wolfe, the search progresses in two phases,
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 29

1. bracketing phase: find an interval [α, α] that contains good points,


2. selection phase: zoom into the interval until an approximate minimizer is
found.
For each phase cubic interpolation is employed to shift the boundaries of the
interval accordingly. Cubic interpolation approximates the function Φ by a
cubic polynomial

q(α) = q0 + q1 α + q2 α2 + q3 α3 , q ′ (α) = q1 + 2q2 α + 3q3 α2

where the coefficients are determined so that function values and derivatives
coincide in α and α, i. e., the coefficients are computed from the equations

q(α) = Φ(α), q ′ (α) = Φ′ (α),


q(α) = Φ(α), q ′ (α) = Φ′ (α).

The next candidate is selected with respect to q depending on the phase. In


the selection phase the minimizer of q may be at the boundary or inside and
the next candidate point is chosen somewhat away from the boundaries of the
interval so as to reduce the size of the search interval at least by a constant
factor in each iteration. How to manipulate the interval boundaries depends
on the resulting function value and and directional derivative in the candidate
in a natural way but requires some care also concerning degeneracies and
numerical difficulties.

Convergence of line search methods

In this section f is assumed continuously differentiable and bounded below so


that the existence of Wolfe points is ensured throughout. With this choosing
the steepest descent direction pk = −∇fk quickly gives rise to a convergent
algorithm but the aim is to prove convergence for a more general setting.
It is sufficient to require that the angle θk between pk and −∇fk is bounded
away from becoming orthogonal,

−∇fk⊤ pk
cos θk = ≥ δ > 0. (2.2)
∥∇fk ∥∥pk ∥

It will turn out that this condition allows to show convergence of several
schemes under mild conditions. The main result seems rather technical at
first.

Theorem 2.9 (Zoutendijk) Let f be bounded below on Rn and contin-


uously differentiable in an open set U that contains the level set S :=
{x : f (x) ≤ f (x0 )} where x0 is the starting point. Let ∇f be Lipschitz
continuous on U , i. e., ∥∇f (x) − ∇f (x̄)∥ ≤ L∥x − x̄∥ for all x, x̄ ∈ U . Let
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 30

xk+1 = xk + αpk be any iterative algorithm with pk a descent direction and αk


satisfying the Wolfe conditions. Then there holds the Zoutendijk condition
X
cos2 θk ∥∇f ∥2 < ∞.
k≥0

Combining the Zoutendijk condition with (2.2) yields δ 2 k≥0 ∥∇f ∥2 < ∞
P
which implies ∇fk → 0. So whenever the step direction is not approaching
orthogonality to steepest descent and xk → x̄ this x̄ must be a stationary
point by continuity. In optimization a point sequence is globally convergent if
∇fk → 0. Mind however, that convergence refers to the gradients converging
to zero. A globally convergent sequence may well rush off to infinity, consider
e. g. minimizing ex .
Before embarking on the proof, let us sketch the intuitive idea. Due to
the Lipschitz condition the gradients cannot change to fast. The curvature
condition, however, requires a significant change in the directional derivative,
∇f ⊤ p
which yields a lower bound on the step length in terms ∥pk ∥2k . Together
k
(∇f ⊤ p )2
with Armijo this implies a lower bound on the decrease relative to ∥pk ∥2k .
k
But f is bounded below so either the direction has to go bad or ∥f ∥ has to
go to zero.
Proof: Subtracting ∇fk⊤ pk < 0 on both sides of the curvature condition
⊤ p ≥ c ∇f ⊤ p yields the middle relation in
∇fk+1 k 2 k k

C.-S.
L∥αk pk ∥∥pk ∥ ≥ (∇fk+1 − ∇fk )⊤ pk ≥ (c2 − 1) ∇fk⊤ pk > 0
Lipsch. | {z } | {z }
<0 <0

c2 − 1 ∇fk⊤ pk
⇒ αk ≥ · .
L ∥pk ∥2
Plugging this lower bound into Armijo gives a minimal decrease,

c2 − 1 (∇fk⊤ pk )2
fk+1 ≤ fk + c1 αk ∇fk⊤ pk ≤ fk + c1
| {zL } ∥pk ∥2
| {z }
<0, =:−c (2.2)
= cos2 θk ···∥∇fk ∥2

⇒ fk+1 ≤ fj − c cos2 θk ∥∇fk ∥2 .


Summing each side of this inequality over k ′ = 0, . . . , k and canceling identical
terms results in
Xk
fk+1 ≤ f0 − c cos2 θi ∥∇fi ∥2 .
i=0

Because f is bounded below, the sum term must remain bounded. □


CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 31

Because the steepest descent choice pk = −∇fk satisfies (2.2) with δ = 1,


the steepest descent method is globally convergent.
For Newton like methods with step direction pk = −Hk−1 ∇fk for some Hk ≻ 0,
it suffices to require ∥Hk ∥∥Hk−1 ∥ ≤ M for all k and some positive constant
M . Before proving this, note that for Hk ≻ 0 the norm ∥Hk ∥ = λmax Hk
is the least upper bound norm, so ∥Hk ∥∥Hk−1 ∥ = λλmax (Hk )
min (Hk )
= κ(Hk ) is the
condition number of Hk and the requirement is that this condition number
is uniformly bounded. For uniformly bounded condition number we obtain

−∇fk⊤ (Hk−1 ∇fk ) ∇fk⊤ −1 ∇fk 1 1


cos θk = −1 ≥ Hk · −1 ≥ M .
∥∇fk ∥∥Hk ∇fk ∥ ∥∇f k ∥ ∥∇fk ∥ ∥Hk ∥
| {z }
1
≥λ ≥ ∥H1 ∥
max (Hk ) k

In order to illustrate the advantage of Newton like methods over steepest


descent, it suffices to consider a convex quadratic function f (x) = 12 x⊤ Qx.
While this appears to be a very special case it is still quite representative,
because by the sufficient optimality conditions all strict local minima are
of this form in the limit and limiting convergence properties are all we are
interested in for now.
The disadvantage of steepest descent is that it strongly depends on the
eigenvalue distribution of Q. In fact it even suffices to consider diagonal
matrices Q ≻ 0 for the two dimensional case to illustrate the problem. For
Q = [ 10 01 ] all level sets are circles around the origin.

sketch level sets for Q=I and scaled Q

In any point the steepest descent direction is always orthogonal to the level
lines (why?) and for the circle this direction points right to the origin. So
when taking the starting point x̄ = ( 11 ) steepest descent with line search will
find the optimum in one step.
Now consider scaling the first coordinate axes by 1000 (in applications this
could correspond to changing the unit of the axes from millimeters to meters),
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 32

10002 0
 
i. e., replace x1 by 1000x̃1 . The equivalent
 change
 in Q leads to Q̃ = 0 1 ,
1
the change to the starting point x̃ = 1000. The level sets are now very
1
narrow ellipses corresponding to the high curvature along the first coordinate
and the steepest descent direction in x̃ reads p̃ = −Q̃x̃ = ( 1000 1 ). Even
doing an exact line search in this direction (given the level lines, where is
the line search optimum?) will go somewhat to the negative side of the first
coordinate and the method starts zigzagging in a very slow manner towards
the origin.
Finally, consider Newton’s method instead. In the first case the direction
reads p = −Q−1 Qx̄ = −x̄, in the scaled case it reads p̃ = −Q̃−1 Q̃x̃ = −x̃, so
in each case it converges in one step. This should be no surprise as the Newton
step goes to the minimum of the quadratic model which coincides with the
function. In the exercises, however, you will prove that Newton’s method is
scale invariant meaning that it generates exactly the same sequence of steps
independent of the choice of the basis. This holds for all kinds of ellipsoids
and not just for axes aligned ones.
Chapter 3

Convex Analysis

The main goals are necessary and sufficient optimality conditions for convex
optimization problems as well as laying the foundation for convex optimization
algorithms. We will see that most convex properties have beautiful geometric
interpretations.

3.1 Convex Sets

We first recall the basic definitions of convex sets and introduce convex cones.
Definition 3.1
• A set C ⊆ Rn is convex if for x, y ∈ C we have [x, y] = {αx + (1 −
α)y : α ∈ [0, 1]} ⊆ C.
• A set C ⊆ Rn is a convex cone, if for x, y ∈ C the open half ray
R++ (x + y) := {λ(x + y) : λ > 0} ⊆ C.

Examples of important convex sets, some of which have appeared already,


are:
• the hyperplane Hs,r
= := {x ∈ Rn : ⟨s, x⟩ = r} for given s ∈ Rn , r ∈ R.

• the halfspace Hs,r = {x ∈ Rn : ⟨s, x⟩ ≤ r}, the open half halfspace


< := {x ∈ Rn : ⟨s, x⟩ ≤ r}
Hs,r
• the unit simplex in Rk ,

Pk
△k := {α ∈ Rk : i=1 αi = 1, αi ≥ 0, i = 1, . . . , k}

sketch simplex

33
CHAPTER 3. CONVEX ANALYSIS 34

• the nonnegative orthant Rn+ = {x ∈ Rn : xi ≥ 0} is a (closed) convex


cone.
The next observation collects basic convexity preserving operations on convex
sets.

Observation 3.2

(i) Let (Cj )j∈J be family of convex sets in Rn , then j∈J Cj is convex.
S
(ii) Let Ci ∈ Rni be convex (or convex cones) for i = 1, . . . , k, then
C1 × · · · × Ck is convex (or a convex cone) in Rn1 +···+nk .
(iii) Let A : Rn → Rm be an affine map. For C ⊆ Rn convex the image
A(C) is convex in Rm and for D ⊆ Rm the preimage A−1 (D) is convex
in Rn .
(iv) If C1 , C2 ⊆ Rn are convex, then the Minkowksi-sum C1 + C2 is convex
and for α1 , α2 ∈ R the set α1 C1 +α2 C2 := {α1 x1 +α2 x2 : x1 ∈ C1 , x2 ∈
C2 } is convex.
(v) If C ⊆ Rn is convex, then so is its interior int C and its closure cl C.

Proof: In the exercises. □

Convex Combinations and Convex Hull

The following concepts are introduced in the Linear Algebra course.


• Let x1 , . . . , xk ∈ Rn , α1 , . . . , αk ∈ R, then ki=1 αi xi is a linear combi-
P
nation of the xi .
• A (linear) subspace of Rn contains all linear combinations of its ele-
ments.
• The intersection of linear subspaces is again a linear subspace.
• For S ⊆ Rn the linear hull span S is the set of all linear combinations
of S or equivalently the smallest subspace containing S or equivalently
the intersection of all subspaces that contain S.
The same concepts exist for affine subspaces (shifted linear subspaces).
• Let x1 , . . . , xk ∈ Rn , α1 , . . . , αk ∈ R with ni=1 αi = 1, then ni=1 αi xi
P P
is an affine combination of the xi .
• An affine subspace of Rn contains all affine combinations of its elements.
• The intersection of affine subspaces is again an affine subspace.
• For S ⊆ Rn the affine hull aff S is the set of all affine combinations of
elements in S or equivalently the smallest affine subspace containing S
or equivalently the intersection of all affine subspaces that contain S.
• k + 1 points x0 , . . . , xk are affinely independent if the vectors {x1 −
x0 , x2 − x0 , . . . , xk − x0 } are linearly independent.
CHAPTER 3. CONVEX ANALYSIS 35

• For affinely independent x0 , . . . , xk each xP∈ aff{x0 , . . . , P


xk } has a
unique representation as affine combination ki=0 αi xi with ki=0 αi =
1.
These concepts are also extended to convex sets and convex cones.

Definition 3.3
Pn
• Let xP n
1 , . . . , xk ∈ R , α1 , . . . , αk ≥ 0 with i=1 αi = 1 (⇔ α ∈ △k ),
then ni=1 αi xi is a convex combination of the xi .
• For S ⊆ Rn the convex hull
\
conv S := C
S⊆C cvx

is the intersection of all convex sets that contain S.


• Let x1 , . . . , xk ∈ Rn , λ1 , . . . , λk ≥ 0, then ni=1 λi xi is a conic combi-
P
nation (or nonnegative linear combination) of the xi .
• For S ⊆ Rn the conic hull
k
X
cone S := {x : ∃k ∈ N, x1 , . . . , xk ∈ S, λ1 , . . . , λk ≥ 0 with x = λ i xi }
i=1

is the set of all conic combinations of elements in S.


Attention: In general this is not the smallest convex cone that contains
S! △

Observation 3.4
(i) A set C ⊆ Rn is convex ⇔ C contains every convex combination of
its elements.
(ii) Let S ⊆ Rn , then
conv S = {x ∈ Rn : ∃k ∈ N, x1 , . . . , xk ∈ S, α ∈ △k with x = ki=1 αi xi }.
P
“The convex hull is the set of all convex combinations of its elements.”
(iii) Let S ⊆ Rn , then cone S = R+ (conv S) = conv (R+ S) where R+ S =
{λs : λ ∈ R+ , s ∈ S}.

Proof: For (i) and (ii) see the exercises.


Pk
(iii): As 0 P
is in all sets, we may assume that x = i=1 λi si ∈ cone S with
λ ≥ 0 and λi > 0, then
k k k
λ λ
P i si = Pi (
X X X X X
x= λi si = ( λj ) · λj )si ∈ conv (R+ S).
i=1 j i=1 j λj i=1 j λj j
| {z } | {z } | {z } | {z }
∈R+ =1 =1 ∈R+ S
| {z }
∈conv S □
CHAPTER 3. CONVEX ANALYSIS 36

While k ∈ N may seem in-


timidating, in fact, every
x ∈ conv S can be repre-
sented by a convex combi-
nation of relatively few ele-
ments.
sketch 2D conv of points and cone

Theorem 3.5 (Carathéodory; conic version)


Let S ⊆ Rn . Each x ∈ cone S can be represented as a conic combination of
linear independent vectors si ∈ S.

Pk This holds for S = ∅, so assume S ̸= ∅. It holds for x = 0, so let


Proof:
x = i=1 λi si ∈ cone S with λi > 0, i = 1, . . . , k. If the si are lin. indep., we
are done. Whenever they are not, we can eliminate one si as follows (and
repeatPthis, if necessary). Because the si are lin. dep., there exits β ∈ Rk \{0}
with ki=1 βi si = 0. W. l. o. g. there is at least one βj < 0 (otherwise use
−βj . Determine t > 0 so that λ′ := λ + tβ ≥ 0 and for some index ı̄ there
holds λı̄ = 0. Then
X k
X k
X k
X
λ′i si = (λi + tβi )si = λi si + t βi si = x.
i∈{1,...,k}\{ı̄} i=1 i=1 i=1
| {z }
=0 □
Each x ∈ cone S can be written as conic combination of at most n elements
of S. In this the feasible choices of elements depend on x.
So what about conv S?
For going back and forth between conic and set versions, a useful trick is
“homogenization”. Convert the set C in Rn into a cone K in Rn+1 by adding
a scaling coordinate, K = cone (C × {1}).

sketch 2D set C and 3D cone K

“Every convex set can be represented as the intersection of a convex cone


with a hyperplane.”
CHAPTER 3. CONVEX ANALYSIS 37

Theorem 3.6 (Carathéodory, general version)


Let S ⊆ Rn . Each x ∈ conv S can be represented as convex combination of
at most n + 1 elements of S.

Proof: Let C = conv S, K = cone (S × {1}).


h
def
X
( x1 ) ∈ K ⇔ ∃λ1 , . . . , λh ∈ R+ ( x1 ) = λi ( s1i )
i=1
h
X h
X
⇔ x= λi si with λi ≥ 0 and λi = 1 ⇔ x ∈ conv S.
i=1 i=1
By Th 3.5 we may assume h ≤ n + 1. □

Closure and Relative Interior

We have already pointed out the importance of closed sets in optimization


for attainment reasons. It will turn out that dangerous situations typically
involve some sequences that go to infinity.
Definition 3.7
• For S ⊆ Rn the closed convex hull conv S is the intersection of all
closed convex sets that contain S.
• For S ⊆ Rn the closed conic hull cone S is the intersection of all closed
convex cones that contain S.
• The dimension dim C of a convex set C is the dimension of aff C, i. e.,
the dimension of the smallest affine subspace that contains C.
• The relative interior rint C of a convex set C is the set of all point
x ∈ C, for which there is with respect to aff C an (open) neighborhood
of C containing x, so for C ⊆ Rn
rint C := {x ∈ C : ∃ε > 0 Bε (x) ∩ aff C ⊆ C}.
• The relative boundary rbd C of a convex set C is the boundary of C
with respect to aff C, i. e., rbd C := cl C \ rint C.

Example For this depicted C (solid lines


represent closed, dashed lines [relatively]
open parts),
• dim C = 2,
• int C = ∅,
• rint C is the shaded part,
• bd C = cl C = conv C,
• rbd C is the line bounding C.

sketch C within hyperplane in 3D space
CHAPTER 3. CONVEX ANALYSIS 38

Exercise Show dim △k = dim rint △k .



Theorem 3.8 Let S ⊆ Rn . If S is compact/bounded, then conv S is com-
pact/bounded.

Ph First let S be bounded by S ⊆ BM (0) for some M > 0. For


Proof:
αi xi ∈ conv S we may
x = i=1P assume h = n + 1 by Th 3.6.
⇒ ∥x∥ ≤ n+1
Pn+1
α
i=1 i ∥xi ∥ ≤ M i=1 αi = M , so conv S is bounded.
Let now S also be closed (bounded and closed and thus compact), then
it remains
P to show that conv S is closed. Consider a convergent sequence
(xk = n+1 k) k
i=1 i i k∈N ∈ conv S with α ∈ △n+1 for k ∈ N converging to
α x
xk → x ∈ Rn .
K
Because △n+1 is compact there is a subsequence K0 with αk →0 ᾱ ∈ △n+1 ,
and because S is compact, there are successive subsequences Ki of Ki−1 for
K Kn+1
i = 1, . . . , n + 1 with xki →i x̄i ∈ S so that xk → x = n+1
P
i=1 ᾱi x̄i ∈ conv S.

If S is closed but unbounded,
conv S is not necessarily closed.
sketch closed S with conv S not closed

Observation 3.9
(i) For S ⊆ Rn there holds conv S = cl conv S, cone S = cl cone S.
(ii) For bounded S ⊆ Rn , cl conv S = conv cl S, thus conv S = conv cl S.
(iii) For compact S ⊆ Rn with 0 ∈ / conv S, cone S = R+ (conv S)[= cone S].

Proof: (i): Exercise.


(ii): ⊇: Recall (A ⊆ B ⇒ cl A ⊆ cl B). So S ⊆ conv S gives cl S ⊆ cl conv S.
def. conv (i)
The latter is convex by Obs 3.2(v), hence conv cl S ⊆ cl conv S =
conv S.
def. conv
⊆: By Th 3.8 conv cl S is closed (and convex), so conv S ⊆ conv cl S.
(iii): By Th 3.8 C := conv S is compact and by assumption 0 ∈/ C.
Obs 3.4(iii) asserts R+ C = cone S ⊆ cone S. We show that R+ C is closed.
Let λk xk ∈ R+ C for k ∈ N and λk xk → x.
K
Because C is compact, there is a subsequence K with xk → x̄ ∈ C, so x̄ ̸= 0.
K x K
By λk ∥xxkk ∥ → ∥x̄∥ there holds λk → ∥x∥
∥x̄∥ =: λ̄ ≥ 0.
K
With this, λk xk → λ̄x̄ = x ∈ R+ C. □

sketch examples that in (iii) 0 ∈


/ conv S is needed, sufficient but not necessary
CHAPTER 3. CONVEX ANALYSIS 39

Theorem 3.10 Let C ⊆ Rn be convex. If C ̸= ∅ then rint C =


̸ ∅ and
dim(rint C) = dim C.

Proof: For dim aff C = k ≥ 0 the set C contains k + 1 affinely independent


points x0 , . . . , xk . The claim holds for k = 0, so let k > 0.
Consider the simplex △ = conv {x0 , . . . , xk+1 } ⊆ C. We show that its
barycenter x̄ = ki=0 k+1 1
P
xi ∈ rint △ (⊆ rint C).
Pk
Observe aff C = {x̄ + i=1 ξi (xi − x̄) : ξ ∈ Rk }. Because the set
S = {x̄ + ki=1 ξi (xi − x̄) : ∥ξ∥∞ ≤ 12 k+1 1
P
} ⊃ Bε (x̄) ∩ aff(C) for ε > 0 small
enough, it suffices to show S ⊆ △. Let x ∈ S, then x = ki=0 αi xi with
P


≤ 21 
z }| { 


X k 


1
α0 = k+1 (1 − ξj ) ≥0 




j=1 

 k k k
k X
k+1
X X
1
X and αi = k+1 (1− ξj )+ ξj = 1.
αi = k+1 (1 − ξj ) + ξi ≥ 0 
 i=0 j=1 j=1
j=1 


| {z } 

≤ 21




| {z } 

1 
≥2

So x ∈ △, hence x̄ ∈ rint C. Furthermore, dim(rint C) ≥ dim △ = dim aff C.



For any point x in the closure of a convex set
C and any point y ∈ rint C the half open line
segment (x, y] is fully contained in the relative
interior of C.
sketch (x, y] for x ∈ bd C, y ∈ rint C

Lemma 3.11 Let C ⊆ Rn be convex, x ∈ cl C and y ∈ rint C. There holds

(x, y] := {αx + (1 − α)y : 0 ≤ α < 1} ⊆ rint C.

Proof: in the Exercises. □

Observation 3.12 Let C ⊆ Rn be convex. The three sets rint C, C and


cl C have the same affine hull (thus the same dimension), the same relative
interior and the some closure (thus the same relative boundary).

Proof: That the affine hull is the same follows by Th 3.10 (also for cl C,
because aff C is closed). All other claims follow by Lem 3.11. For proving,
e. g., that rint C and C have the same closure we have to show cl C ⊆ cl rint C.
[?]For C = ∅ this holds, so let x ∈ cl C and y ∈ rint C (̸= ∅ by Th 3.10). Then
CHAPTER 3. CONVEX ANALYSIS 40

xk = (1 − k1 )x + k1 y ∈ rint C by Lem 3.11 and thus limk→∞ xk = x ∈ cl rint C.



One might at first expect that the relative interior / the closure of the
intersection of convex sets is simply the intersection of their relative interiors
/ their closures. Unfortunately, this is not true in general. It can be shown,
however, by a somewhat technical proof that no problems arise for sets whose
relative interiors intersect.

sketch examples for rint (C1 ∩ C2 ) ̸= rint (C1 ) ∩ rint (C2 ) and cl (C1 ∩ C2 ) ̸= cl (C1 ) ∩ cl (C2 )

In optimization such anomalies may cause significant difficulties, e. g., when-


ever feasible sets arise as intersections of level sets that just touch at their
boundaries. Regularity assumptions like the Slater point assumption intro-
duced later ensure friendly intersection properties and are indispensable in
order to obtain managable optimality conditions.
In the same vein one needs to be careful whenever taking affine images or
preimages of convex sets. Indeed, there are easy examples, where the affine
image of a closed convex set is no longer closed. In convex optimization
this fact is a further frequent source of difficulties for attainment and strong
duality.
In spite of the conceptual importance, the proof of the optimality conditions
will not require these relations, so we refrain from a detailed discussion in
this course.

Projections onto closed sets

Definition 3.13 Let ∅ = ̸ C ⊆ Rn be closed and convex, let x ∈ Rn . The


projection of x onto C is
1
pC (x) = argminy∈C ∥y − x∥2 .
2
In this argmin refers to the unique minimizing argument.

Note, pC (x) is well defined, because f (y) = 12 ∥y − x∥2 is strictly convex


proving uniqueness of the optimal solution. Existence follows because with
r = ∥y − x∥ for some y ∈ C ̸= ∅ the optimal solution is found in Br (x) ∩ C
which is a compact set (Br (x) and C are closed) and f is continuous.
CHAPTER 3. CONVEX ANALYSIS 41

sketch projection ȳ of x onto C, ball and angle for some y ∈ C

̸ C ⊆ Rn closed, convex, ȳ ∈ C, x ∈ Rn .
Theorem 3.14 Let ∅ =

ȳ = pC (x) ⇔ ∀y ∈ C ⟨x − ȳ, y − ȳ⟩ ≤ 0

Proof: ⇒: Let ȳ = pC (x). There holds for y ∈ C and all α ∈ (0, 1]


2
1
2 ∥x−ȳ∥ ≤ 21 ∥x−(ȳ+α(y−ȳ))∥2 = 12 ∥x−ȳ∥2 −α ⟨x − ȳ, y − ȳ⟩+α2 12 ∥y−ȳ∥2 .

Thus
0 ≤ −α ⟨x − ȳ, y − ȳ⟩ + α2 12 ∥y − ȳ∥2
or
⟨x − ȳ, y − ȳ⟩ ≤ α 12 ∥y − ȳ∥2 → 0 for α → 0.

⇐: For x = ȳ ∈ C the statement holds, so let x ̸= ȳ with ⟨x − ȳ, y − ȳ⟩ ≤ 0


for all y ∈ C. Then

0 ≥ ⟨x − ȳ, y − x + x − ȳ⟩ = ∥x − ȳ∥2 + ⟨x − ȳ, y − x⟩


C.S.
≥ ∥x − ȳ∥2 − ∥x − ȳ∥ · ∥y − x∥.

Dividing by ∥x − ȳ∥ proves ∥x − ȳ∥ ≤ ∥y − x∥ for all y ∈ C. □


A bit later we shall also need the following consequence of this.
̸ C ⊆ Rn be convex and closed. For all (x1 , x2 ) ∈
Observation 3.15 Let ∅ =
n n
R × R there holds

∥pC (x1 ) − pC (x2 )∥2 ≤ ⟨pC (x1 ) − pC (x2 ), x1 − x2 ⟩ .

Proof: Apply Th 3.14 for the following two choices to obtain


x , ȳ , y
x1 , pC (x1 ), pC (x2 ) ⇒ ⟨pC (x2 ) − pC (x1 ), x1 − pC (x1 )⟩ ≤ 0,
x2 , pC (x2 ), pC (x1 ) ⇒ ⟨pC (x1 ) − pC (x2 ), x2 − pC (x2 )⟩ ≤ 0,
+ : ⟨pC (x1 ) − pC (x2 ), x2 − x1 + pC (x1 ) − pC (x2 )⟩ ≤ 0.

Apply Cauchy-Schwarz on the right to see ∥pC (x1 ) − pC (x2 )∥ ≤ ∥x1 − x2 ∥.
The distance of the projected points is not bigger than that of the points.
Thus, projection is a nonexpansive map.
Next we consider a characterization of the projection onto convex cones.
CHAPTER 3. CONVEX ANALYSIS 42

Definition 3.16 Let K ⊆ Rn be a convex cone (not nec. closed). The polar
cone is
K ◦ = {s ∈ Rn : ⟨s, x⟩ ≤ 0 ∀x ∈ K}.

The dual nature becomes apparent in the fol-


lowing two verbal descriptions:
K ◦ contains all points that lie in the nonposi-
tive halfspace Hx,0 for all x ∈ K.
K ◦ consists of all normal vectors s that contain
K in their nonpositive halfspace Hs,0 .
sketch open K with K ◦ and ȳ = pK (x)
In fact, the polar cone S ◦ can be declared for any set S ⊆ Rn in the same
way, but we will mainly need it for convex cones.
Note, K ◦ is always convex and closed. One may prove this by observing
that it is the intersection of a family of closed convex half spaces. For an
alternative proof let (sk )k∈N → s with sk ∈ K ◦ , then for any x ∈ K continuity
of the inner product implies ⟨s, x⟩ ≤ supk∈N ⟨sk , x⟩ ≤ 0, so s ∈ K ◦ .
In order to be sure that projections exist, the cone to project on needs to
be closed. For closed convex cones, projections have a particularly simple
characterization.

Theorem 3.17 Let K ⊆ Rn be a closed convex cone and x ∈ Rn .

ȳ = pK (x) ⇔ ȳ ∈ K, x − ȳ ∈ K ◦ , ⟨x − ȳ, ȳ⟩ = 0.

Proof: ⇒: By Th 3.14, ⟨x − ȳ, y − ȳ⟩ ≤ 0 for all y ∈ K. This also holds for
y = αȳ for α ≥ 0, so

for all α ≥ 0 (α − 1) ⟨x − ȳ, ȳ⟩ ≤ 0, hence ⟨x − ȳ, ȳ⟩ = 0.

⇐: Let x ∈ Rn and ȳ satisfy the requirements. For y ∈ K


1 1
∥x − y∥2 = ∥x − ȳ + ȳ − y∥2
2 2
1 1
= ∥x − ȳ∥2 + ⟨x − ȳ, ȳ − y⟩ + ∥ȳ − y∥2
2 | {z } 2
⟨x−ȳ,ȳ⟩=0
= − ⟨x − ȳ, y⟩
| {z }
≤0 by (x−ȳ)∈K ◦

1
≥ ∥x − ȳ∥2 .
2

Exercise Prove pK (x) + pK ◦ (x) = x for x ∈ Rn .


CHAPTER 3. CONVEX ANALYSIS 43

Separation of Convex Sets

Definition 3.18 Let M1 , M2 ⊆ Rn (not necessarily convex).


• M1 and M2 are separable if

∃s ∈ Rn \ {0}, r ∈ R ∀x ∈ M1 , y ∈ M2 ⟨s, x⟩ ≤ r ≤ ⟨s, y⟩

• They are properly separable if they are separable so that

∃x̄ ∈ M1 , ȳ ∈ M2 ⟨s, x̄⟩ < ⟨s, ȳ⟩ .

• They are strongly separable if

∃s ∈ Rn \ {0}, r ∈ R sup ⟨s, x⟩ < r < inf ⟨s, y⟩ .


x∈M1 y∈M2

sketch separation examples

In the convex case the basic result concerns separating a point from a convex
closed set.
Theorem 3.19 (convex separation)
Let ∅ ̸= C ⊆ Rn be convex and
closed and let x ∈
/ C.

∃s ∈ Rn ⟨s, x⟩ > sup ⟨s, y⟩ .


y∈C
sketch construction

Proof: Put s := x − pC (x) (̸= 0). By Th 3.14 there holds for all y ∈ C

0 ≥ ⟨x − pC (x), y − pC (x)⟩
= ⟨s, y − x + s⟩ = ⟨s, y⟩ − ⟨s, x⟩ + ∥s∥2 ,

thus ⟨s, x⟩ − ∥s∥2 ≥ ⟨s, y⟩ for all y ∈ C. □


This can be generalized to convex sets as follows.
Theorem 3.20 (strong separation of convex sets)
Let C1 , C2 be nonempty closed convex sets satisfying C1 ∩ C2 = ∅. If C2 is
bounded, there exists s ∈ Rn so that

sup ⟨s, y⟩ < min ⟨s, y⟩ .


y∈C1 y∈C2
CHAPTER 3. CONVEX ANALYSIS 44

Proof: C1 − C2 = C1 + (−C2 ) is convex and closed, because C2 is compact


(Exercise). Furthermore, 0 ∈
/ C1 − C2 because C1 ∩ C2 = ∅. By Th 3.19 there
exists s ∈ Rn so that

sup{⟨s, y⟩ : y ∈ C1 − C2 } < ⟨s, 0⟩ = 0.

Exploiting compactness of C2 again, each y may be split to obtain

0 > sup ⟨s, y1 ⟩ + sup ⟨s, −y2 ⟩ = sup ⟨s, y1 ⟩ − inf ⟨s, y2 ⟩ .
y1 ∈C1 y2 ∈C2 y1 ∈C1 y2 ∈C2

By compactness the second inf is attained, which proves the claim. □


In general the compactness require-
ment cannot be dropped for the sets
to be strongly separable.
sketch separable unbounded convex sets

As usual, closedness and a cautious treatment of directions to infinity are


essential.
Convex separation allows to establish the existence of supporting hyperplanes
for convex sets and these are the basis of duality.
Definition 3.21 A hyperplane Hs,r = supports a set C, if C is fully contained

in one of its closed half spaces. It supports C in x ∈ bd C, if in addition


⟨s, x⟩ = r.
Lemma 3.22 Let ∅ = ̸ C ⊆ Rn be convex and x ∈ bd C. There is a hyper-
plane, that supports C in x.

Proof: By Obs 3.12 C, cl C and their complements have the same boundary.
Therefore there exist (xk )k∈N → x with
xk ̸∈ cl C. For each k, by Th 3.19 there is
an sk with ∥sk ∥ = 1 so that

⟨sk , xk − y⟩ > 0 ∀y ∈ cl C ⊇ C.
sketch construction of s

Because B1 (0) is compact the sequence (sk ) has a cluster point s and

⟨s, x − y⟩ ≥ 0 ∀y ∈ C.
= with r := ⟨s, x⟩ ≥ ⟨s, y⟩ for y ∈ C supports C in x.
Thus Hs,r □
Remark 3.23 If C is “flat”, it may
happen that C ⊆ Hs,r= . For x ∈ rbd C

this can be avoided by carrying out the


construction in aff C.
=
sketch flat C contained in Hs,r .

Closed convex sets can be fully be described “from the outside” by their
supporting hyperplanes.
CHAPTER 3. CONVEX ANALYSIS 45

Theorem 3.24 Let ∅ = ̸ C ⊆ Rn be convex, then the intersection of all


halfspaces that contain C is the closure of C.

Proof: There holds

cl C ⊆ C ∗ := x ∈ Rn : ⟨s, x⟩ ≤ r for all s, r with C ⊆ {y : ⟨s, y⟩ ≤ r} ,




because if C is contained in a closed halfspace, then cl C is as well.


C ∗ ⊆ cl C: Let x ∈ / cl C. By Th 3.19 x may be separated from cl C by some
/ C ∗.
s̄ so that ⟨s̄, x⟩ > supy∈C ⟨s̄, y⟩ =: r̄, thus x ∈ □

Corollary 3.25 Every closed convex set C ⊂ Rn is the intersection of all


halfspaces that contain C.

Note, this also works for C = Rn and C = ∅ (recall, = Rn ). An important


T

special case are polyhedra.

Definition 3.26 A (closed convex) polyhedron is the intersection of finitely


many halfspaces,

P = {x ∈ Rn : ⟨ai , x⟩ ≤ bi , i = 1, . . . , m} = {x ∈ Rn : Ax ≤ b}.

If b = 0, P is a (closed convex) polyhedral cone.

Exercise Justify in the name ployhedral cone the use of “cone”.


A helpful interpretation of the polar cone K ◦ to a cone K is that it collects ⋉

supporting halfspaces that support K in the origin. Taking the polar of the
polar cone results in the so-called bipolar cone K ◦◦ which is closed ly related
to K.

Theorem 3.27 Let ∅ = ̸ K ⊆ Rn be a convex cone with polar cone K ◦ . The


◦◦
bipolar cone K is the closure of K.

Proof: By Def 3.16 K ◦ = {s ∈ Rn : ⟨s, x⟩ ≤ 0 ∀x ∈ K}. Note, for any cone


K and s ∈ Rn there are only two possible outcomes for the conic optimization
problem
0 ⇔ s ∈ K ◦,

sup ⟨s, x⟩ =
x∈K ∞ otherwise.
Therefore K ◦ describes all halfspaces that contain K without any slack (for
s ∈ K ◦ any ⟨s, x⟩ ≤ r with r > 0 is redundant and r < 0 cuts of the origin).
Th 3.24 assert that the intersection of these halfspaces is the closure of K,
\
cl K = {x ∈ Rn : ⟨s, x⟩ ≤ 0} = {x ∈ Rn : ⟨s, x⟩ ≤ 0 ∀s ∈ K ◦ } = K ◦◦ .
s∈K ◦ □
CHAPTER 3. CONVEX ANALYSIS 46

When optimizing over polyhedral sets (like in LP) and one fails to find a
point in
X = {x ≥ 0 : Ax = b}
one would like to give a preferably short proof that the set is indeed empty.
Reformulated as the question
X
Is b ∈ A•,i xi : xi ≥ 0 ?

this results in the geometric question, whether


b is contained in the cone spanned by the
colummns of A.
sketch cone generated by columns and b

The following Lemma of Farkas shows that there is indeed a good way to
answer this, i. e., there is a short proof also for infeasibility. This famous
lemma exists in several variants. We start with one in the form of a theorem
of the alternative.
Lemma 3.28 (Farkas) Let b ∈ Rn , A ∈ Rn×m . Exactly one of the systems
Ax = b, x ≥ 0 and A⊤ y ≤ 0, b⊤ y > 0 has a solution.

In words, “Either b is contained in the


cone that is spanned by the columns
of A or there exists a hyperplane with
normal vector y that separates b from
the cone.
sketch a separating y

If b does not lie in the cone, the cone only needs to be closed for the Lemma
to follow directly from Th 3.19. Therefore it suffices to prove the following
reformulated variant.
Lemma 3.29 (Farkas, cone version) Pm Let a1 , . . . , am ∈ Rn . The convex
cone K = cone {a1 , . . . , am } = { i=1 λi ai : λi ≥ 0, i = 1, . . . , m} is closed.
Pm
Proof: Let (bk )k∈N → b with bk ∈ K. So bk = k
i=1 λi ai with λki ≥ 0 ∀i, k.
If the ai are linearly independent, the map A = [a1 , . . . , am ] is injective and
the inverse map to Aλk = bk is continuous. In this case λki → λi ≥ 0 with
b = ki=1 λi ai ∈ K and K is closed.
P

If the ai are linearly dependent, the λk may vary a lot due to linearly
dependent contributions. In order to get rid of linear dependence, apply the
conic version of Caratheodory, by which each bk can be represented as a conic
combination of linearly independent vectors ai . As there are only finitely
many linearly independent subsets of {a1 , . . . , am }, each corresponding to a
subcone of K, one of these subcones has to contain infinitely many of the bk
and, by the previous argument, also b. □
CHAPTER 3. CONVEX ANALYSIS 47

Exercise Prove that {Ax : x ≥ 0}◦ = {y : A⊤ y ≤ 0} and {y : A⊤ y ≤ 0}◦ =


{Ax : x ≥ 0}.



Faces and Extreme Points

Definition 3.30 Let F, C ⊆ Rn convex with F ⊆ C. F is a face of C if


there holds
∀x1 , x2 ∈ C [(x1 , x2 ) ∩ F ̸= ∅ ⇒ x1 , x2 ∈ F ].
In words, if a point in the relative interior of a straight line segment in C is
in F , then the full straight line segment is in F .

Consider as C the 3D unit-cube.


• The cube itself is a face.
• The convex hull of any four vertices shar-
ing an identical coordinate is a face.
• The convex hull of any two vertices shar-
ing two identical coordinates (the edges)
is a face.
• Each vertex is a face.
• The empty set is a face
sketch 3D unit cube

Exercise What are faces if C is the 2D unit-disk or, more generally, the
n-dimensional unit-ball?



Definition 3.31 Let C ⊆ Rn be convex.
• ∅ and C are the trivial faces of C.
• A face F of C having dim(F ) = dim C − 1 is a facet of C.
• A face F of C having dim(F ) = 1 is an edge of C.
• A face F of C having dim(F ) = 0, thus F = {x}, is an extreme point
of C and this also refers to the point x itself.
• A subset F ⊆ C is an exposed face of C if there is a supporting
= of C with F = H = ∩ C.
hyperplane Hs,r s,r
• An exposed extreme point of C is a vertex.

Not all extreme points


need to be vertices.
sketch rectangle and attached a quarter disk

Observation 3.32 An exposed face is a face.

= to C convex.
Proof: Let F be an exposed face for supporting hyperplane Hs,r
Any x1 , x2 ∈ C satisfy ⟨s, x1 ⟩ ≤ r and ⟨s, x2 ⟩ ≤ r. Suppose there exists
x̄ = αx1 + (1 − α)x2 ∈ F for some α ∈ (0, 1), then
r = ⟨s, x̄⟩ = α ⟨s, x1 ⟩ + (1 − α) ⟨s, x2 ⟩ ≤ αr + (1 − α)r = r.
CHAPTER 3. CONVEX ANALYSIS 48

This requires ⟨s, x1 ⟩ = r and ⟨s, x2 ⟩ = r, i. e., x1 , x2 ∈ F . □


Observation 3.33 Let ∅ ̸= C ⊆ Rn be convex and compact, then C has
extreme points.

Proof: Because C is compact, the maximum of the continuous function ∥x∥


over C is attained in some point x̄ ∈ C. Suppose, for contradiction, x̄ is no
face. Then there are x1 , x2 ∈ C with x̄ = 12 (x1 + x2 ) and
∥x̄∥2 = ∥ 21 (x1 + x2 )∥2 < 1 (∥x1 ∥2 + ∥x2 ∥2 ) ≤ 12 (∥x̄∥2 + ∥x̄∥2 ) = ∥x̄∥2 .
str. cvx 2 x̄ max □
Observation 3.34 The faces of the faces of a convex set C are faces of C.
In particular, the extreme points of a face are extreme points of C.

Proof: Let G be a face of F and F be a face of C. Let x1 , x2 ∈ C with


∃x ∈ G ⊆ F so that x ∈ (x1 , x2 ). Then by x ∈ F and F a face of C it follows
that x1 , x2 ∈ F , and therefore also x1 , x2 ∈ G by G being a face of F . □
Exercise The intersection of faces is a face.



Theorem 3.35 (Minkowski/Krein-Milman) Let C ⊆ Rn be convex and
compact. Then C is the convex hull of its extreme points.

Proof: This is true if C = ∅ or dim C = 0 (a single point). We continue


by induction on dim C. Let x ∈ C. We need to show that x is a convex
combination of extreme points of C. For this discern the following two cases.
x ∈ rbd C: By Lem 3.22 and the subsequent Rem 3.23 there exists a hyper-
plane H supporting C in x so that dim(C ∩ H) ≤ dim C − 1. By induction
hypothesis x is a convex combination of extreme points of F = C ∩ H. By
Obs 3.32 F is a face of C and the extreme points of F are also extreme points
of C by Obs 3.34.
x ∈ rint C: Choose a straight line within aff C that goes through x. Because
C is compact, the straight line intersects rbd C in two points x̄, x̄ ¯ and
¯). Thus x is a convex combination of x̄ and x̄
x ∈ (x̄, x̄ ¯, which are convex
combinations of extreme points of C by the first part. □
This has important consequences in optimization.
̸ C ⊆ Rn be convex and compact, let s ∈ Rn .
Observation 3.36 Let ∅ =
There holds
max ⟨s, x⟩ = max{⟨s, x⟩ : x extreme point of C}
x∈C

and
Argmaxx∈C ⟨s, x⟩ = conv Argmax{⟨s, x⟩ : x extreme point of C},
where Argmax denotes the set of maximizing arguments.
CHAPTER 3. CONVEX ANALYSIS 49

Proof: Because C is compact, the functional ⟨s, ·⟩ attains its maximum r


= is a supporting hyperplane of C. The set of
on C. By maximality of r, Hs,r
= . F is convex and compact
maximizers is thus the exposed face F = C ∩ Hs,r
and by Obs 3.32 a face of C. By Th 3.35 F is the convex hull of its extreme
points, which are also extreme points of C by Obs 3.34. □
For the special case of linear optimization this gives a first indication, why
for linear programs in normal form at least one vertex is always contained in
the set of optimizers.

Tangent and Normal Cone

In smooth nonlinear optimization functions are locally approximated by


tangent planes via the gradients of the functions. In the convex world
and also for feasible sets in smooth constrained optimization the kinks are
approximated by cones.

Definition 3.37 Let ∅ ̸= S ⊆ Rn . A direction d ∈ Rn is called tangent or


limiting direction to S in x ∈ S if there is a feasible sequence (xk )k∈N → x
with xk ∈ S and (tk ) ↘ 0 with limk→∞ xkt−xk
= d.
The tangent cone TS (x) to S in x is the set of all such directions.

Attention! For non convex S the tangent cone is i. g. non convex, as well.
It is a cone in the sense of satisfying d ∈ T ⇒ ∀λ > 0 λd ∈ T . △

sketch convex, kinky smooth and discrete examples


The origin 0 is always contained in TS (x) and for C convex and x ∈ C the
tangent cont TC (x) is convex, as well.
xk −x
Putting dk := tk → d yields the following reformulation.

Observation 3.38 Let x ∈ S ⊆ Rn .

d ∈ TS (x) ⇔ ∃(dk )k∈N → d, (tk )k∈N ↘ 0 ∀k ∈ N x + tk dk ∈ S.

Observation 3.39 Let x ∈ S ⊆ Rn . TS (x) is closed.

Proof: Let (dℓ )ℓ∈N → d with dℓ ∈ TS (x). Each dℓ is the limit of a feasible
sequence (xℓ,k ), (tℓ,k ) with k ∈ N. For each ℓ ∈ N there exists a kℓ with
CHAPTER 3. CONVEX ANALYSIS 50

xℓ,kℓ −x 1
∥ tℓ,kℓ − dℓ ∥ ≤ ℓ and tℓ,kℓ < tℓ−1,kℓ−1 . The feasible sequence (xℓ,kℓ ), (tℓ,kℓ ),
xℓ,kℓ −x
ℓ ∈ N satisfies limk→∞ tℓ,kℓ = d. □
For convex sets the tangent cone max be described in a rather intuitive form.
Observation 3.40 Let C ⊆ Rn be closed and convex and let x ∈ C. The
tangent cone to S in x is the closure of the cone generated by C − {x},

TC (x) = cone (C − {x}) = cl R+ (C − {x}) = cl {d = α(y − x) : y ∈ C, α ≥ 0}.

Proof: The second and third equality hold by Obs 3.4(iii) and Obs 3.9(i),
we only need to prove the first equality.
⊇: C − {x} ⊆ TC (x), because x + td ∈ C for all d ∈ C − {x} and all t ∈ [0, 1].
By Obs 3.39 TC (x) is closed, therefore cl (R+ (C − {x})) ⊆ TC (x).
xk −x
⊆: Let d ∈ TC (x) arise from a feasible sequence (xk ), (tk ), then tk ∈
R+ (C − {x}). Hence, the limit is contained in cl (R+ (C − {x})). □
Definition 3.41 Let C ⊂ Rn be convex. A direction s ∈ Rn is normal to C
in x ∈ C, if
⟨s, y − x⟩ ≤ 0 for all y ∈ C.
The normal cone NC (x) to C in x is the set of all such directions.

By Th 3.14 the normal cone con-


tains the difference vectors to all
points x′ that have x as their pro-
jection, x = pC (x′ ).
sketch convex set with tangent and normal cone

Observation 3.42 Let C ⊆ Rn


be convex and x ∈ C. The normal cone
NC (x) is polar to the tangent cone TC (x).

Proof: Recall that by Obs 3.40 TC (x) = cl R+ (C − {x}).


NC (x) ⊆ TC (x)◦ : Let s ∈ NC (x), then ⟨s, y − x⟩ ≤ 0 for all y ∈ C. Thus,
for all d = y − x ∈ C − {x} we have ⟨s, d⟩ ≤ 0. This implies ⟨s, d⟩ ≤ 0 for all
d ∈ R+ (C − {x}) and, by continuity, ⟨s, d⟩ ≤ 0 for all d ∈ cl R+ (C − {x}).
Hence, s ∈ TC (x)◦ .
TC (x)◦ ⊆ NC (x): Any s ∈ TC (x)◦ satisfies ⟨s, d⟩ ≤ 0 for all d ∈ cl R+ (C −
{x}). In particular ⟨s, d⟩ ≤ 0 for all d ∈ C − {x} or, equivalently, ⟨s, y − x⟩ ≤
0 for all y ∈ C. Hence, s ∈ NC (x). □
For convex cones K we have (K ◦ )◦ = cl K by 3.27. Now, the tangent cone is
already closed (Obs 3.39), so the following is a direct corollary.
Corollary 3.43 For convex sets C ⊆ Rn and x ∈ C the tangent cone TC (x)
is the polar cone of the normal cone NC (x).
CHAPTER 3. CONVEX ANALYSIS 51

3.2 Convex Functions

Definition 3.44
• The set Conv Rn denotes the set of proper convex functions f : Rn →
R̄ := R ∪ {∞} with f not identical to +∞.
• dom f := {x : f (x) < ∞}, the domain of f is the set of points with
finite function values (for f ∈ Conv Rn it is nonempty).
• The set Conv Rn denotes the set of functions f ∈ Conv Rn whose
epigraph epi f is closed.

For example, the smiley function is in


Conv R but not in Conv R.
sketch smiley

Theorem 3.45 (Jensen’s inequality) Let f ∈ Conv Rn . For k ∈ N, xi ∈


dom f , i = 1, . . . , k, α ∈ △k there holds
k
X k
X
f( αi xi ) ≤ αi f (xi ).
i=1 i=1

Pk−1 αi
Proof: Inductively by splitting off αk xk via ȳ = i=1 1−αk xi . □
Of particular importance are the affine functions

f (x) = ⟨s̄, x⟩ + r̄ with s̄ ∈ Rn , r̄ ∈ R.



Their epigraphs {( xr ) : f (x) ≤ r} are halfspaces {( xr ) :
 x
−1 , ( r ) ≤ −r̄}


with the property that the last component of the normal vector −1 is
strictly negative, i. e., it may not become horizontal.
Theorem 3.46 (affine minorants) Let f ∈ Conv Rn and x̄ ∈ rint dom f .
There exists an affine minorant that supports f in x̄, i. e.,

∃s ∈ Rn ∀x ∈ Rn f (x) ≥ f (x̄) + ⟨s, x − x̄⟩ .

Proof: W. l. o. g. assume aff dom f = Rn and let x̄ ∈ int dom f . Because


(x̄, f (x̄)) is on the boundary of the convex set epi f , Lem 3.22 asserts the
existence of a supporting hyperplane (s, ρ) in (x̄, f (x̄)) with s ∈ Rn , ρ ∈ R
so that
⟨s, x⟩ + ρr ≤ ⟨s, x̄⟩ + ρf (x̄) for all (x, r) ∈ epi f.
In particular this holds for all (x̄, r) with r ≥ f (x̄), thus ρ ≤ 0. By x̄ ∈
int dom f the choice x = x̄ + εs (for ε > 0 small enough) implies ρ < 0.
Divide the inequality by ρ < 0 to obtain
 
s
f (x̄) + , x − x̄ ≤ r for all (x, r) ∈ epi f.
|ρ|
CHAPTER 3. CONVEX ANALYSIS 52

s
Putting s̄ = |ρ| and r = f (x) proves the claim. □
Convex functions may have strange behavior along the (relative) boundary
of the domain but are quite tame inside. In order the show this, we need an
intermediate technical step.

Lemma 3.47 Let f ∈ Conv Rn and assume there are x̄ ∈ Rn , δ > 0,


m, M ∈ R so that

m ≤ f (x) ≤ M for all x ∈ B̊2δ (x̄) := {x : ∥x − x̄∥ < 2δ}.

Then f is Lipschitz continuous on B̊δ (x̄),

M −m
|f (y) − f (y ′ )| ≤ ∥y − y ′ ∥ ∀y, y ′ ∈ B̊δ (x̄).
δ

Proof: Choose y, y ′ ∈ B̊δ (x̄) and put

y′ − y
y ′′ := y ′ + δ ∈ B̊2δ (x̄).
∥y ′ − y∥

By construction y ′ ∈ [y ′′ , y], concretely

∥y ′ − y∥ δ
y′ = ′
y ′′ + y
δ + ∥y − y∥ δ + ∥y ′ − y∥
| {z } | {z }
=:α 1−α

Because f is convex and m ≤ f (x) ≤ M ,

f (y ′ ) − f (y) ≤ αf (y ′′ ) + (1 − α)f (y) − f (y)


∥y ′ − y∥
= [f (y ′′ ) − f (y)]
δ + ∥y ′ − y∥
1 ′
≤ ∥y − y∥(M − m).
δ
For the absolute value swap y ′ ↔ y. □
sketch B̊ for δ, 2δ, y ′′ , y ′ , y and f
This now allows to show that any convex function is Lipschitz continuous on
any compact subset of the relative interior of its domain.

Theorem 3.48 Let f ∈ Conv Rn and let S be a compact subset of rint dom f .
There exists an L = L(S) ≥ 0 with

|f (x) − f (x′ )| ≤ L∥x − x′ ∥ for all x, x′ ∈ S.

Proof: W. l. o. g. assume aff dim f = Rn (thus rint dom f = int dom f ) and
S ⊆ int dom f convex and compact (otherwise work on conv S ⊆ int dom f ).
CHAPTER 3. CONVEX ANALYSIS 53

The key lies in proving the following property,

∀x̄ ∈ S ∃δ = δ(x̄) > 0, L′ = L(x̄, δ) > 0


(B̊) B̊δ (x̄) ⊆ int dom F ∧
[∀y ′ , y ∈ B̊δ (x̄) |f (y) − f (y ′ )| ≤ L′ ∥y − y ′ ∥].

It gives rise to an open covering of S.


sketch covering
Compactness of S allows to extract a

finite covering (x1 , δ1 , L1 ), . . . , (xk , δk , Lk ), put L = max{L1 , . . . , Lk }.

Any straight line segment [x, x′ ] ⊆ S may now be decomposed into

[x, x′ ] = [x = y0 , y1 ] ∪ [y1 , y2 ] ∪ · · · ∪ [yℓ−1 , yℓ = x′ ] with


[yi−1 , yi ] ⊆ B̊δki (xki ), i = 1, . . . , ℓ and ∥x − x′ ∥ = ℓi=1 ∥yi − yi−1 ∥.
P


X (B̊) ℓ
X

⇒ |f (x) − f (x )| ≤ |f (yi ) − f (yi−1 )| ≤ Lki ∥yi − yi−1 ∥ ≤ L∥x − x′ ∥.
i=1 i=1

It remains to prove (B̊) by means of Lem 3.47. For x̄ ∈ int dom f choose,
similar to the proof of Th 3.10, a δ > 0 so that

B̊2δ (x̄) ⊆ conv {v0 , . . . , vn } for suitably chosen v0 , . . . , vn ∈ dom f.

Each y ∈ B̊2δ (x̄) may written in the from y = ni=0 αi vi for some α ∈ △n+1 .
P
Convexity of f implies
n
Th 3.45 X
f (y) ≤ αi f (vi ) ≤ max f (vi ) =: M.
i
i=0

A lower bound m on f over B̊2δ is obtained by Th 3.46 via the existence of


an affine minorant f (x̄) + ⟨s̄, x − x̄⟩ for f in x̄, i. e., ∃m ∀y ∈ B̊2δ (x̄) f (y) ≥
f (x̄) + ⟨s̄, y − x̄⟩ ≥ m. Hence, by Lem 3.47 f is Lipschitz continuous on B̊δ (x̄)
with L(x̄, δ) = M −m δ , which proves (B̊). □
So any convex function f is continuous on rint dom f . It may, however, have
discontinuities at points on the relative boundary of dom f .

The subdifferential of a convex function

For smooth functions f function value f (x) and gradient ∇f (x) describe a
tangent plane to the epigraph epi f at (x, f (x)) and for any direction p ∈ Rn
the directional derivative is obtained by ∇f (x)⊤ p.
CHAPTER 3. CONVEX ANALYSIS 54

For general convex functions there need not


exist a unique tangent plane. On the relative
interior of the domain, however, there is always
at least one and maybe even a set of possible
tangent planes and almost all of these also
correspond to affine minorants of f .
sketch smiley, kink and tangent planes
The subdifferential of f at a point describes the set of all supporting affine
minorants.
Definition 3.49 Let f ∈ Conv Rn and x ∈ Rn . The set
∂f (x) := {s ∈ Rn : f (y) ≥ f (x) + ⟨s, y − x⟩ for all y ∈ Rn }
is the subdifferential of f at x and any s ∈ ∂f (x) is a subgradient of f at x.
Remark
• f (y) ≥ f (x) + ⟨s, y − x⟩ is called subgradient inequality.
• It holds globally for all y ∈ Rn (compare this to the local nature of the
linear model for general smooth functions).
• For x ∈ rint dom f the subdifferential is nonempty by Th 3.46. In
particular, for finite valued convex f : Rn → R there holds ∂f (x) ̸= ∅
for all x ∈ Rn .
• For x ∈ rbd dom f it may happen that ∂f (x) = ∅. On the boundary of
dom f the tangent planes to epi f with horizontal normal vector never
give rise to subgradients, as they do not correspond to affine functions.
• Note that ∂f (x) is convex for all x ∈ Rn (why?).
• As will be proven below, the boundary of ∂f (x) directly corresponds to
the directional derivative f ′ (x; p) = limt↘0 f (x+tp)−f
t
(x)
(if it exists).

For convex functions the directional derivative may also be characterized as


the infimum over all difference quotients.
Observation 3.50 For convex f : Rn → R and x ∈ Rn ,
 f (x + tp) − f (x)
f ′ (x; p) = inf :t>0 .
t
Proof: It suffices to prove that the quotient is monotonically nondecreasing
in t. Indeed, let t1 > t2 > 0, then
t2 t2 t2 t2
f (x + t2 p) = f ((1 − t1 )x + t1 (x + t1 p)) ≤ t1 f (x + t1 p) + (1 − t1 )f (x)

f (x + t2 p) − f (x) f (x + t1 p) − f (x)
⇒ ≤ □
t2 t1
Now consider, for some fixed x ∈ dom f , the directional derivative f ′ (x; ·) as
a function of the direction. Its epigraph turns out to be a cone emanating
from the origin, f ′ (x; 0) = 0. For ease of exposition we consider this for finite
valued functions f , only.
CHAPTER 3. CONVEX ANALYSIS 55

Observation 3.51 For f : Rn → R convex and fixed x ∈ Rn the directional


derivative f ′ (x, ·) is convex, finite, and positively homogenous (f ′ (x; λp) =
λf ′ (x; p) for λ > 0).

Proof: Let p1 , p2 ∈ Rn , α1 , α2 ≥ 0 with α1 + α2 = 1. Because f is convex,


there holds for t > 0

f (x + t(α1 p1 + α2 p2 )) − f (x) =
= f (α1 (x + tp1 ) + α2 (x + tp2 )) − α1 f (x) − α2 f (x)
≤ α1 [f (x + tp1 ) − f (x)] + α2 [f (x + tp2 ) − f (x)].

Divide by t and let t ↘ 0 to obtain

f ′ (x; α1 p1 + α2 p2 ) ≤ α1 f ′ (x; p1 ) + α2 f ′ (x; p2 ).

Positive homogeneity follows directly from the definition,

f (x + λtp) − f (x) τ =λt f (x + τ p) − f (x)


f ′ (x; λp) = lim λ = λ lim .
t↘0 λt τ ↘0 τ

Finiteness follows by x ∈ int dom f with f being locally Lipschitz continuous


by Th 3.48,

for ∥p∥ = 1 : ∃ε > 0, L > 0 with |f (x + tp) − f (x)| ≤ Lt for all 0 ≤ t ≤ ε.

Thus, for ∥p∥ = 1, f ′ (x; p) ≤ L and so for ∥p∥ > 0

f ′ (x; p) = ∥p∥f ′ (x; ∥p∥


p
) ≤ L∥p∥.

Definition 3.52 A positive homogenous function f ∈ Conv Rn is called


sublinear.

Note, the epigraph of a sublinear function is a [?]cone. A particular case of


sublinear functions are support functions.

Definition 3.53 Let S ⊆ Rn (not nec. convex). The support function of S


is
σS (x) := sup{⟨s, x⟩ : s ∈ S}.

Exercise Show that support functions are sublinear and closed (the epigraph
is a closed set).


CHAPTER 3. CONVEX ANALYSIS 56

x 
Geometrically, epi f ′ (x; ·) = Tepi f ( f (x) ),
in words, the epigraph of the directional
derivative at x is a the same time the tan-
gent cone to the epigraph of the function
f at x. The next result shows that
x 
s
∂f (x) = {s : ( −1 ) ∈ [Tepi f ( f (x) )]◦ }.
|D {z E }
s
 p
⇔ ( −1 ), f ′ (x;p) ≤0 ∀p∈Rn

It also proves that f ′ (x; p) = σ∂f (x) (p),


i. e., the directional derivative at x is the
support function of the subdifferential at
x.
sketch epi f , Tepi f (x), [Tepi f (x)]◦ , ∂f (x)

Theorem 3.54 Let f : Rn → R be convex and x ∈ Rn . Then

∂f (x) = {s ∈ Rn : ⟨s, p⟩ ≤ f ′ (x; p) ∀p ∈ Rn }.

In particular, for finite valued convex functions the subdifferential ∂f (x) is


compact and

f ′ (x; p) = max{⟨s; p⟩ : s ∈ ∂f (x)} = σ∂f (x) (p).

f (x+tp)−f (x)
Proof: Recall, by Obs 3.50 f ′ (x; p) = inf t>0 t .
subg-ineq
f (x+tp)−f (x) f (x)+t⟨s,p⟩−f (x)
⊆: s ∈ ∂f (x) ⇒ ∀t > 0 t ≥ t = ⟨s, p⟩.
f (x+tp)−f (x)
⊇: ∀p ∈ Rn ⟨s, p⟩ ≤ f ′ (x; p) ≤ t for all t > 0, thus

f (x) + t ⟨s, p⟩ ≤ f (x + tp) for all t > 0, p ∈ Rn .

For y ∈ Rn put p = y − x, t = 1 to see f (x) + ⟨s, y − x⟩ ≤ f (y), so s ∈ ∂f (x).


Compactness of ∂f (x) now follows from the finiteness and positive homo-
geneity of the directional derivative certified by Obs 3.51. □
With this we are now able to formulate and understand the optimality
conditions in nonsmooth convex optimization.

3.3 Convex optimality conditions

Theorem 3.55 (unconstrained convex optimality conditions)


For convex f : Rn → R and x ∈ Rn the following are equivalent:
(i) x ∈ Argminf := {x : f (y) ≥ f (x) for all y ∈ Rn },
(ii) 0 ∈ ∂f (x),
CHAPTER 3. CONVEX ANALYSIS 57

(iii) f ′ (x; p) ≥ 0 for all p ∈ Rn .

Proof: By definition of the subgradient there holds


(i) (ii)
∀y ∈ Rn f (y) ≥ f (x) = f (x) + ⟨0, y − x⟩ ⇔ 0 ∈ ∂f (x)
(ii) (iii)
and by Th 3.54, 0 ∈ ∂f (x) ⇔ ∀p ∈ R f ′ (x; p) ≥ 0 = ⟨0, p⟩. □
Note that these conditions are necessary and sufficient, require first order
information only and generalize the smooth first order conditions of Th 2.5.
Of course, convex functions need not have a minimum even if they are
bounded from below, just consider ex for x ∈ R. For proper convex functions
with infinite values, the results directly translate for x ∈ rint dom f and
p ∈ aff dom f , but the subdifferential will not be bounded in directional
components orthogonal to aff dom f . For closed convex functions (functions
with closed epigraph) and x ∈ rbd dom f , the subdifferential will also be
unbounded in some directions within p ∈ aff dom f . Geometrically this is
nicely visualized by the relation between the subdifferential and the normal
cone to the epigraph. We will, however, refrain from a detailed exposition.
For constrained optimization we currently restrict considerations to the
abstract setting of a general closed convex feasible set C,

min f (x) f : Rn → R convex,


s. t. x ∈ C, C ⊆ Rn convex and closed.

sketch 3D epi f over ground set C with x∗ ∈ bd C

A convenient way to include a convex ground set in an unconstrained formu-


lation is the indicator function,

ıC : Rn → R̄ 
0 x ∈ C,
x 7→ ıC (x) =
∞ x∈/ C.

With (f + ıC )(x) = f (x) + ıC (x) for x ∈ Rn the constrained problem may


be reformulated in unconstrained form,

inf f (x) = infn (f + ıC )(x).


x∈C x∈R
CHAPTER 3. CONVEX ANALYSIS 58

Because ıC also has infinite values, some properties of its subdifferential need
to be proven explicitly, but this is a good exercise.
Observation 3.56 Let C ⊆ Rn be closed and convex. For the indicator
function ıC , any x ∈ C and convex f : Rn → R there hold
(i) ∂ıC (x) = NC (x),
(ii) ∂(f + ıC )(x) = ∂f (x) + NC (x).

Proof: See the Exercises. □


In points on the boundary of C the directional derivative of ıC may turn out
not to be a closed function with respect to the direction (examples?) and
regularity issues arise.
None the less the optimality conditions for the constrained case are almost
the same as that of the unconstrained case.
Theorem 3.57 (set constrained convex optimality conditions)
Let f : Rn → R be convex, let C ⊆ Rn be convex and closed.
For x ∈ C the following are equivalent,
(i) x ∈ Argminx∈C f := {x ∈ C : f (x) ≤ f (y) for all y ∈ C},
(ii) f ′ (x; p) ≥ 0 for all p ∈ TC (x), [feasible directions are non improving]
(iii) 0 ∈ ∂f (x) + NC (x). [the negative gradient of a supporting
minorant to f lies in the cone of normal directions].

Proof: By inf x∈C f (x) = inf x∈Rn (f + ıC )(x) and Obs 3.56, (i)⇔(iii) follows
from the subgradient inequality for (f + ıC ) exactly in the same way as in
Th 3.55.
(i)⇒(ii): For y ∈ C Obs 3.40 asserts p = y − x ∈ TC (x) and by minimality
f (x + tp) ≥ f (x) for all t ∈ [0, 1], thus f ′ (x; p) = limt↘0 f (x+tp)−f
t
(x)
≥ 0. For
arbitrary p ∈ TC (x) there are y ∈ C and p = αk (y − x) with pk → p. By
k k k

Obs 3.51 f ′ (x; ·) is continuous in p, so f ′ (x; pk ) ≥ 0 ⇒ f ′ (x; p) ≥ 0.


(ii)⇒(i): By Obs 3.40, y ∈ C ⇒ (y − x) ∈ TC (x) and Obs 3.50 with t = 1
yields f (y) − f (x) ≥ f ′ (x; y − x) ≥ 0. □
In practice C is not known explicitly but described by convex inequality
constraints gi (x) ≤ 0, i ∈ I, and affine equality constraints Ax = b,
\
C= {x : gi (x) ≤ 0} ∩ {x : Ax = b}.
i∈I

In this setting, the decisive question is whether it is possible to derive good


algebraic descriptions for TC (x) and NC (x) based on the information provided
for the constraints. This will be discussed in the next chapter.
If C has particularly simple structure so that the projection pC (·) can be
computed efficiently, a surprisingly simple optimization method is available
for nonsmooth convex optimization.
CHAPTER 3. CONVEX ANALYSIS 59

3.4 The subgradient method

Consider the optimization problem

min f (x) f : Rn → R convex, Lipschitz on a neighborhood of C,


s. t. x ∈ C C ⊆ Rn convex, closed, of simple structure.

Let L be the Lipschitz constant of f and assume that the projection pC (·)
can be computed efficiently (e. g., C = Rn , a friendly cone C = Rn+ or a box
like C = [0, 1]n ).
f is assumed to be given via a first order oracle. Its evaluation for some
x ∈ C yields

f (x) and g(x) ∈ ∂f (x) [no matter, which].

Note that the Lipschitz property implies ∥g(x)∥ ≤ L.


Throughout this section we assume some optimal solution x∗ ∈ Argminx∈C f
exists. Then

∀x ∈ C f (x∗ ) ≥ f (x) + ⟨g(x), x∗ − x⟩


and 0 ≤ f (x) − f (x∗ ) ≤ ⟨g(x), x − x∗ ⟩ . sketch x, g(x) and x∗

In words, −g(x) points “into the direction” of x∗ in the sense, that going a
sufficiently small step into this direction allows to get closer to x∗ . Unfortu-
nately, ∥g(x)∥ does not provide a good indicator for a suitable step length in
the nonsmooth case and line search techniques may lead to false convergence.
The surprisingly simple way out is to fix the step lengths in advance.

Algorithm 3.58 (Subgradient Algorithm)


0. Choose
P x0 ∈ C and a sequence of step lengths (hk )k∈N with hk ↘ 0 and
k∈N k → ∞, set k ← 0.
h
1. Call the oracle for xk , giving f (xk ) and g(xk ). If g(xk ) = 0, stop.
g(xk )
2. xk+1 = pC (xk − hk ∥g(x k )∥
)
3. Set k ← k + 1, goto 1 (or stop for k large enough).

Theorem 3.59 Let f be convex and Lipschitz continuous on a neighborhood


of C with constant L, let x∗ ∈ C be a minimizer and let x0 ∈ C with
r0 = ∥x0 − x∗ ∥. Then

r02 + ki=0 h2i


P

min f (xi ) − f (x ) ≤ L .
2 ki=0 hi
P
i=1,...,k
CHAPTER 3. CONVEX ANALYSIS 60

Proof: Let ri = ∥xi − x∗ ∥. By Obs 3.15 ∥pC (x) − pC (y)∥ ≤ ∥x − y∥, so


g(xi )
2
ri+1 = ∥pC (xi − hi ∥g(xi )∥
) − x∗ ∥
g(xi ) 2
≤ ∥xi − x∗ − hi ∥g(x i )∥

D E
g(xi ) ∗
= ri2 − 2hi ∥g(x i )∥
, x i − x + h2i ,

D E k
g(xi )
X
thus ri2 + h2i ≥ ri+1
2
+ 2hi ∥g(xi )∥ , xi − x∗ |
i=0
k D k E
g(xi )
X X

r02 + h2i ≥ rk+1
2
+ 2hi ∥g(x i )∥
, xi − x
i=0
|{z} i=0 |{z} |
≥0
{z }
≥0 ≥0
D k
EX
g(xi ) ∗
≥ min ∥g(xi )∥ , xi −x 2hi
i=0,...,k
i=0

Let the minimum be attained for ı̄ and apply the subgradient inequality for
g(xı̄ ),

r02 + ki=0 h2i


P
subg-ineq
∗ ∗
f (xı̄ ) − f (x ) ≤ ⟨g(xı̄ ), xı̄ − x ⟩ ≤ ∥g(xı̄ )∥ · Pk .
i=0 2hi
| {z }
≤L


If an upper bound R > r0 is known, a good choice is
R
hk = √ .
k+1
r2 +R2 ln k+1
With this the factor after L is roughly 0 2R√k+1 , so convergence is quite
slow. The typical behavior in practice is to obtain good decrease initially
and then progress becomes extremely slow. Unfortunately, it can be proven
that over all convex functions the worst case behavior of this algorithm is
best possible. Smooth cases and smoothing techniques may, however, bring
along considerable improvements. In fact, for huge dimensional problems
like in data science rather simple variations of this method belong currently
to the most efficient approaches. For smaller dimensions other methods like
bundle methods that form models of the function by collecting supporting
hyperplanes, may exhibit significantly better practical performance.
Chapter 4

Optimality Conditions for


Smooth Constrained
Optimization

The starting point will be necessary optimality conditions for local optima in
the case of minimizing a smooth objective over a feasible set of general form,

min f (x) f : Rn → R sufficiently smooth (not nec. convex!),


s. t. x ∈ X X ⊆ Rn (no restrictions otherwise).

The result corresponding to Th 3.57 reads as follows.

Theorem 4.1 (first order necessary optimality conditions) Let ∅ = ̸


X ⊆ Rn , f : Rn → R be continuously differentiable on a neighborhood of X .
For any local minimum x∗ of min f (x)s. t.x ∈ X there holds

∇f (x∗ )⊤ p ≥ 0 for all p ∈ TX (x∗ ).

Proof: Let p ∈ TX (x∗ ) arising as limiting direction of X ∋ (xk )k∈N → x∗ ,



(tk )k∈N ↘ 0, p = limk→∞ xkt−x
k
. Because x∗ is a local minimum, for any k
large enough there holds for some suitable θk ∈ (0, 1)
Taylor
0 ≤ f (xk ) − f (x∗ ) = f (x∗ ) + ∇f (x∗ + θk (xk − x∗ ))⊤ (xk − x∗ ) − f (x∗ ).

Divide by tk to obtain ∇f (x∗ + θk (xk − x∗ )⊤ xkt−x
k
≥ 0. □
| {z } | {z }
→∇f (x∗ ) →p

In words, in local minima the directional derivative in any feasible directions


does not indicate decrease. Like in the unconstrained case Th 2.1 this
condition is necessary but not sufficient.

61
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 62

Definition 4.2 A feasible point x ∈ X is a stationary point (of the mini-


mization problem) if ∇f (x∗ )⊤ p ≥ 0 holds for all p ∈ TX (x∗ ) (or, equiva-
lently, if −∇f (x∗ ) ∈ [TX (x∗ )]◦ ).

Stationary points are good candidates for local optima. For stationary points
the negative gradient lies in “the normal cone” to the tangent cone of the
feasible set, so all descent directions lead out of the feasible set.
For general sets X , the tangent cone TX and its polar cone are hard to
determine. In practice, X typically arises as the intersection of level sets of
inequality constraints and of solution sets to equality constrains like in
min f (x) f : Rn → R suff. smooth
s. t. gi (x) ≤ 0 i ∈ I = {1, . . . , nI }, gi : Rn → R suff. smooth,
(P)
hj (x) = 0, j ∈ E = {1, . . . , nE }, hj : Rn → R suff. smooth,
x ∈ Rn .
Throughout this chapter we consider problems of this form with f, gi , hi
given by first order (sometimes second order) oracles. From now on the
feasible set is

X = X (P) = {x ∈ Rn : gi (x) ≤ 0, i ∈ I, hj (x) = 0, j ∈ E}.

How to describe TX (x) for x ∈ X ?


To get started, it helps to view X as the intersection of level sets of the form
Sr (g) = {x ∈ Rn : g(x) ≤ r} for the constraint functions,
\ \
X = S0 (gi ) ∩ (S0 (hj ) ∩ S0 (−hj )).
i∈I j∈E

For the single level sets the tangent cone and its polar cone are typically easy
to determine in a current candidate point using the information available
from first order oracle evaluations (function value and gradient).

sketch 3D epigraph with 2D level set for gi with ∇gi ; 2D solution set to hj (x) = 0 with ∇hj

x ∈ int So (gi ) x ∈ bd S0 (gi ) x ∈ {x : hj (x) = 0}


[gi inactive] [gi active] [always active]
T =R n T = {p : ∇gi (x) p ≤ 0} T = {p : ∇hj (x)⊤ p = 0}

T ◦ = {0} T ◦ = {λ∇gi (x) : λ ≥ 0} T ◦ = {µ∇hj (x) : µ ∈ R}


CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 63

Note, however, there is an important


special case, where these formulas are
as wrong as they can get. Consider,
e. g., the tangent cone of the level set
to g = (x − x̄)2 (or h of this form) for
the point x̄, then

S0 (g) = {x̄}, T = 0, T ◦ = Rn .
sketch 3D epi g with tangent plane and X

The hope is that most of the the time TX (x) is obtained as the intersection
of the tangent cones of the single constraints.
For constraints that are active in x, nonzero
gradients describe supporting hyperplanes to
the respective level sets and in “regular cases”
the corresponding halfspaces are the respective
tangent cones shifted to x. The tangent cone
to X in x is contained in the intersection of
the tangent cones to the single level sets.
However, even in cases where the tangent
cones of the single constraints are described
correctly this may still go wrong. “Regularity
conditions” will be needed to exclude cases of
misdescription.
sketch 2D two ineqs and two circles

Definition 4.3 Let x ∈ X (P) be a feasible point of (P).



active (in x), if gi (x) = 0,
• Constraint gi (x) ≤ 0 with i ∈ I is
inactive, if gi (x) < 0.
• The active set A(x) = {i ∈ I : gi (x) = 0} denotes the index set of
active inequalities.
• The linearized tangent cone of X in x is

TP (x) = {p ∈ Rn : ∇gi (x)⊤ p ≤ 0, i ∈ A(x), ∇hj (x)⊤ p = 0, j ∈ E}.

Remark 4.4
• TP (x) is a polyhedral (convex) cone and (by Farkas, see the proof of
Th 4.8 below)
A(x)
X X
[TP (x)]◦ = { λi ∇gi (x) + µj ∇hj (x) : λ ∈ R+ , µ ∈ RE }.
i∈A(x) j∈E

• To show stationarity of x the hope is to prove −∇f (x) ∈ [TX (x)]◦ by


A(x)
finding λ ∈ R+ , µ ∈ RE so that
X X
−∇f (x) = λi ∇gi (x) + µj ∇hj (x).
i∈A(x) j∈E
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 64

• Note, however, that TP (x) and [TP (x)]◦ do not so much depend on X
but rather on the description of X by the gi and hj of (P).
• As observed above, in general, TP (x) ̸= TX (x). Yet another example is
the following simple optimization problem.

max x1
s. t. g1 (x) := x31 + x2 ≤ 0
g1 (x) := −x2 ≤0

The optimal solution is x∗ = (0, 0).


sketch g1 , g2 , X

TX (x∗ ) = {λ −1

0 : λ ≥ 0}
 ∗ 2 ⊤
TP (x∗ ) = {p : 3(x11 ) 0 ⊤p ≤ 0} = {( p01 ) : p1 ∈ R}

p ≤ 0, −1

• Similar problems may occur with equality constraints (examples?).

At least the linearized tangent cone TP (x) is never too small.

Lemma 4.5 For x ∈ X (P) there holds TX (x) ⊆ TP (x) and therefore also
[TX (x)]◦ ⊇ [TP (x)]◦ .

Proof: Let p ∈ TX (x) arise from the feasible sequence X ∋ (xk ) → x, tk ↘ 0


via p = limk→∞ xkt−x k
. We only prove ∇gi (x)⊤ p ≤ 0 for i ∈ A(x), the proof
for hi , i ∈ E, is similar.
By xk ∈ X and gi (x) = 0 there are suitable θk ∈ (0, 1) so that
Taylor
0 ≥ gi (xk ) − gi (x) = ∇gi (x + θk (xk − x))⊤ (xk − x),
xk − x
thus ∇gi (x + θk (xk − x))⊤ ≤ 0. □
| {z } tk
→∇gi (x) | {z }
→p

Note, because of [TX (x)]◦


⊇ [TP (x)]◦ it may happen that −∇f (x) ∈ [TX (x)]◦
/ [TP (x)]◦ . So TP (x) may not suffice to recognize stationarity.
but ∈
Reliable statements can be made only subject to additional assumptions,
called constraint qualifications, on the relation between [TX (x)]◦ and [TP (x)]◦
for single points x of interest or for the entire problem (P). In their weakest
form they require the two polar cones to be equal for the current candidate.

Definition 4.6 A feasible point x ∈ X satisfies the constraint qualification


of Guignard (in short, the Guignard constraint qualification or (GCQ)) if

[TX (x)]◦ = [TP (x)]◦ .


CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 65

Of course this basic relation is often too difficult to check. Later, stronger
conditions will be introduced that ensure this equality and are typically
easier to verify. For the time being (GCQ) is exactly the right condition for
our purposes.
The basic optimality conditions for smooth constrained optimization problems
of the form (P) are not formulated with respect to these cones directly but
build on the Lagrange function approach for including constraints in the
objective by means of Lagrange multipliers.
Definition 4.7 For the constrained optimization problem (P) define the
Lagrangian L : Rn × RI × RE → R by
X X
L(x, λ, µ) = f (x) + λi gi (x) + µj hj (x).
i∈I j∈E

Let f, gi , hj ∈ C 1 (Rn ) for i ∈ I, j ∈ E.


(i) The Karusch-Kuhn-Tucker (KKT) conditions for stationarity read

∇x L(x, λ, µ) = 0 
gi (x) ≤ 0 i ∈ I
feasibility for (P)
(KKT) hj (x) = 0 j ∈ E
P λi ≥ 0 i ∈ I
λ g
i∈I i i (x) =0 complementarity

In this, ∇x refers to taking the gradient with respect to x,


X X
∇x L(x, λ, µ) = ∇f (x) + λi ∇gi (x) + µj hj (x).
i∈I j∈E

(ii) Each point (x∗ , λ∗ , µ∗ ) satisfying the KKT conditions is called a KKT-
point (of (P)) and λ∗ , µ∗ are called Lagrange multipliers (of x∗ ).

The KKT conditions only require the information (function values and
gradients) available via the first order oracles of the functions involved. They
form the basis of almost all algorithmic approaches for searching for points
satisfying the first order necessary conditions. It is therefore important
to understand the relation between KKT-points and first order necessary
conditions in depth.
Suppose (x̄, λ̄, µ̄) is a KKT-point. Then x̄ ∈ X is feasible and by comple-
mentarity λ̄i = 0 for i ∈ / A(x̄). So, by the arguments above, the existence of
Lagrange multipliers implies
X X
−∇f (x̄) = λ̄i ∇gi (x) + µ̄j hj (x) ∈ [TP (x̄)]◦ ⊆ [TX (x̄)]◦ .
i∈A(x) j∈E

Thus, x̄ is a stationary point satisfying ∇f (x̄)⊤ p ≥ 0 for all p ∈ TX (x̄).


CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 66

On the other hand, for a given stationary point x̄ (i. e., a point satisfying the
necessary optimality conditions) it may happen that there is no solution of
the KKT-system, if the algebraically derived (linearized) description [TP (x̄)]◦
falls short of spanning the full geometric [TX (x̄)]◦ . If no Lagrange multipliers
exist, it is almost impossible to algorithmically recognize a given stationary
point as such. If, however, certain regularity conditions at x̄ ensure that
(GCQ) is satisfied, then Lagrange multipliers exist and the stationarity
property of x̄ can be recognized algorithmically. As pointed out before,
this is a direct consequence of the polarity relation of the cones due to the
Farkas-Lemma, but because of its importance the proof is given explicitly.
Theorem 4.8 (KKT conditions under (GCQ)) Let x∗ be a local mini-
mum of (P) that satisfies (GCQ). There exist Lagrange multipliers λ∗ ∈ RI+
and µ∗ ∈ RE so that (x∗ , λ∗ , µ∗ ) is a KKT-point.

Proof: Th 4.1 asserts that ∇f (x∗ )⊤ p ≥ 0 for all p ∈ TX (x∗ ). Equivalently,


(GCQ)
−∇f (x∗ ) ∈ [TX (x∗ )]◦ = [TP (x∗ )]◦ . It remains to show that for x ∈ X
A(x)
X X
[TP (x)]◦ = { λi ∇gi (x) + µj ∇hj (x) : λ ∈ R+ , µ ∈ RE },
i∈A(x) j∈E

A(x∗ )
because extending any λ∗ ∈ R+ , µ∗ ∈ RE that generate −∇f (x∗ ) by
putting λ∗i = 0 for i ∈ I \ A(x)∗ to an appropriate λ∗ ∈ RI+ yields a
KKT-Point (x∗ , λ∗ , µ∗ ) as can be checked by direct inspection.
⊇: follows by linearity of the inner product and direct computation, because
for p ∈ TP (x) there holds ⟨∇gi (x), p⟩ ≤ 0 for i ∈ A(x) and ⟨∇hj (x), p⟩ = 0
for i ∈ E.
⊆: First note that TP (x) = {p : [G, H, −H]⊤ p ≤ 0} with G = [∇gi (x)]i∈A(x) ,
H = [∇hj (x)]j∈E . Let d ∈ [TP (x)]◦ , then d⊤ p ≤ 0 for all p ∈ TP (x), i. e.,

the system d⊤ p > 0, [G, H, −H]⊤ p ≤ 0 has no solution.

Now Farkas (Lem 3.28) ensures that


ν  ν 
the system [G, H, −H] ν′′′ = d, ν′
′′
≥0 has a solution.
ν ν

Put λi = νi , i ∈ A(x∗ ), and µi = νi′ − νi′′ , i ∈ E to see that d is contained in


the right hand side. □
A natural, geometrically intuitive condition is to require the two tangent
cones to be equal.
Definition 4.9 A feasible point x ∈ X satisfies the constraint qualification
of Abadie (in short, the Abadie constraint qualification or (ACQ)) if

TX (x) = TP (x).
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 67

Corollary 4.10 (KKT conditions under (ACQ)) Let x∗ be a local min-


imum of (P) that satisfies (ACQ). There exist Lagrange multipliers λ∗ ∈ RI+
and µ∗ ∈ RE so that (x∗ , λ∗ , µ∗ ) is a KKT-point.

Proof: (ACQ) directly implies (GCQ), so the result follows from Th 4.8 □
Because the linearized tangent cone TP is always convex it is not difficult to
come up with examples where (GCQ) holds but (ACQ) does not, so (GCQ)
is indeed more general.
Next we explore conditions that ensure (ACQ). By Lem 4.5 TX (x) ⊆ TP (x)
always holds, therefore the important part is to establish under which
conditions every direction in TP is also contained in TX , i. e., for every
p ∈ TP (x) it must be possible to construct a feasible sequence (xk )k∈N → x
and tk ↘ 0 so that p = limk→∞ xkt−xk
is its limiting direction. We have seen
already that this will not always be possible.
It is possible, however, if we can prove the existence of a “feasible” curve
in X that goes through x and has p as its tangent direction in x. The
mathematical tool for showing the existence of curves in the intersection of
nonlinear equations is the implicit function theorem.
Theorem (Implicit Function Theorem) Let F : Rk × Rm → Rk and
ȳ ∈ Rk satisfy
(i) F (ȳ, 0) = 0,
(ii) F is continuously differentiable in a neighborhood U (ȳ, 0),
(iii) [JF (ȳ, t)]y is regular in (ȳ, 0) (the y-columns of the Jacobian).
The function y(·) : Rm → Rk implicitly defined by F (y(t), t) = 0 and y(0) = ȳ
is well defined and it is continuously differentiable in a neighborhood of the
origin. In particular there holds
Jy(·) (0) = −[JF (ȳ, 0)]−1
y [JF (ȳ, 0)]t .

(For a proof see, e. g., Heuser, “Lehrbuch der Analysis”, Part 2.)

sketch (t = 0)-plane and two parabola level sets, intersecting full-dim/0-dim

For a feasible point x̄ ∈ X in which the gradients of the equality constraints


are linearly independent and for a direction p ∈ TP (x̄) that points strictly
into the level sets of the active inequality constraints, the following technical
lemma establishes the existence of a feasible curve with p as tangent direction.
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 68

Lemma 4.11 Let x̄ ∈ X be feasible with linearly independent ∇hj (x̄), j ∈ E,


and let p ∈ Rn satisfy ∇hj (x̄), j ∈ E, as well as ∇gi (x̄)⊤ p < 0 for i ∈ A(x̄).
Then there exists ε > 0 and a curve x() : (−ε, ε) → Rn with
(i) x(·) is continuously differentiable on (−ε, ε),
(ii) x(0) = x̄, [starts in x̄]
(iii) x(t) ∈ X for t ∈ [0, +ε), [runs in X ]
(iv) x′ (0) = p. [has tangent p in x̄]

Proof: The guiding idea is to form x(t) = x̄ + tp + y(t) with y(·) a correction
term for the equality constraints. Put k = |E| and define F : Rk × R → Rk
via
Fj (y, t) = hj (x̄ + tp + Hy), i ∈ E where H = [∇hj (x̄)]j∈E .
By assumption H has full column rank. With this,
• F (0, 0) = 0, because hj (x̄) = 0 for j ∈ E,
• F is cont. diff., because the hi as well as x̄ + td + Hy are,
• ∇y Fj (y, t) = H ⊤ ∇hj (x̄ + tp + Hy) [recall, the gradient is a column]
therefore [JF (0, 0)]y = H ⊤ H; it is positive definite, hence regular.
By the Implicit Function Theorem there exists ε̂ > 0 and a continuously
differentiable y(·) : (−ε̂, ε̂) → Rk with
• F (y(t), t) = 0 for t ∈ (−ε̂, ε̂),
• y(0) = 0,

• y ′ (0) = Jy(·) (0) = −[JF (0, 0)]−1 ⊤ −1
y [JF (0, 0)]t = −(H H) [ H p ] = 0.
| {z }
∇hi (x̄)⊤ p=0

Put x(·) : (−ε̂, ε̂) → Rn , t 7→ x(t) = x̄ + tp + Hy(t), then


• x(·) is continuously differentiable
• x(0) = x̄
• x′ (0) = p (because y ′ (0) = 0),
• ∃ε > 0 with ε < ε̂ so that x(t) ∈ X for t ∈ (0, ε). Indeed, for
j ∈ E : hj (x(t)) = Fj (y(t), t) = 0 for t ∈ (−ε, ε)
i ∈ A(x̄) : gi (x(t)) = gi (x(0)) + ∇gi (x(0) + θ(x(t) − x(0)))⊤ (x(t) − x(0))
| {z }| {z }
→∇gi (x̄)⊤ →tp with t>0
| {z }
<0
i ∈ I \ A(x̄) : gi (x(t)) < 0, because gi (x̄) < 0 and g, x(·) are continuous.


This result motivates the following regularity condition.

Definition 4.12 A feasible x̄ ∈ X satisfies the constraint qualification of


Mangasarian-Fromovitz (MFCQ) if
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 69

(i) the gradients ∇hj (x̄), j ∈ E, are linearly independent,


(ii) ∃p̄ ∈ Rn with ∇hj (x̄)⊤ p̄ = 0, j ∈ E, and ∇gi (x̄)⊤ p̄ < 0, i ∈ A(x̄).
Theorem 4.13 (KKT conditions under (MFCQ)) Let x∗ be a local min-
imum of (P) that satisfies (MFCQ). There exist Lagrange multipliers λ∗ ∈ RI+ ,
µ∗ ∈ RI so that (x∗ , λ∗ , µ∗ ) is a KKT-point.

Proof: By Cor 4.10 and Lem 4.5 (TX (x∗ ) ⊆ TP (x∗ )) it suffices to prove
TP (x∗ ) ⊆ TX (x∗ ) for x∗ satisfying (MFCQ).
Let p ∈ TP (x∗ ), then ∇hj (x∗ )⊤ p = 0, j ∈ E.
First suppose ∇gi (x∗ )⊤ ∗ ∗
i p < 0 holds for all i ∈ A(x ). Then p ∈ TX (x ) follows

from Lem 4.11 by choosing xk = x( k ), tk = k which yields p = limk→∞ xkt−x
ε ε
k
.
If ∇gi (x∗ )⊤ ∗
i p = 0 for some i ∈ A(x ) then by (MFCQ) there exists p̄ with
∇gi (x )i p̄ < 0, i ∈ A(x ). Put pk = (1 − k1 )p + k1 p̄, then pk ∈ TX (x∗ ) for
∗ ⊤ ∗

k ∈ N by the previous step, pk → p and because TX (x∗ ) is closed (Obs 3.39)


it follows p ∈ TX (x∗ ). □

Thus (MFCQ)⇒(ACQ),
but i. g. (ACQ)̸⇒(MFCQ).
sketch x2 ≤ x2
1 , x2 ≥ 0 in the origin

The arguably most popular regularity condition just requires to check the
linear independence of the gradients to active constraints, because there are
reasonably efficient numerical linear algebra routines for doing so.
Definition 4.14 A feasible x̄ ∈ X satisfies the linear independence con-
straint qualification (LICQ) if the gradients ∇gi (x̄), i ∈ A(x̄), and ∇hj (x̄),
j ∈ E, are linearly independent.
Theorem 4.15 (KKT conditions under (LICQ)) Let x∗ be a local min-
imum of (P) that satisfies (LICQ). There exist Lagrange multipliers λ∗ ∈ RI+ ,
µ∗ ∈ RE so that (x∗ , λ∗ , µ∗ ) is a KKT-point.

Proof: By Th 4.13 it suffices to show (LICQ)⇒(MFCQ).


h i
The matrix H G⊤ with G = [∇g (x∗ )] ∗
⊤ i i∈A(x∗ ) and H = [∇hj (x )]j∈E has full
row rank. Therefore
h i
G⊤ p = −1 has a solution p̄ ∈ Rn as required for (MFCQ).
 
H ⊤ 0

Again, i. g., (MFCQ)̸⇒(LICQ).

sketch x2 ≥ −x3
1 , x2 ≥ 0 in the origin
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 70

No regularity conditions are needed if all constraints are affine. As to be


expected, the linearized cone is exactly the right object in this case.
Theorem 4.16 (KKT conditions for affine constraints) Let all con-
straint functions gi , i ∈ I, hj , j ∈ E, of (P) be affine and let x∗ be a local
minimum of (P). There exist Lagrange multipliers λ∗ ∈ RI+ , µ∗ ∈ RE so that
(x∗ , λ∗ , µ∗ ) is a KKT-point.

Proof: By Cor 4.10 and Lem 4.5 it suffices to prove TP (x∗ ) ⊆ TX (x∗ ). Let
p ∈ TP (x∗ ). For x(t) = x∗ + tp there holds gi (x(t)) = gi (x∗ ) + t∇gi⊤ p, i ∈ I,
and hj (x(t)) = h(x∗ ) + t∇h⊤ j p, j ∈ E, so x(t) is feasible for t ≥ 0 for gi ,
i ∈ A(x), and hj , j ∈ E.

For gi , i ∈ I \ A(x∗ ) it is feasible for 0 ≤ t ≤ inf{ −g i (x )
∇g ⊤ p
: ∇gi⊤ p > 0, i ∈
i
min{1,t̄}
I \ A(x)} =: t̄ > 0. Choose tk = k , k ∈ N, and xk = x(tk ) = x∗ + tk p
∗ ∗
to see p = x +ttkkp−x ∈ TX (x∗ ). □
For convex problems (convex f and gi , i ∈ I, affine hj , j ∈ E) the affine
constraints are well represented by the linearized tangent cone. For the non
affine convex gi everything works out as long as there is a common point
that lies in the relative interior of each of the level sets.

sketch two intersecting parabolas, two touching parabolas

Definition 4.17 Let (P) be a convex optimization problem (with convex f


and gi , i ∈ I, affine hi , i ∈ E). (P) satisfies the regularity condition of Slater
if there is x̄ ∈ X with gi (x̄) < 0 for all non affine gi with i ∈ I. Such an x̄
is called strictly feasible point or Slater point.
Theorem 4.18 (KKT condititions for convex (P) under Slater) Let
(P) be a convex optimization problem with convex differentiable f, gi , i ∈ I,
affine hj , j ∈ E and with a Slater point x̄. If x∗ is a (global) minimum,
there exist Lagrange multipliers λ∗ ∈ RI+ , µ∗ ∈ RE so that (x∗ , λ∗ , µ∗ ) is a
KKT-point.

Proof: By Cor 4.10 and Lem 4.5 it suffices to prove TP (x∗ ) ⊆ TX (x∗ ). Let
p ∈ TP (x∗ ).
Because x̄ ∈ X with gi (x̄) < 0 for non affine gi , i ∈ I, the direction p̄ = x̄−x∗
satisfies
∇h⊤

j p̄ = 0 j ∈ E(affine) 
∇gi⊤ p̄ ≤ 0 i ∈ A(x∗ ), gi affine ⇒ p̄ ∈ TP (x∗ )
∇gi (x∗ )⊤ p̄ < 0 i ∈ A(x∗ ), gi not affine

CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 71

Put pk = (1 − k1 )p + k1 p̄ ∈ TP , k ∈ N.
For each k ∈ N there is ε̄k > 0 so that x∗ + εpk ∈ X (P) for all 0 ≤ ε ≤ ε̄k ,
because
j ∈ E : hj (x∗ + εpk ) = hj (x∗ ) + ε∇h⊤ j pk = 0,
i ∈ A(x ), affine : gi (x∗ + εpk ) = gi (x∗ ) + ε∇gi⊤ pk ≤ 0,

i ∈ A(x∗ ), not affine : gi (x∗ + εpk ) = gi (x∗ ) + ε ∇gi (x∗ + θεpk )⊤ pk < 0,
| {z }
<0 for ε̄k small enough
by continuity of ∇gi
i ∈ I \ A(x∗ ) : gi (x∗ + εpk ) < 0 by continuity of gi .
Thus, pk ∈ TX (x∗ ) and, because TX (x∗ ) is closed, lim pk = p ∈ TX (x∗ ). □
k→∞
For convex problems the KKT conditions are also sufficient for optimality.
This follows via −∇f (x∗ ) ∈ NC (x∗ ) and Th 3.57 but is also quickly proved
directly.
Theorem 4.19 (KKT conditions, sufficiency for smooth convex case)
Let (P) be a convex optimization problem with convex differentiable f, gi , i ∈ I
and affine hj , j ∈ E and let (x∗ , λ∗ , µ∗ ) be a KKT-point. Then x∗ is a global
minimum.

Proof: For x ∈ X we have


subg-in.
f (x) ≥ f (x∗ ) + ∇f (x∗ )⊤ (x − x∗ )
KKT
X X
= f (x∗ ) − λ∗i ∇gi (x∗ )⊤ (x − x∗ ) − µ∗j ∇h⊤ ∗
j (x − x )
i∈I j∈E
| {z }
=0, hj affine
compl. X
= f (x∗ ) − λ∗i ∇gi (x∗ )⊤ (x − x∗ )
| {z }
i∈A(x∗ ) subg x∈X
≤ gi (x)−gi (x∗ )=gi (x) ≤ 0
≥ f (x∗ ).

For not necessarily differentiable convex functions f, gi , i ∈ I, and affine
hi , i ∈ E, the KKT conditions require the existence of a feasible x together
with Lagrange multipliers λ ∈ RI+ , µ ∈ RE so that
P P
0 ∈ ∂f (x) + i∈I λi ∂gi (x) + j∈E µj {∇hj },
λi gi (x) = 0 for i ∈ I.
T T
Considering X = i∈I {x : gi (x) ≤ 0} ∩ j∈E {x : hj (x) = 0}, we have for
x̄ ∈ X
X X
NX (x̄) ⊇ {p ∈ λi ∂gi (x̄) + µj ∇hj : λi ≥ 0, i ∈ A(x̄), µj ∈ R, j ∈ E}
i∈A(x̄) j∈E
X X
= R+ ∂gi (x̄) + R{∇hi }.
i∈A(x̄) j∈E
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 72

If a Slater point exists for (P), equality holds for all x̄ ∈ X , but we will not
prove this here.

Second Order Optimality Conditions (under LICQ)

Throughout this part we assume that all the functions f, gi , hj appearing


in (P) are twice continuously differentiable and that (LICQ) is satisfied
whenever needed. For x ∈ X satisfying (LICQ) there holds TX (x) = TP (x)
for P, so the first order necessary conditions of Th 4.1 may be written as

∇f (x)⊤ p ≥ 0 for all p ∈ TP (x).

For p ∈ TP (x) with ∇f (x)⊤ p > 0 no local improvement is possible.


For p ∈ TP (x) with ∇f (x)⊤ p = 0 second order information may help.
Let (x, λ, µ) be a KKT-point, then
X X
∇f (x) + λi ∇gi (x) + µj ∇hj (x) = 0.
i∈I j∈E

For p ∈ TP (x) there holds ∇gi (x)⊤ p ≤ 0, i ∈ A(x), ∇hj (x)⊤ p = 0, j ∈ E, so

∇f (x)⊤ p = 0 ⇔ ∇gi (x)⊤ p = 0 for all i ∈ A(x) with λi > 0.

In consequence, second order information is only required for the following


directions.
Definition 4.20 Let (x, λ, µ) be a KKT-Point of (P). The cone associated
with x and λ is

Tλ (x) = {p ∈ TP (x) : ∇gi (x)⊤ p = 0 for all i ∈ A(x) with λi > 0}.

Note, if (LICQ) holds in x, the multipliers are unique and the cone depends
on x only. The dependence on λ is, however, relevant whenever Lagrange
multipliers are not unique.
Theorem 4.21 (Second Order Necessary Optimality Conditions) Let
x∗ be a local optimal solution of (P) satisfying (LICQ) with λ∗ and µ∗ the
(unique) Lagrange multipliers for x∗ . There holds

p⊤ ∇xx L(x∗ , λ∗ , µ∗ )p ≥ 0 for all p ∈ Tλ∗ (x∗ ).

Proof: For ease of notation assume E = ∅. Let p ∈ Tλ∗ (x∗ ) ⊆ TP (x∗ ) =


TX (x∗ ). For this p we construct a special feasible sequence X ∋ (xk ) → x∗ ,

tk ↘ 0, k ∈ N, with xkt−xk
→ p with the property that L(xk , λ∗ ) = f (xk ).
Similar to the proof in Lem 4.11 the construction uses the Implicit Function
Theorem.
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 73

Let G = [∇gj (x∗ )] with j ∈ {i ∈ A(x∗ ) : ∇gi (x∗ )⊤ p = 0} =: J ⊇ {i : λ∗i > 0}.
Due to (LICQ), G has full column rank. Put

Fj (y, t) = gj (x∗ + tp + Gy) for j ∈ J .

Like in Lem4.11 construct y(t), x(·) : (−ε, ε) → Rn , t 7→ x(t) = x∗ +tp+Gy(t)


with y ′ (0) = 0, x′ (0) = p and

gj (x(t)) = 0 for t ∈ (−ε, ε), j ∈ J ,


gj (x(t)) < 0 for i ∈ A(x∗ ) \ J .
xk −x∗ p
Put xk = x(tk ) for tk ↘ 0, so limk→∞ ∥xk −x∗ ∥ = ∥p∥ and for k large enough
X
L(xk , λ∗ ) = f (xk ) + λ∗i gi (xk ) = f (xk ) ≥ f (x∗ ).
| {z }
i∈I λ >0⇒i∈J
i

On the other hand, by Taylor there are suitable ξk ∈ [x∗ , xk ] so that

L(xk , λ∗ ) = L(x∗ , λ∗ ) + ∇x L(x∗ , λ∗ )⊤ (xk −x∗ )+ 21 (xk −x∗ )⊤ ∇xx L(ξk , λ∗ )(xk −x∗ ).
| {z } | {z } | {z }
=f (xk ) KKT
= f (x∗ )
KKT
= 0
compl

From this we obtain


1
2 (xk − x∗ )⊤ ∇xx L(ξk , λ∗ )(xk − x∗ ) = f (xk ) − f (x∗ ) ≥ 0 |· 1
∥xk −x∗ ∥2
1 xk −x∗ ⊤ ∗ xk −x∗ f (xk )−f (x∗ )
2 ( ∥xk −x∗ ∥ ) ∇xx L(ξk , λ )( ∥xk −x∗ ∥ ) = ∥xk −x∗ ∥2
≥ 0,
1 p ⊤ ∗ ∗ p
lim :
k→∞ 2 ∥p∥ ∇xx L(x , λ ) ∥p∥ ≥ 0.


The corresponding sufficient conditions start with a KKT-point and therefore
do not need to assume (LICQ).

Theorem 4.22 (Sufficient Optimality Conditions) Let (x∗ , λ∗ , µ∗ ) be


a KKT-point of (P). If there holds

p⊤ ∇xx L(x∗ , λ∗ , µ∗ )p > 0 for all p ∈ Tλ∗ (x∗ ) \ {0}

the point x∗ is a local optimal solution.

Proof: For ease of notation let E = ∅. The proof shows that every feasible
sequence (X \ {x∗ }) ∋ (xk )k∈N → x∗ satisfies f (xk ) > f (x∗ ). By the

usual compactness argument the sequence pk = ∥xxkk −x −x∗ ∥ has a subsequence
K ⊆ N converging to some cluster point p, w. l. o. g., pk → p. By Lem 4.5
p ∈ TX (x∗ ) ⊆ TP (x∗ ). Discern the following two cases.
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 74

p ∈ Tλ∗ (x∗ ): By Taylor there exist suitable ξk ∈ [xk , x∗ ] so that

gi (xk )≤0 X
f (xk ) ≥ f (xk ) + λ∗i gi (xk ) = L(xk , λ∗ )
Taylor
= L(x∗ , λ∗ ) +[∇x L(x∗ , λ∗ )⊤ + 12 (xk − x∗ )⊤ ∇xx L(ξk , λ∗ )](xk − x∗ )
| {z } | {z }
KKT KKT
= f (x∗ ) = 0
compl

xk −x∗ ⊤ ∗ xk −x∗
= f (x∗ ) + 12 ∥xk − x∗ ∥2 ∥xk −x∗ ∥ ∇xx L(ξk , λ ) ∥xk −x∗ ∥ .
| {z }
k→∞ ⊤ ∗ ,λ∗ )p>0
→ p ∇ xx L(x

p ∈ TP (x∗ ) \ Tλ∗ (x∗ ): There exists j ∈ A(x∗ ) with λ∗j > 0 and ∇gj (x∗ )⊤ p <
0. Now
f (xk ) − f (x∗ ) Taylor ∇f (ξk )⊤ (xk − x∗ )
lim = lim
k→∞ ∥xk − x∗ ∥ ξk ∈[xk ,x∗ ] k→∞ ∥xk − x∗ ∥
KKT
X
= ∇f (x∗ )⊤ p = − λi ∇gi (x∗ )⊤ p > 0.
i∈I

Thus, for k large enough, there holds f (xk ) > f (x∗ ). □


In practice one usually only checks whether ∇xx L satisfies the definiteness
conditions on the linear subspaces

Tλ (x) := {p ∈ Rn : ∇gi (x)⊤ p = 0, i ∈ A(x), ∇hj (x)⊤ p = 0, j ∈ E} ⊆ Tλ (x),

Tλ (x) := {p ∈ Rn : ∇gi (x)⊤ p = 0, i ∈ A(x) ∧ λi > 0, ∇hj (x)⊤ p = 0, j ∈ E} ⊇ Tλ (x).

Let the columns of Z and Z form a basis of Tλ (x) and Tλ (x) respectively,
then
Z ⊤ ∇xx L(x, λ, µ)Z ⪰ 0 is necessary for optimality of x,

Z ∇xx L(x, λ, µ)Z ≻ 0 is sufficient for optimality of x.

If (LICQ) holds, λ is unique. If, in addition, strict complementarity (i. e.,


gi (x) = 0 ⇔ λi > 0) holds, then Tλ (x) = Tλ (x) = Tλ (x) and the correspond-
ing Z ⊤ ∇xx L(x, λ, µ)Z is called the projected Hessian (of the Lagrangian).

Sensitivity

Under suitable nondegeneracy assumptions the Lagrange multipliers may


be interpreted as marginal costs for changes in the right hand side of the
constraints in the sense that investing this per unit cost into changing the
right hand side by an infinitesimal amount leads to an identical limiting
improvement of the objective.
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 75

sketch nondegenerate and degenerate active gradient situations

Theorem 4.23 For δ ∈ RI∪E let

min f (x)
(Pδ ) s. t. gi (x) ≤ δi , i ∈ I,
hj (x) = δi , j ∈ E.

If (x∗ , λ∗ , µ∗ ) is a KKT-point of (Pδ ) for δ = 0, in which (LICQ), strict


complementarity and the sufficient optimality conditions hold, then there
exists a neighborhood U of δ = 0 and a continuously differentiable function
x(·) : U → Rn with x(0) = x∗ so that x(δ) is a local optimum of (Pδ ) and
 ∗
−λ
∇δ (f (x(δ))|δ=0 = −µ ∗ .

Proof: W. l. o. g. assume I = ∅ (otherwise put E ← E ∪ A(x∗ )), let (x∗ , µ∗ )


be a KKT-point
! satisfying the requirements, put H(x) = [∇hi (x)]i∈E and
h1
..
h(x) = . . For (Pδ ) the KKT-conditions read
h|E|

 
∇f (x)+H(x)µ
F (x, µ, δ) = h(x)−δ
= 0.

The Implicit Function Theorem may be applied to this for parameter δ,


because
(i) F (x∗ , µ∗ , 0) = 0,
(ii) F is continuously hdifferentiable on a neighborhood U of 0,
∇2 f (x)+ µi ∇2 hi (x) H(x)
P i
(iii) [JF (x, µ, δ)]x,µ = H(x)⊤
and
∗ ∗ ∗
 L H 0
JF := [JF (x , µ , 0)]x,µ =: H ⊤ 0 is regular.

Indeed, by (LICQ) H = H(x ) is regular, so for Z spanning the
orthogonal complement Tλ∗ (x∗ ) of H the matrix [Z H] ∈h Rn×n isi
regular. Using the regularity of the block diagonal P = [Z 0H] I0
observe that J ∗ is regular if and only if P ⊤ J ∗ P is regular. Now,
 ⊤ 
⊤ ∗ Z LZ Z ⊤ LH 0
P J P = H LZ H LH H H .
⊤ ⊤ ⊤
0 H⊤H 0
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 76

In this, H ⊤ H is positive definite, thus regular. The projected Hessian


Z ⊤ LZ is by positive definite by the sufficient optimality conditions.
Block Gaussian elimination with respect to blocks (1, 1), (3, 2) and
(2, 3) proves that P ⊤ J ∗ P is regular.
Therefore the Implicit Function Theorem ensures the existence of continuously
differentiable x(δ), µ(δ) that solve the KKT-system for |δ| small enough.
Applying the chain rule gives
KKT-F
∇δ h(x(δ))|δ=0 = ∇δ (x(δ)) ∇x h(x(δ)) |0 = ∇δ δ = I,
| {z } line 2
=H(x(δ))

∇δ f (x(δ))|δ=0 = ∇δ (x(δ)) ∇x f (x(δ)) |0 = − ∇δ x(δ)H µ|0 = −µ∗ .


| {z } | {z }
KKT-F =I as above
= −H(x(δ))µ(δ)
line 1 □
It is important to note that λi and µj alone are, i. g., not reliable indicators
for sensitivity because they directly depend on the scaling of the constraints.
Indeed, if gi (x) ≤ 0 is scaled (multiplied) by 1000 then the corresponding
λi
multiplier changes to 1000 and compensates the change. Thus, only the

relative sizes λi ∥∇gi (x )∥ have a chance to be meaningful.
Chapter 5

Saddle Points and Lagrangian


Duality

The objects of interest are bifunctions of the form

ℓ: X × Y → R with X ⊆ Rn , Y ⊆ Rm .
(x, y) 7→ ℓ(x, y)

These arise in optimization mostly via “Lagrangian relaxation”.


Example Let K1 ⊆ Rn , K2 ⊆ Rm be closed convex cones.

min c⊤ x ⇔ inf [c⊤ x + sup ⟨b − Ax, y⟩],


x∈K1 y∈K2◦
s. t. b − Ax ∈ K2 ,
x ∈ K1

+∞ if b − Ax ∈
/ K2 ,
because sup ⟨b − Ax, y⟩ =
y∈K2◦ 0 if b − Ax ∈ K2 .

⇔ inf sup ℓ(x, y) := c⊤ x + ⟨b − Ax, y⟩ .


x∈K1 y∈K ◦
2


What is happening if the positions of inf and sup are swapped?

Observation 5.1 (weak duality) inf sup ℓ(x, y) ≥ sup inf ℓ(x, y).
x∈X y∈Y y∈Y x∈X

Proof: Observe that


ℓ(x̄, ȳ) ≥ inf ℓ(x, ȳ) holds for all ȳ ∈ Y, x̄ ∈ X,
x∈X
⇒ sup ℓ(x̄, y) ≥ sup inf ℓ(x, ȳ) holds for all x̄ ∈ X,
y∈Y y∈Y x∈X
⇒ inf sup ℓ(x̄, y) ≥ sup inf ℓ(x, ȳ).
x∈X y∈Y y∈Y x∈X
77
CHAPTER 5. SADDLE POINTS 78


Example Continuing the example before,
D E
inf sup [c⊤ x + b⊤ y − x⊤ A⊤ y] ≥ sup inf [b⊤ y + c − A⊤ y, x ] =
x∈K1 y∈K ◦ y∈K2◦ x∈K1
2

= sup [b⊤ y + inf c − A⊤ y, x ] ⇔ max b⊤ y


y∈K2◦ x∈K1
|( {z } s. t. A⊤ y − c ∈ K1◦ ,
0 if c−A y∈−K1◦
⊤ y ∈ K2◦ .
=
−∞ otherwise
For a concrete example, consider the Mozart problem
c⊤
z }| {
min [h−9 −8 i ]x h 6 i max b⊤ y
s. t.
11
2 1 x ≤ 11 , ⇝ K2 = R3 s. t. A⊤ y − c ∈ K2◦ = R2− (≤ 0),
+
| 1{z2 } 9
| {z } y ∈ K1◦ = R3− (≤ 0)
A b
x ≥ 0, ⇝ K1 = R2+ ⇕

max b⊤ y
s. t. A⊤ y ≤ c,
y ≤ 0.

This approach to duality is known as “Lagrangian duality”. Note, weak
duality holds in general and does not require any convexity assumptions at
all. The decisive questions are
• Under which conditions is inf x supy ℓ(x, y) = supy inf x ℓ(x, y)?
• If equality holds, do there exist x̄, ȳ that attain this value?
Definition 5.2 A pair (x̄, ȳ) ∈ X × Y is a saddle point of a function
ℓ : X × Y → R if sup ℓ(x̄, y) = ℓ(x̄, ȳ) = inf ℓ(x, ȳ).
y∈Y x∈X

The following variants are equivalent to this. For (x̄, ȳ) ∈ X × Y there holds
ℓ(x̄, y) ≤ ℓ(x̄, ȳ) ≤ ℓ(x, ȳ) for all (x, y) ∈ X × Y,
ℓ(x̄, y) ≤ ℓ(x, ȳ) for all (x, y) ∈ X × Y.

sketch saddle function ℓ for x and y


CHAPTER 5. SADDLE POINTS 79

The concept of saddle points fits perfectly to game theory:


X cannot improve its position if Y does not move,
Y cannot improve its position if X does not move.
If there are saddle points, they all have the same value and can be combined
arbitrarily.
Observation 5.3 The value ℓ(x̄, ȳ) is the same for all saddle points (x̄, ȳ)
If (x̄1 , ȳ1 ) and (x̄2 , ȳ2 ) are saddle points, then (x̄1 , ȳ2 ) and (x̄2 , ȳ1 ) are saddle
points, as well.

Proof: Given the saddle points, there holds



ℓ(x̄1 , y) ≤ ℓ(x̄1 , ȳ1 ) ≤ ℓ(x, ȳ1 ) ← x = x̄2 , y = ȳ2
∀(x, y) ∈ X × Y
ℓ(x̄2 , y) ≤ ℓ(x̄2 , ȳ2 ) ≤ ℓ(x, ȳ2 ) ← x = x̄1 , y = ȳ1
By the choices of x and y all combinations give rise to the same value. This
also shows that
∀(x, y) ∈ X × Y ℓ(x̄1 , y) ≤ ℓ(x̄1 , ȳ1 ) = ℓ(x̄1 , ȳ2 ) = ℓ(x̄2 , ȳ2 ) ≤ ℓ(x, ȳ2 ).
Hence, (x̄1 , ȳ2 ) is a saddle point. The same steps prove the property for
(x̄2 , ȳ1 ). □
Consider the following two functions,
the primal function φ(x) := supy∈Y ℓ(x, y) for x ∈ X; it may attain +∞,
the dual function ψ(y) := inf x∈X ℓ(x, y) for y ∈ Y ; it may attain −∞.
In the game theoretic setting the functions specify the worst case pay off
functions for the “minimizing primal” and the “maximizing dual” player’s
current choices, if the opponent reacts in the best way. Any choice gives a
bound on that, so the following is trivial and not the saddle point property.
Observation 5.4 ∀(x, y) ∈ X × Y ψ(y) ≤ ℓ(x, y) ≤ φ(x).

Proof: ⇔ inf x∈X ℓ(x, y) ≤ ℓ(x, y) ≤ supy∈Y ℓ(x, y). □


Each side tries to find a choice optimizing the respective function.
Theorem 5.5 Let Φ = Argminx∈X φ(x), Ψ = Argmaxy∈Y ψ(y). ℓ has
saddle points on X ×Y if and only if minx∈X φ(x) = maxy∈Y ψ(y) is attained
on both sides. In this case, Φ × Ψ is the set of saddle points.

Proof: If (x̄, ȳ) is a saddle point, then by definition φ(x̄) = ψ(ȳ) , thus by
Obs 5.4 (x̄, ȳ) ∈ Φ × Ψ and there holds minX φ = maxY ψ.
Conversely, let minX φ = maxY ψ be attained, then there exists (x̄, ȳ) ∈ Φ×Ψ
Obs 5.4 Obs 5.4
with φ(x̄) = ψ(ȳ). Thus ℓ(x̄, y) ≤ φ(x̄) = ψ(ȳ) ≤ ℓ(x, ȳ) holds for
all (x, y) ∈ X × Y proving the saddle point property. □
We will be able to ensure the existence of saddle points (= strong dual-
ity) under the following four, rather strong assumptions that guarantee
attainment.
CHAPTER 5. SADDLE POINTS 80

(A1) X ⊆ Rn and Y ⊆ Rm are nonempty, convex and closed.


(A2) ℓ is continuous and convex-concave on X × Y , i. e.,
For y ∈ Y the function ℓ(·, y) : X → R is convex,
for x ∈ X the function ℓ(x, ·) : Y → R is concave.
(A3) X is bounded or

∃yo ∈ Y ℓ(x, y0 ) → ∞ for ∥x∥ → ∞, x ∈ X

(A4) Y is bounded or

∃xo ∈ X ℓ(x0 , y) → −∞ for ∥y∥ → ∞, y ∈ Y

Theorem 5.6 If (A1)–(A4) are satisfied, the bifunction ℓ has a nonempty


convex compact set of saddle points on X × Y .

Proof: First, we prove that the set of saddle points is convex and compact.
If there are saddle points (x̄, ȳ), they have a common saddle value ¯l = ℓ(x̄, ȳ)
by Obs 5.3 and there holds ℓ(x̄, y) ≤ ¯l ≤ ℓ(x, ȳ) for all (x, y) ∈ X × Y . Thus,
the primal optimizing set Φ is the intersection of the level sets of the convex
functions ℓ(·, y), \
Φ= Sl̄ (ℓ(·, y)).
y∈Y

Because ℓ(·, y) is convex and continuous, the level sets are convex and closed,
so Φ is convex and closed. By (A3) at least one of the level sets is compact,
so Φ is compact. The same holds for Ψ and thus for Φ × Ψ.
The existence of saddle points is proven in three steps starting with even
stronger assumptions and then weakening them again.
Step 1: In addition to (A1)–(A4) let X and Y be bounded and ℓ(x, ·) be
strictly concave for each x ∈ X.
For each y ∈ Y the function hy (x) := ℓ(x, y) is convex and closed. Thus
the primal function φ(x) := supy∈Y hy (x) is convex, closed and has compact
domain dom φ = X. Therefore minx∈X φ(X) is attained in some x̄ ∈ X.
The strict concavity of ℓ(x, ·) for each x ∈ X and the compactness of Y implies
that for each x there is a unique maximizing y(x) ∈ Y with φ(x) = ℓ(x, y(x)).
Put ȳ = y(x̄), then

φ(x̄) = ℓ(x̄, ȳ) ≥ ℓ(x̄, y) for all y ∈ Y. (5.1)

For proving the second of the saddle point inequalities, let x ∈ X and put

xk := k1 x + (1 − k1 )x̄, yk = y(xk ) for k ∈ N.


CHAPTER 5. SADDLE POINTS 81

K
By compactness of Y , there is a subsequence K with ȳk → ỹ, possibly not
equal to ȳ.
x̄ min ℓ(·,yk ) conv
1
φ(x̄) ≤ φ(xk ) = ℓ(xk , yk ) ≤ ℓ(x, yk ) + (1 − k1 )ℓ(x̄, yk )
|k {z }
K (5.1)
→ 0 + ℓ(x̄, ỹ) ≤ φ(x̄)

Because ȳ is the unique maximizer with ℓ(x̄, ȳ) = φ(x̄) this shows ỹ = ȳ. By
ℓ(x̄, yk ) ≤ ℓ(x̄, ȳ) it also implies
1 1
φ(x̄) ≤ ℓ(x, yk ) + (1 − )φ(x̄) |·k
k k
K
⇒ φ(x̄) ≤ ℓ(x, yk ) → ℓ(x, ȳ).
This holds for any x ∈ X. Thus, (x̄, ȳ) is a saddle point.
Step 2: In addition to (A1)–(A4) let X and Y be bounded. In order to build
on Step 1, define

for k ∈ N ℓk (x, y) := ℓ(x, y) − k1 ∥y∥2 strictly concave in y.

For each ℓk there is a saddle point (x̄k , ȳk ) satisfying

∀(x, y) ∈ X × Y ℓ(x̄k , y) − k1 ∥y∥2 ≤ ℓ(x, ȳk ) − k1 ∥ȳk ∥2 .


K
By compactness there is a subsequence K with (x̄k , ȳk ) → (x̄, ȳ) and there
holds
∀(x, y) ∈ X × Y ℓ(x̄, y) ≤ ℓ(x, ȳ),
proving the saddle point property for (x̄, ȳ).
Step 3: Now assume (A1)–(A4) alone. Put

for k ∈ N Xk := X ∩ Bk (0), Yk := Y ∩ Bk (0).

By Step 2, ℓ restricted to Xk × Yk has a saddle point (x̄k , ȳk ) ∈ Xk × Yk


satisfying
ℓ(x̄k , y) ≤ ℓ(x, ȳk ) for all (x, y) ∈ Xk × Yk .
Assume, for contradiction, that (ȳk )k∈N is unbounded. Then Y is unbounded.
By (A4) there holds x0 ∈ Xk for all k large enough and

∀y ∈ Yk ℓ(x̄k , y) ≤ ℓ(x0 , ȳk ) → −∞.

For each fixed y ∈ Y this yields ℓ(x̄k , y) → −∞, which is possible only if
(x̄k )k∈N is unbounded. Thus X is unbounded. By (A3) y0 ∈ Yk for all k
large enough and

+∞ ← ℓ(x̄k , y0 ) ≤ ℓ(x0 , ȳk ) → −∞


CHAPTER 5. SADDLE POINTS 82

Therefore there is a subsequence K with (x̄k , ȳk ) → (x̄, ȳ) satisfying

ℓ(x̄, y) ≤ ℓ(x, ȳ) for all (x, y) ∈ X × Y. □


Given (A1)–(A4) the weak duality relation

inf sup ℓ(x, y) ≥ sup inf ℓ(x, y)


x∈X y∈Y y∈Y x∈X

can now be improved to the much more useful strong duality relation

min sup ℓ(x, y) = max inf ℓ(x, y).


x∈X y∈Y y∈Y x∈X

Example Consider the primal dual pair of conic programs with closed
convex cones K1 , K2 (K1◦ , K2◦ are generically closed convex cones)

min c⊤ x max b⊤ y
(P ) s. t. b − Ax ∈ K2 , (D) s. t. A⊤ y − c ∈ K1◦ ,
x ∈ K1 , y ∈ K2◦ .
While the saddle point theorem does not offer an easy way to the typical
strong duality result for conic programs, it may still be brought to use as
follows. Assume K2 and K1◦ to be full dimensional and put

ℓ(x, y) := c⊤ x + b⊤ y − y ⊤ Ax.

Next consider the four assumptions,


(A1) holds for X = K1 and Y = K2◦ ,
(A2) holds for this ℓ, because it is affine in x and affine in y,
(A3) because K1 is unbounded we need to assume the existence of a dual
Slater point,
∃ỹ ∈ K2◦ A⊤ ỹ − c ∈ int K1◦
D E D E ∥x∥→∞
A⊤ ỹ − c, ∥x∥
x
< 0 and c − A⊤ ỹ, x

⇒ ∀x ∈ K2 \{0} → +∞.
(A4) because K2◦ is unbounded we need to assume the existence of a primal
Slater point,
∃x̃ ∈ K1 b − Ax̃ ∈ int K2
∥y∥→∞
D E
◦ y 
⇒ ∀y ∈ K2 \ {0} b − Ax̃, ∥y∥ < 0 and ⟨b − Ax̃, y⟩ → −∞.
Thus, if such Slater points exist, (P ) and (D) will have optimal solutions
exhibiting the same optimal value. In several typical situations these points
are easy to produce. Consider, e. g., the Mozart problem,
 ⊤ h 6 i⊤
min h −9
−8i x max 11 y,
6
h1 1i 9
s. t. [ 11 21 12 ] y − −9
 
s. t. 11 − 2 1 x ≥ 0,
9 12 −8 ≤ 0,
x ≥ 0, y ≤ 0.
CHAPTER 5. SADDLE POINTS 83
 −10 
Here, x̃ = 0 and ỹ = 0 would be suitable choices. For a linear program
0
this is, however, an unnecessary exercise, because there strong duality holds
generically without further conditions, as we will see next. ♡

The Lagrange Function, Duality and Optimality

To illustrate and interpret the relevance of the saddle point approach for
optimization problems of the form (w. l. o. g. inequality constraints only)
min f (x) f : Rn → R
s. t. gi (x) ≤ 0, i ∈ I, g : Rn → RI [g(x) ∈ K2 = RI− ]
x∈X X convex of simple structure
we consider the Lagrangian
X
L(x, λ) = f (x) + λi gi (x).
i∈I

Note, inf x∈X supλ≥0 L(x, λ) is the original problem. Indeed, if for x ∈ X

gi (x) > 0, choose λi → +∞, then sup = ∞,


gi (x) ≤ 0, choose λi = 0, sup = 0.
Only the sup = 0 cases, i. e., feasible x, are relevant for taking inf.
Conversely, for fixed λ ≥ 0 one may view

ψ(λ) = inf L(x, λ) (−∞ is a possible value)


x∈X

as a modified version of the original problem in which λi penalizes a possible


violation of constraint gi but unfortunately also rewards overfulfillment of it.
For each fixed λ ≥ 0 the value ψ(λ) is always a lower bound on the optimal
value of the original problem (weak duality). Indeed, for all feasible x′ ∈ X
(and, if it exists, also for an optimal x∗ ),
X
ψ(λ) = inf L(x, λ) ≤ f (x′ ) + λi gi (x′ ) ≤ f (x′ ).
x∈X |{z} | {z }
i∈I ≥0 ≤0
| {z }
≤0 (polarity)

In order to determine this lower bound ψ(λ) one has to solve the “simpler”
optimization problem over x ∈ X where violations of the gi are penalized in
the objective. This is Lagrangian relaxation of the constraints by a (Lagrange
multiplier) parameter λ.
By Obs 5.1 one obtains the best possible lower bound if one can solve the
problem
sup inf L(x, λ) = sup ψ(λ),
λ≥0 x∈X λ≥0
CHAPTER 5. SADDLE POINTS 84

i. e., if one can solve the dual optimization problem. It is important to note
that this dual problem is always a convex problem. To see this, observe that

ψ(λ) = inf L(x, λ) = inf [ f (x) + g(x)⊤ λ ]


x∈X x∈X | {z }
affine in λ for each fixed x

is always concave, because it is the infimum of functions that are affine in λ


(one per x ∈ X).

∂(−ψ)(λ) = {−g(x′ ) : x′ ∈ Argminx∈X L(x, λ)}[+NRI (λ)].


+

Therefore the dual problem supλ≥0 ψ(λ) can be solved by the subgradient
algorithm (or similar nonsmooth optimization methods like bundle methods),
whenever a global optimizer to inf x∈X L(x, λ) can be determined efficiently
for each λ ≥ 0. If L(x, λ) has saddle points, its optimal value coincides with
the optimal value of the original problem.
I
Theorem P 5.7 A point (x̄, λ̄) I∈ X × R+ is a saddle point of L(x, λ) =
f (x) + i∈I λi gi (x) on X × R+ if and only if
(i) x̄ minimizes L(·, λ̄) on X,
(ii) gi (x) ≤ 0, i ∈ I, [primal feasibility]
(iii) λ̄i gi (x̄) = 0, i ∈ I. [complementarity]

Proof: ⇒: (i) follows from L(x̄, λ̄) ≤ L(x, λ̄) for all x ∈ X. (ii) and (iii)
are implied by L(x̄, ·) being affine in λ and L(x̄, λ) ≤ L(x̄, λ̄) for all λ ≥ 0,
because
gi (x̄) > 0 contradicts L(x̄, λ) ≤ L(x̄, λ̄) for all λ ≥ 0,
gi (x̄) < 0 ⇒ λ̄i = 0,
λ̄i > 0 ⇒ gi (x̄) = 0.

⇐: (ii) and (iii) ensure L(x̄, λ) ≤ L(x̄, λ̄) for all λ ≥ 0,


and by (i), L(x̄, λ̄) ≤ L(x, λ̄) for all x ∈ X. □

Corollary 5.8 If (x̄, λ̄) is a saddle point of L on X × RI+ , then x̄ is an


optimal solution of the optimization problem.

Proof: Feasibility of x̄ follows from Th 5.7(ii) and for any feasible x ∈ X,

(iii) (i) X
f (x̄) = L(x̄, λ̄) ≤ L(x, λ̄) = f (x) + λ̄i gi (x) ≤ f (x).
|{z} | {z }
i∈I ≥0 ≤ 0
feas. □
For convex problems the existence of saddle points is equivalent to the
existence of Lagrange multipliers in the KKT system.
CHAPTER 5. SADDLE POINTS 85

Theorem 5.9 Let X = Rn and f, gi convex. The following are equivalent


for the convex optimization problem:
(i) (x̄, λ̄) is a saddle point of L on Rn × RI+ ,
(ii) x̄ is an optimal solution and λ̄ a Lagrange multiplier for x̄ in the KKT
system.

Proof: Denote the feasible set by X = X ∩ {x : gi (x) ≤ 0, i ∈ I}.


Th 5.7(i) Th 5.7(ii) Th 5.7(iii)
Th 3.55
Th 5.7 ⇔ 0 ∈ ∂L(·, λ̄) ∧ x̄ ∈ X ∧ (λ̄i gi (x̄) = 0, i ∈ I)
X
x̄ ∈ X ∧ ∃λ̄ ∈ RI+ 0 ∈ ∂f (x̄) +
 
⇔ λ̄i ∂gi (x̄) ∧ λ̄i gi (x̄) = 0, i ∈ I
i∈I □
If saddle points exist, the optimal solutions to the dual problem are therefore
Lagrange multipliers for the primal optimal solutions (and vice versa).
These results open one possibility to prove strong duality for linear program-
ming. Recall the primal dual pair of linear programs in normal form,

(P ) min c⊤ x
s. t. Ax = b, [b − Ax ∈ {0} =: K2 ]
x ≥ 0, [x ∈ Rn+ =: K1 ]

(D) max b⊤ y
s. t. A⊤ y ≤ c, [A⊤ y − c ∈ Rn− = (Rn+ )◦ = K1◦ ]
y free, [y ∈ Rm = {0}◦ = K2◦ ]

⇔ (D′ ) max b⊤ y,
s. t. A⊤ y + z = c,
y ∈ Rm , z ≥ 0.

Note that the dual to (D) is (P) and vice versa.

Theorem 5.10 (Strong Duality for Linear Programming)


If one of (P) or (D) has an optimal solution, the other one has as well and
the optimal objective values coincide.

Proof: By Th 4.16 affine constraints require no regularity conditions for the


existence of multipliers. W. l. o. g. let x̄ be an optimal solution of (P), i. e., of

min c⊤ x



s. t. +b − Ax ≤ 0 | · λ+ 


x̄ opt.
−b + Ax ≤ 0 | · λ− ⇒ ∃ Lagrange mult. λ̄
 Th 4.16
−x ≤ 0 | · λx 

x free

CHAPTER 5. SADDLE POINTS 86

By Th 5.9, (x̄, λ̄) is a saddle point of the Lagrangian

L(x, λ) = c⊤ x + ⟨b − Ax, λ+ − λ− ⟩ + ⟨−x, λx ⟩


* +

= b (λ+ − λ− ) + c− A⊤ (λ − λ ) − λx , x
| {z } | + {z −} |{z}
y y z

Therefore L(x̄, λ̄) = supλ≥ 0 inf x∈Rn L(x, λ) which is equivalent to y = λ̄+ −λ̄−
and z = λ̄x being optimal for

max b⊤ y
max b⊤ y
s. t. c − A⊤ y − z = 0 ⇔
s. t. A⊤ y ≤ c.
y ∈ Rm , z ≥ 0 □
Chapter 6

Interior Point Methods

Interior Point Methods offer a general approach for developing “efficient”


methods for linear/quadratic/semidefinite optimization. In the case of linear
and quadratic optimization suitably designed methods have polynomial
running time, in the case of semidefinite programming polynomiality requires
reasonable additional assumptions on the primal and dual feasible sets.
The approach follows a general strategy for dealing with inequality constraint
first developed for general nonlinear optimization. The idea is to start in the
interior of the feasible set and to avoid crossing the boundary of the feasible
region by adding to the objective function a barrier function that has value
infinity on the boundary. A line search for minimizing the objective will then
not get too close to the boundary.

The most popular barrier function for


x ≥ 0 (x ∈ R+ ) is − log x
sketch −logx

For example, consider the linear pro-


gram
min cx
s. t. a ≤ x ≤ b
for some a < b with a, b, c ∈ R.
sketch cx − log(x − a) − log(b − x)

In order to be able to approach the boundary the influence of the barrier


term is decreased by a barrier parameter µ > 0. For general linear programs

87
CHAPTER 6. INTERIOR POINT METHODS 88

in normal form this reads


barrier problem [convex]
Pn
min c⊤ x min c⊤ x − µ i=1 log xi
(P ) s. t. Ax = b, (Pµ ) s. t. Ax = b
x≥0 (x > 0)

max b⊤ y max b⊤ y + µ ni=1 log zi


P
(D) s. t. A⊤ y + z = c, (Dµ ) s. t. A⊤ y + z = c
y ∈ Rm , z ≥ 0 (z > 0)
Definition 6.1
x is strictly feasible for (P ) if x is feasible for (P ) and x > 0.
(y, z) is strictly feasible for (D) if (y, z) is feasible for (D) and z > 0.

Primal and dual strictly feasible points will be assumed to exist in this
chapter unless explicitly stated otherwise.
Assumption 6.2 There exist x0 strictly feasible for P and (y0 , z0 ) strictly
feasible for (D).

The general strategy is to determine approximately optimal solutions of (Pµ )


for µ ↘ 0. If successive values of µ are close enough, the next approximation
should be easy to reach form the previous one.
First consider a fixed value of µ.
The optimal solution of (Pµ ) is a saddle point of the Lagrangian
n
X

Lµ (x, y) = c x − µ log xi + (b − Ax)⊤ y.
i=1

Each saddle point satisfies the KKT conditions and therefore


∇y Lµ (x, y) = 0 (optimal in y): b − Ax = 0 [primal feas.]
∇x Lµ (x, y) = 0 (optimal in x): c − µx−1 − A⊤ y = 0 [dual feas.?!]
 1 
x1

In this, x−1 =  ..
.
 is to interpreted componentwise. The missing dual
1
xn
slack z appears to be z = µx−1 .
This gives rise to the primal-dual KKT system,
Ax = b (x > 0) primal feasibility
A⊤ y + z = c (z > 0) dual feasibility
x ◦ z = µ1 perturbed complementarity
 x1 ·z1 
The ◦ represents the “Hadamard”-product x ◦ z = .. .
.
xn ·zn
CHAPTER 6. INTERIOR POINT METHODS 89

• perturbed complementarity: Observe that

for µ = 0 x ◦ z = 0, thus ⟨x, z⟩ = 0 ⇒ optimal for (P ), (D)

This follows by optimality conditions but it is worth to recall the easy


direct proof: For any primal feasible x and dual feasible (y, z)
D E
⟨c, x⟩ − ⟨b, y⟩ = ⟨c, x⟩ − ⟨Ax, y⟩ = x, c − A⊤ y = ⟨x, z⟩ .

• When starting from (Dµ ) one arrives at the same primal-dual KKT
system.

(Pµ ) is strictly convex in x,
• The objective in
(Dµ ) is strictly convex in z.
Thus, the KKT system delivers a

unique optimal solution (xµ , zµ ) of (Pµ ), (Dµ ).

In this view, y just serves to span the feasible z-space,


yµ is unique if A has full column rank and then determined by zµ .
→ (xµ , zµ ) hold the most important information [they determine the
position inside the cones].
Via the Implicit Function Theorem it is possible to show (not here),

{(xµ , zµ ) : µ > 0} describes a smooth curve, the central path.

It has numerous useful properties. In particular, in linear programming it


converges from the interior to an optimal solution satisfying strict comple-
mentarity.

Definition 6.3 A pair of primal and dual optimal solutions x∗ and (y ∗ , z ∗ )


satisfies strict complementarity if for all i = 1, . . . , n either xi ̸= 0 or zi =
̸ 0.

Before proving this, first recall that the affine primal and dual feasible
subspaces are orthogonal to each other.

Lemma 6.4 For x′ , x′′ ∈ {x : Ax = b} and z ′ , z ′′ ∈ {z : A⊤ y + z = c} there


holds (x′ − x′′ )⊤ (z ′ − z ′′ ) = 0.

Proof: By Ax′ − Ax′′ = 0 and A⊤ (y ′ − y ′′ ) = −(z ′ − z ′′ ) we obtain


(y ′ − y ′′ )⊤ A(x′ − x′′ ) = 0. □
| {z }
=−(z ′ −z ′′ )⊤

Lemma 6.5 For µk ↘ 0 there exist cluster points of (xµk , zµk ) and each
corresponds to a strictly complementary pair (x∗ , z ∗ ) of optimal solutions of
(P ) and (D).
CHAPTER 6. INTERIOR POINT METHODS 90

Proof: To show boundedness of (xµk , zµk ), fix some µ̄, let 0 < µ < µ̄ and let
x0 , (y0 , z0 ) be the strictly feasible points of Ass 6.2, then
Lem 6.4
0 = (xµ − x0 )⊤ (zµ − z0 ) = x⊤µ zµ −x⊤ ⊤ ⊤
µ z 0 − x 0 zµ + x 0 z0
| {z }
=nµ by xµ ◦zµ =µ1

⇒ x⊤ ⊤ ⊤ ⊤
µ z0 + x0 zµ = nµ + x0 z0 ≤ nµ̄ + x0 z0 (6.1)
Because z0 > 0 and x0 > 0 the coordinates of xµk and zµk remain bounded.
K K
Thus there is a subsequence K ⊆ N with xµ → x̄ ≥ 0 and zµ → z̄ ≥ 0 with x̄
>0 >0
and z̄ feasible [smoothness of the central path would also yield uniqueness].
Now,
x⊤
µk zµk = nµk → 0 ⇒ x̄⊤ z̄ = 0,
so x̄ = x∗ and z̄ = z ∗ are primal and dual optima.
In order to prove strict complementarity, replace in (6.1) x0 by x∗ and z0 by
z ∗ to obtain
x⊤ ∗ ⊤ ∗
µ z + zµ x = nµ + 0.
xµ zµ
For µ > 0, xµ ◦ zµ = µ1 yields µ = zµ−1 , µ = x−1
µ , thus

n n
X zi∗ X x∗i
+ = n.
(zµ )i (xµ )i
i=1 i=1

zi∗ K x∗i K
By (zµk )i → 0 ∨ 1 and (xµk )i → 0 ∨ 1 the claim follows. □

The Algorithmic Framework

We need to find a root of the nonlinear primal dual KKT system,


 
Ax−b
Fµ (x, y, z) = A y+z−c = 0.

x◦z−µ1

Newton’s method determines the step via the linearization


 ∆x 
Fµ + JFµ · ∆y = 0.
∆z

I A∆x = b − Ax =: fp
II A⊤ ∆y + ∆z = c − A⊤ y − z =: fd
III ∆x ◦ z + x ◦ ∆z = µ1 − x ◦ z =: fc
CHAPTER 6. INTERIOR POINT METHODS 91

Solve this by

II : ∆z = fd − A⊤ ∆y
III : ∆x = µz −1 −x− z −1 ◦x◦∆z = µz −1 −x−z −1 ◦x◦fd + z −1 ◦x◦A⊤∆y
| {z }
=Diag(x◦z −1 )A⊤∆y
−1 ⊤
in I : A · Diag(x) · Diag(z ) · A ∆y = fp − A(µz −1 −x − z −1 ◦ x ◦ fd )
| {z }
=:M ∈Sm
+

The matrix M is positive definite whenever A has full column rank (which
may be assumed w. l. o. g. whenever Ax = b has a solution at all), because
x > 0 and z > 0. Therefore ∆y can be solved by a Cholesky decomposition
3
which requires O( m3 ) arithmetic operations.
Algorithm 6.6 (interior point framework)
Input: A, b, c, x0 , y0 , z0 with x0 > 0, z0 > 0
1. Choose µ.
2. Compute ∆x, ∆y, ∆z as above.
3. Choose a step length α (≤ 1) so that

x + α∆x > 0 and z + α∆z > 0

4. Update x ← x + α∆x, y ← y + α∆y, z ← z + α∆z


5. If ∥fp ∥, ∥fd ∥ and x⊤ z are small enough, STOP, else GOTO 1.

It is important to note that the choice of µ heavily influences the quality of


the step direction as well as its length,
• decreasing µ slowly results in a good step but slow progress
• decreasing µ too fast may result in a bad step with small α giving slow
progress again.

The Algorithm of Monteiro and Adler

It classifies as a feasible short-step primal-dual path-following algorithm.

sketch polytope with central path, neighborhood and approximate steps

It starts close to the central path and, doing full Newton steps of step length
1, it stays within the following neighborhood of the central path

(N ) ∥x ◦ z − µ(x, z)1∥ ≤ θµ(x, z) for some fixed 0 < θ < 1


CHAPTER 6. INTERIOR POINT METHODS 92


where µ(x, z) := xn z so that
 
1 1 1
µ(x, z)1 = x ◦ z, √ ·√ the “projection of x ◦ z onto √ ”.
n n n

Because ∥ · ∥∞ ≤ ∥ · ∥, each point (x, z) in the neighborhood satisfies

|xi zi − µ(x, z)| ≤ θµ(x, z)

or

(N ′ ) (1 − θ)µ(x, z) ≤ xi · zi ≤ (1 + θ)µ(x, z), i = 1, . . . , n.

Algorithm 6.7 (Monteiro and Adler)


Input: A, b, c, strictly feasible starting point (x0 , y 0 , z 0 ) satisfying (N ).
0. Choose a fixed σ < 1 (see later), put k ← 0.
k ⊤ k
1. µk := (x )n z .
2. Solve
I A∆x = 0,
II A⊤ ∆y + ∆z = 0,
III ∆x ◦ z k + xk ◦ ∆z = σµk 1 − xk ◦ z k .
3. (xk+1 , y k+1 , z k+1 ) := (xk + ∆x, y k + ∆y, z k + ∆z). [no line search!]
4. If (xk+1 )⊤ (z k+1 ) < 2−2L , STOP.
5. k ← k + 1, GOTO 1.

The rather strange L refers to the number of bits that are required to encode
the LP and is more or less standard for complexity theoretic purposes; the
reasons for this choice will not be pursued here in any detail.
In order to simplify notation in the analysis of the algorithm we will from
now on write
(x, y, z) for (xk , y k , z k ), [the current point]
(x+ , y+ , z+ ) for (xk+1 , y k+1 , z k+1 ). [the next point]

Because of A∆x = 0 and Ax = b, also x+ satisfies Ax+ = b.


Likewise (y+ , z+ ) satisfy Ay+ + z+ = c.
The only line in the primal-dual KKT system that is in general not satisfied is
the perturbed complementarity condition, because it is bilinear and in general
the linearization yields ∆x ◦ ∆z ≠ 0. All the work goes into controlling this
term.
We will prove: (x+ , z+ ) satisfies (N ) and x+ > 0 and z+ > 0.

Lemma 6.8
(i) ∆x⊤ ∆z = 0,
CHAPTER 6. INTERIOR POINT METHODS 93

x⊤ z +
(ii) µ+ := +n = σµ,
(iii) x+ ◦ z+ = µ+ 1 + ∆x ◦ ∆z.

Proof: (i): follows from Lem 6.4.


(ii),(iii): Consider
III
x+ ◦z+ = (x+∆x)◦(z+∆z) = x◦z+x◦∆z+∆x◦z+∆x◦∆z = σµ1+∆x◦∆z.

Then nµ+ = x⊤ ⊤ ⊤
+ z+ = 1 (x+ ◦ z+ ) = 1 (σµ1 + ∆x ◦ ∆z) = nσµ + ∆x
| {z∆z},
(i)
=0
so µ+ = σµ. □
[∆x⊤ ∆z = 0 is a consequence of feasibility and simplifies the analysis
considerably]
With this and the next lemma, instead of “(x+ , z+ ) satisfies (N)” it now
remains to show “∥∆x ◦ ∆z∥ ≤ θµ+ ”.
Lemma 6.9 Let x(α) = x + α∆x, z(α) = z + α∆z. If (N) is satisfied for
α = 0 and α = 1, then (N) also holds for α ∈ [0, 1].

Proof: Exercise. □
For bounding ∥∆x ◦ ∆z∥ the following relation comes in handy.
III
Lemma 6.10 Let x > 0,z > 0 and h = ∆x ◦ z + x ◦ ∆z = µ+ I − x ◦ z. Put
1 1
d = x 2 ◦ z − 2 , then
1 1
∥d−1 ◦ ∆x∥2 + ∥d ◦ ∆z∥2 + 2∆x⊤ ∆z = ∥x− 2 ◦ z − 2 ◦ h∥2 .
1 1 1 1 1 1
Proof: ∥x− 2 ◦ z − 2 ◦ h∥2 = ∥x− 2 ◦ z 2 ◦ ∆x + x 2 ◦ z − 2 ◦ ∆z∥2 and direct
computation. □
∥x◦z−µ+ 1∥2
Lemma 6.11 Put γ = min{xi zi : i = 1, . . . , n}, then ∥∆x◦∆z∥ ≤ 2γ .

Proof: For d of Lem 6.10 we get


∥∆x ◦ ∆z∥ = ∥∆x ◦ d−1 ◦ d ◦ ∆z∥
(ai bi )2 ≤ a2i
P P P 2
bi
≤ ∥d−1 ◦ ∆x∥ · ∥d ◦ ∆z∥
1
2
(∥·∥+∥·∥)2
−1 ⊤
≤ 1
2 (∥d ◦ ∆x∥2 + ∥d ◦ ∆z∥2 ) + ∆x
| {z∆z}
Lem 6.8(i)
= 0
Lem 6.10 − 12 1
= 1
2 ∥x ◦ z − 2 ◦ (µ+ 1 − x ◦ z)∥2

− 12
≤ 1
2 ∥γ (µ+ 1 − x ◦ z)∥2

CHAPTER 6. INTERIOR POINT METHODS 94

Theorem 6.12 If 0 < θ < 1 and 0 < σ < 1 satisfy

θ2 + n(1 − σ)2
≤ θσ (6.2)
2(1 − θ)

then ∥x+ ◦ z+ − µ+ 1∥ ≤ θµ+ holds in each iteration.

Proof: Because (x ◦ z − µ1)⊤ 1 = 0 [recall, µ1 is the projection onto √1 1]


n

Pythagoras
∥x ◦ z[−µ1 + µ1] − µ+ 1∥2 = ∥x ◦ z − µ1∥2 + ∥(µ − µ+ )1∥2
(N ),Lem 6.8(ii)
≤ (θµ)2 + µ2 (1 − σ)2 ∥1∥2
| {z }
=n
= (θ2 + n(1 − σ)2 )µ2 .

(N ′ )
Because (N ′ ) holds for (x, z), the γ of Lem 6.11 satisfies γ ≥ (1 − θ)µ, so
Lem 6.8(iii) ∥x ◦ z − µ+ 1∥2 θ2 + n(1 − σ)2 (6.2)
∥x+ ◦ z+ − µ+ 1∥ ≤ ≤ µ ≤ θ σµ
Lem 6.11 2γ 2(1 − θ) |{z}
µ+ □

For (6.2) to hold for a constant θ independent of n, σ must be of the form


σ ∼ 1 − √δn . A feasible choice for (6.2) is θ = δ = 0.35.

Theorem 6.13 Given θ and σ = 1 − √δ that satisfy (6.2) and a strictly


n
feasibly starting point that satisfies (N ) and (x0 )⊤ z 0 ≤ 2L , the
(x0 , y 0 , z 0 )

algorithm terminates in O( nL) iterations.

Proof: For the value (xk )⊤ z k = nµk = nσ k µ0 to fall below 2−2L , iteration
−2L
counter k has to satisfy σ k ≤ 2nµ0 which is guaranteed by

δ [log(1+x)≤x] δ −2L
k log σ = k log(1 − √ ) ≤ −k √ ≤ log( 2nµ0 )
n n

So the termination criterion certainly holds for [L > log n]



n √
k≥ [2L log 2 + log(nµ0 )] = O( nL)
δ □
The cost per iteration is O(m3 ) arithmetic operations and typically m =
O(n). Therefore the desired precision is reached within O(n3.5 L) arithmetic
operations. There are theoretic variants requiring O(n3 L) operations.
CHAPTER 6. INTERIOR POINT METHODS 95

Centered Starting Points via Skew Symmetric Embedding

The algorithm of Monteiro and Adler needs a starting point satisfying (N ) and
this may seem like a difficult requirement. There is, however, an astonishingly
cheap way to produce such a starting point for a slightly modified problem
for the primal-dual pair (P) and (D) that
• has a trivial starting point,
• gives a certificate of infeasibility if no solutions exist,
• and allows to reconstruct the original optimal solutions from its optimal
solution, if optimal solutions exist.
For introducing the modified problem for (P) and (D) we start with the
following skew-symmetric homogenized system
Ax −τ b = 0
−A⊤ y +τ c −z = 0
(HS)
b⊤ y −c⊤ x −ρ = 0
x ≥ 0, τ ≥ 0, z ≥ 0, ρ ≥ 0
The system is feasible by setting all variables to zero.
For any solution with τ > 0, the point xτ , ( τy , τz ) is primal/dual feasible and

b⊤ τy ≥ c⊤ xτ

the third line ensures
⇒ both are optimal.
while weak duality gives b⊤ τy ≤ c⊤ xτ
This, however, also shows that this system cannot have a strictly feasible
solution. In order to obtain a strictly feasible system, we simply put τ = 1,
x = 1, y = 0, z = 1, ρ = 1 and compensate the arising infeasibilities
(captured by new constants α, β, b̄, c̄) by introducing yet another variable ϑ
(and its “dual” η) that we drive to zero,
min βϑ
s. t. Ax −τ b +ϑb̄ = 0,
−A⊤ y +τ c −ϑc̄ −z = 0,
(SE) ⊤ ⊤
b y −c x ϑα −ρ = 0,
−b̄⊤ y +c̄⊤ x −τ α −η = −β,
x ≥ 0, τ ≥ 0, ϑ ≥ 0, z ≥ 0, ρ ≥ 0, η ≥ 0,

where (computed for x0 = 1, y 0 = 0, z 0 = 1, τ 0 = ρ0 = ϑ0 = η 0 = 1)


b̄ = −A1 + b,
c̄ = c − 1,
α = c⊤ 1 + 1,
β = −c̄⊤ 1 + c⊤ 1 + 1 + 1 = 1⊤ 1 + 2 = n + 2.
Problem (SE) is the skew symmetric embedding and has the special property
of being selfdual, i.e., its dual is again exactly (SE). To see this, use multipliers
CHAPTER 6. INTERIOR POINT METHODS 96

ỹ for row 1, x̃ for row 2, τ̃ for row 3, θ̃ for row 4, and introduce dual slack
variables z̃, ρ̃ and η̃.
Because the dual to (SE) is of the same form as (SE), it has the same strictly
feasible starting point x̃0 = 1, ỹ 0 = 0, z̃ 0 = 1, τ̃ 0 = ρ̃0 = ϑ̃0 = η̃ 0 = 1.
In the primal-dual KKT system of (SE) the perturbed complementarity lines
read
x ◦ z̃ = µ1, x̃ ◦ z = µ1,
τ ρ̃ = µ, τ̃ ρ = µ,
ϑη̃ = µ, ϑ̃η = µ.
For µ = 1 the common strictly feasible starting point is exactly on the
central path. The interior point algorithm will update both variable groups
in exactly the same way,
for k = 0, 1, . . . xk = x̃k , y k = ỹ k , . . .
There is no need to keep both copies and the ˜-copy may be dropped. The
remaining primal-dual KKT system has only two constraint lines and two
complementarity lines more than that of the original system!
It is important to note that (SE) has, in fact, the trivial optimal solution of
setting η = β and all other variables to zero. This solution, however, is not
useful at all. We need to exploit the highly valuable property of interior point
methods that for linear programming problems they converge to a strictly
complementary solution, i. e., in every complementarity pairing at least one
coordinate is nonzero. This kind of solution of (SE) allows to reconstruct
the original optimal solutions or to certify that none exist.
Theorem 6.14
The selfdual program (SE) has an optimal solution (x∗ , y ∗ , z ∗ , τ ∗ , ρ∗ , ϑ∗ , η ∗ )
with either τ ∗ > 0 or ρ∗ > 0. There holds

(i) τ ∗ > 0 ⇔ (P) and (D) are feasible with optimal solutions xτ ∗ and
∗ ∗
( τy ∗ , τz ∗ ),
(ii) τ ∗ = 0 ⇔ (P) or (D) has an improving half ray [ i. e., at least one of
them is infeasible].

Proof: The existence of the optimal solution of (SE) with either τ ∗ > 0 or
ρ∗ > 0 follows from Lem 6.5. Because ϑ = β is optimal (set η = 0 and all
others to zero), ϑ∗ = 0 and the first three constraint lines of (SE) reduce to
the skew-symmetric homogenized system (HS).

x∗ ∗
(i),⇒: By τ ∗ > 0 and the arguments for (HS), τ∗ and ( τy ∗ , τz ∗ ) are feasible
and optimal for (P) and (D).
(ii),⇒: For τ ∗ = 0 strict complementarity implies ρ∗ > 0. The three
constraints yield
Ax∗ = 0, A⊤ y ∗ + z ∗ = 0 and b⊤ y ∗ > c⊤ x∗ .
CHAPTER 6. INTERIOR POINT METHODS 97

Thus b⊤ y ∗ > 0 or c⊤ x∗ < 0 which gives rise to a dual or primal improving


half ray. Indeed, consider the case c⊤ x∗ < 0. If a feasible primal point x̄ ≥ 0
with Ax̄ = b exists, then the half ray x(t) = x̄ + tx∗ is also feasible for t > 0
and limt→∞ c⊤ x(t) = −∞, so if (P) is feasible, it is unbounded. If b⊤ y ∗ > 0
an analogous construction works for (D), so at least one problem is infeasible
and no finite optima exist.
(i),⇐: For primal optimal x̄ and dual optimal (ȳ, z̄) set

τ∗ = β
1⊤ (x̄+z̄)+1
> 0, x∗ = τ ∗ x̄, y ∗ = τ ∗ ȳ, z ∗ = τ ∗ z̄, ϑ∗ = ρ∗ = η ∗ = 0.

The first three constraints of (SE) hold by direct inspection, consider the
fourth,
−b̄⊤ y ∗ + c̄⊤ x∗ − τ ∗ α = −β
Changing sign and substituting the respective definitions yields
τ ∗ (−1⊤ A⊤ ȳ + b⊤ ȳ − c⊤ x̄ +1⊤ x̄ + c⊤ 1 + 1) = β.
|{z} | {z }
=c−z̄ =0

(ii),⇐: We only prove the case of a primal improving halfray given by


x̄ ≥ 0, Ax̄ = 0, c⊤ x̄ < 0. We may assume x̄ to be small enough so that
c̄⊤ x̄ = c⊤ x̄ − 1⊤ x̄ ≥ −β. Then x∗ = x̄, y ∗ = 0, z ∗ = 0, τ ∗ = 0, ρ∗ = −c⊤ x̄,
ϑ∗ = 0, η ∗ = c̄⊤ x̄ + β (≥ 1⊤ x̄ ≥ 0) is an optimal solution of the required
form. A similar construction works for a dual improving half ray. □
The solution of the skew-symmetric embedding (SE) by interior point methods
thus provides
• either primal and dual optimal solutions
• or a certificate that at least one of the two problems is infeasible, which
proves via Th 5.10 that no optimal solutions exist.
Practical variants often work without ϑ and τ and use equivalent infeasible
methods instead.

(Convex) Quadratic Optimization

A convex quadratic programming problem is of the form


min 21 x⊤ Qx + c⊤ x Q ∈ Sn+ , c ∈ Rn ,
(QP ) s. t. Ax ≥ b, A ∈ Rm×n , b ∈ Rm ,
x ∈ Rn .
The Lagrangian dual (constraints derived via optimality condition on choice
of x) reads
max 12 x⊤ Qx + c⊤ x + (b − Ax)⊤ y
s. t. Qx + c − A⊤ y = 0
x ∈ Rn , y ≥ 0
CHAPTER 6. INTERIOR POINT METHODS 98

Exercise Exploiting the constraint, show that the dual cost function is
equivalent to b⊤ y − 12 x⊤ Qx which is concave, thereby proving convexity of
the problem. Furthermore, if Q ≻ 0 one may eliminate x to obtain the
following quadratic problem in y only,
max (b + Ac)⊤ y − 12 y ⊤ AQ−1 A⊤ y − 12 c⊤ Q−1 c
s. t. y ≥ 0.



Strong duality holds for this primal and dual by Th 4.16 and Th 5.9.
The barrier problem to (QP) reads
m
X
minn f (x) := 12 x⊤ Qx + c⊤ x − µ log(Ai,• x − bi ).
x∈R
i=1

It is an unconstrained smooth convex problem and first order optimality


conditions are sufficient,

∇f (x) = 0 : Qx + c − µA⊤ [ Ai,•1x−bi ]i=1,...,m = 0.

Introduce slack variables s := Ax − b ≥ 0 and put y = µs−1 to obtain the


primal-dual KKT system for convex quadratic programming,
Qx + c − A⊤ y = 0 dual feasibility,
Ax − s = b primal feasibility,
s ◦ y = µ1 perturbed complementarity.
Compute the Newton step by solving
Q∆x − A⊤ ∆y = −(Qx + c − A⊤ y),
A∆x − ∆s = −(Ax − s − b),
∆s ◦ y + s ◦ ∆y = µ1 − s ◦ y.
Similar to linear programming, this algorithmic approach allows to solve
convex quadratic programming problems in O(n3.5 L) [or even O(n3 L)].
Example Portfolio optimization (Markowitz model): Let N = {1, . . . , n} be
n possible investments with xi being the share invested into i ∈ N giving a
stochastic per unit revenue wi with expected value w̄i = E(wi ). The expected
revenue w̄⊤ x should not fall below some level ω ∈ R while keeping the “risk”
small. In the Markowitz model “risk” is measured by the covariance matrix
Q = E((w − w̄)(w − w̄)⊤ ). In this, the distribution data is assumed to be
obtainable, e. g., from past observations or from “experts”. The task now
reduces to solving the quadratic programming problem
min 12 x⊤ Qx risk measure,
s. t. w̄⊤ x ≥ ω, expected revenue,
1⊤ x = 1, share of entire investment,
x ≥ 0.
CHAPTER 6. INTERIOR POINT METHODS 99

The problem is solved for several values of ω in order to find a balance


between risk and revenue.
Today there are better risk measures (preferably convex ones). Several further
aspects like a minimum level of investment for items i that are invested
in (requires integer programming techniques) are included in the models.
The main difficulty, however, remains how to get reliable distribution data
reflecting the stochasticity. ♡
Example SQP-Methods (Sequential Quadratic Programming) belong to
the best available approaches for solving general constrained nonlinear pro-
gramming problems of the form

min f (x) f ∈ C2
s. t. gi (x) ≤ 0, i = 1, . . . , m g : Rn → Rm , g ∈ C 2
x ∈ Rn .

For the current point x the step direction ∆x is determined by solving a


quadratic subproblem that may be derived, e. g., from a barrier approach for
including the inequalities,
X
min f (x) − µ log(−gi (x)).
x

Stationarity reads
X 1
0 = ∇f (x) + µ ∇gi (x) · , put s := −g(x) ≥ 0, y = µs−1 .
−gi (x)
This gives rise to the system
P
I ∇f (x) + yi ∇gi (x) = 0
II g(x) + s = 0
III s ◦ y = µ1

Applying Newton’s method to this nonlinear system gives the following two
equations for the first two lines,

I [∇2 f (x) + yi ∇2 gi (x)]∆x + ∆yi ∇gi (x) = −[∇f (x) + yi ∇gi (x)]
P P P
II Jg (x)⊤ ∆x + ∆s = −[g(x) + s]

In order to recognize the quadratic subproblem in this, put


X
A := Jg (x)⊤ , Q := [∇2 f (x) + yi ∇2 gi (x)], c := ∇f (x), b := −g(x),

then the system for computing the step reads

x̄ := ∆x
I Q∆x + c − A⊤ (y + ∆y) = 0 s̄ := s + ∆s Qx̄ + c − A⊤ ȳ = 0
II A∆x − (s + ∆s) = b ←→ Ax̄ − s̄ = b
III (s + ∆s) ◦ (y + ∆y) = µ1 ȳ := y + ∆y s̄ ◦ ȳ = µ1
CHAPTER 6. INTERIOR POINT METHODS 100

The latter are equivalent to the optimality conditions for the barrier problem
to the quadratic program
min 21 x̄⊤ Qx̄ + c⊤ x̄ min 12 ∆x⊤ [∇2 f (x) + yi ∇2 gi (x)]∆x + ∇f (x)⊤ ∆x
P
s. t. Ax̄ ≥ b ⇔ s. t. g(x) + Jg (x)∆x ≤ 0
x̄ ∈ Rn ∆x ∈ Rn
Note, the constraints are replaced by their linearizations in x, but the
quadratic term of the cost function now includes the curvature information
of f as well as that of the active gi weighted by their current Lagrange
multiplier approximations yi . This results in cautious steps in directions
where these functions have strong curvature. Once ∆x is computed, the
method continues with a line search in the direction of ∆x (≤ 1 because it is
a Newton method).
Similar approaches as in unconstrained optimization may be used to render
Q positive semidefinite. The convexified model can be solved efficiently with
interior point methods. ♡

Second-Order-Cone (SOC) Programming

The second order cone Qn = {( xx̄0 ) ∈ Rn : x0 ≥ ∥x̄∥} (x ≥Q 0, and x >Q


0 :⇔ x0 > ∥x̄∥) is self-dual ((Qn )∗ = −(Qn )◦ = Qn ). The standard primal-
dual pair of (linear) second-order-cone programs reads (see the duality
considerations in the initial example for conic linear programs of Chapter 5)
min c⊤ x max b⊤ y
s. t. Ax = b, s. t. A⊤ y + z = c,
x ≥Q 0, y ∈ Rm , z ≥Q 0.
Strong duality holds, if there exist strictly feasible primal x̃ >Q 0 with
Ax̃ = b and dual ỹ ∈ Rm , z̃ >Q 0 with A⊤ ỹ + z̃ = c (without proof).
SOC interior point methods are based on the barrier problem
min c⊤ x − 12 µ log(x20 − ∥x̄∥2 )
s. t. Ax = b,
(x >Q 0),
with Lagrangian
1
Lµ (x, y) = c⊤ x − µ log(x20 − ∥x̄∥2 ) + y ⊤ (b − Ax).
2
Due to convexity, the sufficient optimality conditions require
=:z
z }| {
2x0
∇x L µ = 0 : c− 1 1
2 µ x20 −∥x̄∥2 −2x̄ −A⊤ y = 0,
∇y L µ = 0 : Ax = b.
CHAPTER 6. INTERIOR POINT METHODS 101

1 x0
We have z = µ x2 −∥x̄∥2 ( −x̄ ) >Q 0 ⇔ x >Q 0. In order to express the
0
perturbed complementarity condition in bilinear form, observe that z solves
h i
x0 x̄⊤ z = µ 1 .

x̄ x0 I 0̄
| {z }
=:Arw(x)

Consider this arrow operator Arw(x). We have Arw(x) ≻ 0 ⇔ x0 > 0 ∧


x0 > x10 x̄⊤ x̄ (by the Schur Complement Theorem, Ex) ⇔ x >Q 0, so z is
uniquely determined.
With this, the primal-dual KKT system reads

A⊤ y + z = c, z >Q 0,
Ax = b, x >Q 0,
Arw(x)z = µe0 .

Continue as usual by solving this with Newton’s method. In fact, for a


single SOC the optimal solution can be determined explicitly, but in all
practical applications SOCs never appear alone but in combination with
several other cones. In other words, in a conic linear program the cone is
typically the Cartesian product of several smaller nonnegative cone, SOCs,
and semidefinite cones.
Any convex quadratic constraint can be represented by / modeled as a
second order cone constraint (SOC constraint). To see this, consider, for
Q = LL⊤ ∈ Sn+ , q ∈ Rn , δ ∈ R the convex quadratic function

g(x) = x⊤ Qx + q ⊤ x + δ

Then g(x) ≤ 0 ⇔ ( x0 ) ∈ epi g and

(1 + r) − (q ⊤ x + δ)
   
x x L⊤ x+q
epi g = {( r ) : g(x) ≤ r} = ( r ) : (1−r)+(q ⊤ x+δ) ≤ .
2 2
"1 #  1 ⊤ 1 
  (1−δ) −2q 2
ξ 2
Thus the linear constraints ξ̄0 = q +  L⊤ 0  ( xr ) and conic
1 1 ⊤ 1
2 (1+δ) 2q −2
¯ and r ≤ 0 give a valid SOC constraint representation.
constraints ξ0 ≥ ∥ξ∥
SOC programming with several second order cones therefore allows to model
arbitrary combinations of convex quadratic constraints (and objectives).

Semidefinite Programming (SDP)

Here the variable X ∈ Sn+ (X ⪰ 0) is a symmetric matrix of order n required


to be positive semidefinite. Positive definite matrices are denoted by Sn++
(≻ 0). As inner product we use the standard Frobenius or trace inner product
CHAPTER 6. INTERIOR POINT METHODS 102

for matrices, i. e., for A, B ∈ Rm×n , ⟨A, B⟩ := ij Aij Bij = tr B ⊤ A (simply


P
consider matrices as stacked column vectors). In normal form the primal
semidefinite program reads

min ⟨C, X⟩ "


⟨A1 ,X⟩
#
s. t. AX = b, where AX = ..
. for given A1 , . . . , Am ∈ Sn .
X ⪰ 0, ⟨Am ,X⟩

For the dual program the adjoint operator A⊤ to A corresponds to, as usual,
linear combinations of the rows. Indeed, the adjoint operator is defined by
X X
∀X ∈ Sn , y ∈ Rm ⟨AX, y⟩ = yi ⟨Ai , X⟩ = y i Ai , X .
| {z }
=:A⊤ y

Because the positive semidefinite cone Sn+ is self-dual ((Sn+ )∗ = −(Sn+ )◦ = Sn+ ),
the dual semidefinite program reads

max b⊤ y
s. t. A⊤ y + Z = C,
y ∈ Rm , Z ⪰ 0.

Strong duality holds if there exist primal X ≻ 0 with AX = b and dual


y ∈ Rm , Z ≻ 0 with A⊤ y + Z = C (without proof). In contrast to linear
programming, however, there also exist feasible primal-dual pairs with a
strictly positive duality gap, i.e., the primal optimal value may be strictly
larger than the dual optimal value and still both may be attained by feasible
solutions.
Semidefinite optimization is significantly more general than linear, convex
quadratic and SOC programming which it contains as special cases,
" x1 0 #
x2
• LP: x ≥ 0 ⇔ X = Diag(x) = .. ⪰0
.
0 xn
• convex quadratic programming/constraints: for Q ⪰ 0,
 1

⊤ ⊤ Schur complement −(b⊤ x+d) (Q 2 x)⊤
x Qx + b x + d ≤ 0 ⇔ 1 ⪰ 0.
Q2 x I

• SOCP: x ≥Q 0 ⇔ Arw(x) ⪰ 0.
" #
X1 0
• several semidefinite variables: X1 ⪰ 0, . . . , Xk ⪰ 0 ⇔ ..
. ⪰ 0.
0 Xk

Indeed, for theoretical purposes it suffices to consider just one semidefinite


variable. In practice, however, one should split the variable into separate
linear, SOC or semidefinite blocks, whenever possible.
CHAPTER 6. INTERIOR POINT METHODS 103

Because X is positive semidefinite if and only if its eigenvalues λi (X) are


nonnegative, a suitable barrier function is
n
Y n
X
− log det X = − log λi (X) = − log λi (X).
i=1 i=1

One can prove


• − log det X is strictly convex on the positive definite matrices Sn++ ,
• ⟨∇(− log det X), ·⟩ = X −1 , · [as a linear functional]
The barrier problem gives rise to the Lagrangian

Lµ (X, y) = ⟨C, X⟩ − µ log det X + ⟨b − AX, y⟩

and the sufficient optimality conditions require

∇X Lµ (X, y) = 0 : C − µX −1 − A⊤ y = 0 → put Z = µX −1
∇y Lµ (X, y) = 0 : AX = b

The primal-dual KKT system and its direct linearization read

A⊤ y + Z = C A⊤ ∆y + ∆Z = C − A⊤ y − Z
AX = b → A∆X = b − AX
XZ = µI ∆X · Z + X · ∆Z = µI − XZ [?]

Because of the third line, the solution to this linearized system will, in
general, result in a non symmetric ∆X ∈ Rn×n . One can prove that it
suffices to use the symmetric part 12 (∆X + ∆X ⊤ ), but there are attractive
other symmetrization strategies, as well.
In all other aspects the interior point approach and its analysis follow the
linear programming case. For a strictly feasible starting point (X 0 , y 0 , Z 0 )
close to the central path the algorithm stops with a solution ⟨X, Z⟩ ≤ ε in

O( n log( X 0 , Z 0 /ε)) iterations. The same skew-symmetric embedding
works unless serious duality issues arise.
A decisive difference is that in semidefinite programming feasible solutions
may require doubly exponential encoding size relative to the encoding length
of the problem. For the current approaches the dependence on ε and the
starting point cannot be replaced by a polynomial expression depending on
the encoding length of the problem. Indeed, in the strict theoretical sense
it is not yet clear whether general semidefinite programs can be solved in
polynomial time or not.
For practical purposes and problems interior point methods need surprisingly
few iterations, but each iteration is quite expensive. Due to additional
numerical issues developing solvers for semidefinite programming is much
CHAPTER 6. INTERIOR POINT METHODS 104

more demanding than for linear programming and there is still a lot of work
ahead.
Example Robust stability of dynamical systems: A dynamical systems
describes the change of a state x(t) over time by differential equations. The
system is called stable if all trajectories lead to some desired goal state which
is typiccaly shifted into the origin. In the robust linear setting considered
here the coefficients of the linear system are not fully known in advance,
dx(t)
= A(t)x(t) where A may be any A(t) ∈ conv {A1 , . . . , Ak }.
dt
All trajectories
√ are certainly leading to the origin, if there exists a norm

∥x∥H = x Hx with H ≻ 0 so that the norm of the state vector strictly
decreases along the trajectories,
d∥x(t)∥2H
≤ δ < 0.
dt
If such an H ≻ 0 exists, the system is called quadratically stable and x⊤ Hx
is called a Lyapunov-function. For the current system the criterion evaluates
to
d dx dx
(x(t)⊤ Hx(t)) = ( )⊤ Hx(t) + x(t)⊤ H = x(t)⊤ (A(t)⊤ H + HA(t))x(t).
dt dt dt
Because this has to be less than some δ < 0 for any starting point x =
x(0) ∈ Rn \ {0} and any A(t) ∈ conv {A1 , . . . , Ak } the sought for H ≻ 0 has
to satisfy
A⊤i H + HAi ≺ 0 for i = 1, . . . , k.
This may be cast as an SDP as follows,
max λ
s. t. H ⪰ λI,
A⊤ H + HAi ⪯ −λI, i = 1, . . . , k
λ ∈ R, H ∈ Sn
In order to illustrate how to write this as a dual in normal form, put y =
(λ, h11 , . . . , h1n , h22 , . . . , hnn )⊤ and consider the block diagonal reformulation
max λ
 λI−H 0   
Z0 0
A⊤
1 H+HA1 Z1
s. t.  .. + ..  = 0,
. .
0 A⊤
k H+HAk
0 Zk
| {z }
=:A⊤ y (linear in y)
y∈ R 1+n(n+1)/2 , Z0 ⪰ 0, . . . , Zk ⪰ 0.
If λ∗ > 0, the corresponding H ∗ generates the required Lyapunov-function.

Chapter 7

The Simplex Method

The simplex method is a classical and in many situations the most efficient
solution method for solving linear programs, in particular if the program
formulation needs to be changed dynamically by adding further constraints
or variable columns. In contrast to interior point methods it heavily exploits
the linear cost function and the polyhedral feasible set by starting in some
feasible vertex and by switching to successively better ones along edges of
the polyhedron. This gives the method a strong combinatorial flavor.
Unless explicitly stated otherwise we consider linear programs in normal
form,
min c⊤ x c ∈ Rn
(P ) s. t. Ax = b, A ∈ Rm×n of full row rank, b ∈ Rm ,
x≥0
with feasible set X = {x ≥ 0 : Ax = b} and corresponding dual
max b⊤ y
(D) s. t. A⊤ y + z = c,
y ∈ Rm , z ≥ 0,

with feasible set Z = {(y, z) ∈ Rm × Rn+ : A⊤ y + z = c}.

7.1 The (revised) Primal Simplex Algorithm

For deriving the method consider the given linear inequality system

Ax = b, m equations,
Ix ≥ 0, n inequalities,

and its relation to the polyhedral feasible set.


sketch one/two equations in R3
+

105
CHAPTER 7. SIMPLEX METHOD 106

A vertex of the feasible set is determined by n equations,


m are already fixed, Ax = b, and
n − m stem from the inequalities → N ⊆ {1, . . . , n}, |N | = n − m.
A selection of n − m equations xi = 0, i ∈ N , defines a vertex only if the
remaining elements of x are uniquely determined by Ax = b. Putting
B = {1, . . . , n} \ N with some ordering B = (B1 , . . . , Bm )
and splitting A and x into (by resorting the columns)
A = [AB AN ], x = [ xxN
B
] gives AB xB + AN xN = b, xN = 0.
This determines xB uniquely if and only if AB is regular/invertible. Then
xB = A−1
B [b − AN xN ] (with row i holding the element Bi ).
This is the case if AB has linearly independent columns, i. e., if the columns
form a basis of Rm .
Definition 7.1 Given an inequality system Ax = b, x ≥ 0 with A ∈ Rm×n
of full row rank, b ∈ Rm ,
• a regular submatrix AB with column indices B ⊆ {1, . . . , n}, |B| = m, is
a basis (in this, consider B as an m-tuple B = (B1 , . . . , Bm ) reflecting
the ordering of the columns).
• the variables xB are basis variables or dependent variables,
• the variables xN with N = {1, . . . , n} \ B are nonbasic/independent
variables,
• x̄B = A−1
B [b − AN x̄N ] with x̄N = 0 is a basic solution (to B),
• and it is a feasible basis/basic solution if x̄B ≥ 0.

To start, let B be a feasible basis (how to get there will be discussed later),
xB = A−1
B [b − AN xN ], xN = 0, xB ≥ 0.

In order to check whether the solution can be improved by increasing one of


the nonbasic xi with i ∈ N , express the objective c⊤ x in dependence of xN ,
−1 ⊤ −1
c⊤ x = c⊤ ⊤
B xB + cN xN = c⊤ ⊤
B AB b − cB AB AN xN + cN xN
= c⊤ A−1 b + (cN − A⊤ −⊤ ⊤
N AB cB ) xN .
|B {zB } | {z }
constant reduced costs =:c̃N
current objective for xN = 0

The reduced costs reflect the change in objective in dependence of xN .


For testing whether increasing some nonbasic
variable improves the objective we need to
check
∃j ∈ N with c̃j < 0 ?
What follows from c̃j ≥ 0 for all j ∈ N ?
sketch one equation with N = {2, 3}
CHAPTER 7. SIMPLEX METHOD 107

Lemma 7.2 Consider the linear program (P ) min c⊤ x s. t. Ax = b, x ≥


0 and let x̄ be a feasible basic solution to basis B having reduced costs
−T
cN − A⊤
N AB cB ≥ 0. Then x̄ is optimal.

Proof: Let x be a feasible point of (P ), i. e., Ax = b, x ≥ 0. Then

xB = A−1
B [b − AN xN ] ≥ 0, xN ≥ 0 = x̄N ,
−1 ⊤ −⊤ ⊤ −1
and c⊤ x = c⊤ ⊤ ⊤
B AB b + (cN − AN AB cB ) xN ≥ cB AB b = c x̄.
| {z }
≥0 □
The proof also shows that multiple optima are possible only if at least one
component of the reduced costs is zero. Later this will be of relevance again.
Suppose now there is a nonbasic index with negative reduced costs,

∃ȷ̂ ∈ N c̃ȷ̂ < 0.

Computing the reduced cost vector c̃N and selecting such a ȷ̂ is called
“pricing”. For this fixed ȷ̂ the objective strictly improves when increasing xȷ̂
(all other nonbasic variables are kept at zero). So the next step is to increase
xȷ̂ as much as possible without leaving the feasible set,

find the largest xȷ̂ with xB (xȷ̂ ) = A−1


B [b − A•,ȷ̂ xȷ̂ ] ≥ 0 (note, A−1
B b ≥ 0).

With row i of A−1 B [b − A•,ȷ̂ xȷ̂ ] corresponding to the i-th basis element Bi ,
xBi is in danger of becoming negative only if [A−1 B A•,ȷ̂ ]i > 0, thus feasibility
is guaranteed for

[A−1
 
B b]i −1
xȷ̂ ≤ inf : i ∈ {1, . . . , m}, [AB A•,ȷ̂ ]i > 0 “ratio test”.
[A−1
B A•,ȷ̂ ]i

If there are no indices with [A−1B A•,ȷ̂ ]i > 0, the infimum evaluates to +∞,
xȷ̂ may be increased to infinity without violating feasibility and at the same
time the objective value decreases to minus infinity. This gives an improving
half ray,
   
xB =A−1
B b −A−1
B A•,ȷ̂
∀α ≥ 0 x(α) = xN =0
+α eȷ̂ ∈ X,

and
−1
inf{c⊤ x(α) = c⊤
B AB b + c̃ȷ̂ α : α ≥ 0} = −∞.
<0 sketch unbounded set in R3
+

Thus, if there is no i ∈ {1, . . . , m} with [A−1


B A•,ȷ̂ ]i > 0, the optimization
problem is unbounded.
Suppose now that the infimum is finite and let ı̂ be an index for which this
infimum is attained (in reference to an old tableaux form of the algorithm
CHAPTER 7. SIMPLEX METHOD 108

designed for updating A−1


B AN by hand, the pair (ı̂, ȷ̂) is often referred to as
pivot and the corresponding element [A−1 −1
B A•,ȷ̂ ]ı̂ within the matrix AB AN
as pivot element).

[A−1
B b]ı̂
Set xȷ̂ ← −1 and xB ← xB − A−1
B A•,ȷ̂ xȷ̂ .
[AB A•,ȷ̂ ]ı̂
With this the basic variable to ı̂ is now zero, xBı̂ = 0, it is removed from the
current basis and added to the nonbasic variables. xBı̂ is called the leaving
variable, its place in the basis is taken by the entering variable xȷ̂ . This
indeed gives rise to a feasible basis again.
Lemma 7.3 The index set B + = (B \ Bı̂ ) ∪ ȷ̂ describes a feasible basis and

xB =A−1
 [A−1
B b]ı̂
 −1
−AB A•,ȷ̂

x+ = B b + ≥0
xN =0 [A−1
B A•,ȷ̂ ]ı̂
eȷ̂

is the corresponding feasible basic solution.

Proof: Because AB is a basis, AB w = A•,ȷ̂ has a unique solution. Because


wı̂ > 0, the column A•,ȷ̂ is linear independent of the columns in B \ {Bı̂ }.
Thus the columns of AB + are again a basis by the exchange theorem of
Steinitz. By construction, x+ ≥ 0 and for N + = {1, . . . , n} \ B + there holds
xN + = 0 and

[A−1 [A−1 b]ı̂


 
+ B b]ı̂ −1
AB + xB + = AB xB − −1 AB A•,ȷ̂ + −1B A•,ȷ̂ = b
[AB A•,ȷ̂ ]ı̂ [AB A•,ȷ̂ ]ı̂
| {z } | {z }
=b =0 □
In order to change to this basis it remains to

update the feasible basis to N ← (N ∪ {Bı̂ }) \ {ȷ̂} and Bı̂ ← ȷ̂.

The process is now repeated for the updated feasible basis. This completes
the derivation and description of the primal simplex algorithm.
Algorithm 7.4 ((Revised Primal) Simplex Algorithm)
Input: A, b, c, a feasible basis B and x̄B = A−1
B b ≥ 0.

1. BTRAN: Compute ȳ = A−⊤ B cB .


[For numerical stability and to exploit sparsity solve A⊤ B ȳ = cB .]
2. Pricing: Compute z̄N = cN − A⊤ N ȳ.
If z̄N ≥ 0 then x̄ is optimal, STOP,
else choose ȷ̂ ∈ N with z̄j < 0. xȷ̂ is the entering variable.
3. FTRAN: Solve AB w = A•,ȷ̂ [basis variables change by −w per unit of xȷ̂ ].
4. Ratio Test: If w ≤0 the LP is unbounded, STOP,

else let γ = min wBii : wi > 0, i ∈ {1, . . . , m} be attained for some
ı̂ ∈ {1, . . . , m}; xBı̂ is the leaving variable.
CHAPTER 7. SIMPLEX METHOD 109

5. update x̄B ← x̄B − γw, xȷ̂ ← γ, N ← (N ∪ {Bı̂ }) \ {ȷ̂}, Bı̂ ← ȷ̂, GOTO 1.

Example
Consider the Mozart problem, x2
" −9 #
−8 11
h1 1 1 0 0i h 6
i
A= 21010 ,b = 11 ,c = 0 , 10
12001 9 0
0 9
8
with the origin as initial basis/vertex. 7 x4
Input: A, b, c and c
6
5
B = (3, 4, 5), N = (1, 2), 4
3
0
!
0
 6
 2 x3
x̄ = 6 , x̄B = 11 . 1 x5
11 9
9 x1
h1 0 0i h0i  0  1 2 3 4 5 6 7 8 9
Iteration 1: AB = 0 1 0 , cB = 0 , ȳ = 0 , z̄N = −9 −8 .
001   0 0
1
Choose ȷ̂ = 1, then w = 2 → ı̂ = 2, Bı̂ = 4, γ = 5.5.
1
5.5
!
 0.5  0
Update to vertex/basis B = (3, 1, 5), N = (4, 2), x̄B = 5.5 , x̄ = 0.5 .
3.5 0
3.5
0 0
h1 1 0i h i    
4.5[N1 =4]
Iteration 2: AB = 020 , cB = −9 , ȳ = −4.5 , z̄N = −3.5[N2 =2]
.
011  0.5  0 0
Choose ȷ̂ = 2, then w = →
ı̂ = 1, Bı̂ = 3, γ = 1.
0.5
1.5
5
!
1 1
Update to vertex/basis B = (2, 1, 5), N = (4, 3), x̄B = 5 , x̄ = 0 .
2 0
2
h1 1 0i h −8 i  −7 
Iteration 3: AB = 120 , cB = −9 , ȳ = −1 , z̄N = ( 17 ) → optimal. ♡
211 0 0
Historically, the simplex method was developed by George B. Dantzig before
computers were available. It supported computations by hand via a clever
arrangement in a tableaux that facilitated the repeated applications of
Gaussian elimination steps. These are required to execute the basis changes
for the selected pivot element.
Example Originally, the simplex tableaux works on problems stated in
canonical form. Here we describe a variant adapted to the revised method
for the normal form described above and apply it to the first step of the
Mozart problem with initial basis B = (3, 4, 5). In this variant it maintains
the updated system b = Ax in the form A−1 −1
B b = AB AN xN + IxB with
an additional row zero holding the current and reduced costs in the form
−1 ⊤ −⊤
−c⊤ ⊤
B AB b | (cN − cB AB AN )xN + 0xB . In the table below, the simplex
tableaux only consist of the first three boxed columns, the more or less
CHAPTER 7. SIMPLEX METHOD 110

redundant three identity columns are kept here only for illustrative purposes.

−c⊤ −1 c⊤ ⊤ −⊤
B AB b N −cB AB AN
0 −9 −8
B1 , x3 6 1 1 1
B2 , x4 11 2 1 1
B3 , x5 9 1 2 1
A−1
B b A−1
B AN I
N1 , x1 N2 , x2 B1 , x3 B2 , x4 B3 , x5

The boxed pivot pair (2, 1) is determined by first choosing a column xȷ̂ with
negative reduced cost (e. g., the most negative) and then choosing within
this column a row ı̂ with positive value and smallest right hand side to value
ratio. Replacing the current basic variable xB2 = x4 against the nonbasic
variable x1 now requires to transform the column corresponding to x1 to the
unit vector e2 via standard Gaussian elimination steps. This means
• multiply line 2 (corresponding to B2 or x4 ) by 21 ,
• add 9 times this new line 2 to the cost line 0,
• add −1 times this new line 2 to basis line 1,
• add −1 times this new line 2 to basis line 3.
• swap the recomputed column of the leaving variable x4 into the position
of the entering variable x1 and change the labels of the rows and columns
accordingly.
This results in the next simplex tableaux and the next pivot, etc.
9
49.5 2 − 72
1
B1 , x3 2 − 12 1
2
B2 , x1 11 1 1
2 2 2
B3 , x5 7
− 12 3
2 2
N1 , x4 N2 , x2


The derivation above shows that the algorithm works correct if it stops, but
it is not yet clear, whether it stops.

Theorem 7.5 If in Alg 7.4 the step size γ is always strictly positive, the
primal simplex algorithm stops in finitely many steps.

Proof: In each step with γ > 0 the objective value improves,


−1 ⊤ −⊤ ⊤ −1 ⊤ −1
c⊤ x+ = c⊤ ⊤ + ⊤
B AB b+(cN −AN AB cB ) xN = CB AB b+ z̄ȷ̂ · γ < cB AB b = c x.
<0 >0
CHAPTER 7. SIMPLEX METHOD 111

Therefore no basic solutions appears more than once. Each basic solution
 of m indices of {1, . . . , m}. Thus, the
corresponds to a different choice
n
algorithm ends after at most m iterations. □
The step size  x̄Bi
γ = min wi : wi > 0, i ∈ {1, . . . , m}
will be zero, whenever there is an i ∈ {1, . . . , m} with wi > 0 so that the
corresponding basic variable/slack x̄Bi = 0,
x2
i. e., the inequality Bi is satisfied with equality
and is also active at this vertex/basic solution. x5
In the illustration to the right x4
x1
N = (2, 3), B = (1, 4, 5), and x4 = 0. x3

Definition 7.6
• A basis B is degenerate if xBi = 0 for some i ∈ {1, . . . , m}.
• A linear program is degenerate if it has a degenerate basis (not nec.
feasible).

Geometrically a basis is degenerate by definition, if the point of this basic


solution lies on more than n of the hyperplanes induced by the equality
and inequality constraints. For randomly chosen data this happens with
probability zero. In practice, however, this occurs rather frequently.
In the simplex algorithm setting the nonbasic variables in N to zero deter-
mines the point. In the pricing step one of the inequalities is taken out of
N and increasing its slack induces a one dimensional half ray along which
the solution improves strictly. If this half ray encounters the next limiting
inequality in B at step length zero, the corresponding next nonbasic variable
gives rise to the same point.
Such a zero step may be followed by further zero steps and — depending on
selection rules in pricing and ratio test — it may even lead to revisiting a
previous basis,

in this case the simplex algorithm is cycling, it will not stop.

For most practical selection rules in pricing and ratio test examples of cycling
have been constructed but it has also been observed more and more in
practice.
Bland proved that cycling cannot occur when using the following rather
simple lexicographic pivot selection rules of Bland,
• in pricing, choose among the variables with negative reduced costs the
one having smallest index,
CHAPTER 7. SIMPLEX METHOD 112

• in the ratio test, choose among the selectable basic variables of value
zero the one with smallest index.
Unfortunately there does not seem to be a proof that provides good intuition
on why this is the case, but as will become clear, the result is fundamental
for establishing an algorithmic proof of strong duality for linear programming
that is almost combinatorial in nature.
Theorem 7.7 If the pivot selection rules of Bland are used in Alg 7.4, the
primal simplex algorithm always terminates.

Proof: Assume, for contradiction, Alg 7.4 cycles in spite of using Bland’s
rules. Let B 1 , . . . , B k be the cyclic sequence of bases and let I = i ∈


{1, . . . , n} : ∃k1 , k2 ∈ {1, . . . , k} i ∈ B k1 , i ∈


/ B k2 be the indices of variables
moving in and out of the basis. Let t = max I. Let, w. l. o. g., t ∈ B 1 =: B̂,
t∈/ B 2 be the leaving variable and let s ∈ B 2 \ B 1 (with s ∈ N 1 = N̂ ) be the
entering variable replacing t. Let h ≥ 2 be the first index with t ∈ / B h =: B̄,
t∈B h+1 h
, i. e., t ∈ N =: N̄ is the entering variable and must have negative
reduced cost at h,
c̄t = ct − (A•,t )⊤ A−⊤ B̄
cB̄ < 0.
By Bland’s rule there holds c̄i ≥ 0 for all i ∈ (I ∩ N̄ ) \ {t}. It will be
convenient to set c̄i := 0 for i ∈ B̄. The objective value is
n
X
−1
β+ c̄j xj where β = c⊤
B̄ AB̄ b. (7.1)
j=1

For basis B̂ = B 1 there holds s ∈ N̂ with reduced cost ĉs = cs −(A•,s )⊤ A−⊤ cB̂ <

0. Because β does not change, the objective in dependence on xs is β + ĉs xs .
Equating this with xB̂ = A−1 b − A−1 A•,s xs to (7.1) yields for all xs ≥ 0

| {z } | B̂{z }
=:b̂ =:â
X
β + ĉs xs = β + c̄i (b̂i + âi xs ) + c̄s xs
i∈B̂
X X
Thus, ∀xs ≥ 0 (ĉs − c̄s + c̄i âi )xs = c̄i b̂i = constant
i∈B̂ i∈B̂
X
⇒ ĉs − c̄s + c̄i âi = 0 ⇒ ∃j ∈ B̂ c̄j âj > 0.
<0 ≥0
i∈B̂
| {z }
⇒>0

̸ 0, therefore j ∈
In particular, c̄j = / B̄ and j ∈ B̂, i. e., j moves in and out of
the basis, j ∈ I and xj = 0. Note that ât > 0 because t was selected in the
ratio test for s, and c̄t < 0, so j ̸= t. Hence, j < t and c̄j ≥ 0 (otherwise j
would have been selected instead of t as entering variable at basis B̄ = B h )
giving âj > 0. But by xj = 0, j ∈ B̂, j < t and âj > 0 Bland’s rule would
then have selected j instead of t as leaving variable at basis B̂ = B 1 . □
CHAPTER 7. SIMPLEX METHOD 113

Corollary 7.8 When starting from a feasible basis and using Bland’s rules,
in finitely many steps Alg 7.4 either finds an optimal solution or certifies
unboundedness.

Proof: Lem 7.2, Lem 7.3 and Th 7.7. □


There are further variants that avoid cycling,
• choose the leaving variable randomly
• apply a small random perturbation to the right hand side,
• perturb the right hand side symbolically with perturbations 0 < ε1 ≪
ε2 ≪ · · · ≪ εm that cannot cancel each other.
These techniques are employed only if potential cycling is observed, e. g., if
there is no change in the objective for one hundred iterations.
Without indications of cycling other rules are preferred that aim at improving
the objective faster.
• the most negative reduced cost rule (pick i ∈ N minimizing z̄i ) is
considered the canonical original rule, but this rule heavily depends on
the scaling of the variables.
• the steepest edge rule tries to reduce the influence of scaling by choos-
ing j ∈ N so that for the resulting direction ∆x the value c⊤ ∥∆x∥ ∆x
is
minimized. It is currently considered the most efficient rule but re-
quires intricate additional considerations in order to obtain an efficient
implementation with little overhead.
For most practical instances it is not easy to predict whether simplex methods
or interior point methods will be faster. In the presence of strong degeneracy
interior point methods are typically preferable. In theory only interior point
methods are known to have a running time bounded by a polynomial in the
encoding length. In contrast, for most pivot rules it has been shown explicitly
that the simplex method will require exponential running time for worst case
examples. The first such example goes back to Klee and Minty 1972 and is
based on perturbing the facets of a cube in n space so that all 2n vertices
are visited (examples of this kind are called Klee-Minty cubes). There is still
a significant body of research going into the search for better pivot rules,
into the diameter of polytopes (based the refuted Hirsch-conjecture the hope
is that the length of a shortest path between any two vertices along edges of
the polytope cannot be more than linear in the dimension) as well as better
randomized and smoothed analysis approaches.
CHAPTER 7. SIMPLEX METHOD 114

7.2 Feasibility and the Fundamental Theorem of


Linear Optimization

The simplex algorithm requires a feasible basis to start with. If no feasible


basis is known for (P), there are two main approaches.

7.2.1 The Two-Phase Method

In phase 1 the following artificial problem is solved, where w. l. o. g. b ≥ 0,

min 1⊤ s
s. t. Ax + s = b,
x ≥ 0, s ≥ 0.

For this problem the basis consisting of s1 , . . . , sm is feasible.


If the simplex algorithm finds an optimal solution with si = 0 for all i, the
corresponding optimal basis is feasible for the original problem. Phase 2
then solves the original problem starting from this feasible basis.

Remark
• As soon as any auxiliary variable si becomes nonbasic, it can be removed
from the problem.
• If the optimal solution of phase 1 still contains some si variables
(degenerate cases), they can be moved out of the basis by pivoting steps
with suitable columns.

If Phase 1 stops with an optimal solution having positive objective value,


the original problem cannot have a feasible solution by Cor 7.8.

7.2.2 The Big-M Method

In order to move towards a better basis from the start, this approach combines
the search for a feasible solution with improving the objective by choosing a
cost factor M > 0 large enough and solving (w. l. o. g. b ≥ 0)

min c⊤ x + M 1⊤ s min c⊤ x + M σ
s. t. Ax + s = b, or s. t. Ax + b̄σ = b, [b̄ = b − Ax̄]
x ≥ 0, s ≥ 0, x ≥ 0, σ ≥ 0, [for some starting x̄ ≥ 0]

Advantages:
• The simplex algorithm searches for a good basic solution from the
start.
CHAPTER 7. SIMPLEX METHOD 115

• When all auxiliary variables have become nonbasic the algorithm just
continues with the original problem.
Disadvantages:
• It is not clear how large M has to be chosen.
• A huge value in M usually causes numerical difficulties.
Commercial software packages also employ so called “crash methods”, which
refer to heuristic approaches for finding good starting bases.

Theorem 7.9 (Fundamental Theorem of Linear Programming)


If a linear program in normal form has an optimal solution, it is also attained
in a feasible basic solution/ a vertex. A linear program without optimal
solutions is either unbounded or infeasible.

Proof: Two-phase method and Cor 7.8. □


The theorem could alternatively be proven via Th 5.10 and Obs 3.36.
Note, the theorem heavily relies on x ≥ 0. Indeed, if free variables are
admitted as well, the feasible set need not have basic solutions/vertices,
consider e. g.
min x1 + x2
s. t. x1 + x2 = γ,
x ∈ R2 .

7.3 The Simplex Method and Duality

Consider the primal and dual feasible sets in normal form,

X = {x ≥ 0 : Ax = b} and Z = {(y, z) ∈ Rm × Rn+ : A⊤ y + z = c}

and recall

weak duality p∗ := inf c⊤ x ≥ sup b⊤ y =: d∗ ,


x∈X (y,z)∈Z

which may be proved directly for feasible x ∈ X and (y, z) ∈ Z by

c⊤ x − b⊤ y = (A⊤ y + z)⊤ x − (Ax)⊤ y = x ⊤ z ≥ 0.


≥0 ≥0

The simplex algorithm supplies an optimality certificate via its reduced cost
vector, so dual information should be available there. Indeed, if p∗ is finite
the simplex algorithm Alg 7.4 delivers a dual optimal solution directly via
1. BTRAN: Compute ȳ = A−⊤ ⊤
B cB [by solving AB ȳ = cB ].

2. Pricing: Compute z̄N = cN − AN ȳ. If z̄N ≥ 0, x̄ is optimal, STOP, . . .
CHAPTER 7. SIMPLEX METHOD 116

If the algorithm stops with optimal x̄, then


h ⊤i
AB 0
= [ ccN
 B

AN
ȳ + z̄ N
] , so (ȳ, z̄) is dual feasible,
| {z }
=:z̄≥0

and the value of the this dual solution is

b⊤ ȳ = b⊤ A−⊤ ⊤ ⊤
B cB = x̄B cB = c x̄

equal to that of the primal optimal solution. Thus, by weak duality, (ȳ, z̄) is a
dual optimal solution. The simplex method therefore provides an alternative
proof of the Strong Duality Theorem for Linear Programming, Th 5.10.

Theorem (Strong Duality for Linear Programming)


If one of (P) or (D) has an optimal solution, the other one has as well and
the optimal objective values coincide.

Proof: Th 7.9 and considerations above. □


It may happen, however, that (P) and (D) are both infeasible, consider e. g.
the following primal dual pair in canonical form

max x1 min −y1


s. t. x1 − x2 ≤ −1, s. t. y1 − y2 ≥ 1,
−x1 + x2 ≤ 0, −y1 + y2 ≥ 0,
x1 ≥ 0, x2 ≥ 0, y1 ≥ 0, y2 ≥ 0.

Indeed, x2 ≤ x1 ≤ x2 − 1 is impossible as well as 1 + y2 ≤ y1 ≤ y2 .


Any pair of primal and dual optimal solutions needs to satisfy the complemen-
tarity relation x⊤ z = 0 and the Simplex solution satisfies it by construction,
because x̄B ≥ 0, x̄N = 0 and at optimality z̄N ≥ 0, z̄B = 0.

Corollary 7.10 (Compelementary Slackness Theorem) Let x̄ and (ȳ, z̄)


be primal and dual feasible solutions of (P) and (D). They form a primal-dual
optimal pair if and only if x̄⊤ z̄ = 0.

Proof: ⇒: Strong duality theorem Th 5.10 and weak duality.


⇐: weak duality. □
Note,
• x̄⊤ z̄ = 0 holds for arbitrary pairings of primal and dual optimal
solutions,
• by x̄ ≥ 0 and z̄ ≥ 0 there holds x̄⊤ z̄ = 0 ⇔ ∀i x̄i z̄i = 0,
CHAPTER 7. SIMPLEX METHOD 117

• the name complementary slackness has its origins in the interpretation


of the variables as slacks in the canonical form,

max c⊤ x min b⊤ y
s. t. Ax ≤ b s. t. A⊤ y ≥ c
x≥0 y≥0
⇕ ⇕
max b⊤ (−y)
min (−c)⊤ x h i  
A
s. t. AI (−y) + zz I = −c
 ⊤
s. t. [A I] [ xs ] = b 0
[ xs ] ≥ 0
h i
m z A
y ∈ R , zI ≥ 0

Thus,
h i
 x̄
 z̄ A =A⊤ ȳ−c 
x̄, ȳ opt. ⇔ s̄=b−Ax̄ , ȳ, z̄ I =ȳ
opt.
x̄i · z̄iA = x̄i · (A⊤ ȳ − c)i ,

 1. 0 =
Cor 7.10
⇔ 2. 0 = s̄i · z̄iI = (b̄ − Ax̄)i · ȳi ,
3. and x̄, ȳ are primal-dual feasible.

With this the complementary slackness theorem has the following


important interpretation and consequence:
If the primal slack of the j-th inequality is positive in some optimal
solution x̄ (the inequality is inactive), then in every dual optimal
solution yj must be zero.
Conversely, if yj is the dual variable to inequality j and yj ̸= 0
in some dual optimal solution, then the primal inequality must be
satisfied with equality in every primal optimal solution.
The same interpretations hold for x as dual variables to the dual
inequality constraints.
For a geometric interpretation, the canonical form is again more suitable.
a3 In the depicted situation the dual optimal solution
may only combine inequalities 1 and 2,
a1 0
=

c a1 y1 + a2 y2 + a3 y3 ≥ c
opt. x̄ a2 b1 y1 + b2 y2 = c⊤ x̄
=

x̄⊤ a1 x̄⊤ a2
Note, ȳ solves the system arising from the relations
compl.
x̄i ̸= 0 ⇒ [a1 y1 + a2 y2 ]i = ci .
CHAPTER 7. SIMPLEX METHOD 118

a3
If the dual optimum is not unique the
primal must be degenerate. a1
a4
Likewise, if there are several primal c
optima the dual must be degenerate. opt. x̄ a2

The Dual Simplex Method

In order to derive the dual simplex method for the dual in normal form
max b⊤ y
s. t. A⊤ y + z = c,
y ∈ Rm , z ≥ 0,
consider again the splitting into basic and nonbasic parts
h ⊤i
AB
A⊤
y + ( zzN
B
) = [ ccN
B
].
N

Motivated by complementary slackness, put zB = 0 and compute the others


in dependence of zB ,
y = A−⊤ −⊤
B cB − AB zB
zN = cN − A⊤ ⊤ −⊤ ⊤ −⊤
N y = cN − AN AB cB + AN AB zB ≥ 0 for dual feasibility.
The dual objective may be expressed in dependence of zB by
b⊤ y = b⊤ A−⊤ ⊤ −⊤
B cB − b AB zB .

Increasing zBı̂ helps in maximizing if 0 > [A−1 B b]ı̂ = xBı̂ , thus if the primal
constraint Bı̂ is infeasible. If such a ı̂ exists, zBı̂ is increased as long as zN
remains feasible, this value is
−⊤ ⊤ −⊤
γ = sup{γ ≥ 0 : zN = cN − A⊤
N AB cB + γAN AB eı̂ ≥ 0}.

The analogous considerations on optimality and unboundedness lead to the


following algorithm.
Algorithm 7.11 (Dual Simplex Algorithm)
−⊤
Input: A, b, c, a dual feasible basis B, z̄N = cN − A⊤
N AB cB ≥ 0.
1. FTRAN: Solve AB x̄B = b.
2. Pricing: If x̄B ≥ 0, B is optimal, STOP,
else choose ı̂ ∈ {1, . . . , m} with xBı̂ < 0 [zBı̂ is the leaving variable].
3. BTRAN: Solve A⊤ B w = eı̂ and compute u = −AN w.

4. Ratio Test: If u ≤ 0 the primal (P) is infeasible, STOP,


else let γ = min{ uz̄kk : uk > 0, k ∈ N } be attained for ȷ̂ ∈ N [zȷ̂ is the
entering variable].
5. update z̄N ← z̄N − γu, z̄Bı̂ ← γ, N ← (N \ {ȷ̂}) ∪ {Bı̂ }, Bı̂ ← ȷ̂, GOTO 1.
CHAPTER 7. SIMPLEX METHOD 119

7.4 Sensitivity

A typical question that arises in practice for a computed optimal solution is

How strongly does it depend on changes of the costs or the right hand side?

This might be asked because


• one plans to change the prices,
• not all the data is known precisely,
• one would like to know which of the constraints are how relevant.
Consider the primal-dual pair of problems in normal form,
min c⊤ x max b⊤ y
s. t. Ax = b, s. t. A⊤ y + z = c,
x ≥ 0, y ∈ Rm , z ≥ 0.
Assume an optimal basis B has been determined with optimal solutions
x∗B = A−1
B b, optimal value p∗ = c⊤ ∗ ⊤ ∗
B xB + cN xN ,
y ∗ = A−⊤
B cB ,
∗ ⊤
d =b y , ∗

zN = cN − A⊤
∗ ∗
Ny .

∆b3
In nondegenerate situations (depicted
here for the canonical primal case) ∆b1
small changes in c or b will not affect ∆c
the optimal basis. In this case only c
opt. x∗
the values change and they do so in a ∆b2
perfectly predictable way.

Changes in the objective coefficients c

Note, changes in c do not affect primal feasibility but affect the dual feasible
set. The current basis B stays optimal as long as the reduced costs zN (c) ≥ 0
remain nonnegative.
For c + t∆c with t ∈ Rn put
−⊤ ⊤ −⊤
zN (t) = cN + t∆cN − A⊤ ∗
N AB (cB + t∆cB ) = zN + t (∆cN − AN AB ∆cB )
| {z }
=:∆zN

then
∗ ]
[zN [z ∗ ]i
i
zN (t) ≥ 0 ⇔ sup − ≤t≤ inf − N
i : [∆zN ]i >0 [∆zN ]i i : [∆zN ]i <0 [∆zN ]i

For this range of t-values the new optimal value is simply (c + t∆c)⊤ x∗ .
CHAPTER 7. SIMPLEX METHOD 120

Changes in the right hand side coefficients b

These preserve dual feasibility but affect the primal feasible region. The
current basis is optimal as long as xB ≥ 0 remains nonnegative.
For b + t∆b with t ∈ Rn put

xB (t) = x∗B + t A−1 ∆b


| B{z }
=:∆xB

then
[x∗B ]i [x∗ ]i
xB (t) ≥ 0 ⇔ sup − ≤t≤ inf − B
i : [∆xB ]i >0 [∆xB ]i i : [∆xB ]i <0 [∆xB ]i

For this range of t-values the new optimal value is (b + t∆b)⊤ y ∗ .


Note, even for t exceeding this range with a different optimal basis, the
value (b + t∆b)⊤ y ∗ is always a valid lower bound on the new primal optimal
objective value [why?].
Within the canonical setting each optimal dual variable yi∗ ≥ 0 holds the
Lagrange multiplier to a constraint Ai,• x ≤ bi . Consider now the effect of
changing bı̄ for one fixed inequality index ı̄. The same considerations show
that in the nondegenerate case yı̄∗ gives the marginal change of the objective
per unit change in bı̄ (typically this is the exact value only for changes in a
small range as computed above). In applications in economics the constraints
Ai,• x∗ ≤ bi often correspond to resource constraints, where the right hand
side bı̄ indicates – like in the Morzart problem – the available amount of
resource ı̄. In this setting yı̄ are called shadow prices as they give the per
unit price up to which it is worth to buy additional resources in order to
improve the objective (if the objective corresponds to revenue). If the market
is at equilibrium, the market price of resource ı̄ will be exactly this value
yı̄ (otherwise revenue could be increased). In any case, a yı̄ ̸= 0 indicates,
that tightening this constraint will certainly influence the objective; the
corresponding inequalities and their dual variables are then called strongly
active.
If some yı̄ corresponds to an inequality constraint and yı̄∗ = 0 (inequality
ı̄ is inactive if the slack of the inequality is positive, otherwise it is weakly
active), the inequality can be dropped without changing the optimal value,
because
• the current primal solution stays feasible and its objective value does
not change,
• the current dual solution stays feasible and its objective value does not
change.
CHAPTER 7. SIMPLEX METHOD 121

Note, however, that the set of optimal solutions might get bigger in degenerate
cases (if inequality ı̄ is weakly active).
In the same way one may drop variables x∗i = 0 or zi∗ = 0 without changing
the objective value.
For most large scale linear programs one tries to avoid to include all potentially
inactive inequalities and variables. They are added on demand and dropped
again if they no longer seem of importance. Such techniques are sketched
next.

7.5 Column Generation and Cutting Plane Meth-


ods

Column Generation

Sometimes it may be impossible to even store the system Ax = b because


the number of variables n is too large. This is typically the case if most of
the columns correspond to choices arising from some algorithmic process, so
for each of these variables xj there is an algorithmic description on how to
arrive at its column A•,j .
As long as xj is nonbasic (j ∈ N with xj = 0) the column A•,j is not needed
explicitly. The column is needed, however, once xj is selected as entering
variable in the pricing step,
2. Pricing: Compute z̄N = cN − A⊤
N ȳ and choose ȷ̂ ∈ N with z̄ȷ̂ < 0 . . .

For pricing it is sufficient if minj∈N cj − A⊤


•,j ȳ can be determined algorith-
mically, but this does not necessarily require the explicit availability of the
matrix AN .
Example The following cutting stock problem from steel industry is close
to the original wall paper application, for which the column generation
technique was developed in the pioneering work of Gilmore and Gomory.
The task is to cut coils of sheet metal of
given widths bi , i = 1, . . . , m for total
lengths ℓi out of standardized mother
coils of width b̄ so that trim loss is
minimized.
sketch mother coil with cutting patterns

A cutting pattern is a combination s ∈ Nm


0 of the widths bi with multiplicities
si summing up to a total width of at most b̄. All cutting patterns are collected
in the set
m
X
S = {s ∈ Nm0 : si bi ≤ b̄}.
i=1
CHAPTER 7. SIMPLEX METHOD 122

Let variable xs , s ∈ S, specify the length for which the cutting pattern s is
to be employed, then the problem to be solved may be formulated as follows,
P
min Ps∈S xs [minimize total length]
s. t. s
s∈S i sx ≥ ℓi , i = 1, . . . , m, [satisfy demand of width i]
xs ≥ 0, s ∈ S [|S| is typically huge!]

The column corresponding to s ∈ S is A•,s = s and has objective coefficient


cs = 1. For the current dual variable values ȳ the pricing step needs to solve

max ȳP⊤s
⊤ m 
min(1 − s ȳ) ⇔ i=1 si bi ≤ b̄,
s∈S s. t. s ∈ S.
si ∈ N0 , i = 1, . . . , m

The latter problem is a so called knapsack problem. Because of its integrality


constraints on s it is not an LP but an integer program (IP). For convenience
we may assume w. l. o. g. ȳ ≥ 0, because widths with ȳi < 0 will always have
si = 0. In practice the bi may be considered integer multiples of some minimal
unit width (typically 0.1 mm) and the width of a mother coil is approximately
1 m, so we may assume bi ∈ N and b̄ ∈ N with bi ≤ b̄ ≈ 10000. In this
range the problem can be solved reasonably well by dynamic programming
techniques, which build recursively on optimal solutions for smaller sizes.
Here the solution successively allows to use the next width i = 1, . . . , m once
the optimal solution has been determined for items 1, . . . , i − 1 and successive
total widths 1, . . . , b̄. To formalize this let
k k
X X b = −b̄, . . . , b̄,
opt(b, k) := sup ȳi si : si bi ≤ b, si ∈ N0 for
k = 0, . . . , m.
i=1 i=1

It denotes the optimal value that can be achieved for total width b and items
i = 1, . . . , k. For convenience the definition ensures opt(b, k) = −∞ for b < 0
and opt(b, k) = 0 for b = 0 or (b ≥ 0 ∧ k = 0). With this the following
recursion holds for b = 1, . . . , b̄ and k = 1, . . . , m

opt(b, k) = max{opt(b, k − 1), opt(b − bk , k) + ȳk },

because the best solution of width b may either only use widths 1, . . . , k − 1
and no width k or it contains at least one copy of width k for a benefit of
ȳk on top of the best choice for the remaining total width b − bk filled with
widths 1, . . . , k.

Algorithm 7.12 (Knapsack by Dynamic Programming)


Input: b̄ ∈ N, b ∈ {1, . . . , b̄}m , ȳ ∈ Rm +
Output: opt ∈ R{0,...,b̄} , sol ∈ {0, . . . , m}b̄
Initialize opt = 0, sol = 0;
CHAPTER 7. SIMPLEX METHOD 123

for k = 1, . . . , m do
for j = bk , . . . , b̄ do
if opt[j − bk ] + ȳk > opt[j] then
opt[j] ← opt[j − bk ] + ȳk , sol[j] ← k;
end if
end for
end for

It helps to carry out the algorithm for a small example by hand, e. g. for
b1 = 3, b2 = 5, b3 = 6 and b̄ = 10 with ȳ1 = 2, ȳ2 = 4, ȳ3 = 5 which should
result in
b opt sol opt sol opt sol opt sol
10 0 0 6 1 8 2 8 2
9 0 0 6 1 6 1 7 3
8 0 0 4 1 6 2 6 2
7 0 0 4 1 4 1 5 3
6 0 0 k=1 4 1 k=2 4 1 k=3 5 3
→ → →
5 0 0 2 1 4 2 4 2
4 0 0 2 1 2 1 2 1
3 0 0 2 1 2 1 2 1
2 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
Note that the sol-vector allows to reconstruct the optimal solution, as the
item employed for getting the current optimal value indicates which next
solution to build on (here s1 = 0, s2 = 2, s3 = 0).
Once the pricing problem is solved, the generated cutting pattern is added as
a column together with its variable and the optimal solution is recomputed
for the updated problem.
Algorithm 7.13 (Column Generation Framework; Cutting Stock)
Input: b̄ ∈ N, b ∈ {1, . . . , b̄}m , ℓ ∈ Rm
0. Choose some initial patterns that ensure feasibility, e. g., s(i) = ⌊ bb̄i ⌋ei for
i = 1, . . . , m.
1. Solve the LP problem for the current selection of patterns → ȳ.
2. Remove unused patterns.
3. Find new patterns by pricing / column generation using e. g. Alg 7.12;
if there are none with negative reduced cost, STOP,
else add the new one(s) and GOTO 1.

In practice the first iterations typically show fast improvement that quickly
slows down dramatically. This tailing off effect is due to miniature im-
provements that are possible by slightly modifying the cutting patterns
CHAPTER 7. SIMPLEX METHOD 124

selected from the huge available set. Mostly there is no point in going for
full optimality and the iterations are stopped once progress is slow.
The model used here has some disadvantages in practice.
1. The selected lengths xs are in general no useful multiple of available
mother coil lengths. The typical solution uses some patterns for a very
long time and numerous further ones for lengths that are almost not worth
setting up.
2. The time needed to set up a machine to cut a given pattern is often the
limiting factor. Therefore practical solutions should use as few patterns as
possible and rather provide a bit of excess length or trim loss. Note, that
the number of patterns actually employed in the solution of the simplex
method is at most m [why?], which is typically too large in practice.

Cutting Plane Methods

If the system Ax ≥ b (or Ax = b with slacks) contains too many inequalities


to work with this system, one starts with a first subset, solves the problem
for these and then repeatedly adds a selection of those inequalities that are
still violated by the current optimal solution and resolves.

Algorithm 7.14 (Cutting Plane Framework)


0. Choose some initial subset of inequalities.
1. Solve the LP for the current selection of inequalities → x̄.
2. Remove inequalities that do not seem important ( e. g. if slack exceeds
some value)
3. Solve the separation problem: Find (or try to find) valid inequalities of
the feasible set, that separate the current optimal x̄ from the feasible set.
If there are none, STOP,
else add a selection of those identified and GOTO 1.

In the separation problem, “try to find” alludes to the fact that for many
practical problems – in particular in the context of integer programming – it
is too time consuming to indeed explore or enumerate all relevant inequalities.
In these cases one aims at developing algorithmic approaches for finding
violated inequalities that either work exactly for a well defined subclass of
inequalities or that employ heuristic methods for hopefully identifying the
most relevant ones.
Note, adding violated cutting planes amounts to adding dual variables for
which the dual pricing step indicates relevance. Indeed, the cutting plane
approach is exactly the same as column generation for the dual problem.
CHAPTER 7. SIMPLEX METHOD 125

Dynamic Problem Modifications and Warm Starting

The simplex algorithm may start from any basic solution that is either primal
feasible (by using the primal simplex algorithm) or dual feasible (by using
the dual simplex algorithm). After solving an initial variant of the problem
to optimality, the current basis B is primal and dual feasible.
• Column generation adds variables and columns with negative reduced
cost. This keeps the basis primal feasible but destroys its dual feasibility.
In this setting the primal simplex algorithm allows to continue directly
from the previous basis. Furthermore, for minor problem modifications
typically most of the decisions on which variables need to be in the
basis remain correct and the next optimal basis is found after relatively
few iterations.
• Cutting plane methods add inequalities violated by the current primal
solution. Therefore the latest basis is no longer primal feasible. On the
dual side, however, this only adds variables with improving reduced
dual costs, so the basis stays dual feasible and one may continue directly
from this basis with the dual simplex algorithm. Again, frequently the
next optimal basis is found within a few dual simplex iterations.
The approach to continue directly from a previously computed solution is
called warm starting and the simplex algorithm offers almost ideal possibilities
to do so.
This is in stark contrast to interior point methods where no good general
warm starting strategies seem to be available so far. Indeed, even rather
slight problem modifications change the shape of the central path and in
particular the location of its terminal point significantly. Therefore interior
point methods have to go back deeply into the interior to get reasonably
close to the central path again.
Interior point methods are often faster in solving initial problems or problems
that do not require dynamic modifications. If modifications are required, a
typical approach is to start by solving the initial problem by interior point
methods. This yields an approximation to an optimal solution, for which
an optimal basis is then constructed in a so called cross over step (this
is not always easy or efficient). Once this optimal basis is computed, the
modifications and resolves are then carried out via the appropriate simplex
algorithm.

You might also like