Opte - Optimization
Opte - Optimization
Introduction
In this, (OP) stands for optimization problem or program, min reads “min-
imize” (it describes the task; a minimum value need not exist) and s. t. is
short for “subject to”. There are a number of similar problem description
conventions, like “f (x) → min s. t. . . . ”, but the given one seems most widely
spread. In this course we mainly consider rather friendly finite dimensional
optimization problems.
• Ω ⊆ Rn (sometimes also Cn ); in advanced courses on optimal control
(e. g. optimal acceleration and steering over time) or PDE optimiza-
tion (e.g. optimal shapes or temperatures) the goal is to find optimal
functions, in which case Ω specifies the corresponding function space.
• f, gi , hi : Rn → R “sufficiently smooth” (e. g. ∈ C 2 (R) or convex).
• The index sets E, I will be finite throughout this course; the case of
infinite index sets combined with finite dimensional ground sets is
studied in the field “semiinfinite optimization”.
Definition 1.1 Consider the optimization problem (OP).
• X = {x ∈ Ω : hj (x) = 0, j ∈ E, gi (x) ≤ 0, i ∈ I} is the feasible set
( Menge der zulässigen Punkte / der zulässigen Lösungen oder zulässige
Menge)
• A problem with X = ∅ is called infeasible ( unzulässig).
1
CHAPTER 1. INTRODUCTION 2
f (x∗ ) ≤ f (x) ∀x ∈ X .
x
local optima global local
Ω
?
On the other hand this result is of little help in actually locating a minimum
even in the case of rather simple continuous functions. To see this it suffices
CHAPTER 1. INTRODUCTION 3
to imagine a function that is constant except for a tiny spike located in some
unknown position like illustrated below.
f:
Example
CHAPTER 1. INTRODUCTION 4
Definition 1.3
• A function f : Rn → R := R ∪ {∞} is convex if
• It is strictly convex if
f ∞ outside f
epi f
epi f
r r
x x
Sr (f )
Sr (f )
Proof: ⇒: Let (x, r), (y, p) ∈ epi f , then f (x) ≤ r, f (y) ≤ p. For α ∈ (0, 1)
T
Proof: Let x, y ∈ C = i∈I Ci , then x, y ∈ Ci for all i ∈ I, thus [x, y] ⊆ Ci
for all i ∈ I, hence [x, y] ⊆ C. □
T
Proof: Check that epi f = i∈I epi fi and use Obs 1.4 and 1.5. □
Proof: Form Sr′ = epi f ∩ {(x, r) : x ∈ Rn }. By Obs 1.4 and 1.5 this Sr′ is
| {z }
cvx
cvx and therefore also Sr = {x : (x, r) ∈ Sr′ } = Sr (f ) (Ex). □
Note, however, that functions with convex level sets are not necessarily
convex (these are called quasiconvex ; draw some examples).
For f , gi cvx, hi affine and Ω cvx the optimization problem
min f (x)
s. t. hi (x) = 0 i ∈ E
gi (x) ≤ 0 i ∈ I
x∈Ω
Because each single set is cvx, Obs 1.5 implies the convexity of X (even in
the case of infinitely many constraints).
An important reason for the attractivity of convex optimization is based on
the property that there is no need to discern local and global optima.
CHAPTER 1. INTRODUCTION 6
This lecture series will put a special focus on an important special case of
convex optimization where closedness if no problem. Linear Optimization or
Linear Programming requires all f, gi , hi to be affine, i. e., of the form a⊤ x+β,
and Ω = Rn+ := {x ∈ Rn : xi ≥ 0, i = 1, . . . , n}.Instead of x ∈ Rn+ we usually
write x ≥ 0 which is to be interpreted as componentwise nonnegativity. As
we will see, all linear programs (LP) can be brought into the following normal
form,
min c⊤ x
s. t. Ax = b, LP in normal form
x ≥ 0.
Convex optimization offers a multitude of algorithmic possibilities, not only
for important special cases. Several of these allow to study algorithmic
complexity aspects, some of this will appear in this course. Already the
seemingly very special field of linear optimization allows for a wide spectrum
of applications. Let us start with a first glimpse of these and some related
theoretical questions.
Example The production of (simplified) Mozart-Balls and Mozart-Coins
requires respective quantities of marzipan, nougat and dark chocolate (in
CHAPTER 1. INTRODUCTION 7
x2
With decision variables
x1 . . . number of Mozart-Balls 11
10
x2 . . . number of Mozart-Coins
9
the problem reads 8
7
max 5x1 + 4x2 c
6
s. t. x1 + x2 ≤ 6, 5
2x1 + x2 ≤ 11, 4
x1 + 2x2 ≤ 9, 3
2
x1 ≥ 0, x2 ≥ 0.
1
1 2 3 4 5 6 7 8 9 x1
Clearly, the optimal solution is attained in a vertex.“
”
How can this solution be computed precisely (instead of reading it off)?
Determine the intersection point of the two straight lines
x1 + x2 = 6
⇒ x1 = 5, x2 = 1.
2x1 + x2 = 11
xi ≥ 0 for i = 1, . . . , n ⇔: x ≥ 0, x ∈ Rn+ .
These sign constraints are already linear side constraints. General linear
inequalities are of the form
n
X
ai xi ≤ β ⇔ a⊤ x ≤ β.
i=1
x2 x
a⊤ x a
∥a∥ ∥a∥
x1
{x : a⊤ x ≤ β} {x : a⊤ x ≥ β}
{x : a⊤ x = β}
What are the correct choices of a and β for the sign constraints xi ≥ 0?
The linear inequality constraints are collected in a matrix representation (the
indices within ai now refer to the constraint and no longer to components of
a)
a⊤1 x ≤ b1
" ⊤#
a1
" #
b1
.. ⇝ Ax ≤ b with A = .. , b = .. .
. . .
⊤
am b
⊤
am x ≤ bm m
CHAPTER 1. INTRODUCTION 9
max c⊤ x
s. t. Ax + s = b with si = bi − a⊤
i x ≥ 0 slack variables.
x ≥ 0, s ≥ 0
min c⊤ x
s. t. Ax = b,
x ≥ 0.
Geometrically the feasible set may now be viewed as the intersection of the
cone of nonnegative vectors with an affine subspace, e. g.in the case of three
variables and one equality constraint,
x3
x2
x1
Before, the optimal solution (x∗1 , x∗2 ) resulted from intersecting the first and
second inequality constraints. How does this translate to the normal form
setting?
(x1 , x2 ) lies on 1 ⇒ slack variable x3 = 0
lies on 2 ⇒ x4 = 0
It remains to solve h 1 1 0 i h x1 i h i
6
210 x2
x5
= 11 .
121 9
In general, in order to compute a vertex, choose n−m slack variables that are
set to zero ⇒ the solution lies on the corresponding equalities. The solution
of the remaining system determines the slack to the remaining constraints.
CHAPTER 1. INTRODUCTION 11
a⊤
1 x ≤ b1 | · y1 ≥ 0
..
.
a⊤
m x ≤ bm | · ym ≥ 0
⊤
0 x ≤ 1 |·γ ≥0
Pm ⊤x ≤
Pm
y a
i=1 i i y b
i=1 i i + γ
Later theory will show that this indeed generates all linear inequalities that
are valid for {x : Ax ≤ b} and all relevant ones do not require the right hand
side shift by γ.
CHAPTER 1. INTRODUCTION 12
x2 y = (1, 0, 1)⊤
a1 a2 1a1 + 1a3
max c⊤ x
s. t. Ax ≤ b 1a2 + 1a3
x≥0 y = (0, 1, 1)⊤
a3
x1
For any y ∈ Rm ⊤ ⊤ ⊤
+ , put a = y A, β = y b, then any x satisfying Ax ≤ b also
⊤
satisfies a x ≤ β. Now, if a ≥ c (componentwise as usual), then any x ∈ X
satisfies x ≥ 0 and therefore also c⊤ x ≤ a⊤ x ≤ β. In summary,
⋉
⋊
The dual program in canonical form asks for the y ≥ 0 that gives rise to the
smallest upper bound,
max b⊤ y
s. t. A⊤ y ≥ c
y≥0
order to arrive at the primal dual pair of linear programs in normal form
min c⊤ x max b⊤ y
(P ) s. t. Ax = b (D) s. t. A⊤ y + z = c
x≥0 y ∈ Rm , z ≥ 0
⋉
⋊
Weak Duality
refers to the fact that the objective value any dual feasible point provides a
(in normal form) lower bound on the objective value of any primal feasible
point and vice versa. This holds in convex optimization in general and will
be surprisingly easy to prove. To give a first flavor of this for linear programs
in normal form, denote their feasible set by
□
Note that in this latter proof equality holds throughout whenever z⊤x
= 0.
By the nonnegativity of both vectors this amounts to each coordinate being
zero in at least one of z and x, so for each inequality constraint the slack
variable on the primal or the dual side (or both) must be zero. This will be
called the complementarity property. Whenever a complementary feasible
primal dual pair of solutions is found, this proves optimality of both sides
immediately.
An important question is therefore: Under which conditions can we be sure
that primal and dual solutions with the same optimal value exist (strong
duality holds) or is it possible that p∗ = inf x∈X c⊤ x > sup(y,z)∈Z b⊤ y =: d∗ ,
i. e., that a dualtiy gap p∗ − d∗ > 0 occurs?
CHAPTER 1. INTRODUCTION 14
We shall see that for LP there is no duality gap whenever at least one side is
feasible, but this may already fail to hold for linear programs over the second
order or the positive semidefinite cone.
Why is this important?
One the one hand, the optimality certificate and sensitivity information are
at their best only if strong duality holds. But there is also an entire modeling
technique of high practical relevance that relies on strong duality.
Robust Optimization
Smooth Unconstrained
Optimization
f, gi , hi ∈ C 1 or ∈ C 2 ,
15
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 16
f (x + td) − f (x)
f ′ (x; d) := lim ; f ′ (x̄; d) = ∇f (x̄)⊤ d.
t↘0 t
quadratic/second-order model of f at x
f (x̄) + ∇f (x̄)(x − x̄) + 12 (x − x̄)⊤ ∇2 f (x̄)(x − x̄).
sketch function
△
Proof: Th2.1 implies ∇f (x∗ ) = 0 and the second part follows by the same ar-
gument: Suppose, for contradiction, ∇f (x∗ ) ̸⪰ 0, i. e., ∃p ∈ Rn p⊤ ∇2 f (x∗ )p <
0. For all sufficiently small α > 0 there holds p⊤ ∇2 f (x∗ + tαp)p < 0 for all
t ∈ (0, 1), so by Taylor there is a suitable t ∈ (0, 1) with
Thus,
f (x∗ + t(x̄ − x∗ )) − f (x∗ )
0 = ∇f (x∗ )⊤ (x̄ − x∗ ) = lim ≤ f (x̄) − f (x∗ ),
t↘0 t
therefore f (x∗ ) ≤ f (x̄). □
For differentiable convex functions the necessary conditions are also sufficient.
satisfies
(i) xk converges to x∗ , i. e., xk → x∗ ,
(ii) the rate of convergence is quadratic, i. e.,
(iii) the sequence of the norms of the gradients ∥∇f (xk )∥ converges quadrat-
ically to zero.
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 21
This is exactly the description of the Newton step. Convergence will be fast
if the quadratic model is a good approximation, which is true if curvature
changes only slightly. The latter is ensured by the Lipschitz condition on
the second derivative.
An alternative important interpretation of Newton’s method is its true origin
as an iterative method for solving nonlinear equations,
∂f
∂x1
After rearranging terms and switching signs the norm of the bracketed
expression reads
∃r > 0 [Br (x∗ ) ⊆ U (x∗ ) ∧ (∀x ∈ Br (x∗ ) ∥∇2 f (x)−1 ∥ ≤ 2∥∇2 f∗−1 ∥)].
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 23
For any starting point x0 satisfying ∥x0 − x∗ ∥ < min{1, r, L∥∇21f −1 ∥ } this
∗
proves xk → x∗ with quadratic convergence establishing (i) and (ii).
The proof of (iii) relies on the same estimate of the linearization error,
By using the same r as above, for all xk ∈ Br (x∗ ) there holds 12 L∥∇2 fk−1 ∥2 ≤
2L∥∇2 f∗−1 ∥. The latter gives the constant factor proving quadratic conver-
gence in the limit. □
Note that the proof does not rely on any finiteness properties of the vector
space, so the same proof also works for Banach spaces.
Let us emphasize two important properties of Newton’s method,
• convergence is only guaranteed locally; within the region of quadratic
convergence, the number of correct digits double in every iteration.
• as it search for some stationary point, the step direction p = xk+1 − xk
may not even lead downwards but may well go for a local maximum
instead.
In order to obtain a minimization algorithm, Newton’s method has to com-
bined with some method that ensures some global form of descent. The two
main approaches for this are line search methods and trust region methods.
In this course we only deal with the most important aspects of the former.
The negative gradient −∇f (x) is always a descent direction if not zero and
it is called the “steepest descent direction”.
It is important, however, to understand that the gradient heavily depends
on the underlying scalar product. Indeed, the gradient represents the lin-
ear map of the derivative with respect to the canonical scalar product
⟨x, ∇f (x̄)⟩ = ∇f ⊤ x. A different scalar product results in a different coordi-
nate representation, so geometrically the steepest descent direction depends
on the choice of the scalar product. It is worth to better understand the
influence of the scalar product and its associated norm on the geometry of
linear and second-order model.
Consider a general inner product and its norm induced by a positive definite
H ≻ 0,
1 √
⟨x, y⟩H := y ⊤ Hx, ∥x∥H := ⟨x, x⟩H
2
= x⊤ Hx.
For H = I we obtain the canonical inner product with linear [and second]
order model
1 2
f (x̄) + ⟨x − x̄, ∇f (x̄)⟩ + x − x̄, ∇ f (x̄)(x − x̄)
2
In order to obtain the same linear model with respect to a general inner
product ⟨·, ·⟩H for some H ≻ 0, the gradient has to be transformed to
H −1 ∇f (x̄), because
The rate at which the quality of the linear model deteriorates into a normal-
ized direction x(α) = x̄ + α ∥p∥p H is given by
D E
f (x + α ∥p∥p H ) − f (x̄) + ∥p∥p H , α∇f (x̄) ≈
H
2
D E
≈ α2 ∥p∥p H , H −1 ∇2 f (x̄) ∥p∥p H
H
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 25
p ⊤ 2 p
For H = I this results in the usual curvature ∥p∥ ∇ f (x̄) ∥p∥ which is most
transparent in the eigenvalue decomposition of the Hessian (symmetric
" # by
λ1
smoothness) ∇2 f (x̄) = P ΛP ⊤ with diagonal matrix Λ = Diag ..
. holding
λn
the eigenvalues λ1 ≥ · · · ≥ λn and orthognal P ∈ Rn×n with its columns
vi = P•,i holding an orthogonal basis of the corresponding eigenvectors. For
steps into an eigenvector direction vi the error behaves like
f (x + α ∥vvii ∥ ) − f (x̄) + α∇f (x̄)⊤ ∥vvii ∥ ≈ α2 ∥vvii ∥ ⊤ ∇2 f (x̄) ∥vvii ∥ = α2 λi
For smooth f , however, this is easily satisfied for an arbitrarily small step.
There is no point to stop, if the current decrease still predicts good progress.
It will be save to stop if the current decrease has flattened out or worsened
sufficiently, as required in the
Note the requirement c1 < c2 which makes sure that c1 ∇fk⊤ pk > c2 ∇fk⊤ pk
(by ∇fk⊤ pk < 0), so that c2 flattens out the descent less than c1 .
f (xk + αk pk ) ≤ fk + c1 α∇fk⊤ pk
∇f (xk + αk pk )⊤ pk ≥ c2 ∇fk⊤ pk
Stopping the search whenever these are satisfied will exhibit sufficient decrease.
There are other similar variants that frequently appear in the literature, e. g.,
f (xk + αk pk ) ≤ fk + c1 α∇fk⊤ pk
|∇f (xk + αk pk )⊤ pk | ≤ c2 |∇fk⊤ pk |
Proof: Because Φ(α) = f (xk + αpk ) is bounded from below for all α > 0,
the affine function fk + αc1 ∇fk⊤ pk must intersect Φ(α) for some α > 0.
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 28
Thus
∇f (xk + α′′ pk )⊤ pk = c1 ∇fk⊤ pk > c2 ∇fk⊤ pk .
Therefore there is a neighborhood (and in it an interval) that satisfies Wolfe.
For strong Wolfe note that by ∇fk⊤ pk < 0 also ∇f (xk + α′′ pk )⊤ pk , hence
|∇f (xk + α′′ pk )⊤ pk | < c2 |∇fk⊤ pk |. □
Here the αi refer to the internal iterative process of searching for some α
satisfying the Wolfe conditions for Φ(α) = f (xk + αpk ).
For Newton type methods, the size of the step pk is typically meaningful.
In this case one uses the initial step length α0 = 1 whenever possible and
checks Armijo only, because the Newton process typically ensures a sufficient
minimal step length. If Armijo fails, one may use a simple backtracking
9
scheme like repeatedly setting αi+1 = 10 αi until Armijo holds.
If nothing is known about the step direction, a reasonable very first step
size is almost impossible to predict. In the hope that a step size of norm 1
has a reasonable meaning in the problem formulation, one often starts with
1
α0 = ∥p∥ . If previous iterations exist, one either uses the last successful step
size or employs interpolation exploiting previous function values.
Given the information Φ(0), Φ′ (0) = ∇fk⊤ pk , Φ(α0 ) and typically also
Φ′ (α0 ) = ∇f (xk + α0 pk )⊤ pk (if evaluating the gradient is not too expensive)
with α0 not yet satisfying Wolfe, the search progresses in two phases,
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 29
where the coefficients are determined so that function values and derivatives
coincide in α and α, i. e., the coefficients are computed from the equations
−∇fk⊤ pk
cos θk = ≥ δ > 0. (2.2)
∥∇fk ∥∥pk ∥
It will turn out that this condition allows to show convergence of several
schemes under mild conditions. The main result seems rather technical at
first.
Combining the Zoutendijk condition with (2.2) yields δ 2 k≥0 ∥∇f ∥2 < ∞
P
which implies ∇fk → 0. So whenever the step direction is not approaching
orthogonality to steepest descent and xk → x̄ this x̄ must be a stationary
point by continuity. In optimization a point sequence is globally convergent if
∇fk → 0. Mind however, that convergence refers to the gradients converging
to zero. A globally convergent sequence may well rush off to infinity, consider
e. g. minimizing ex .
Before embarking on the proof, let us sketch the intuitive idea. Due to
the Lipschitz condition the gradients cannot change to fast. The curvature
condition, however, requires a significant change in the directional derivative,
∇f ⊤ p
which yields a lower bound on the step length in terms ∥pk ∥2k . Together
k
(∇f ⊤ p )2
with Armijo this implies a lower bound on the decrease relative to ∥pk ∥2k .
k
But f is bounded below so either the direction has to go bad or ∥f ∥ has to
go to zero.
Proof: Subtracting ∇fk⊤ pk < 0 on both sides of the curvature condition
⊤ p ≥ c ∇f ⊤ p yields the middle relation in
∇fk+1 k 2 k k
C.-S.
L∥αk pk ∥∥pk ∥ ≥ (∇fk+1 − ∇fk )⊤ pk ≥ (c2 − 1) ∇fk⊤ pk > 0
Lipsch. | {z } | {z }
<0 <0
c2 − 1 ∇fk⊤ pk
⇒ αk ≥ · .
L ∥pk ∥2
Plugging this lower bound into Armijo gives a minimal decrease,
c2 − 1 (∇fk⊤ pk )2
fk+1 ≤ fk + c1 αk ∇fk⊤ pk ≤ fk + c1
| {zL } ∥pk ∥2
| {z }
<0, =:−c (2.2)
= cos2 θk ···∥∇fk ∥2
In any point the steepest descent direction is always orthogonal to the level
lines (why?) and for the circle this direction points right to the origin. So
when taking the starting point x̄ = ( 11 ) steepest descent with line search will
find the optimum in one step.
Now consider scaling the first coordinate axes by 1000 (in applications this
could correspond to changing the unit of the axes from millimeters to meters),
CHAPTER 2. SMOOTH UNCONSTRAINED OPTIMIZATION 32
10002 0
i. e., replace x1 by 1000x̃1 . The equivalent
change
in Q leads to Q̃ = 0 1 ,
1
the change to the starting point x̃ = 1000. The level sets are now very
1
narrow ellipses corresponding to the high curvature along the first coordinate
and the steepest descent direction in x̃ reads p̃ = −Q̃x̃ = ( 1000 1 ). Even
doing an exact line search in this direction (given the level lines, where is
the line search optimum?) will go somewhat to the negative side of the first
coordinate and the method starts zigzagging in a very slow manner towards
the origin.
Finally, consider Newton’s method instead. In the first case the direction
reads p = −Q−1 Qx̄ = −x̄, in the scaled case it reads p̃ = −Q̃−1 Q̃x̃ = −x̃, so
in each case it converges in one step. This should be no surprise as the Newton
step goes to the minimum of the quadratic model which coincides with the
function. In the exercises, however, you will prove that Newton’s method is
scale invariant meaning that it generates exactly the same sequence of steps
independent of the choice of the basis. This holds for all kinds of ellipsoids
and not just for axes aligned ones.
Chapter 3
Convex Analysis
The main goals are necessary and sufficient optimality conditions for convex
optimization problems as well as laying the foundation for convex optimization
algorithms. We will see that most convex properties have beautiful geometric
interpretations.
We first recall the basic definitions of convex sets and introduce convex cones.
Definition 3.1
• A set C ⊆ Rn is convex if for x, y ∈ C we have [x, y] = {αx + (1 −
α)y : α ∈ [0, 1]} ⊆ C.
• A set C ⊆ Rn is a convex cone, if for x, y ∈ C the open half ray
R++ (x + y) := {λ(x + y) : λ > 0} ⊆ C.
Pk
△k := {α ∈ Rk : i=1 αi = 1, αi ≥ 0, i = 1, . . . , k}
sketch simplex
33
CHAPTER 3. CONVEX ANALYSIS 34
Observation 3.2
(i) Let (Cj )j∈J be family of convex sets in Rn , then j∈J Cj is convex.
S
(ii) Let Ci ∈ Rni be convex (or convex cones) for i = 1, . . . , k, then
C1 × · · · × Ck is convex (or a convex cone) in Rn1 +···+nk .
(iii) Let A : Rn → Rm be an affine map. For C ⊆ Rn convex the image
A(C) is convex in Rm and for D ⊆ Rm the preimage A−1 (D) is convex
in Rn .
(iv) If C1 , C2 ⊆ Rn are convex, then the Minkowksi-sum C1 + C2 is convex
and for α1 , α2 ∈ R the set α1 C1 +α2 C2 := {α1 x1 +α2 x2 : x1 ∈ C1 , x2 ∈
C2 } is convex.
(v) If C ⊆ Rn is convex, then so is its interior int C and its closure cl C.
Definition 3.3
Pn
• Let xP n
1 , . . . , xk ∈ R , α1 , . . . , αk ≥ 0 with i=1 αi = 1 (⇔ α ∈ △k ),
then ni=1 αi xi is a convex combination of the xi .
• For S ⊆ Rn the convex hull
\
conv S := C
S⊆C cvx
Observation 3.4
(i) A set C ⊆ Rn is convex ⇔ C contains every convex combination of
its elements.
(ii) Let S ⊆ Rn , then
conv S = {x ∈ Rn : ∃k ∈ N, x1 , . . . , xk ∈ S, α ∈ △k with x = ki=1 αi xi }.
P
“The convex hull is the set of all convex combinations of its elements.”
(iii) Let S ⊆ Rn , then cone S = R+ (conv S) = conv (R+ S) where R+ S =
{λs : λ ∈ R+ , s ∈ S}.
⋉
⋊
Theorem 3.8 Let S ⊆ Rn . If S is compact/bounded, then conv S is com-
pact/bounded.
Observation 3.9
(i) For S ⊆ Rn there holds conv S = cl conv S, cone S = cl cone S.
(ii) For bounded S ⊆ Rn , cl conv S = conv cl S, thus conv S = conv cl S.
(iii) For compact S ⊆ Rn with 0 ∈ / conv S, cone S = R+ (conv S)[= cone S].
≤ 21
z }| {
X k
1
α0 = k+1 (1 − ξj ) ≥0
j=1
k k k
k X
k+1
X X
1
X and αi = k+1 (1− ξj )+ ξj = 1.
αi = k+1 (1 − ξj ) + ξi ≥ 0
i=0 j=1 j=1
j=1
| {z }
≤ 21
| {z }
1
≥2
Proof: That the affine hull is the same follows by Th 3.10 (also for cl C,
because aff C is closed). All other claims follow by Lem 3.11. For proving,
e. g., that rint C and C have the same closure we have to show cl C ⊆ cl rint C.
[?]For C = ∅ this holds, so let x ∈ cl C and y ∈ rint C (̸= ∅ by Th 3.10). Then
CHAPTER 3. CONVEX ANALYSIS 40
sketch examples for rint (C1 ∩ C2 ) ̸= rint (C1 ) ∩ rint (C2 ) and cl (C1 ∩ C2 ) ̸= cl (C1 ) ∩ cl (C2 )
̸ C ⊆ Rn closed, convex, ȳ ∈ C, x ∈ Rn .
Theorem 3.14 Let ∅ =
Thus
0 ≤ −α ⟨x − ȳ, y − ȳ⟩ + α2 12 ∥y − ȳ∥2
or
⟨x − ȳ, y − ȳ⟩ ≤ α 12 ∥y − ȳ∥2 → 0 for α → 0.
Definition 3.16 Let K ⊆ Rn be a convex cone (not nec. closed). The polar
cone is
K ◦ = {s ∈ Rn : ⟨s, x⟩ ≤ 0 ∀x ∈ K}.
Proof: ⇒: By Th 3.14, ⟨x − ȳ, y − ȳ⟩ ≤ 0 for all y ∈ K. This also holds for
y = αȳ for α ≥ 0, so
1
≥ ∥x − ȳ∥2 .
2
□
Exercise Prove pK (x) + pK ◦ (x) = x for x ∈ Rn .
⋉
⋊
CHAPTER 3. CONVEX ANALYSIS 43
In the convex case the basic result concerns separating a point from a convex
closed set.
Theorem 3.19 (convex separation)
Let ∅ ̸= C ⊆ Rn be convex and
closed and let x ∈
/ C.
Proof: Put s := x − pC (x) (̸= 0). By Th 3.14 there holds for all y ∈ C
0 ≥ ⟨x − pC (x), y − pC (x)⟩
= ⟨s, y − x + s⟩ = ⟨s, y⟩ − ⟨s, x⟩ + ∥s∥2 ,
0 > sup ⟨s, y1 ⟩ + sup ⟨s, −y2 ⟩ = sup ⟨s, y1 ⟩ − inf ⟨s, y2 ⟩ .
y1 ∈C1 y2 ∈C2 y1 ∈C1 y2 ∈C2
Proof: By Obs 3.12 C, cl C and their complements have the same boundary.
Therefore there exist (xk )k∈N → x with
xk ̸∈ cl C. For each k, by Th 3.19 there is
an sk with ∥sk ∥ = 1 so that
⟨sk , xk − y⟩ > 0 ∀y ∈ cl C ⊇ C.
sketch construction of s
Because B1 (0) is compact the sequence (sk ) has a cluster point s and
⟨s, x − y⟩ ≥ 0 ∀y ∈ C.
= with r := ⟨s, x⟩ ≥ ⟨s, y⟩ for y ∈ C supports C in x.
Thus Hs,r □
Remark 3.23 If C is “flat”, it may
happen that C ⊆ Hs,r= . For x ∈ rbd C
Closed convex sets can be fully be described “from the outside” by their
supporting hyperplanes.
CHAPTER 3. CONVEX ANALYSIS 45
P = {x ∈ Rn : ⟨ai , x⟩ ≤ bi , i = 1, . . . , m} = {x ∈ Rn : Ax ≤ b}.
When optimizing over polyhedral sets (like in LP) and one fails to find a
point in
X = {x ≥ 0 : Ax = b}
one would like to give a preferably short proof that the set is indeed empty.
Reformulated as the question
X
Is b ∈ A•,i xi : xi ≥ 0 ?
The following Lemma of Farkas shows that there is indeed a good way to
answer this, i. e., there is a short proof also for infeasibility. This famous
lemma exists in several variants. We start with one in the form of a theorem
of the alternative.
Lemma 3.28 (Farkas) Let b ∈ Rn , A ∈ Rn×m . Exactly one of the systems
Ax = b, x ≥ 0 and A⊤ y ≤ 0, b⊤ y > 0 has a solution.
If b does not lie in the cone, the cone only needs to be closed for the Lemma
to follow directly from Th 3.19. Therefore it suffices to prove the following
reformulated variant.
Lemma 3.29 (Farkas, cone version) Pm Let a1 , . . . , am ∈ Rn . The convex
cone K = cone {a1 , . . . , am } = { i=1 λi ai : λi ≥ 0, i = 1, . . . , m} is closed.
Pm
Proof: Let (bk )k∈N → b with bk ∈ K. So bk = k
i=1 λi ai with λki ≥ 0 ∀i, k.
If the ai are linearly independent, the map A = [a1 , . . . , am ] is injective and
the inverse map to Aλk = bk is continuous. In this case λki → λi ≥ 0 with
b = ki=1 λi ai ∈ K and K is closed.
P
If the ai are linearly dependent, the λk may vary a lot due to linearly
dependent contributions. In order to get rid of linear dependence, apply the
conic version of Caratheodory, by which each bk can be represented as a conic
combination of linearly independent vectors ai . As there are only finitely
many linearly independent subsets of {a1 , . . . , am }, each corresponding to a
subcone of K, one of these subcones has to contain infinitely many of the bk
and, by the previous argument, also b. □
CHAPTER 3. CONVEX ANALYSIS 47
⋉
⋊
Faces and Extreme Points
Exercise What are faces if C is the 2D unit-disk or, more generally, the
n-dimensional unit-ball?
⋉
⋊
Definition 3.31 Let C ⊆ Rn be convex.
• ∅ and C are the trivial faces of C.
• A face F of C having dim(F ) = dim C − 1 is a facet of C.
• A face F of C having dim(F ) = 1 is an edge of C.
• A face F of C having dim(F ) = 0, thus F = {x}, is an extreme point
of C and this also refers to the point x itself.
• A subset F ⊆ C is an exposed face of C if there is a supporting
= of C with F = H = ∩ C.
hyperplane Hs,r s,r
• An exposed extreme point of C is a vertex.
= to C convex.
Proof: Let F be an exposed face for supporting hyperplane Hs,r
Any x1 , x2 ∈ C satisfy ⟨s, x1 ⟩ ≤ r and ⟨s, x2 ⟩ ≤ r. Suppose there exists
x̄ = αx1 + (1 − α)x2 ∈ F for some α ∈ (0, 1), then
r = ⟨s, x̄⟩ = α ⟨s, x1 ⟩ + (1 − α) ⟨s, x2 ⟩ ≤ αr + (1 − α)r = r.
CHAPTER 3. CONVEX ANALYSIS 48
⋉
⋊
Theorem 3.35 (Minkowski/Krein-Milman) Let C ⊆ Rn be convex and
compact. Then C is the convex hull of its extreme points.
and
Argmaxx∈C ⟨s, x⟩ = conv Argmax{⟨s, x⟩ : x extreme point of C},
where Argmax denotes the set of maximizing arguments.
CHAPTER 3. CONVEX ANALYSIS 49
Attention! For non convex S the tangent cone is i. g. non convex, as well.
It is a cone in the sense of satisfying d ∈ T ⇒ ∀λ > 0 λd ∈ T . △
Proof: Let (dℓ )ℓ∈N → d with dℓ ∈ TS (x). Each dℓ is the limit of a feasible
sequence (xℓ,k ), (tℓ,k ) with k ∈ N. For each ℓ ∈ N there exists a kℓ with
CHAPTER 3. CONVEX ANALYSIS 50
xℓ,kℓ −x 1
∥ tℓ,kℓ − dℓ ∥ ≤ ℓ and tℓ,kℓ < tℓ−1,kℓ−1 . The feasible sequence (xℓ,kℓ ), (tℓ,kℓ ),
xℓ,kℓ −x
ℓ ∈ N satisfies limk→∞ tℓ,kℓ = d. □
For convex sets the tangent cone max be described in a rather intuitive form.
Observation 3.40 Let C ⊆ Rn be closed and convex and let x ∈ C. The
tangent cone to S in x is the closure of the cone generated by C − {x},
Proof: The second and third equality hold by Obs 3.4(iii) and Obs 3.9(i),
we only need to prove the first equality.
⊇: C − {x} ⊆ TC (x), because x + td ∈ C for all d ∈ C − {x} and all t ∈ [0, 1].
By Obs 3.39 TC (x) is closed, therefore cl (R+ (C − {x})) ⊆ TC (x).
xk −x
⊆: Let d ∈ TC (x) arise from a feasible sequence (xk ), (tk ), then tk ∈
R+ (C − {x}). Hence, the limit is contained in cl (R+ (C − {x})). □
Definition 3.41 Let C ⊂ Rn be convex. A direction s ∈ Rn is normal to C
in x ∈ C, if
⟨s, y − x⟩ ≤ 0 for all y ∈ C.
The normal cone NC (x) to C in x is the set of all such directions.
Definition 3.44
• The set Conv Rn denotes the set of proper convex functions f : Rn →
R̄ := R ∪ {∞} with f not identical to +∞.
• dom f := {x : f (x) < ∞}, the domain of f is the set of points with
finite function values (for f ∈ Conv Rn it is nonempty).
• The set Conv Rn denotes the set of functions f ∈ Conv Rn whose
epigraph epi f is closed.
Pk−1 αi
Proof: Inductively by splitting off αk xk via ȳ = i=1 1−αk xi . □
Of particular importance are the affine functions
s
Putting s̄ = |ρ| and r = f (x) proves the claim. □
Convex functions may have strange behavior along the (relative) boundary
of the domain but are quite tame inside. In order the show this, we need an
intermediate technical step.
M −m
|f (y) − f (y ′ )| ≤ ∥y − y ′ ∥ ∀y, y ′ ∈ B̊δ (x̄).
δ
y′ − y
y ′′ := y ′ + δ ∈ B̊2δ (x̄).
∥y ′ − y∥
∥y ′ − y∥ δ
y′ = ′
y ′′ + y
δ + ∥y − y∥ δ + ∥y ′ − y∥
| {z } | {z }
=:α 1−α
Theorem 3.48 Let f ∈ Conv Rn and let S be a compact subset of rint dom f .
There exists an L = L(S) ≥ 0 with
Proof: W. l. o. g. assume aff dim f = Rn (thus rint dom f = int dom f ) and
S ⊆ int dom f convex and compact (otherwise work on conv S ⊆ int dom f ).
CHAPTER 3. CONVEX ANALYSIS 53
ℓ
X (B̊) ℓ
X
′
⇒ |f (x) − f (x )| ≤ |f (yi ) − f (yi−1 )| ≤ Lki ∥yi − yi−1 ∥ ≤ L∥x − x′ ∥.
i=1 i=1
It remains to prove (B̊) by means of Lem 3.47. For x̄ ∈ int dom f choose,
similar to the proof of Th 3.10, a δ > 0 so that
Each y ∈ B̊2δ (x̄) may written in the from y = ni=0 αi vi for some α ∈ △n+1 .
P
Convexity of f implies
n
Th 3.45 X
f (y) ≤ αi f (vi ) ≤ max f (vi ) =: M.
i
i=0
For smooth functions f function value f (x) and gradient ∇f (x) describe a
tangent plane to the epigraph epi f at (x, f (x)) and for any direction p ∈ Rn
the directional derivative is obtained by ∇f (x)⊤ p.
CHAPTER 3. CONVEX ANALYSIS 54
f (x + t2 p) − f (x) f (x + t1 p) − f (x)
⇒ ≤ □
t2 t1
Now consider, for some fixed x ∈ dom f , the directional derivative f ′ (x; ·) as
a function of the direction. Its epigraph turns out to be a cone emanating
from the origin, f ′ (x; 0) = 0. For ease of exposition we consider this for finite
valued functions f , only.
CHAPTER 3. CONVEX ANALYSIS 55
f (x + t(α1 p1 + α2 p2 )) − f (x) =
= f (α1 (x + tp1 ) + α2 (x + tp2 )) − α1 f (x) − α2 f (x)
≤ α1 [f (x + tp1 ) − f (x)] + α2 [f (x + tp2 ) − f (x)].
Exercise Show that support functions are sublinear and closed (the epigraph
is a closed set).
⋉
⋊
CHAPTER 3. CONVEX ANALYSIS 56
x
Geometrically, epi f ′ (x; ·) = Tepi f ( f (x) ),
in words, the epigraph of the directional
derivative at x is a the same time the tan-
gent cone to the epigraph of the function
f at x. The next result shows that
x
s
∂f (x) = {s : ( −1 ) ∈ [Tepi f ( f (x) )]◦ }.
|D {z E }
s
p
⇔ ( −1 ), f ′ (x;p) ≤0 ∀p∈Rn
f (x+tp)−f (x)
Proof: Recall, by Obs 3.50 f ′ (x; p) = inf t>0 t .
subg-ineq
f (x+tp)−f (x) f (x)+t⟨s,p⟩−f (x)
⊆: s ∈ ∂f (x) ⇒ ∀t > 0 t ≥ t = ⟨s, p⟩.
f (x+tp)−f (x)
⊇: ∀p ∈ Rn ⟨s, p⟩ ≤ f ′ (x; p) ≤ t for all t > 0, thus
ıC : Rn → R̄
0 x ∈ C,
x 7→ ıC (x) =
∞ x∈/ C.
Because ıC also has infinite values, some properties of its subdifferential need
to be proven explicitly, but this is a good exercise.
Observation 3.56 Let C ⊆ Rn be closed and convex. For the indicator
function ıC , any x ∈ C and convex f : Rn → R there hold
(i) ∂ıC (x) = NC (x),
(ii) ∂(f + ıC )(x) = ∂f (x) + NC (x).
Proof: By inf x∈C f (x) = inf x∈Rn (f + ıC )(x) and Obs 3.56, (i)⇔(iii) follows
from the subgradient inequality for (f + ıC ) exactly in the same way as in
Th 3.55.
(i)⇒(ii): For y ∈ C Obs 3.40 asserts p = y − x ∈ TC (x) and by minimality
f (x + tp) ≥ f (x) for all t ∈ [0, 1], thus f ′ (x; p) = limt↘0 f (x+tp)−f
t
(x)
≥ 0. For
arbitrary p ∈ TC (x) there are y ∈ C and p = αk (y − x) with pk → p. By
k k k
Let L be the Lipschitz constant of f and assume that the projection pC (·)
can be computed efficiently (e. g., C = Rn , a friendly cone C = Rn+ or a box
like C = [0, 1]n ).
f is assumed to be given via a first order oracle. Its evaluation for some
x ∈ C yields
In words, −g(x) points “into the direction” of x∗ in the sense, that going a
sufficiently small step into this direction allows to get closer to x∗ . Unfortu-
nately, ∥g(x)∥ does not provide a good indicator for a suitable step length in
the nonsmooth case and line search techniques may lead to false convergence.
The surprisingly simple way out is to fix the step lengths in advance.
D E k
g(xi )
X
thus ri2 + h2i ≥ ri+1
2
+ 2hi ∥g(xi )∥ , xi − x∗ |
i=0
k D k E
g(xi )
X X
∗
r02 + h2i ≥ rk+1
2
+ 2hi ∥g(x i )∥
, xi − x
i=0
|{z} i=0 |{z} |
≥0
{z }
≥0 ≥0
D k
EX
g(xi ) ∗
≥ min ∥g(xi )∥ , xi −x 2hi
i=0,...,k
i=0
Let the minimum be attained for ı̄ and apply the subgradient inequality for
g(xı̄ ),
□
If an upper bound R > r0 is known, a good choice is
R
hk = √ .
k+1
r2 +R2 ln k+1
With this the factor after L is roughly 0 2R√k+1 , so convergence is quite
slow. The typical behavior in practice is to obtain good decrease initially
and then progress becomes extremely slow. Unfortunately, it can be proven
that over all convex functions the worst case behavior of this algorithm is
best possible. Smooth cases and smoothing techniques may, however, bring
along considerable improvements. In fact, for huge dimensional problems
like in data science rather simple variations of this method belong currently
to the most efficient approaches. For smaller dimensions other methods like
bundle methods that form models of the function by collecting supporting
hyperplanes, may exhibit significantly better practical performance.
Chapter 4
The starting point will be necessary optimality conditions for local optima in
the case of minimizing a smooth objective over a feasible set of general form,
61
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 62
Stationary points are good candidates for local optima. For stationary points
the negative gradient lies in “the normal cone” to the tangent cone of the
feasible set, so all descent directions lead out of the feasible set.
For general sets X , the tangent cone TX and its polar cone are hard to
determine. In practice, X typically arises as the intersection of level sets of
inequality constraints and of solution sets to equality constrains like in
min f (x) f : Rn → R suff. smooth
s. t. gi (x) ≤ 0 i ∈ I = {1, . . . , nI }, gi : Rn → R suff. smooth,
(P)
hj (x) = 0, j ∈ E = {1, . . . , nE }, hj : Rn → R suff. smooth,
x ∈ Rn .
Throughout this chapter we consider problems of this form with f, gi , hi
given by first order (sometimes second order) oracles. From now on the
feasible set is
For the single level sets the tangent cone and its polar cone are typically easy
to determine in a current candidate point using the information available
from first order oracle evaluations (function value and gradient).
sketch 3D epigraph with 2D level set for gi with ∇gi ; 2D solution set to hj (x) = 0 with ∇hj
S0 (g) = {x̄}, T = 0, T ◦ = Rn .
sketch 3D epi g with tangent plane and X
The hope is that most of the the time TX (x) is obtained as the intersection
of the tangent cones of the single constraints.
For constraints that are active in x, nonzero
gradients describe supporting hyperplanes to
the respective level sets and in “regular cases”
the corresponding halfspaces are the respective
tangent cones shifted to x. The tangent cone
to X in x is contained in the intersection of
the tangent cones to the single level sets.
However, even in cases where the tangent
cones of the single constraints are described
correctly this may still go wrong. “Regularity
conditions” will be needed to exclude cases of
misdescription.
sketch 2D two ineqs and two circles
Remark 4.4
• TP (x) is a polyhedral (convex) cone and (by Farkas, see the proof of
Th 4.8 below)
A(x)
X X
[TP (x)]◦ = { λi ∇gi (x) + µj ∇hj (x) : λ ∈ R+ , µ ∈ RE }.
i∈A(x) j∈E
• Note, however, that TP (x) and [TP (x)]◦ do not so much depend on X
but rather on the description of X by the gi and hj of (P).
• As observed above, in general, TP (x) ̸= TX (x). Yet another example is
the following simple optimization problem.
max x1
s. t. g1 (x) := x31 + x2 ≤ 0
g1 (x) := −x2 ≤0
TX (x∗ ) = {λ −1
0 : λ ≥ 0}
∗ 2 ⊤
TP (x∗ ) = {p : 3(x11 ) 0 ⊤p ≤ 0} = {( p01 ) : p1 ∈ R}
p ≤ 0, −1
Lemma 4.5 For x ∈ X (P) there holds TX (x) ⊆ TP (x) and therefore also
[TX (x)]◦ ⊇ [TP (x)]◦ .
Of course this basic relation is often too difficult to check. Later, stronger
conditions will be introduced that ensure this equality and are typically
easier to verify. For the time being (GCQ) is exactly the right condition for
our purposes.
The basic optimality conditions for smooth constrained optimization problems
of the form (P) are not formulated with respect to these cones directly but
build on the Lagrange function approach for including constraints in the
objective by means of Lagrange multipliers.
Definition 4.7 For the constrained optimization problem (P) define the
Lagrangian L : Rn × RI × RE → R by
X X
L(x, λ, µ) = f (x) + λi gi (x) + µj hj (x).
i∈I j∈E
∇x L(x, λ, µ) = 0
gi (x) ≤ 0 i ∈ I
feasibility for (P)
(KKT) hj (x) = 0 j ∈ E
P λi ≥ 0 i ∈ I
λ g
i∈I i i (x) =0 complementarity
(ii) Each point (x∗ , λ∗ , µ∗ ) satisfying the KKT conditions is called a KKT-
point (of (P)) and λ∗ , µ∗ are called Lagrange multipliers (of x∗ ).
The KKT conditions only require the information (function values and
gradients) available via the first order oracles of the functions involved. They
form the basis of almost all algorithmic approaches for searching for points
satisfying the first order necessary conditions. It is therefore important
to understand the relation between KKT-points and first order necessary
conditions in depth.
Suppose (x̄, λ̄, µ̄) is a KKT-point. Then x̄ ∈ X is feasible and by comple-
mentarity λ̄i = 0 for i ∈ / A(x̄). So, by the arguments above, the existence of
Lagrange multipliers implies
X X
−∇f (x̄) = λ̄i ∇gi (x) + µ̄j hj (x) ∈ [TP (x̄)]◦ ⊆ [TX (x̄)]◦ .
i∈A(x) j∈E
On the other hand, for a given stationary point x̄ (i. e., a point satisfying the
necessary optimality conditions) it may happen that there is no solution of
the KKT-system, if the algebraically derived (linearized) description [TP (x̄)]◦
falls short of spanning the full geometric [TX (x̄)]◦ . If no Lagrange multipliers
exist, it is almost impossible to algorithmically recognize a given stationary
point as such. If, however, certain regularity conditions at x̄ ensure that
(GCQ) is satisfied, then Lagrange multipliers exist and the stationarity
property of x̄ can be recognized algorithmically. As pointed out before,
this is a direct consequence of the polarity relation of the cones due to the
Farkas-Lemma, but because of its importance the proof is given explicitly.
Theorem 4.8 (KKT conditions under (GCQ)) Let x∗ be a local mini-
mum of (P) that satisfies (GCQ). There exist Lagrange multipliers λ∗ ∈ RI+
and µ∗ ∈ RE so that (x∗ , λ∗ , µ∗ ) is a KKT-point.
A(x∗ )
because extending any λ∗ ∈ R+ , µ∗ ∈ RE that generate −∇f (x∗ ) by
putting λ∗i = 0 for i ∈ I \ A(x)∗ to an appropriate λ∗ ∈ RI+ yields a
KKT-Point (x∗ , λ∗ , µ∗ ) as can be checked by direct inspection.
⊇: follows by linearity of the inner product and direct computation, because
for p ∈ TP (x) there holds ⟨∇gi (x), p⟩ ≤ 0 for i ∈ A(x) and ⟨∇hj (x), p⟩ = 0
for i ∈ E.
⊆: First note that TP (x) = {p : [G, H, −H]⊤ p ≤ 0} with G = [∇gi (x)]i∈A(x) ,
H = [∇hj (x)]j∈E . Let d ∈ [TP (x)]◦ , then d⊤ p ≤ 0 for all p ∈ TP (x), i. e.,
TX (x) = TP (x).
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 67
Proof: (ACQ) directly implies (GCQ), so the result follows from Th 4.8 □
Because the linearized tangent cone TP is always convex it is not difficult to
come up with examples where (GCQ) holds but (ACQ) does not, so (GCQ)
is indeed more general.
Next we explore conditions that ensure (ACQ). By Lem 4.5 TX (x) ⊆ TP (x)
always holds, therefore the important part is to establish under which
conditions every direction in TP is also contained in TX , i. e., for every
p ∈ TP (x) it must be possible to construct a feasible sequence (xk )k∈N → x
and tk ↘ 0 so that p = limk→∞ xkt−xk
is its limiting direction. We have seen
already that this will not always be possible.
It is possible, however, if we can prove the existence of a “feasible” curve
in X that goes through x and has p as its tangent direction in x. The
mathematical tool for showing the existence of curves in the intersection of
nonlinear equations is the implicit function theorem.
Theorem (Implicit Function Theorem) Let F : Rk × Rm → Rk and
ȳ ∈ Rk satisfy
(i) F (ȳ, 0) = 0,
(ii) F is continuously differentiable in a neighborhood U (ȳ, 0),
(iii) [JF (ȳ, t)]y is regular in (ȳ, 0) (the y-columns of the Jacobian).
The function y(·) : Rm → Rk implicitly defined by F (y(t), t) = 0 and y(0) = ȳ
is well defined and it is continuously differentiable in a neighborhood of the
origin. In particular there holds
Jy(·) (0) = −[JF (ȳ, 0)]−1
y [JF (ȳ, 0)]t .
(For a proof see, e. g., Heuser, “Lehrbuch der Analysis”, Part 2.)
Proof: The guiding idea is to form x(t) = x̄ + tp + y(t) with y(·) a correction
term for the equality constraints. Put k = |E| and define F : Rk × R → Rk
via
Fj (y, t) = hj (x̄ + tp + Hy), i ∈ E where H = [∇hj (x̄)]j∈E .
By assumption H has full column rank. With this,
• F (0, 0) = 0, because hj (x̄) = 0 for j ∈ E,
• F is cont. diff., because the hi as well as x̄ + td + Hy are,
• ∇y Fj (y, t) = H ⊤ ∇hj (x̄ + tp + Hy) [recall, the gradient is a column]
therefore [JF (0, 0)]y = H ⊤ H; it is positive definite, hence regular.
By the Implicit Function Theorem there exists ε̂ > 0 and a continuously
differentiable y(·) : (−ε̂, ε̂) → Rk with
• F (y(t), t) = 0 for t ∈ (−ε̂, ε̂),
• y(0) = 0,
⊤
• y ′ (0) = Jy(·) (0) = −[JF (0, 0)]−1 ⊤ −1
y [JF (0, 0)]t = −(H H) [ H p ] = 0.
| {z }
∇hi (x̄)⊤ p=0
□
This result motivates the following regularity condition.
Proof: By Cor 4.10 and Lem 4.5 (TX (x∗ ) ⊆ TP (x∗ )) it suffices to prove
TP (x∗ ) ⊆ TX (x∗ ) for x∗ satisfying (MFCQ).
Let p ∈ TP (x∗ ), then ∇hj (x∗ )⊤ p = 0, j ∈ E.
First suppose ∇gi (x∗ )⊤ ∗ ∗
i p < 0 holds for all i ∈ A(x ). Then p ∈ TX (x ) follows
∗
from Lem 4.11 by choosing xk = x( k ), tk = k which yields p = limk→∞ xkt−x
ε ε
k
.
If ∇gi (x∗ )⊤ ∗
i p = 0 for some i ∈ A(x ) then by (MFCQ) there exists p̄ with
∇gi (x )i p̄ < 0, i ∈ A(x ). Put pk = (1 − k1 )p + k1 p̄, then pk ∈ TX (x∗ ) for
∗ ⊤ ∗
Thus (MFCQ)⇒(ACQ),
but i. g. (ACQ)̸⇒(MFCQ).
sketch x2 ≤ x2
1 , x2 ≥ 0 in the origin
The arguably most popular regularity condition just requires to check the
linear independence of the gradients to active constraints, because there are
reasonably efficient numerical linear algebra routines for doing so.
Definition 4.14 A feasible x̄ ∈ X satisfies the linear independence con-
straint qualification (LICQ) if the gradients ∇gi (x̄), i ∈ A(x̄), and ∇hj (x̄),
j ∈ E, are linearly independent.
Theorem 4.15 (KKT conditions under (LICQ)) Let x∗ be a local min-
imum of (P) that satisfies (LICQ). There exist Lagrange multipliers λ∗ ∈ RI+ ,
µ∗ ∈ RE so that (x∗ , λ∗ , µ∗ ) is a KKT-point.
sketch x2 ≥ −x3
1 , x2 ≥ 0 in the origin
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 70
Proof: By Cor 4.10 and Lem 4.5 it suffices to prove TP (x∗ ) ⊆ TX (x∗ ). Let
p ∈ TP (x∗ ). For x(t) = x∗ + tp there holds gi (x(t)) = gi (x∗ ) + t∇gi⊤ p, i ∈ I,
and hj (x(t)) = h(x∗ ) + t∇h⊤ j p, j ∈ E, so x(t) is feasible for t ≥ 0 for gi ,
i ∈ A(x), and hj , j ∈ E.
∗
For gi , i ∈ I \ A(x∗ ) it is feasible for 0 ≤ t ≤ inf{ −g i (x )
∇g ⊤ p
: ∇gi⊤ p > 0, i ∈
i
min{1,t̄}
I \ A(x)} =: t̄ > 0. Choose tk = k , k ∈ N, and xk = x(tk ) = x∗ + tk p
∗ ∗
to see p = x +ttkkp−x ∈ TX (x∗ ). □
For convex problems (convex f and gi , i ∈ I, affine hj , j ∈ E) the affine
constraints are well represented by the linearized tangent cone. For the non
affine convex gi everything works out as long as there is a common point
that lies in the relative interior of each of the level sets.
Proof: By Cor 4.10 and Lem 4.5 it suffices to prove TP (x∗ ) ⊆ TX (x∗ ). Let
p ∈ TP (x∗ ).
Because x̄ ∈ X with gi (x̄) < 0 for non affine gi , i ∈ I, the direction p̄ = x̄−x∗
satisfies
∇h⊤
j p̄ = 0 j ∈ E(affine)
∇gi⊤ p̄ ≤ 0 i ∈ A(x∗ ), gi affine ⇒ p̄ ∈ TP (x∗ )
∇gi (x∗ )⊤ p̄ < 0 i ∈ A(x∗ ), gi not affine
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 71
Put pk = (1 − k1 )p + k1 p̄ ∈ TP , k ∈ N.
For each k ∈ N there is ε̄k > 0 so that x∗ + εpk ∈ X (P) for all 0 ≤ ε ≤ ε̄k ,
because
j ∈ E : hj (x∗ + εpk ) = hj (x∗ ) + ε∇h⊤ j pk = 0,
i ∈ A(x ), affine : gi (x∗ + εpk ) = gi (x∗ ) + ε∇gi⊤ pk ≤ 0,
∗
i ∈ A(x∗ ), not affine : gi (x∗ + εpk ) = gi (x∗ ) + ε ∇gi (x∗ + θεpk )⊤ pk < 0,
| {z }
<0 for ε̄k small enough
by continuity of ∇gi
i ∈ I \ A(x∗ ) : gi (x∗ + εpk ) < 0 by continuity of gi .
Thus, pk ∈ TX (x∗ ) and, because TX (x∗ ) is closed, lim pk = p ∈ TX (x∗ ). □
k→∞
For convex problems the KKT conditions are also sufficient for optimality.
This follows via −∇f (x∗ ) ∈ NC (x∗ ) and Th 3.57 but is also quickly proved
directly.
Theorem 4.19 (KKT conditions, sufficiency for smooth convex case)
Let (P) be a convex optimization problem with convex differentiable f, gi , i ∈ I
and affine hj , j ∈ E and let (x∗ , λ∗ , µ∗ ) be a KKT-point. Then x∗ is a global
minimum.
If a Slater point exists for (P), equality holds for all x̄ ∈ X , but we will not
prove this here.
Tλ (x) = {p ∈ TP (x) : ∇gi (x)⊤ p = 0 for all i ∈ A(x) with λi > 0}.
Note, if (LICQ) holds in x, the multipliers are unique and the cone depends
on x only. The dependence on λ is, however, relevant whenever Lagrange
multipliers are not unique.
Theorem 4.21 (Second Order Necessary Optimality Conditions) Let
x∗ be a local optimal solution of (P) satisfying (LICQ) with λ∗ and µ∗ the
(unique) Lagrange multipliers for x∗ . There holds
Let G = [∇gj (x∗ )] with j ∈ {i ∈ A(x∗ ) : ∇gi (x∗ )⊤ p = 0} =: J ⊇ {i : λ∗i > 0}.
Due to (LICQ), G has full column rank. Put
L(xk , λ∗ ) = L(x∗ , λ∗ ) + ∇x L(x∗ , λ∗ )⊤ (xk −x∗ )+ 21 (xk −x∗ )⊤ ∇xx L(ξk , λ∗ )(xk −x∗ ).
| {z } | {z } | {z }
=f (xk ) KKT
= f (x∗ )
KKT
= 0
compl
□
The corresponding sufficient conditions start with a KKT-point and therefore
do not need to assume (LICQ).
Proof: For ease of notation let E = ∅. The proof shows that every feasible
sequence (X \ {x∗ }) ∋ (xk )k∈N → x∗ satisfies f (xk ) > f (x∗ ). By the
∗
usual compactness argument the sequence pk = ∥xxkk −x −x∗ ∥ has a subsequence
K ⊆ N converging to some cluster point p, w. l. o. g., pk → p. By Lem 4.5
p ∈ TX (x∗ ) ⊆ TP (x∗ ). Discern the following two cases.
CHAPTER 4. CONSTRAINED OPTIMALITY CONDITIONS 74
gi (xk )≤0 X
f (xk ) ≥ f (xk ) + λ∗i gi (xk ) = L(xk , λ∗ )
Taylor
= L(x∗ , λ∗ ) +[∇x L(x∗ , λ∗ )⊤ + 12 (xk − x∗ )⊤ ∇xx L(ξk , λ∗ )](xk − x∗ )
| {z } | {z }
KKT KKT
= f (x∗ ) = 0
compl
xk −x∗ ⊤ ∗ xk −x∗
= f (x∗ ) + 12 ∥xk − x∗ ∥2 ∥xk −x∗ ∥ ∇xx L(ξk , λ ) ∥xk −x∗ ∥ .
| {z }
k→∞ ⊤ ∗ ,λ∗ )p>0
→ p ∇ xx L(x
p ∈ TP (x∗ ) \ Tλ∗ (x∗ ): There exists j ∈ A(x∗ ) with λ∗j > 0 and ∇gj (x∗ )⊤ p <
0. Now
f (xk ) − f (x∗ ) Taylor ∇f (ξk )⊤ (xk − x∗ )
lim = lim
k→∞ ∥xk − x∗ ∥ ξk ∈[xk ,x∗ ] k→∞ ∥xk − x∗ ∥
KKT
X
= ∇f (x∗ )⊤ p = − λi ∇gi (x∗ )⊤ p > 0.
i∈I
Let the columns of Z and Z form a basis of Tλ (x) and Tλ (x) respectively,
then
Z ⊤ ∇xx L(x, λ, µ)Z ⪰ 0 is necessary for optimality of x,
⊤
Z ∇xx L(x, λ, µ)Z ≻ 0 is sufficient for optimality of x.
Sensitivity
min f (x)
(Pδ ) s. t. gi (x) ≤ δi , i ∈ I,
hj (x) = δi , j ∈ E.
∇f (x)+H(x)µ
F (x, µ, δ) = h(x)−δ
= 0.
ℓ: X × Y → R with X ⊆ Rn , Y ⊆ Rm .
(x, y) 7→ ℓ(x, y)
♡
What is happening if the positions of inf and sup are swapped?
Observation 5.1 (weak duality) inf sup ℓ(x, y) ≥ sup inf ℓ(x, y).
x∈X y∈Y y∈Y x∈X
□
Example Continuing the example before,
D E
inf sup [c⊤ x + b⊤ y − x⊤ A⊤ y] ≥ sup inf [b⊤ y + c − A⊤ y, x ] =
x∈K1 y∈K ◦ y∈K2◦ x∈K1
2
max b⊤ y
s. t. A⊤ y ≤ c,
y ≤ 0.
♡
This approach to duality is known as “Lagrangian duality”. Note, weak
duality holds in general and does not require any convexity assumptions at
all. The decisive questions are
• Under which conditions is inf x supy ℓ(x, y) = supy inf x ℓ(x, y)?
• If equality holds, do there exist x̄, ȳ that attain this value?
Definition 5.2 A pair (x̄, ȳ) ∈ X × Y is a saddle point of a function
ℓ : X × Y → R if sup ℓ(x̄, y) = ℓ(x̄, ȳ) = inf ℓ(x, ȳ).
y∈Y x∈X
The following variants are equivalent to this. For (x̄, ȳ) ∈ X × Y there holds
ℓ(x̄, y) ≤ ℓ(x̄, ȳ) ≤ ℓ(x, ȳ) for all (x, y) ∈ X × Y,
ℓ(x̄, y) ≤ ℓ(x, ȳ) for all (x, y) ∈ X × Y.
Proof: If (x̄, ȳ) is a saddle point, then by definition φ(x̄) = ψ(ȳ) , thus by
Obs 5.4 (x̄, ȳ) ∈ Φ × Ψ and there holds minX φ = maxY ψ.
Conversely, let minX φ = maxY ψ be attained, then there exists (x̄, ȳ) ∈ Φ×Ψ
Obs 5.4 Obs 5.4
with φ(x̄) = ψ(ȳ). Thus ℓ(x̄, y) ≤ φ(x̄) = ψ(ȳ) ≤ ℓ(x, ȳ) holds for
all (x, y) ∈ X × Y proving the saddle point property. □
We will be able to ensure the existence of saddle points (= strong dual-
ity) under the following four, rather strong assumptions that guarantee
attainment.
CHAPTER 5. SADDLE POINTS 80
(A4) Y is bounded or
Proof: First, we prove that the set of saddle points is convex and compact.
If there are saddle points (x̄, ȳ), they have a common saddle value ¯l = ℓ(x̄, ȳ)
by Obs 5.3 and there holds ℓ(x̄, y) ≤ ¯l ≤ ℓ(x, ȳ) for all (x, y) ∈ X × Y . Thus,
the primal optimizing set Φ is the intersection of the level sets of the convex
functions ℓ(·, y), \
Φ= Sl̄ (ℓ(·, y)).
y∈Y
Because ℓ(·, y) is convex and continuous, the level sets are convex and closed,
so Φ is convex and closed. By (A3) at least one of the level sets is compact,
so Φ is compact. The same holds for Ψ and thus for Φ × Ψ.
The existence of saddle points is proven in three steps starting with even
stronger assumptions and then weakening them again.
Step 1: In addition to (A1)–(A4) let X and Y be bounded and ℓ(x, ·) be
strictly concave for each x ∈ X.
For each y ∈ Y the function hy (x) := ℓ(x, y) is convex and closed. Thus
the primal function φ(x) := supy∈Y hy (x) is convex, closed and has compact
domain dom φ = X. Therefore minx∈X φ(X) is attained in some x̄ ∈ X.
The strict concavity of ℓ(x, ·) for each x ∈ X and the compactness of Y implies
that for each x there is a unique maximizing y(x) ∈ Y with φ(x) = ℓ(x, y(x)).
Put ȳ = y(x̄), then
For proving the second of the saddle point inequalities, let x ∈ X and put
K
By compactness of Y , there is a subsequence K with ȳk → ỹ, possibly not
equal to ȳ.
x̄ min ℓ(·,yk ) conv
1
φ(x̄) ≤ φ(xk ) = ℓ(xk , yk ) ≤ ℓ(x, yk ) + (1 − k1 )ℓ(x̄, yk )
|k {z }
K (5.1)
→ 0 + ℓ(x̄, ỹ) ≤ φ(x̄)
Because ȳ is the unique maximizer with ℓ(x̄, ȳ) = φ(x̄) this shows ỹ = ȳ. By
ℓ(x̄, yk ) ≤ ℓ(x̄, ȳ) it also implies
1 1
φ(x̄) ≤ ℓ(x, yk ) + (1 − )φ(x̄) |·k
k k
K
⇒ φ(x̄) ≤ ℓ(x, yk ) → ℓ(x, ȳ).
This holds for any x ∈ X. Thus, (x̄, ȳ) is a saddle point.
Step 2: In addition to (A1)–(A4) let X and Y be bounded. In order to build
on Step 1, define
For each fixed y ∈ Y this yields ℓ(x̄k , y) → −∞, which is possible only if
(x̄k )k∈N is unbounded. Thus X is unbounded. By (A3) y0 ∈ Yk for all k
large enough and
can now be improved to the much more useful strong duality relation
Example Consider the primal dual pair of conic programs with closed
convex cones K1 , K2 (K1◦ , K2◦ are generically closed convex cones)
min c⊤ x max b⊤ y
(P ) s. t. b − Ax ∈ K2 , (D) s. t. A⊤ y − c ∈ K1◦ ,
x ∈ K1 , y ∈ K2◦ .
While the saddle point theorem does not offer an easy way to the typical
strong duality result for conic programs, it may still be brought to use as
follows. Assume K2 and K1◦ to be full dimensional and put
ℓ(x, y) := c⊤ x + b⊤ y − y ⊤ Ax.
To illustrate and interpret the relevance of the saddle point approach for
optimization problems of the form (w. l. o. g. inequality constraints only)
min f (x) f : Rn → R
s. t. gi (x) ≤ 0, i ∈ I, g : Rn → RI [g(x) ∈ K2 = RI− ]
x∈X X convex of simple structure
we consider the Lagrangian
X
L(x, λ) = f (x) + λi gi (x).
i∈I
Note, inf x∈X supλ≥0 L(x, λ) is the original problem. Indeed, if for x ∈ X
In order to determine this lower bound ψ(λ) one has to solve the “simpler”
optimization problem over x ∈ X where violations of the gi are penalized in
the objective. This is Lagrangian relaxation of the constraints by a (Lagrange
multiplier) parameter λ.
By Obs 5.1 one obtains the best possible lower bound if one can solve the
problem
sup inf L(x, λ) = sup ψ(λ),
λ≥0 x∈X λ≥0
CHAPTER 5. SADDLE POINTS 84
i. e., if one can solve the dual optimization problem. It is important to note
that this dual problem is always a convex problem. To see this, observe that
Therefore the dual problem supλ≥0 ψ(λ) can be solved by the subgradient
algorithm (or similar nonsmooth optimization methods like bundle methods),
whenever a global optimizer to inf x∈X L(x, λ) can be determined efficiently
for each λ ≥ 0. If L(x, λ) has saddle points, its optimal value coincides with
the optimal value of the original problem.
I
Theorem P 5.7 A point (x̄, λ̄) I∈ X × R+ is a saddle point of L(x, λ) =
f (x) + i∈I λi gi (x) on X × R+ if and only if
(i) x̄ minimizes L(·, λ̄) on X,
(ii) gi (x) ≤ 0, i ∈ I, [primal feasibility]
(iii) λ̄i gi (x̄) = 0, i ∈ I. [complementarity]
Proof: ⇒: (i) follows from L(x̄, λ̄) ≤ L(x, λ̄) for all x ∈ X. (ii) and (iii)
are implied by L(x̄, ·) being affine in λ and L(x̄, λ) ≤ L(x̄, λ̄) for all λ ≥ 0,
because
gi (x̄) > 0 contradicts L(x̄, λ) ≤ L(x̄, λ̄) for all λ ≥ 0,
gi (x̄) < 0 ⇒ λ̄i = 0,
λ̄i > 0 ⇒ gi (x̄) = 0.
(iii) (i) X
f (x̄) = L(x̄, λ̄) ≤ L(x, λ̄) = f (x) + λ̄i gi (x) ≤ f (x).
|{z} | {z }
i∈I ≥0 ≤ 0
feas. □
For convex problems the existence of saddle points is equivalent to the
existence of Lagrange multipliers in the KKT system.
CHAPTER 5. SADDLE POINTS 85
(P ) min c⊤ x
s. t. Ax = b, [b − Ax ∈ {0} =: K2 ]
x ≥ 0, [x ∈ Rn+ =: K1 ]
(D) max b⊤ y
s. t. A⊤ y ≤ c, [A⊤ y − c ∈ Rn− = (Rn+ )◦ = K1◦ ]
y free, [y ∈ Rm = {0}◦ = K2◦ ]
⇔ (D′ ) max b⊤ y,
s. t. A⊤ y + z = c,
y ∈ Rm , z ≥ 0.
min c⊤ x
s. t. +b − Ax ≤ 0 | · λ+
x̄ opt.
−b + Ax ≤ 0 | · λ− ⇒ ∃ Lagrange mult. λ̄
Th 4.16
−x ≤ 0 | · λx
x free
CHAPTER 5. SADDLE POINTS 86
Therefore L(x̄, λ̄) = supλ≥ 0 inf x∈Rn L(x, λ) which is equivalent to y = λ̄+ −λ̄−
and z = λ̄x being optimal for
max b⊤ y
max b⊤ y
s. t. c − A⊤ y − z = 0 ⇔
s. t. A⊤ y ≤ c.
y ∈ Rm , z ≥ 0 □
Chapter 6
87
CHAPTER 6. INTERIOR POINT METHODS 88
Primal and dual strictly feasible points will be assumed to exist in this
chapter unless explicitly stated otherwise.
Assumption 6.2 There exist x0 strictly feasible for P and (y0 , z0 ) strictly
feasible for (D).
In this, x−1 = ..
.
is to interpreted componentwise. The missing dual
1
xn
slack z appears to be z = µx−1 .
This gives rise to the primal-dual KKT system,
Ax = b (x > 0) primal feasibility
A⊤ y + z = c (z > 0) dual feasibility
x ◦ z = µ1 perturbed complementarity
x1 ·z1
The ◦ represents the “Hadamard”-product x ◦ z = .. .
.
xn ·zn
CHAPTER 6. INTERIOR POINT METHODS 89
• When starting from (Dµ ) one arrives at the same primal-dual KKT
system.
(Pµ ) is strictly convex in x,
• The objective in
(Dµ ) is strictly convex in z.
Thus, the KKT system delivers a
Before proving this, first recall that the affine primal and dual feasible
subspaces are orthogonal to each other.
Lemma 6.5 For µk ↘ 0 there exist cluster points of (xµk , zµk ) and each
corresponds to a strictly complementary pair (x∗ , z ∗ ) of optimal solutions of
(P ) and (D).
CHAPTER 6. INTERIOR POINT METHODS 90
Proof: To show boundedness of (xµk , zµk ), fix some µ̄, let 0 < µ < µ̄ and let
x0 , (y0 , z0 ) be the strictly feasible points of Ass 6.2, then
Lem 6.4
0 = (xµ − x0 )⊤ (zµ − z0 ) = x⊤µ zµ −x⊤ ⊤ ⊤
µ z 0 − x 0 zµ + x 0 z0
| {z }
=nµ by xµ ◦zµ =µ1
⇒ x⊤ ⊤ ⊤ ⊤
µ z0 + x0 zµ = nµ + x0 z0 ≤ nµ̄ + x0 z0 (6.1)
Because z0 > 0 and x0 > 0 the coordinates of xµk and zµk remain bounded.
K K
Thus there is a subsequence K ⊆ N with xµ → x̄ ≥ 0 and zµ → z̄ ≥ 0 with x̄
>0 >0
and z̄ feasible [smoothness of the central path would also yield uniqueness].
Now,
x⊤
µk zµk = nµk → 0 ⇒ x̄⊤ z̄ = 0,
so x̄ = x∗ and z̄ = z ∗ are primal and dual optima.
In order to prove strict complementarity, replace in (6.1) x0 by x∗ and z0 by
z ∗ to obtain
x⊤ ∗ ⊤ ∗
µ z + zµ x = nµ + 0.
xµ zµ
For µ > 0, xµ ◦ zµ = µ1 yields µ = zµ−1 , µ = x−1
µ , thus
n n
X zi∗ X x∗i
+ = n.
(zµ )i (xµ )i
i=1 i=1
zi∗ K x∗i K
By (zµk )i → 0 ∨ 1 and (xµk )i → 0 ∨ 1 the claim follows. □
I A∆x = b − Ax =: fp
II A⊤ ∆y + ∆z = c − A⊤ y − z =: fd
III ∆x ◦ z + x ◦ ∆z = µ1 − x ◦ z =: fc
CHAPTER 6. INTERIOR POINT METHODS 91
Solve this by
II : ∆z = fd − A⊤ ∆y
III : ∆x = µz −1 −x− z −1 ◦x◦∆z = µz −1 −x−z −1 ◦x◦fd + z −1 ◦x◦A⊤∆y
| {z }
=Diag(x◦z −1 )A⊤∆y
−1 ⊤
in I : A · Diag(x) · Diag(z ) · A ∆y = fp − A(µz −1 −x − z −1 ◦ x ◦ fd )
| {z }
=:M ∈Sm
+
The matrix M is positive definite whenever A has full column rank (which
may be assumed w. l. o. g. whenever Ax = b has a solution at all), because
x > 0 and z > 0. Therefore ∆y can be solved by a Cholesky decomposition
3
which requires O( m3 ) arithmetic operations.
Algorithm 6.6 (interior point framework)
Input: A, b, c, x0 , y0 , z0 with x0 > 0, z0 > 0
1. Choose µ.
2. Compute ∆x, ∆y, ∆z as above.
3. Choose a step length α (≤ 1) so that
It starts close to the central path and, doing full Newton steps of step length
1, it stays within the following neighborhood of the central path
⊤
where µ(x, z) := xn z so that
1 1 1
µ(x, z)1 = x ◦ z, √ ·√ the “projection of x ◦ z onto √ ”.
n n n
or
The rather strange L refers to the number of bits that are required to encode
the LP and is more or less standard for complexity theoretic purposes; the
reasons for this choice will not be pursued here in any detail.
In order to simplify notation in the analysis of the algorithm we will from
now on write
(x, y, z) for (xk , y k , z k ), [the current point]
(x+ , y+ , z+ ) for (xk+1 , y k+1 , z k+1 ). [the next point]
Lemma 6.8
(i) ∆x⊤ ∆z = 0,
CHAPTER 6. INTERIOR POINT METHODS 93
x⊤ z +
(ii) µ+ := +n = σµ,
(iii) x+ ◦ z+ = µ+ 1 + ∆x ◦ ∆z.
Proof: Exercise. □
For bounding ∥∆x ◦ ∆z∥ the following relation comes in handy.
III
Lemma 6.10 Let x > 0,z > 0 and h = ∆x ◦ z + x ◦ ∆z = µ+ I − x ◦ z. Put
1 1
d = x 2 ◦ z − 2 , then
1 1
∥d−1 ◦ ∆x∥2 + ∥d ◦ ∆z∥2 + 2∆x⊤ ∆z = ∥x− 2 ◦ z − 2 ◦ h∥2 .
1 1 1 1 1 1
Proof: ∥x− 2 ◦ z − 2 ◦ h∥2 = ∥x− 2 ◦ z 2 ◦ ∆x + x 2 ◦ z − 2 ◦ ∆z∥2 and direct
computation. □
∥x◦z−µ+ 1∥2
Lemma 6.11 Put γ = min{xi zi : i = 1, . . . , n}, then ∥∆x◦∆z∥ ≤ 2γ .
− 12
≤ 1
2 ∥γ (µ+ 1 − x ◦ z)∥2
□
CHAPTER 6. INTERIOR POINT METHODS 94
θ2 + n(1 − σ)2
≤ θσ (6.2)
2(1 − θ)
Pythagoras
∥x ◦ z[−µ1 + µ1] − µ+ 1∥2 = ∥x ◦ z − µ1∥2 + ∥(µ − µ+ )1∥2
(N ),Lem 6.8(ii)
≤ (θµ)2 + µ2 (1 − σ)2 ∥1∥2
| {z }
=n
= (θ2 + n(1 − σ)2 )µ2 .
(N ′ )
Because (N ′ ) holds for (x, z), the γ of Lem 6.11 satisfies γ ≥ (1 − θ)µ, so
Lem 6.8(iii) ∥x ◦ z − µ+ 1∥2 θ2 + n(1 − σ)2 (6.2)
∥x+ ◦ z+ − µ+ 1∥ ≤ ≤ µ ≤ θ σµ
Lem 6.11 2γ 2(1 − θ) |{z}
µ+ □
Proof: For the value (xk )⊤ z k = nµk = nσ k µ0 to fall below 2−2L , iteration
−2L
counter k has to satisfy σ k ≤ 2nµ0 which is guaranteed by
δ [log(1+x)≤x] δ −2L
k log σ = k log(1 − √ ) ≤ −k √ ≤ log( 2nµ0 )
n n
The algorithm of Monteiro and Adler needs a starting point satisfying (N ) and
this may seem like a difficult requirement. There is, however, an astonishingly
cheap way to produce such a starting point for a slightly modified problem
for the primal-dual pair (P) and (D) that
• has a trivial starting point,
• gives a certificate of infeasibility if no solutions exist,
• and allows to reconstruct the original optimal solutions from its optimal
solution, if optimal solutions exist.
For introducing the modified problem for (P) and (D) we start with the
following skew-symmetric homogenized system
Ax −τ b = 0
−A⊤ y +τ c −z = 0
(HS)
b⊤ y −c⊤ x −ρ = 0
x ≥ 0, τ ≥ 0, z ≥ 0, ρ ≥ 0
The system is feasible by setting all variables to zero.
For any solution with τ > 0, the point xτ , ( τy , τz ) is primal/dual feasible and
b⊤ τy ≥ c⊤ xτ
the third line ensures
⇒ both are optimal.
while weak duality gives b⊤ τy ≤ c⊤ xτ
This, however, also shows that this system cannot have a strictly feasible
solution. In order to obtain a strictly feasible system, we simply put τ = 1,
x = 1, y = 0, z = 1, ρ = 1 and compensate the arising infeasibilities
(captured by new constants α, β, b̄, c̄) by introducing yet another variable ϑ
(and its “dual” η) that we drive to zero,
min βϑ
s. t. Ax −τ b +ϑb̄ = 0,
−A⊤ y +τ c −ϑc̄ −z = 0,
(SE) ⊤ ⊤
b y −c x ϑα −ρ = 0,
−b̄⊤ y +c̄⊤ x −τ α −η = −β,
x ≥ 0, τ ≥ 0, ϑ ≥ 0, z ≥ 0, ρ ≥ 0, η ≥ 0,
ỹ for row 1, x̃ for row 2, τ̃ for row 3, θ̃ for row 4, and introduce dual slack
variables z̃, ρ̃ and η̃.
Because the dual to (SE) is of the same form as (SE), it has the same strictly
feasible starting point x̃0 = 1, ỹ 0 = 0, z̃ 0 = 1, τ̃ 0 = ρ̃0 = ϑ̃0 = η̃ 0 = 1.
In the primal-dual KKT system of (SE) the perturbed complementarity lines
read
x ◦ z̃ = µ1, x̃ ◦ z = µ1,
τ ρ̃ = µ, τ̃ ρ = µ,
ϑη̃ = µ, ϑ̃η = µ.
For µ = 1 the common strictly feasible starting point is exactly on the
central path. The interior point algorithm will update both variable groups
in exactly the same way,
for k = 0, 1, . . . xk = x̃k , y k = ỹ k , . . .
There is no need to keep both copies and the ˜-copy may be dropped. The
remaining primal-dual KKT system has only two constraint lines and two
complementarity lines more than that of the original system!
It is important to note that (SE) has, in fact, the trivial optimal solution of
setting η = β and all other variables to zero. This solution, however, is not
useful at all. We need to exploit the highly valuable property of interior point
methods that for linear programming problems they converge to a strictly
complementary solution, i. e., in every complementarity pairing at least one
coordinate is nonzero. This kind of solution of (SE) allows to reconstruct
the original optimal solutions or to certify that none exist.
Theorem 6.14
The selfdual program (SE) has an optimal solution (x∗ , y ∗ , z ∗ , τ ∗ , ρ∗ , ϑ∗ , η ∗ )
with either τ ∗ > 0 or ρ∗ > 0. There holds
∗
(i) τ ∗ > 0 ⇔ (P) and (D) are feasible with optimal solutions xτ ∗ and
∗ ∗
( τy ∗ , τz ∗ ),
(ii) τ ∗ = 0 ⇔ (P) or (D) has an improving half ray [ i. e., at least one of
them is infeasible].
Proof: The existence of the optimal solution of (SE) with either τ ∗ > 0 or
ρ∗ > 0 follows from Lem 6.5. Because ϑ = β is optimal (set η = 0 and all
others to zero), ϑ∗ = 0 and the first three constraint lines of (SE) reduce to
the skew-symmetric homogenized system (HS).
∗
x∗ ∗
(i),⇒: By τ ∗ > 0 and the arguments for (HS), τ∗ and ( τy ∗ , τz ∗ ) are feasible
and optimal for (P) and (D).
(ii),⇒: For τ ∗ = 0 strict complementarity implies ρ∗ > 0. The three
constraints yield
Ax∗ = 0, A⊤ y ∗ + z ∗ = 0 and b⊤ y ∗ > c⊤ x∗ .
CHAPTER 6. INTERIOR POINT METHODS 97
τ∗ = β
1⊤ (x̄+z̄)+1
> 0, x∗ = τ ∗ x̄, y ∗ = τ ∗ ȳ, z ∗ = τ ∗ z̄, ϑ∗ = ρ∗ = η ∗ = 0.
The first three constraints of (SE) hold by direct inspection, consider the
fourth,
−b̄⊤ y ∗ + c̄⊤ x∗ − τ ∗ α = −β
Changing sign and substituting the respective definitions yields
τ ∗ (−1⊤ A⊤ ȳ + b⊤ ȳ − c⊤ x̄ +1⊤ x̄ + c⊤ 1 + 1) = β.
|{z} | {z }
=c−z̄ =0
Exercise Exploiting the constraint, show that the dual cost function is
equivalent to b⊤ y − 12 x⊤ Qx which is concave, thereby proving convexity of
the problem. Furthermore, if Q ≻ 0 one may eliminate x to obtain the
following quadratic problem in y only,
max (b + Ac)⊤ y − 12 y ⊤ AQ−1 A⊤ y − 12 c⊤ Q−1 c
s. t. y ≥ 0.
⋉
⋊
Strong duality holds for this primal and dual by Th 4.16 and Th 5.9.
The barrier problem to (QP) reads
m
X
minn f (x) := 12 x⊤ Qx + c⊤ x − µ log(Ai,• x − bi ).
x∈R
i=1
min f (x) f ∈ C2
s. t. gi (x) ≤ 0, i = 1, . . . , m g : Rn → Rm , g ∈ C 2
x ∈ Rn .
Stationarity reads
X 1
0 = ∇f (x) + µ ∇gi (x) · , put s := −g(x) ≥ 0, y = µs−1 .
−gi (x)
This gives rise to the system
P
I ∇f (x) + yi ∇gi (x) = 0
II g(x) + s = 0
III s ◦ y = µ1
Applying Newton’s method to this nonlinear system gives the following two
equations for the first two lines,
I [∇2 f (x) + yi ∇2 gi (x)]∆x + ∆yi ∇gi (x) = −[∇f (x) + yi ∇gi (x)]
P P P
II Jg (x)⊤ ∆x + ∆s = −[g(x) + s]
x̄ := ∆x
I Q∆x + c − A⊤ (y + ∆y) = 0 s̄ := s + ∆s Qx̄ + c − A⊤ ȳ = 0
II A∆x − (s + ∆s) = b ←→ Ax̄ − s̄ = b
III (s + ∆s) ◦ (y + ∆y) = µ1 ȳ := y + ∆y s̄ ◦ ȳ = µ1
CHAPTER 6. INTERIOR POINT METHODS 100
The latter are equivalent to the optimality conditions for the barrier problem
to the quadratic program
min 21 x̄⊤ Qx̄ + c⊤ x̄ min 12 ∆x⊤ [∇2 f (x) + yi ∇2 gi (x)]∆x + ∇f (x)⊤ ∆x
P
s. t. Ax̄ ≥ b ⇔ s. t. g(x) + Jg (x)∆x ≤ 0
x̄ ∈ Rn ∆x ∈ Rn
Note, the constraints are replaced by their linearizations in x, but the
quadratic term of the cost function now includes the curvature information
of f as well as that of the active gi weighted by their current Lagrange
multiplier approximations yi . This results in cautious steps in directions
where these functions have strong curvature. Once ∆x is computed, the
method continues with a line search in the direction of ∆x (≤ 1 because it is
a Newton method).
Similar approaches as in unconstrained optimization may be used to render
Q positive semidefinite. The convexified model can be solved efficiently with
interior point methods. ♡
1 x0
We have z = µ x2 −∥x̄∥2 ( −x̄ ) >Q 0 ⇔ x >Q 0. In order to express the
0
perturbed complementarity condition in bilinear form, observe that z solves
h i
x0 x̄⊤ z = µ 1 .
x̄ x0 I 0̄
| {z }
=:Arw(x)
A⊤ y + z = c, z >Q 0,
Ax = b, x >Q 0,
Arw(x)z = µe0 .
g(x) = x⊤ Qx + q ⊤ x + δ
(1 + r) − (q ⊤ x + δ)
x x L⊤ x+q
epi g = {( r ) : g(x) ≤ r} = ( r ) : (1−r)+(q ⊤ x+δ) ≤ .
2 2
"1 # 1 ⊤ 1
(1−δ) −2q 2
ξ 2
Thus the linear constraints ξ̄0 = q + L⊤ 0 ( xr ) and conic
1 1 ⊤ 1
2 (1+δ) 2q −2
¯ and r ≤ 0 give a valid SOC constraint representation.
constraints ξ0 ≥ ∥ξ∥
SOC programming with several second order cones therefore allows to model
arbitrary combinations of convex quadratic constraints (and objectives).
For the dual program the adjoint operator A⊤ to A corresponds to, as usual,
linear combinations of the rows. Indeed, the adjoint operator is defined by
X X
∀X ∈ Sn , y ∈ Rm ⟨AX, y⟩ = yi ⟨Ai , X⟩ = y i Ai , X .
| {z }
=:A⊤ y
Because the positive semidefinite cone Sn+ is self-dual ((Sn+ )∗ = −(Sn+ )◦ = Sn+ ),
the dual semidefinite program reads
max b⊤ y
s. t. A⊤ y + Z = C,
y ∈ Rm , Z ⪰ 0.
• SOCP: x ≥Q 0 ⇔ Arw(x) ⪰ 0.
" #
X1 0
• several semidefinite variables: X1 ⪰ 0, . . . , Xk ⪰ 0 ⇔ ..
. ⪰ 0.
0 Xk
∇X Lµ (X, y) = 0 : C − µX −1 − A⊤ y = 0 → put Z = µX −1
∇y Lµ (X, y) = 0 : AX = b
A⊤ y + Z = C A⊤ ∆y + ∆Z = C − A⊤ y − Z
AX = b → A∆X = b − AX
XZ = µI ∆X · Z + X · ∆Z = µI − XZ [?]
Because of the third line, the solution to this linearized system will, in
general, result in a non symmetric ∆X ∈ Rn×n . One can prove that it
suffices to use the symmetric part 12 (∆X + ∆X ⊤ ), but there are attractive
other symmetrization strategies, as well.
In all other aspects the interior point approach and its analysis follow the
linear programming case. For a strictly feasible starting point (X 0 , y 0 , Z 0 )
close to the central path the algorithm stops with a solution ⟨X, Z⟩ ≤ ε in
√
O( n log( X 0 , Z 0 /ε)) iterations. The same skew-symmetric embedding
works unless serious duality issues arise.
A decisive difference is that in semidefinite programming feasible solutions
may require doubly exponential encoding size relative to the encoding length
of the problem. For the current approaches the dependence on ε and the
starting point cannot be replaced by a polynomial expression depending on
the encoding length of the problem. Indeed, in the strict theoretical sense
it is not yet clear whether general semidefinite programs can be solved in
polynomial time or not.
For practical purposes and problems interior point methods need surprisingly
few iterations, but each iteration is quite expensive. Due to additional
numerical issues developing solvers for semidefinite programming is much
CHAPTER 6. INTERIOR POINT METHODS 104
more demanding than for linear programming and there is still a lot of work
ahead.
Example Robust stability of dynamical systems: A dynamical systems
describes the change of a state x(t) over time by differential equations. The
system is called stable if all trajectories lead to some desired goal state which
is typiccaly shifted into the origin. In the robust linear setting considered
here the coefficients of the linear system are not fully known in advance,
dx(t)
= A(t)x(t) where A may be any A(t) ∈ conv {A1 , . . . , Ak }.
dt
All trajectories
√ are certainly leading to the origin, if there exists a norm
⊤
∥x∥H = x Hx with H ≻ 0 so that the norm of the state vector strictly
decreases along the trajectories,
d∥x(t)∥2H
≤ δ < 0.
dt
If such an H ≻ 0 exists, the system is called quadratically stable and x⊤ Hx
is called a Lyapunov-function. For the current system the criterion evaluates
to
d dx dx
(x(t)⊤ Hx(t)) = ( )⊤ Hx(t) + x(t)⊤ H = x(t)⊤ (A(t)⊤ H + HA(t))x(t).
dt dt dt
Because this has to be less than some δ < 0 for any starting point x =
x(0) ∈ Rn \ {0} and any A(t) ∈ conv {A1 , . . . , Ak } the sought for H ≻ 0 has
to satisfy
A⊤i H + HAi ≺ 0 for i = 1, . . . , k.
This may be cast as an SDP as follows,
max λ
s. t. H ⪰ λI,
A⊤ H + HAi ⪯ −λI, i = 1, . . . , k
λ ∈ R, H ∈ Sn
In order to illustrate how to write this as a dual in normal form, put y =
(λ, h11 , . . . , h1n , h22 , . . . , hnn )⊤ and consider the block diagonal reformulation
max λ
λI−H 0
Z0 0
A⊤
1 H+HA1 Z1
s. t. .. + .. = 0,
. .
0 A⊤
k H+HAk
0 Zk
| {z }
=:A⊤ y (linear in y)
y∈ R 1+n(n+1)/2 , Z0 ⪰ 0, . . . , Zk ⪰ 0.
If λ∗ > 0, the corresponding H ∗ generates the required Lyapunov-function.
♡
Chapter 7
The simplex method is a classical and in many situations the most efficient
solution method for solving linear programs, in particular if the program
formulation needs to be changed dynamically by adding further constraints
or variable columns. In contrast to interior point methods it heavily exploits
the linear cost function and the polyhedral feasible set by starting in some
feasible vertex and by switching to successively better ones along edges of
the polyhedron. This gives the method a strong combinatorial flavor.
Unless explicitly stated otherwise we consider linear programs in normal
form,
min c⊤ x c ∈ Rn
(P ) s. t. Ax = b, A ∈ Rm×n of full row rank, b ∈ Rm ,
x≥0
with feasible set X = {x ≥ 0 : Ax = b} and corresponding dual
max b⊤ y
(D) s. t. A⊤ y + z = c,
y ∈ Rm , z ≥ 0,
For deriving the method consider the given linear inequality system
Ax = b, m equations,
Ix ≥ 0, n inequalities,
105
CHAPTER 7. SIMPLEX METHOD 106
To start, let B be a feasible basis (how to get there will be discussed later),
xB = A−1
B [b − AN xN ], xN = 0, xB ≥ 0.
xB = A−1
B [b − AN xN ] ≥ 0, xN ≥ 0 = x̄N ,
−1 ⊤ −⊤ ⊤ −1
and c⊤ x = c⊤ ⊤ ⊤
B AB b + (cN − AN AB cB ) xN ≥ cB AB b = c x̄.
| {z }
≥0 □
The proof also shows that multiple optima are possible only if at least one
component of the reduced costs is zero. Later this will be of relevance again.
Suppose now there is a nonbasic index with negative reduced costs,
Computing the reduced cost vector c̃N and selecting such a ȷ̂ is called
“pricing”. For this fixed ȷ̂ the objective strictly improves when increasing xȷ̂
(all other nonbasic variables are kept at zero). So the next step is to increase
xȷ̂ as much as possible without leaving the feasible set,
With row i of A−1 B [b − A•,ȷ̂ xȷ̂ ] corresponding to the i-th basis element Bi ,
xBi is in danger of becoming negative only if [A−1 B A•,ȷ̂ ]i > 0, thus feasibility
is guaranteed for
[A−1
B b]i −1
xȷ̂ ≤ inf : i ∈ {1, . . . , m}, [AB A•,ȷ̂ ]i > 0 “ratio test”.
[A−1
B A•,ȷ̂ ]i
If there are no indices with [A−1B A•,ȷ̂ ]i > 0, the infimum evaluates to +∞,
xȷ̂ may be increased to infinity without violating feasibility and at the same
time the objective value decreases to minus infinity. This gives an improving
half ray,
xB =A−1
B b −A−1
B A•,ȷ̂
∀α ≥ 0 x(α) = xN =0
+α eȷ̂ ∈ X,
and
−1
inf{c⊤ x(α) = c⊤
B AB b + c̃ȷ̂ α : α ≥ 0} = −∞.
<0 sketch unbounded set in R3
+
[A−1
B b]ı̂
Set xȷ̂ ← −1 and xB ← xB − A−1
B A•,ȷ̂ xȷ̂ .
[AB A•,ȷ̂ ]ı̂
With this the basic variable to ı̂ is now zero, xBı̂ = 0, it is removed from the
current basis and added to the nonbasic variables. xBı̂ is called the leaving
variable, its place in the basis is taken by the entering variable xȷ̂ . This
indeed gives rise to a feasible basis again.
Lemma 7.3 The index set B + = (B \ Bı̂ ) ∪ ȷ̂ describes a feasible basis and
xB =A−1
[A−1
B b]ı̂
−1
−AB A•,ȷ̂
x+ = B b + ≥0
xN =0 [A−1
B A•,ȷ̂ ]ı̂
eȷ̂
The process is now repeated for the updated feasible basis. This completes
the derivation and description of the primal simplex algorithm.
Algorithm 7.4 ((Revised Primal) Simplex Algorithm)
Input: A, b, c, a feasible basis B and x̄B = A−1
B b ≥ 0.
5. update x̄B ← x̄B − γw, xȷ̂ ← γ, N ← (N ∪ {Bı̂ }) \ {ȷ̂}, Bı̂ ← ȷ̂, GOTO 1.
Example
Consider the Mozart problem, x2
" −9 #
−8 11
h1 1 1 0 0i h 6
i
A= 21010 ,b = 11 ,c = 0 , 10
12001 9 0
0 9
8
with the origin as initial basis/vertex. 7 x4
Input: A, b, c and c
6
5
B = (3, 4, 5), N = (1, 2), 4
3
0
!
0
6
2 x3
x̄ = 6 , x̄B = 11 . 1 x5
11 9
9 x1
h1 0 0i h0i 0 1 2 3 4 5 6 7 8 9
Iteration 1: AB = 0 1 0 , cB = 0 , ȳ = 0 , z̄N = −9 −8 .
001 0 0
1
Choose ȷ̂ = 1, then w = 2 → ı̂ = 2, Bı̂ = 4, γ = 5.5.
1
5.5
!
0.5 0
Update to vertex/basis B = (3, 1, 5), N = (4, 2), x̄B = 5.5 , x̄ = 0.5 .
3.5 0
3.5
0 0
h1 1 0i h i
4.5[N1 =4]
Iteration 2: AB = 020 , cB = −9 , ȳ = −4.5 , z̄N = −3.5[N2 =2]
.
011 0.5 0 0
Choose ȷ̂ = 2, then w = →
ı̂ = 1, Bı̂ = 3, γ = 1.
0.5
1.5
5
!
1 1
Update to vertex/basis B = (2, 1, 5), N = (4, 3), x̄B = 5 , x̄ = 0 .
2 0
2
h1 1 0i h −8 i −7
Iteration 3: AB = 120 , cB = −9 , ȳ = −1 , z̄N = ( 17 ) → optimal. ♡
211 0 0
Historically, the simplex method was developed by George B. Dantzig before
computers were available. It supported computations by hand via a clever
arrangement in a tableaux that facilitated the repeated applications of
Gaussian elimination steps. These are required to execute the basis changes
for the selected pivot element.
Example Originally, the simplex tableaux works on problems stated in
canonical form. Here we describe a variant adapted to the revised method
for the normal form described above and apply it to the first step of the
Mozart problem with initial basis B = (3, 4, 5). In this variant it maintains
the updated system b = Ax in the form A−1 −1
B b = AB AN xN + IxB with
an additional row zero holding the current and reduced costs in the form
−1 ⊤ −⊤
−c⊤ ⊤
B AB b | (cN − cB AB AN )xN + 0xB . In the table below, the simplex
tableaux only consist of the first three boxed columns, the more or less
CHAPTER 7. SIMPLEX METHOD 110
redundant three identity columns are kept here only for illustrative purposes.
−c⊤ −1 c⊤ ⊤ −⊤
B AB b N −cB AB AN
0 −9 −8
B1 , x3 6 1 1 1
B2 , x4 11 2 1 1
B3 , x5 9 1 2 1
A−1
B b A−1
B AN I
N1 , x1 N2 , x2 B1 , x3 B2 , x4 B3 , x5
The boxed pivot pair (2, 1) is determined by first choosing a column xȷ̂ with
negative reduced cost (e. g., the most negative) and then choosing within
this column a row ı̂ with positive value and smallest right hand side to value
ratio. Replacing the current basic variable xB2 = x4 against the nonbasic
variable x1 now requires to transform the column corresponding to x1 to the
unit vector e2 via standard Gaussian elimination steps. This means
• multiply line 2 (corresponding to B2 or x4 ) by 21 ,
• add 9 times this new line 2 to the cost line 0,
• add −1 times this new line 2 to basis line 1,
• add −1 times this new line 2 to basis line 3.
• swap the recomputed column of the leaving variable x4 into the position
of the entering variable x1 and change the labels of the rows and columns
accordingly.
This results in the next simplex tableaux and the next pivot, etc.
9
49.5 2 − 72
1
B1 , x3 2 − 12 1
2
B2 , x1 11 1 1
2 2 2
B3 , x5 7
− 12 3
2 2
N1 , x4 N2 , x2
♡
The derivation above shows that the algorithm works correct if it stops, but
it is not yet clear, whether it stops.
Theorem 7.5 If in Alg 7.4 the step size γ is always strictly positive, the
primal simplex algorithm stops in finitely many steps.
Therefore no basic solutions appears more than once. Each basic solution
of m indices of {1, . . . , m}. Thus, the
corresponds to a different choice
n
algorithm ends after at most m iterations. □
The step size x̄Bi
γ = min wi : wi > 0, i ∈ {1, . . . , m}
will be zero, whenever there is an i ∈ {1, . . . , m} with wi > 0 so that the
corresponding basic variable/slack x̄Bi = 0,
x2
i. e., the inequality Bi is satisfied with equality
and is also active at this vertex/basic solution. x5
In the illustration to the right x4
x1
N = (2, 3), B = (1, 4, 5), and x4 = 0. x3
Definition 7.6
• A basis B is degenerate if xBi = 0 for some i ∈ {1, . . . , m}.
• A linear program is degenerate if it has a degenerate basis (not nec.
feasible).
For most practical selection rules in pricing and ratio test examples of cycling
have been constructed but it has also been observed more and more in
practice.
Bland proved that cycling cannot occur when using the following rather
simple lexicographic pivot selection rules of Bland,
• in pricing, choose among the variables with negative reduced costs the
one having smallest index,
CHAPTER 7. SIMPLEX METHOD 112
• in the ratio test, choose among the selectable basic variables of value
zero the one with smallest index.
Unfortunately there does not seem to be a proof that provides good intuition
on why this is the case, but as will become clear, the result is fundamental
for establishing an algorithmic proof of strong duality for linear programming
that is almost combinatorial in nature.
Theorem 7.7 If the pivot selection rules of Bland are used in Alg 7.4, the
primal simplex algorithm always terminates.
Proof: Assume, for contradiction, Alg 7.4 cycles in spite of using Bland’s
rules. Let B 1 , . . . , B k be the cyclic sequence of bases and let I = i ∈
For basis B̂ = B 1 there holds s ∈ N̂ with reduced cost ĉs = cs −(A•,s )⊤ A−⊤ cB̂ <
B̂
0. Because β does not change, the objective in dependence on xs is β + ĉs xs .
Equating this with xB̂ = A−1 b − A−1 A•,s xs to (7.1) yields for all xs ≥ 0
B̂
| {z } | B̂{z }
=:b̂ =:â
X
β + ĉs xs = β + c̄i (b̂i + âi xs ) + c̄s xs
i∈B̂
X X
Thus, ∀xs ≥ 0 (ĉs − c̄s + c̄i âi )xs = c̄i b̂i = constant
i∈B̂ i∈B̂
X
⇒ ĉs − c̄s + c̄i âi = 0 ⇒ ∃j ∈ B̂ c̄j âj > 0.
<0 ≥0
i∈B̂
| {z }
⇒>0
̸ 0, therefore j ∈
In particular, c̄j = / B̄ and j ∈ B̂, i. e., j moves in and out of
the basis, j ∈ I and xj = 0. Note that ât > 0 because t was selected in the
ratio test for s, and c̄t < 0, so j ̸= t. Hence, j < t and c̄j ≥ 0 (otherwise j
would have been selected instead of t as entering variable at basis B̄ = B h )
giving âj > 0. But by xj = 0, j ∈ B̂, j < t and âj > 0 Bland’s rule would
then have selected j instead of t as leaving variable at basis B̂ = B 1 . □
CHAPTER 7. SIMPLEX METHOD 113
Corollary 7.8 When starting from a feasible basis and using Bland’s rules,
in finitely many steps Alg 7.4 either finds an optimal solution or certifies
unboundedness.
min 1⊤ s
s. t. Ax + s = b,
x ≥ 0, s ≥ 0.
Remark
• As soon as any auxiliary variable si becomes nonbasic, it can be removed
from the problem.
• If the optimal solution of phase 1 still contains some si variables
(degenerate cases), they can be moved out of the basis by pivoting steps
with suitable columns.
In order to move towards a better basis from the start, this approach combines
the search for a feasible solution with improving the objective by choosing a
cost factor M > 0 large enough and solving (w. l. o. g. b ≥ 0)
min c⊤ x + M 1⊤ s min c⊤ x + M σ
s. t. Ax + s = b, or s. t. Ax + b̄σ = b, [b̄ = b − Ax̄]
x ≥ 0, s ≥ 0, x ≥ 0, σ ≥ 0, [for some starting x̄ ≥ 0]
Advantages:
• The simplex algorithm searches for a good basic solution from the
start.
CHAPTER 7. SIMPLEX METHOD 115
• When all auxiliary variables have become nonbasic the algorithm just
continues with the original problem.
Disadvantages:
• It is not clear how large M has to be chosen.
• A huge value in M usually causes numerical difficulties.
Commercial software packages also employ so called “crash methods”, which
refer to heuristic approaches for finding good starting bases.
and recall
The simplex algorithm supplies an optimality certificate via its reduced cost
vector, so dual information should be available there. Indeed, if p∗ is finite
the simplex algorithm Alg 7.4 delivers a dual optimal solution directly via
1. BTRAN: Compute ȳ = A−⊤ ⊤
B cB [by solving AB ȳ = cB ].
⊤
2. Pricing: Compute z̄N = cN − AN ȳ. If z̄N ≥ 0, x̄ is optimal, STOP, . . .
CHAPTER 7. SIMPLEX METHOD 116
b⊤ ȳ = b⊤ A−⊤ ⊤ ⊤
B cB = x̄B cB = c x̄
equal to that of the primal optimal solution. Thus, by weak duality, (ȳ, z̄) is a
dual optimal solution. The simplex method therefore provides an alternative
proof of the Strong Duality Theorem for Linear Programming, Th 5.10.
max c⊤ x min b⊤ y
s. t. Ax ≤ b s. t. A⊤ y ≥ c
x≥0 y≥0
⇕ ⇕
max b⊤ (−y)
min (−c)⊤ x h i
A
s. t. AI (−y) + zz I = −c
⊤
s. t. [A I] [ xs ] = b 0
[ xs ] ≥ 0
h i
m z A
y ∈ R , zI ≥ 0
Thus,
h i
x̄
z̄ A =A⊤ ȳ−c
x̄, ȳ opt. ⇔ s̄=b−Ax̄ , ȳ, z̄ I =ȳ
opt.
x̄i · z̄iA = x̄i · (A⊤ ȳ − c)i ,
1. 0 =
Cor 7.10
⇔ 2. 0 = s̄i · z̄iI = (b̄ − Ax̄)i · ȳi ,
3. and x̄, ȳ are primal-dual feasible.
c a1 y1 + a2 y2 + a3 y3 ≥ c
opt. x̄ a2 b1 y1 + b2 y2 = c⊤ x̄
=
x̄⊤ a1 x̄⊤ a2
Note, ȳ solves the system arising from the relations
compl.
x̄i ̸= 0 ⇒ [a1 y1 + a2 y2 ]i = ci .
CHAPTER 7. SIMPLEX METHOD 118
a3
If the dual optimum is not unique the
primal must be degenerate. a1
a4
Likewise, if there are several primal c
optima the dual must be degenerate. opt. x̄ a2
In order to derive the dual simplex method for the dual in normal form
max b⊤ y
s. t. A⊤ y + z = c,
y ∈ Rm , z ≥ 0,
consider again the splitting into basic and nonbasic parts
h ⊤i
AB
A⊤
y + ( zzN
B
) = [ ccN
B
].
N
Increasing zBı̂ helps in maximizing if 0 > [A−1 B b]ı̂ = xBı̂ , thus if the primal
constraint Bı̂ is infeasible. If such a ı̂ exists, zBı̂ is increased as long as zN
remains feasible, this value is
−⊤ ⊤ −⊤
γ = sup{γ ≥ 0 : zN = cN − A⊤
N AB cB + γAN AB eı̂ ≥ 0}.
7.4 Sensitivity
How strongly does it depend on changes of the costs or the right hand side?
zN = cN − A⊤
∗ ∗
Ny .
∆b3
In nondegenerate situations (depicted
here for the canonical primal case) ∆b1
small changes in c or b will not affect ∆c
the optimal basis. In this case only c
opt. x∗
the values change and they do so in a ∆b2
perfectly predictable way.
Note, changes in c do not affect primal feasibility but affect the dual feasible
set. The current basis B stays optimal as long as the reduced costs zN (c) ≥ 0
remain nonnegative.
For c + t∆c with t ∈ Rn put
−⊤ ⊤ −⊤
zN (t) = cN + t∆cN − A⊤ ∗
N AB (cB + t∆cB ) = zN + t (∆cN − AN AB ∆cB )
| {z }
=:∆zN
then
∗ ]
[zN [z ∗ ]i
i
zN (t) ≥ 0 ⇔ sup − ≤t≤ inf − N
i : [∆zN ]i >0 [∆zN ]i i : [∆zN ]i <0 [∆zN ]i
For this range of t-values the new optimal value is simply (c + t∆c)⊤ x∗ .
CHAPTER 7. SIMPLEX METHOD 120
These preserve dual feasibility but affect the primal feasible region. The
current basis is optimal as long as xB ≥ 0 remains nonnegative.
For b + t∆b with t ∈ Rn put
then
[x∗B ]i [x∗ ]i
xB (t) ≥ 0 ⇔ sup − ≤t≤ inf − B
i : [∆xB ]i >0 [∆xB ]i i : [∆xB ]i <0 [∆xB ]i
Note, however, that the set of optimal solutions might get bigger in degenerate
cases (if inequality ı̄ is weakly active).
In the same way one may drop variables x∗i = 0 or zi∗ = 0 without changing
the objective value.
For most large scale linear programs one tries to avoid to include all potentially
inactive inequalities and variables. They are added on demand and dropped
again if they no longer seem of importance. Such techniques are sketched
next.
Column Generation
Let variable xs , s ∈ S, specify the length for which the cutting pattern s is
to be employed, then the problem to be solved may be formulated as follows,
P
min Ps∈S xs [minimize total length]
s. t. s
s∈S i sx ≥ ℓi , i = 1, . . . , m, [satisfy demand of width i]
xs ≥ 0, s ∈ S [|S| is typically huge!]
max ȳP⊤s
⊤ m
min(1 − s ȳ) ⇔ i=1 si bi ≤ b̄,
s∈S s. t. s ∈ S.
si ∈ N0 , i = 1, . . . , m
It denotes the optimal value that can be achieved for total width b and items
i = 1, . . . , k. For convenience the definition ensures opt(b, k) = −∞ for b < 0
and opt(b, k) = 0 for b = 0 or (b ≥ 0 ∧ k = 0). With this the following
recursion holds for b = 1, . . . , b̄ and k = 1, . . . , m
because the best solution of width b may either only use widths 1, . . . , k − 1
and no width k or it contains at least one copy of width k for a benefit of
ȳk on top of the best choice for the remaining total width b − bk filled with
widths 1, . . . , k.
for k = 1, . . . , m do
for j = bk , . . . , b̄ do
if opt[j − bk ] + ȳk > opt[j] then
opt[j] ← opt[j − bk ] + ȳk , sol[j] ← k;
end if
end for
end for
It helps to carry out the algorithm for a small example by hand, e. g. for
b1 = 3, b2 = 5, b3 = 6 and b̄ = 10 with ȳ1 = 2, ȳ2 = 4, ȳ3 = 5 which should
result in
b opt sol opt sol opt sol opt sol
10 0 0 6 1 8 2 8 2
9 0 0 6 1 6 1 7 3
8 0 0 4 1 6 2 6 2
7 0 0 4 1 4 1 5 3
6 0 0 k=1 4 1 k=2 4 1 k=3 5 3
→ → →
5 0 0 2 1 4 2 4 2
4 0 0 2 1 2 1 2 1
3 0 0 2 1 2 1 2 1
2 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
Note that the sol-vector allows to reconstruct the optimal solution, as the
item employed for getting the current optimal value indicates which next
solution to build on (here s1 = 0, s2 = 2, s3 = 0).
Once the pricing problem is solved, the generated cutting pattern is added as
a column together with its variable and the optimal solution is recomputed
for the updated problem.
Algorithm 7.13 (Column Generation Framework; Cutting Stock)
Input: b̄ ∈ N, b ∈ {1, . . . , b̄}m , ℓ ∈ Rm
0. Choose some initial patterns that ensure feasibility, e. g., s(i) = ⌊ bb̄i ⌋ei for
i = 1, . . . , m.
1. Solve the LP problem for the current selection of patterns → ȳ.
2. Remove unused patterns.
3. Find new patterns by pricing / column generation using e. g. Alg 7.12;
if there are none with negative reduced cost, STOP,
else add the new one(s) and GOTO 1.
In practice the first iterations typically show fast improvement that quickly
slows down dramatically. This tailing off effect is due to miniature im-
provements that are possible by slightly modifying the cutting patterns
CHAPTER 7. SIMPLEX METHOD 124
selected from the huge available set. Mostly there is no point in going for
full optimality and the iterations are stopped once progress is slow.
The model used here has some disadvantages in practice.
1. The selected lengths xs are in general no useful multiple of available
mother coil lengths. The typical solution uses some patterns for a very
long time and numerous further ones for lengths that are almost not worth
setting up.
2. The time needed to set up a machine to cut a given pattern is often the
limiting factor. Therefore practical solutions should use as few patterns as
possible and rather provide a bit of excess length or trim loss. Note, that
the number of patterns actually employed in the solution of the simplex
method is at most m [why?], which is typically too large in practice.
♡
In the separation problem, “try to find” alludes to the fact that for many
practical problems – in particular in the context of integer programming – it
is too time consuming to indeed explore or enumerate all relevant inequalities.
In these cases one aims at developing algorithmic approaches for finding
violated inequalities that either work exactly for a well defined subclass of
inequalities or that employ heuristic methods for hopefully identifying the
most relevant ones.
Note, adding violated cutting planes amounts to adding dual variables for
which the dual pricing step indicates relevance. Indeed, the cutting plane
approach is exactly the same as column generation for the dual problem.
CHAPTER 7. SIMPLEX METHOD 125
The simplex algorithm may start from any basic solution that is either primal
feasible (by using the primal simplex algorithm) or dual feasible (by using
the dual simplex algorithm). After solving an initial variant of the problem
to optimality, the current basis B is primal and dual feasible.
• Column generation adds variables and columns with negative reduced
cost. This keeps the basis primal feasible but destroys its dual feasibility.
In this setting the primal simplex algorithm allows to continue directly
from the previous basis. Furthermore, for minor problem modifications
typically most of the decisions on which variables need to be in the
basis remain correct and the next optimal basis is found after relatively
few iterations.
• Cutting plane methods add inequalities violated by the current primal
solution. Therefore the latest basis is no longer primal feasible. On the
dual side, however, this only adds variables with improving reduced
dual costs, so the basis stays dual feasible and one may continue directly
from this basis with the dual simplex algorithm. Again, frequently the
next optimal basis is found within a few dual simplex iterations.
The approach to continue directly from a previously computed solution is
called warm starting and the simplex algorithm offers almost ideal possibilities
to do so.
This is in stark contrast to interior point methods where no good general
warm starting strategies seem to be available so far. Indeed, even rather
slight problem modifications change the shape of the central path and in
particular the location of its terminal point significantly. Therefore interior
point methods have to go back deeply into the interior to get reasonably
close to the central path again.
Interior point methods are often faster in solving initial problems or problems
that do not require dynamic modifications. If modifications are required, a
typical approach is to start by solving the initial problem by interior point
methods. This yields an approximation to an optimal solution, for which
an optimal basis is then constructed in a so called cross over step (this
is not always easy or efficient). Once this optimal basis is computed, the
modifications and resolves are then carried out via the appropriate simplex
algorithm.