OPTIIILN2023Spring ConvexOpti
OPTIIILN2023Spring ConvexOpti
LECTURE NOTES
OPTIMIZATION III
CONVEX ANALYSIS
NONLINEAR PROGRAMMING THEORY
NONLINEAR PROGRAMMING ALGORITHMS
ISYE 6663
Aharon Ben-Tal† & Arkadi Nemirovski∗
†
The William Davidson Faculty of Industrial Engineering & Management,
Technion – Israel Institute of Technology
∗
H. Milton Stewart School of Industrial & Systems Engineering,
Georgia Institute of Technology
Aim: Introduction to the Theory of Nonlinear Programming and algorithms of Continuous Opti-
mization.
Duration: 14 weeks, 3 hours per week
Prerequisites: elementary Linear Algebra (vectors, matrices, Euclidean spaces); basic knowledge of
Calculus (including gradients and Hessians of multivariate functions). These prerequisites are summarized
in Appendix.
Contents:
Part I. Elements of Convex Analysis and Optimality Conditions
7 weeks
1-2. Convex sets (definitions, basic properties, Caratheodory-Radon-Helley theorems)
3-4. The Separation Theorem for convex sets (Farkas Lemma, Separation, Theorem on Alternative,
Extreme points, Krein-Milman Theorem in Rn , structure of polyhedral sets, theory of Linear Program-
ming)
5. Convex functions (definition, differential characterizations, operations preserving convexity)
6. Mathematical Programming programs and Lagrange duality in Convex Programming (Convex
Programming Duality Theorem with applications to linearly constrained convex Quadratic Programming)
7. Optimality conditions in unconstrained and constrained optimization (Fermat rule; Karush-Kuhn-
Tucker first order optimality condition for the regular case; necessary/sufficient second order optimality
conditions for unconstrained case; second order sufficient optimality conditions)
Part II: Algorithms
7 weeks
8. Univariate unconstrained minimization (Bisection; Curve Fitting; Armijo-terminated inexact line
search)
9. Multivariate unconstrained minimization: Gradient Descent and Newton methods
10. Around the Newton Method (variable metric methods, conjugate gradients, quasi-Newton algo-
rithms)
11. Polynomial solvability of Convex Programming
12. Constrained minimization: active sets and penalty/barrier approaches
13. Constrained minimization: augmented Lagrangians
14. Constrained minimization: Sequential Quadratic Programming
Contents
1 Convex sets in Rn 11
1.1 Definition and basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.1 A convex set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.2 Examples of convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.2.1 Affine subspaces and polyhedral sets . . . . . . . . . . . . . . . 11
1.1.2.2 Unit balls of norms . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.2.3 Ellipsoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.2.4 Neighbourhood of a convex set . . . . . . . . . . . . . . . . . . 14
1.1.3 Inner description of convex sets: Convex combinations and convex hull . . 14
1.1.3.1 Convex combinations . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.3.2 Convex hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.3.3 Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.4 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1.5 Calculus of convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.6 Topological properties of convex sets . . . . . . . . . . . . . . . . . . . . . 17
1.1.6.1 The closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.6.2 The interior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.1.6.3 The relative interior . . . . . . . . . . . . . . . . . . . . . . . . 19
1.1.6.4 Nice topological properties of convex sets . . . . . . . . . . . . . 20
1.2 Main theorems on convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.1 Caratheodory Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.2 Radon Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.3 Helley Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.4 Polyhedral representations and Fourier-Motzkin Elimination . . . . . . . . 24
1.2.4.1 Polyhedral representations . . . . . . . . . . . . . . . . . . . . . 24
1.2.4.2 Every polyhedrally representable set is polyhedral! (Fourier-
Motzkin elimination) . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.4.3 Some applications . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.2.4.4 Calculus of polyhedral representations . . . . . . . . . . . . . . 27
1.2.5 General Theorem on Alternative and Linear Programming Duality . . . . 28
1.2.5.1 Homogeneous Farkas Lemma . . . . . . . . . . . . . . . . . . . 28
1.2.5.2 General Theorem on Alternative . . . . . . . . . . . . . . . . . 31
1.2.5.3 Application: Linear Programming Duality . . . . . . . . . . . . 34
1.2.6 Separation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.2.6.1 Separation: definition . . . . . . . . . . . . . . . . . . . . . . . . 38
1.2.6.2 Separation Theorem . . . . . . . . . . . . . . . . . . . . . . . . 40
3
4 CONTENTS
2 Convex functions 59
2.1 Convex functions: first acquaintance . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.1.1 Definition and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.1.2 Elementary properties of convex functions . . . . . . . . . . . . . . . . . . 60
2.1.2.1 Jensen’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.1.2.2 Convexity of level sets of a convex function . . . . . . . . . . . 61
2.1.3 What is the value of a convex function outside its domain? . . . . . . . . 61
2.2 How to detect convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.2.1 Operations preserving convexity of functions . . . . . . . . . . . . . . . . 62
2.2.2 Differential criteria of convexity . . . . . . . . . . . . . . . . . . . . . . . . 64
2.3 Gradient inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.4 Boundedness and Lipschitz continuity of a convex function . . . . . . . . . . . . 68
2.5 Maxima and minima of convex functions . . . . . . . . . . . . . . . . . . . . . . . 71
2.6 Subgradients and Legendre transformation . . . . . . . . . . . . . . . . . . . . . . 76
2.6.1 Proper functions and their representation . . . . . . . . . . . . . . . . . . 76
2.6.2 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.6.3 Legendre transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Convex sets in Rn
[x, y] = {z = λx + (1 − λ)y : 0 ≤ λ ≤ 1}
x, y ∈ M, 0 ≤ λ ≤ 1 ⇒ λx + (1 − λ)y ∈ M.
Note that by this definition an empty set is convex (by convention, or better to say, by the
exact sense of the definition: for the empty set, you cannot present a counterexample to show
that it is not convex).
Convexity of affine subspaces immediately follows from the possibility to represent these sets as
solution sets of systems of linear equations (Proposition A.3.7), due to the following simple and
important fact:
aTα x ≤ bα , α ∈ A (!)
11
12 LECTURE 1. CONVEX SETS IN RN
S = {x ∈ Rn : aTα x ≤ bα , α ∈ A}
is convex.
In particular, the solution set of a finite system
Ax ≤ b
Remark 1.1.1 Note that every set given by Proposition 1.1.1 is not only convex, but also closed
(why?). In fact, from Separation Theorem (Theorem 1.2.9 below) it follows that
Every closed convex set in Rn is the solution set of a (perhaps, infinite) system of
nonstrict linear inequalities.
Remark 1.1.2 Note that replacing some of the nonstrict linear inequalities aTα x ≤ bα in (!)
with their strict versions aTα x < bα , we get a system with the solution set which still is convex
(why?), but now not necessary is closed.
k λx k= |λ| k x k;
k x + y k≤k x k + k y k .
{x ∈ E :k x k≤ 1},
These indeed are norms (which is not clear in advance). When p = 2, we get the usual Euclidean
norm; of course, you know how the Euclidean ball looks. When p = 1, we get
n
X
k x k1 = |xi |,
i=1
When p = ∞, we get
k x k∞ = max |xi |,
1≤i≤n
V = {x ∈ Rn : −1 ≤ xi ≤ 1, 1 ≤ i ≤ n}.
Exercise 1.2 Prove that unit balls of norms on Rn are exactly the same as convex sets V in
Rn satisfying the following three properties:
A set V satisfying the outlined properties is the unit ball of the norm
n o
k x k= inf t ≥ 0 : t−1 x ∈ V .
Hint: You could find useful to verify and to exploit the following facts:
1. A norm k · k on Rn is Lipschitz continuous with respect to the standard Euclidean distance: there
exists Ck·k < ∞ such that | k x k − k y k | ≤ Ck·k k x − y k2 for all x, y
2. Vice versa, the Euclidean norm is Lipschitz continuous with respect to a given norm k · k: there
exists ck·k < ∞ such that | k x k2 − k y k2 | ≤ ck·k k x − y k for all x, y
14 LECTURE 1. CONVEX SETS IN RN
1.1.2.3 Ellipsoids
Example 1.1.3 [Ellipsoid] Let Q be a n × n matrix which is symmetric (Q = QT ) and positive
definite (xT Qx ≥ 0, with ≥ being = if and only if x = 0). Then, for every nonnegative r, the
Q-ellipsoid of radius r centered at a – the set
{x : (x − a)T Q(x − a) ≤ r2 }
is convex.
To see that an ellipsoid {x : (x − a)T Q(x − a) ≤ r2 } is convex, note that since Q is positive
definite, the matrix Q1/2 is well-defined and positive definite. Now, if k · k is a norm on
Rn and P is a nonsingular n × n matrix, p the function k P x k is a norm along with k · k
(why?). Thus, the function k x kQ ≡ x Qx =k Q1/2 x k2 is a norm along with k · k2 , and
T
is convex.
1.1.3 Inner description of convex sets: Convex combinations and convex hull
1.1.3.1 Convex combinations
Recall the notion of linear combination y of vectors y1 , ..., ym – this is a vector represented as
m
X
y= λi yi ,
i=1
where λi are real coefficients. Specifying this definition, we have come to the notion of an affine
combination - this is a linear combination with the sum of coefficients equal to one. The last
notion in this genre is the one of convex combination.
Definition 1.1.2 A convex combination of vectors y1 , ..., ym is their affine combination with
nonnegative coefficients, or, which is the same, a linear combination
m
X
y= λ i yi
i=1
1.1.3.3 Simplex
The convex hull of m + 1 affinely independent points y0 , ..., ym (Section A.3.3) is called m-
dimensional simplex with the vertices y0 , .., ym . By results of Section A.3.3, every point x of
an m-dimensional simplex with vertices y0 , ..., ym admits exactly one representation as a convex
combination of the vertices; the corresponding coefficients form the unique solution to the system
of linear equations
m
X m
X
λi xi = x, λi = 1.
i=0 i=0
This system is solvable if and only if x ∈ M = Aff({y0 , .., ym }), and the components of the
solution (the barycentric coordinates of x) are affine functions of x ∈ Aff(M ); the simplex itself
is comprised of points from M with nonnegative barycentric coordinates.
16 LECTURE 1. CONVEX SETS IN RN
1.1.4 Cones
A nonempty subset M of Rn is called conic, if it contains, along with every point x ∈ M , the
entire ray Rx = {tx : t ≥ 0} spanned by the point:
x ∈ M ⇒ tx ∈ M ∀t ≥ 0.
A convex conic set is called a cone.
Proposition 1.1.5 A nonempty subset M of Rn is a cone if and only if it possesses the fol-
lowing pair of properties:
• is conic: x ∈ M, t ≥ 0 ⇒ tx ∈ M ;
• contains sums of its elements: x, y ∈ M ⇒ x + y ∈ M .
2. Taking direct product: if M1 ⊂ Rn1 and M2 ⊂ Rn2 are convex sets, so is the set
3. Arithmetic summation and multiplication by reals: if M1 , ..., Mk are convex sets in Rn and
λ1 , ..., λk are arbitrary reals, then the set
k
X
λ1 M1 + ... + λk Mk = { λi xi : xi ∈ Mi , i = 1, ..., k}
i=1
is convex.
With this fact in mind, it is easy to prove that, e.g., the closure of the open Euclidean ball
{x : |x − a| < r} [r > 0]
is the closed ball {x :k x − a k2 ≤ r}. Another useful application example is the closure of a set
M = {x : aTα x < bα , α ∈ A}
given by strict linear inequalities: if such a set is nonempty, then its closure is given by the
nonstrict versions of the same inequalities:
cl M = {x : aTα x ≤ bα , α ∈ A}.
Nonemptiness of M in the latter example is essential: the set M given by two strict inequal-
ities
x < 0, −x < 0
in R clearly is empty, so that its closure also is empty; in contrast to this, applying formally
the above rule, we would get wrong answer
cl M = {x : x ≤ 0, x ≥ 0} = {0}.
• The interior of the closed ball {x :k x − a k2 ≤ r} is the open ball {x :k x − a k2 < r} (why?)
• The interior of a polyhedral set {x : Ax ≤ b} with matrix A not containing zero rows is
the set {x : Ax < b} (why?)
The latter statement is not, generally speaking, valid for sets of solutions of infinite
systems of linear inequalities. For example, the system of inequalities
1
x≤ , n = 1, 2, ...
n
in R has, as a solution set, the nonpositive ray R− = {x ≤ 0}; the interior of this ray
is the negative ray {x < 0}. At the same time, strict versions of our inequalities
1
x< , n = 1, 2, ...
n
define the same nonpositive ray, not the negative one.
It is also easily seen (this fact is valid for arbitrary metric spaces, not for Rn only), that
The interior of a set is, of course, contained in the set, which, in turn, is contained in its closure:
int M ⊂ M ⊂ cl M. (1.1.1)
The complement of the interior in the closure – the set
∂M = cl M \ int M
– is called the boundary of M , and the points of the boundary are called boundary points of
M (Warning: these points not necessarily belong to M , since M can be less than cl M ; in fact,
all boundary points belong to M if and only if M = cl M , i.e., if and only if M is closed).
The boundary of a set clearly is closed (as the intersection of two closed sets cl M and
n
R \ int M ; the latter set is closed as a complement to an open set). From the definition of the
boundary,
M ⊂ int M ∪ ∂M [= cl M ],
so that a point from M is either an interior, or a boundary point of M .
• we can define the relative boundary ∂ri M = cl M \rint M which is a closed set contained
in Aff(M ), and, as for the “actual” interior and boundary, we have
Of course, if Aff(M ) = Rn , then the relative interior becomes the usual interior, and similarly
for boundary; this for sure is the case when int M 6= ∅ (since then M contains a ball B, and
therefore the affine hull of M is the entire Rn , which is the affine hull of B).
rint M ⊂ M ⊂ cl M
can be very “non-tight”. For example, let M be the set of rational numbers in the segment
[0, 1] ⊂ R. Then rint M = int M = ∅ – since every neighbourhood of every rational real
contains irrational reals – while cl M = [0, 1]. Thus, rint M is “incomparably smaller” than M ,
cl M is “incomparably larger”, and M is contained in its relative boundary (by the way, what
is this relative boundary?).
The following proposition demonstrates that the topology of a convex set M is much better
than it might be for an arbitrary set.
cl M = cl rint M
(in particular, every point of cl M is the limit of a sequence of points from rint M )
(iv) The relative interior remains unchanged when we replace M with its closure:
rint M = rint cl M.
under linear transformation µ 7→ Aµ, where A is the matrix with the columns a1P
, ..., an . The standard
simplex clearly has a nonempty interior (comprised of all vectors µ > 0 with µi < 1); since A is
i
nonsingular (due to linear independence of a1 , ..., an ), multiplication by A maps open sets onto open
ones, so that ∆ has a nonempty interior. Since ∆ ⊂ M , the interior of M is nonempty. 2
(iii): We should prove that the closure of rint M is exactly the same that the closure of M . In fact
we shall prove even more:
Lemma 1.1.1 Let x ∈ rint M and y ∈ cl M . Then all points from the half-segment [x, y),
M ⊂ Aff(M ) = x + L.
x + rB ⊂ M. (1.1.3)
Now let λ ∈ [0, 1), and let z = (1 − λ)x + λy. Since y ∈ cl M , we have y = lim yi for certain sequence
i→∞
of points from M . Setting zi = (1 − λ)x + λyi , we get zi → z as i → ∞. Now, from (1.1.3) and the
convexity of M is follows that the sets Zi = {u = (1 − λ)x0 + λyi : x0 ∈ x + rB} are contained in M ;
clearly, Zi is exactly the set zi + r0 B, where r0 = (1 − λ)r > 0. Thus, z is the limit of sequence zi , and
r0 -neighbourhood (in Aff(M )) of every one of the points zi belongs to M . For every r00 < r0 and for all i
such that zi is close enough to z, the r0 -neighbourhood of zi contains the r00 -neighbourhood of z; thus, a
neighbourhood (in Aff(M )) of z belongs to M , whence z ∈ rint M .
A useful byproduct of Lemma 1.1.1 is as follows:
of points xi ∈ cl M where at least one term with positive coefficient corresponds to xi ∈ rint M
is in fact a point from rint M .
(iv): The statement is evidently true when M is empty, so assume that M is nonempty. The inclusion
rint M ⊂ rint cl M is evident, and all we need is to prove the inverse inclusion. Thus, let z ∈ rint cl M ,
and let us prove that z ∈ rint M . Let x ∈ rint M (we already know that the latter set is nonempty).
Consider the segment [x, z]; since z is in the relative interior of cl M , we can extend a little bit this
segment through the point z, not leaving cl M , i.e., there exists y ∈ cl M such that z ∈ [x, y). We are
done, since by Lemma 1.1.1 from z ∈ [x, y), with x ∈ rint M , y ∈ cl M , it follows that z ∈ rint M .
22 LECTURE 1. CONVEX SETS IN RN
We see from the proof of Theorem 1.1.1 that to get a closure of a (nonempty) convex set, it
suffices to subject it to the “radial” closure, i.e., to take a point x ∈ rint M , take all rays in
Aff(M ) starting at x and look at the intersection of such a ray l with M ; such an intersection
will be a convex set on the line which contains a one-sided neighbourhood of x, i.e., is either
a segment [x, yl ], or the entire ray l, or a half-interval [x, yl ). In the first two cases we should
not do anything; in the third we should add y to M . After all rays are looked through and
all ”missed” endpoints yl are added to M , we get the closure of M . To understand what
is the role of convexity here, look at the nonconvex set of rational numbers from [0, 1]; the
interior (≡ relative interior) of this ”highly percolated” set is empty, the closure is [0, 1], and
there is no way to restore the closure in terms of the interior.
Let us choose among all these representations of x as a convex combination of points from M the one
with the smallest possible N , and let it be the above combination. We claim that N ≤ m + 1 (this claim
leads to the desired statement). Indeed, if N > m + 1, then the system of m + 1 homogeneous equations
N
P
µ i xi = 0
i=1
PN
µi = 0
i=1
Proof. Since N > n + 1, the homogeneous system of n + 1 scalar equations with N unknowns µ1 , ..., µN
N
P
µ i xi = 0
i=1
PN
µi = 0
i=1
Let I = {i : λi ≥ 0}, J = {i : λi < 0}; then I and J are nonempty and form a partitioning of {1, ..., N }.
We have X X
a≡ λi = (−λj ) > 0
i∈I j∈J
(since the sum of all λ’s is zero and not all λ’s are zero). Setting
λi −λj
αi = , i ∈ I, βj = , j ∈ J,
a a
we get X X
αi ≥ 0, βj ≥ 0, αi = 1, βj = 1,
i∈I j∈J
and
X X X X N
X
[ αi xi ] − [ βj xj ] = a−1 [ λi xi ] − [ (−λj )xj ] = a−1 λi xi = 0.
i∈I j∈J i∈I j∈J i=1
Proof. Let us prove the statement by induction on the number N of sets in the family. The case of
N ≤ n + 1 is evident. Now assume that the statement holds true for all families with certain number
N ≥ n + 1 of sets, and let S1 , ..., SN , SN +1 be a family of N + 1 convex sets which satisfies the premise
of the Helley Theorem; we should prove that the intersection of the sets S1 , ..., SN , SN +1 is nonempty.
Deleting from our N + 1-set family the set Si , we get N -set family which satisfies the premise of the
Helley Theorem and thus, by the inductive hypothesis, the intersection of its members is nonempty:
Let us choose a point xi in the (nonempty) set T i . We get N + 1 ≥ n + 2 points from Rn . By Radon’s
Theorem, we can partition the index set {1, ..., N + 1} into two nonempty subsets I and J in such a way
that certain convex combination x of the points xi , i ∈ I, is a convex combination of the points xj , j ∈ J,
as well. Let us verify that x belongs to all the sets S1 , ..., SN +1 , which will complete the proof. Indeed,
let i∗ be an index from our index set; let us prove that x ∈ Si∗ . We have either i∗ ∈ I, or i∗ ∈ J. In
the first case all the sets T j , j ∈ J, are contained in Si∗ (since Si∗ participates in all intersections which
give T i with i 6= i∗ ). Consequently, all the points xj , j ∈ J, belong to Si∗ , and therefore x, which is
a convex combination of these points, also belongs to Si∗ (all our sets are convex!), as required. In the
second case similar reasoning says that all the points xi , i ∈ I, belong to Si∗ , and therefore x, which is a
convex combination of these points, belongs to Si∗ .
Exercise 1.9 Let S1 , ..., SN be a family of N convex sets in Rn , and let m be the affine dimension of
Aff(S1 ∪ ... ∪ SN ). Assume that every m + 1 sets from the family have a point in common. Prove that all
sets from the family have a point in common.
In the aforementioned version of the Helley Theorem we dealt with finite families of convex sets. To
extend the statement to the case of infinite families, we need to strengthen slightly the assumption. The
resulting statement is as follows:
Theorem 1.2.4 [Helley, II] Let F be an arbitrary family of convex sets in Rn . Assume that
(a) every n + 1 sets from the family have a point in common,
and
(b) every set in the family is closed, and the intersection of the sets from certain finite subfamily of
the family is bounded (e.g., one of the sets in the family is bounded).
Then all the sets from the family have a point in common.
Proof. By the previous theorem, all finite subfamilies of F have nonempty intersections, and these
intersections are convex (since intersection of a family of convex sets is convex, Theorem 1.1.3); in view
of (a) these intersections are also closed. Adding to F all intersections of finite subfamilies of F, we get
a larger family F 0 comprised of closed convex sets, and a finite subfamily of this larger family again has
a nonempty intersection. Besides this, from (b) it follows that this new family contains a bounded set Q.
Since all the sets are closed, the family of sets
{Q ∩ Q0 : Q0 ∈ F}
is a nested family of compact sets (i.e., a family of compact sets with nonempty intersection of sets from
every finite subfamily); by the well-known Analysis theorem such a family has a nonempty intersection1) .
Note that every polyhedrally representable set is the image under linear mapping (even a projection) of
a polyhedral, and thus convex, set. It follows that a polyhedrally representable set definitely is convex
(Proposition 1.1.6).
Examples: 1) Every polyhedral set X = {x ∈ Rn : Ax ≤ b} is polyhedrally representable – a polyhedral
description of X is nothing but a polyhedral representation with no slack variables (k = 0). Vice versa,
a polyhedral representation of a set X with no slack variables (k = 0) clearly is a polyhedral description
of the set (which therefore is polyhedral).
Pn
2) Looking at the set X = {x ∈ Rn : i=1 |xi | ≤ 1}, we cannot say immediately whether it is or is not
polyhedral; at least the initial description of X is not of the form {x : Ax ≤ b}. However, X admits a
polyhedral representation, e.g., the representation
n
X
X = {x ∈ Rn : ∃u ∈ Rn : −ui ≤ xi ≤ ui , 1 ≤ i ≤ n, ui ≤ 1}. (1.2.2)
| {z }
i=1
⇔|xi |≤ui
Note that the set X in question can be described by a system of linear inequalities in x-variables only,
namely, as
Xn
X = {x ∈ Rn : i xi ≤ 1 , ∀(i = ±1, 1 ≤ i ≤ n)},
i=1
that is, X is polyhedral. However, the above polyhedral description of X (which in fact is minimal in
terms of the number of inequalities involved) requires 2n inequalities — an astronomically large number
when n is just few tens. In contrast to this, the polyhedral representation (1.2.2) of the same set requires
just n slack variables u and 2n + 1 linear inequalities on x, u – the “complexity” of this representation is
just linear in n.
3) Let a1 , ..., am be given vectors in Rn . Consider the conic hull of the finite set {a1 , ..., am } – the set
Pm
Cone {a1 , ..., am } = {x = λi ai : λ ≥ 0} (see section 1.1.4). It is absolutely unclear whether this set is
i=1
polyhedral. In contrast to this, its polyhedral representation is immediate:
Pm
Cone {a1 , ..., am } = {x ∈ Rn : ∃λ ≥ 0 : x = i=1 λi ai }
−λ ≤P0m
= {x ∈ Rn : ∃λ ∈ Rm : x− P i=1 λi ai ≤ 0 }
m
−x + i=1 λi ai ≤ 0
In other words, the original description of X is nothing but its polyhedral representation (in slight
disguise), with λi ’s in the role of slack variables.
X = {x : ∃u : Ax + bu ≤ c}
of Y on the space of x-variables is polyhedral. To see it, let us split the inequalities defining Y into three
groups (some of them can be empty):
— “black” inequalities — those with bi = 0; these inequalities do not involve u at all;
— “red” inequalities – those with bi > 0. Such an inequality can be rewritten equivalently as u ≤
b−1 T
i [ci − ai x], and it imposes a (depending on x) upper bound on u;
— ”green” inequalities – those with bi < 0. Such an inequality can be rewritten equivalently as u ≥
b−1 T
i [ci − ai x], and it imposes a (depending on x) lower bound on u.
Now it is clear when x ∈ X, that is, when x can be extended, by some u, to a point (x, u) from Y : this
is the case if and only if, first, x satisfies all black inequalities, and, second, the red upper bounds on u
specified by x are compatible with the green lower bounds on u specified by x, meaning that every lower
bound is ≤ every upper bound (the latter is necessary and sufficient to be able to find a value of u which
is ≥ all lower bounds and ≤ all upper bounds). Thus,
T
ai x ≤ ci for all “black” indexes i – those with bi = 0
X= x: b−1 T −1 T
j [cj − aj x] ≤ bk [ck − ak x] for all “green” (i.e., with bj < 0) indexes j .
and all “red” (i.e., with bk > 0) indexes k
We see that X is given by finitely many nonstrict linear inequalities in x-variables only, as claimed.
The outlined procedure for building polyhedral descriptions (i.e., polyhedral representations not in-
volving slack variables) for projections of polyhedral sets is called Fourier-Motzkin elimination.
T = {t ∈ R : ∃x : cT x = t, Ax ≤ b}.
Rewriting the linear equality cT x = t as a pair of opposite inequalities, we see that T is polyhedrally
representable, and the above definition of T is nothing but a polyhedral representation of this set, with
x in the role of the vector of slack variables. By Fourier-Motzkin elimination, T is polyhedral – this set
is given by a finite system of nonstrict linear inequalities in variable t only. As such, as it is immediately
seen, T is
— either empty (meaning that the LP in question is infeasible),
— or is a below unbounded nonempty set of the form {t ∈ R : −∞ ≤ t ≤ b} with b ∈ R ∪ {+∞}
(meaning that the LP is feasible and unbounded),
— or is a below bounded nonempty set of the form {t ∈ R : −a ≤ t ≤ b} with a ∈ R and +∞ ≥ b ≥ a.
In this case, the LP is feasible and bounded, and a is its optimal value.
Note that given the list of linear inequalities defining T (this list can be built algorithmically by Fourier-
Motzkin elimination as applied to the original polyhedral representation of T ), we can easily detect which
one of the above cases indeed takes place, i.e., to identify the feasibility and boundedness status of the
LP and to find its optimal value. When it is finite (case 3 above), we can use the Fourier-Motzkin
1.2. MAIN THEOREMS ON CONVEX SETS 27
elimination backward, starting with t = a ∈ T and extending this value to a pair (t, x) with t = a = cT x
and Ax ≤ b, that is, we can augment the optimal value by an optimal solution. Thus, we can say that
Fourier-Motzkin elimination is a finite Real Arithmetics algorithm which allows to check whether an LP
is feasible and bounded, and when it is the case, allows to find the optimal value and an optimal solution.
An unpleasant fact of life is that this algorithm is completely impractical, since the elimination process
can blow up exponentially the number of inequalities. Indeed, from the description of the process it is
clear that if a polyhedral set is given by m linear inequalities, then eliminating one variable, we can end
up with as much as m2 /4 inequalities (this is what happens if there are m/2 red, m/2 green and no black
inequalities). Eliminating the next variable, we again can “nearly square” the number of inequalities,
and so on. Thus, the number of inequalities in the description of T can become astronomically large
when even when the dimension of x is something like 10. The actual importance of Fourier-Motzkin
elimination is of theoretical nature. For example, the LP-related reasoning we have just carried out
shows that every feasible and bounded LP program is solvable – has an optimal solution (we shall revisit
this result in more details in section 1.2.9.2. This is a fundamental fact for LP, and the above reasoning
(even with the justification of the elimination “charged” to it) is the shortest and most transparent way
to prove this fundamental fact. Another application of the fact that polyhedrally representable sets are
polyhedral is the Homogeneous Farkas Lemma to be stated and proved in section 1.2.5.A; this lemma
will be instrumental in numerous subsequent theoretical developments.
2. Taking direct product: Let Mi ⊂ Rni , 1 ≤ i ≤ m, be polyhedral sets given by polyhedral repre-
sentations
Mi = {xi ∈ Rni : ∃ui ∈ Rki : Ai xi + Bi ui ≤ ci }, 1 ≤ i ≤ m.
Then the direct product M1 × ... × Mm := {x = (x1 , ..., xm ) : xi ∈ Mi , 1 ≤ i ≤ m} of the sets is a
polyhedral set with explicit polyhedral representation, specifically,
M1 × ... × Mm = {x = (x1 , ..., xm ) ∈ Rn1 +...+nm :
∃u = (u1 , ..., um ) ∈ Rk1 +...+km : Ai xi + Bi ui ≤ ci , 1 ≤ i ≤ m}.
Rn : ∃(xi ∈ R
λ1 M1 + ... + λm Mm = {x ∈P n i ki
P , u ∈ R , 1 ≤ i ≤ m) :
x ≤ i λi xi , x ≥ i λi xi , Ai xi + Bi ui ≤ ci , 1 ≤ i ≤ m}.
4. Taking the image under an affine mapping: Let M ⊂ Rn be a polyhedral set given by polyhedral
representation
M = {x ∈ Rn : ∃u ∈ Rk : Ax + Bu ≤ c}
and let P(x) = P x + p : Rn → Rm be an affine mapping. Then the image P(M ) := {y =
P x + p : x ∈ M } of M under the mapping is polyhedral set with explicit polyhedral representation,
specifically,
5. Taking the inverse image under affine mapping: Let M ⊂ Rn be polyhedral set given by polyhe-
dral representation
M = {x ∈ Rn : ∃u ∈ Rk : Ax + Bu ≤ c}
and let P(y) = P y + p : Rm → Rn be an affine mapping. Then the inverse image P −1 (M ) := {y :
P y + p ∈ M } of M under the mapping is polyhedral set with explicit polyhedral representation,
specifically,
P −1 (M ) = {y ∈ Rm : ∃u : A(P y + p) + Bu ≤ c}.
Note that rules for intersection, taking direct products and taking inverse images, as applied to polyhedral
descriptions of operands, lead to polyhedral descriptions of the results. In contrast to this, the rules for
taking sums with coefficients and images under affine mappings heavily exploit the notion of polyhedral
representation: even when the operands in these rules are given by polyhedral descriptions, there are no
simple ways to point out polyhedral descriptions of the results.
Finally, we note that the problem of minimizing a linear form cT x over a set M given by polyhedral
representation:
M = {x ∈ Rn : ∃u ∈ Rk : Ax + Bu ≤ c}
can be immediately reduced to an explicit LP program, namely,
min cT x : Ax + Bu ≤ c .
x,u
A reader with some experience in Linear Programming definitely used a lot the above “calculus of
polyhedral representations” when building LPs (perhaps without clear understanding of what in fact is
going on, same as Molière’s Monsieur Jourdain all his life has been speaking prose without knowing it).
then every vector h which has nonnegative inner products with all ai should also have nonnegative inner
product with a: X
a= λi ai & λi ≥ 0 ∀i & hT ai ≥ 0 ∀i ⇒ hT a ≥ 0.
i
The Homogeneous Farkas Lemma says that this evident necessary condition is also sufficient:
Lemma 1.2.1 [Homogeneous Farkas Lemma] Let a, a1 , ..., aN be vectors from Rn . The vector a is a
conic combination of the vectors ai (linear combination with nonnegative coefficients) if and only if every
vector h satisfying hT ai ≥ 0, i = 1, ..., N , satisfies also hT a ≥ 0. In other words, a homogeneous linear
inequality
aT h ≥ 0
in variable h is consequence of the system
aTi h ≥ 0, 1 ≤ i ≤ N
of homogeneous linear inequalities if and only if it can be obtained from the inequalities of the system by
“admissible linear aggregation” – taking their weighted sum with nonnegative weights.
Proof. The necessity – the “only if” part of the statement – was proved before the Farkas Lemma
was formulated. Let us prove the “if” part of the Lemma. Thus, assume that every vector h satisfying
hT ai ≥ 0 ∀i satisfies also hT a ≥ 0, and let us prove that a is a conic combination of the vectors ai .
An “intelligent” proof goes as follows. The set Cone {a1 , ..., aN } of all conic combinations of a1 , ..., aN
is polyhedrally representable (Example 3 in section 1.2.5.A.1) and as such is polyhedral (Theorem 1.2.5):
Observing that 0 ∈ Cone {a1 , ..., aN }, we conclude that bj ≤ 0 for all j; and since λai ∈ Cone {a1 , ..., aN }
for every i and every λ ≥ 0, we should have λpTj ai ≥ bj for all i, j and all λ ≥ 0, whence pTj ai ≥ 0 for
all i and j. For every j, relation pTj ai ≥ 0 for all i implies, by the premise of the statement we want to
prove, that pTj a ≥ 0, and since bj ≤ 0, we see that pTj a ≥ bj for all j, meaning that a indeed belongs to
Cone {a1 , ..., aN } due to (!).
An interested reader can get a better understanding of the power of Fourier-Motzkin elimination,
which ultimately is the basis for the above intelligent proof, by comparing this proof with the one based
on Helley’s Theorem.
Proof based on Helley’s Theorem. As above, we assume that every vector h satisfying hT ai ≥ 0
∀i satisfies also hT a ≥ 0, and we want to prove that a is a conic combination of the vectors ai .
There is nothing to prove when a = 0 – the zero vector of course is a conic combination of the vectors
ai . Thus, from now on we assume that a 6= 0.
10 . Let
Π = {h : aT h = −1},
and let
Ai = {h ∈ Π : aTi h ≥ 0}.
Π is a hyperplane in Rn , and every Ai is a polyhedral set contained in this hyperplane and is therefore
convex.
20 . What we know is that the intersection of all the sets Ai , i = 1, ..., N , is empty (since a vector h
from the intersection would have nonnegative inner products with all ai and the inner product −1 with
a, and we are given that no such h exists). Let us choose the smallest, in the number of elements, of
those sub-families of the family of sets A1 , ..., AN which still have empty intersection of their members;
without loss of generality we may assume that this is the family A1 , ..., Ak . Thus, the intersection of all
k sets A1 , ..., Ak is empty, but the intersection of every k − 1 sets from the family A1 , ..., Ak is nonempty.
30 LECTURE 1. CONVEX SETS IN RN
30 . We claim that
(A) a ∈ Lin({a1 , ..., ak });
(B) The vectors a1 , ..., ak are linearly independent.
(A) is easy: assuming that a 6∈ E = Lin({a1 , ..., ak }), we conclude that the orthogonal
projection f of the vector a onto the orthogonal complement E ⊥ of E is nonzero. The inner
product of f and a is the same as f T f , is.e., is positive, while f T ai = 0, i = 1, ..., k. Taking
h = −(f T f )−1 f , we see that hT a = −1 and hT ai = 0, i = 1, ..., k. In other words, h belongs
to every set Ai , i = 1, ..., k, by definition of these sets, and therefore the intersection of the
sets A1 , ..., Ak is nonempty, which is a contradiction.
(B) is given by the Helley Theorem I. (B) is evident when k = 1, since in this case linear
dependence of a! , ..., ak would mean that a1 = 0; by (A), this implies that a = 0, which
is not the case. Now let us prove (B) in the case of k > 1. Assume, on the contrary
to what should be proven, that a1 , ..., ak are linearly dependent, so that the dimension of
E = Lin({a1 , ..., ak }) is certain m < k. We already know from A. that a ∈ E. Now let
A0i = Ai ∩ E. We claim that every k − 1 of the sets A0i have a nonempty intersection, while
all k these sets have empty intersection. The second claim is evident – since the sets A1 , ..., Ak
have empty intersection, the same is the case with their parts A0i . The first claim also is
easily supported: let us take k − 1 of the dashed sets, say, A01 , ..., A0k−1 . By construction,
the intersection of A1 , ..., Ak−1 is nonempty; let h be a vector from this intersection, i.e., a
vector with nonnegative inner products with a1 , ..., ak−1 and the product −1 with a. When
replacing h with its orthogonal projection h0 on E, we do not vary all these inner products,
since these are products with vectors from E; thus, h0 also is a common point of A1 , ..., Ak−1 ,
and since this is a point from E, it is a common point of the dashed sets A01 , ..., A0k−1 as well.
Now we can complete the proof of (B): the sets A01 , ..., A0k are convex sets belonging to
the hyperplane Π0 = Π ∩ E = {h ∈ E : aT h = −1} (Π0 indeed is a hyperplane in E,
since 0 6= a ∈ E) in the m-dimensional linear subspace E. Π0 is an affine subspace of
the affine dimension ` = dim E − 1 = m − 1 < k − 1 (recall that we are in the situation
when m = dim E < k), and every ` + 1 ≤ k − 1 subsets from the family A01 ,...,A0k have a
nonempty intersection. From the Helley Theorem I (see Exercise 1.9) it follows that all the
sets A01 , ..., A0k have a point in common, which, as we know, is not the case. The contradiction
we have got proves that a1 , ..., ak are linearly independent.
40 . With (A) and (B) in our disposal, we can easily complete the proof of the “if” part of the Farkas
Lemma. Specifically, by (A), we have
k
X
a= λi ai
i=1
with some real coefficients λi , and all we need is to prove that these coefficients are nonnegative. Assume,
on the contrary, that, say, λ1 < 0. Let us extend the (linearly independent in view of (B)) system of
vectors a1 , ..., ak by vectors f1 , ..., fn−k to a basis in Rn , and let ξi (x) be the coordinates of a vector x in
this basis. The function ξ1 (x) is a linear form of x and therefore is the inner product with certain vector:
ξ1 (x) = f T x ∀x.
Now we have
f T a = ξ1 (a) = λ1 < 0
and
T 1, i=1
f ai =
0, i = 2, ..., k
so that f T ai ≥ 0, i = 1, ..., k. We conclude that a proper normalization of f – namely, the vector |λ1 |−1 f
– belongs to A1 , ..., Ak , which is the desired contradiction – by construction, this intersection is empty.
1.2. MAIN THEOREMS ON CONVEX SETS 31
Ax − b ≥ 0
has no solutions.
The general question above is too difficult, and it makes sense to pass from it to a seemingly simpler
one:
(??) How to certify that (S) has, or does not have, a solution.
Imagine that you are very smart and know the correct answer to (?); how could you convince me that
your answer is correct? What could be an “evident for everybody” validity certificate for your answer?
If your claim is that (S) is solvable, a certificate could be just to point out a solution x∗ to (S). Given
this certificate, one can substitute x∗ into the system and check whether x∗ indeed is a solution.
Assume now that your claim is that (S) has no solutions. What could be a “simple certificate”
of this claim? How one could certify a negative statement? This is a highly nontrivial problem not
just for mathematics; for example, in criminal law: how should someone accused in a murder prove his
innocence? The “real life” answer to the question “how to certify a negative statement” is discouraging:
such a statement normally cannot be certified (this is where the rule “a person is presumed innocent until
proven guilty” comes from). In mathematics, however, the situation is different: in some cases there exist
“simple certificates” of negative statements. For example, in order to certify that (S) has no solutions,
it suffices to demonstrate that a consequence of (S) is a contradictory inequality such as
−1 ≥ 0.
For example, assume that λi , i = 1, ..., m, are nonnegative weights. Combining inequalities from (S) with
these weights, we come to the inequality
m
X
λi fi (x) Ω 0 (Comb(λ))
i=1
where Ω is either ” > ” (this is the case when the weight of at least one strict inequality from (S) is
positive), or ” ≥ ” (otherwise). Since the resulting inequality, due to its origin, is a consequence of the
system (S), i.e., it is satisfied by every solution to (S), it follows that if (Comb(λ)) has no solutions at
all, we can be sure that (S) has no solution. Whenever this is the case, we may treat the corresponding
vector λ as a “simple certificate” of the fact that (S) is infeasible.
Let us look what does the outlined approach mean when (S) is comprised of linear inequalities:
”>”
(S) : {aTi x Ωi bi , i = 1, ..., m} Ωi =
”≥”
32 LECTURE 1. CONVEX SETS IN RN
(Ω is ” > ” whenever λi > 0 for at least one i with Ωi = ” > ”, and Ω is ” ≥ ” otherwise). Now, when
can a linear inequality
dT x Ω e
be contradictory? Of course, it can happen only when d = 0. Whether in this case the inequality is
contradictory, it depends on what is the relation Ω: if Ω = ” > ”, then the inequality is contradictory
if and only if e ≥ 0, and if Ω = ” ≥ ”, it is contradictory if and only if e > 0. We have established the
following simple result:
Proposition 1.2.1 Consider a system of linear inequalities
T
ai x > bi , i = 1, ..., ms ,
(S) :
aTi x ≥ bi , i = ms + 1, ..., m.
with n-dimensional vector of unknowns x. Let us associate with (S) two systems of linear inequalities
and equations with m-dimensional vector of unknowns λ:
(a) λ ≥ 0;
m
P
(b) λi ai = 0;
i=1
m
TI : P
(cI ) λi bi ≥ 0;
i=1
m
(d ) Ps
I λ > 0. i
i=1
(a) λ ≥ 0;
m
P
(b) λi ai = 0;
TII : i=1
m
P
(cII ) λi bi > 0.
i=1
Assume that at least one of the systems TI , TII is solvable. Then the system (S) is infeasible.
1.2.5.2.2 General Theorem on Alternative. Proposition 1.2.1 says that in some cases it is easy
to certify infeasibility of a linear system of inequalities: a “simple certificate” is a solution to another
system of linear inequalities. Note, however, that the existence of a certificate of this latter type is to the
moment only a sufficient, but not a necessary, condition for the infeasibility of (S). A fundamental result
in the theory of linear inequalities is that the sufficient condition in question is in fact also necessary:
Theorem 1.2.6 [General Theorem on Alternative] In the notation from Proposition 1.2.1, system (S)
has no solutions if and only if either TI , or TII , or both these systems, are solvable.
Proof. GTA is a more or less straightforward corollary of the Homogeneous Farkas Lemma. Indeed, in
view of Proposition 1.2.1, all we need to prove is that if (S) has no solution, then at least one of the
systems TI , or TII is solvable. Thus, assume that (S) has no solutions, and let us look at the consequences.
Let us associate with (S) the following system of homogeneous linear inequalities in variables x, τ , :
(a) τ − ≥ 0
(b) aTi x −bi τ − ≥ 0, i = 1, ..., ms (1.2.3)
(c) aTi x −bi τ ≥ 0, i = ms + 1, ..., m
We claim that
1.2. MAIN THEOREMS ON CONVEX SETS 33
− ≥ 0 (1.2.4)
Recall that by their origin, ν and all λi are nonnegative. Now, it may happen that λ1 , ..., λms are zero.
In this case ν > 0 by (1.2.5.c), and relations (1.2.5a − b) say that λ1 , ..., λm solve TII . In the remaining
ms
P
case (that is, when not all λ1 , ..., λms are zero, or, which is the same, when λi > 0), the same relations
i=1
say that λ1 , ..., λm solve TI .
1.2.5.2.3 Corollaries of the Theorem on Alternative. We formulate here explicitly two very
useful principles following from the Theorem on Alternative:
A. A system of linear inequalities
aTi x Ωi bi , i = 1, ..., m
has no solutions if and only if one can combine the inequalities of the system in a linear fashion
(i.e., multiplying the inequalities by nonnegative weights, adding the results and passing, if
necessary, from an inequality aT x > b to the inequality aT x ≥ b) to get a contradictory
inequality, namely, either the inequality 0T x ≥ 1, or the inequality 0T x > 0.
B. A linear inequality
aT0 x Ω0 b0
is a consequence of a solvable system of linear inequalities
aTi x Ωi bi , i = 1, ..., m
if and only if it can be obtained by combining, in a linear fashion, the inequalities of the
system and the trivial inequality 0 > −1.
It should be stressed that the above principles are highly nontrivial and very deep. Consider,
e.g., the following system of 4 linear inequalities with two variables u, v:
−1 ≤ u ≤ 1
−1 ≤ v ≤ 1.
u2 + v 2 ≤ 2, (!)
34 LECTURE 1. CONVEX SETS IN RN
The concluding inequality is linear and is a consequence of the original system, but in the
demonstration of this fact both steps (!) and (!!) are “highly nonlinear”. It is absolutely
unclear a priori why the same consequence can, as it is stated by Principle A, be derived
from the system in a linear manner as well [of course it can – it suffices just to add two
inequalities u ≤ 1 and v ≤ 1].
Note that the Theorem on Alternative and its corollaries A and B heavily exploit the fact that
we are speaking about linear inequalities. For example, consider the following 2 quadratic
and 2 linear inequalities with two variables:
(a) u2 ≥ 1;
(b) v 2 ≥ 1;
(c) u ≥ 0;
(d) v ≥ 0;
(e) uv ≥ 1.
The inequality (e) is clearly a consequence of (a) – (d). However, if we extend the system of
inequalities (a) – (b) by all “trivial” (i.e., identically true) linear and quadratic inequalities
with 2 variables, like 0 > −1, u2 + v 2 ≥ 0, u2 + 2uv + v 2 ≥ 0, u2 − uv + v 2 ≥ 0, etc.,
and ask whether (e) can be derived in a linear fashion from the inequalities of the extended
system, the answer will be negative. Thus, Principle A fails to be true already for quadratic
inequalities (which is a great sorrow – otherwise there were no difficult problems at all!)
1.2.5.3.1 Dual to an LP program: the origin The motivation for constructing the problem
dual to an LP program
aT1
T
c∗ = min cT x : Ax − b ≥ 0 A = a2 ∈ Rm×n
... (LP)
x
aTm
is the desire to generate, in a systematic way, lower bounds on the optimal value c∗ of (LP). An evident
way to bound from below a given function f (x) in the domain given by system of inequalities
is offered by what is called the Lagrange duality (to be investigated in-depth in Section 3.2) and is as
follows:
Lagrange Duality:
• Let us look at all inequalities which can be obtained from (1.2.6) by linear aggregation,
i.e., at the inequalities of the form
X X
yi gi (x) ≥ yi bi (1.2.7)
i i
1.2. MAIN THEOREMS ON CONVEX SETS 35
with the “aggregation weights” yi ≥ 0. Note that the inequality (1.2.7), due to its origin, is
valid on the entire set X of solutions of (1.2.6).
• Depending on the choice of aggregation weights, it may happen that the left hand P side
in (1.2.7) is ≤ f (x) for all x ∈ Rn . Whenever it is the case, the right hand side yi bi of
i
(1.2.7) is a lower bound on f in X.
P
Indeed, on X the quantity yi bi is a lower bound on yi gi (x), and for y in question
i
the latter function of x is everywhere ≤ f (x).
It follows that
• The optimal value in the problem
( )
X y ≥ 0, (a)
max yi bi : P yi gi (x) ≤ f (x) ∀x ∈ Rn (b) (1.2.8)
y
i i
is a lower bound on the values of f on the set of solutions to the system (1.2.6).
Let us look what happens with the Lagrange duality when f and gi are homogeneous Plinear functions:
f = cT x, gi (x) = aTi x. In this case, the requirement (1.2.8.b) merely says that c = yi ai (or, which
i
is the same, AT y = c due to the origin of A). Thus, problem (1.2.8) becomes the Linear Programming
problem
max bT y : AT y = c, y ≥ 0 , (LP∗ )
y
has no solution. We know by the Theorem on Alternative that the latter fact means that some other
system of linear equalities (more exactly, at least one of a certain pair of systems) does have a solution.
More precisely,
(*) (Sa ) has no solutions if and only if at least one of the following two systems with m + 1
unknowns:
(a) λ = (λ0 , λ1 , ..., λm ) ≥ 0;
Pm
(b) −λ0 c + λi ai = 0;
i=1
TI : Pm
(cI ) −λ0 a + λi bi ≥ 0;
i=1
(d ) λ > 0,
I 0
or
(a) λ = (λ0 , λ1 , ..., λm ) ≥ 0;
Pm
(b) −λ0 c − λi ai = 0;
TII : i=1
Pm
(cII )
−λ0 a − λi bi > 0
i=1
– has a solution.
36 LECTURE 1. CONVEX SETS IN RN
Now assume that (LP) is feasible. We claim that under this assumption (Sa ) has no solutions if and
only if TI has a solution.
The implication ”TI has a solution ⇒ (Sa ) has no solution” is readily given by the above
remarks. To verify the inverse implication, assume that (Sa ) has no solutions and the system
Ax ≤ b has a solution, and let us prove that then TI has a solution. If TI has no solution, then
by (*) TII has a solution and, moreover, λ0 = 0 for (every) solution to TII (since a solution
to the latter system with λ0 > 0 solves TI as well). But the fact that TII has a solution λ
with λ0 = 0 is independent of the values of a and c; if this fact would take place, it would
mean, by the same Theorem on Alternative, that, e.g., the following instance of (Sa ):
0T x ≥ −1, Ax ≥ b
has no solutions. The latter means that the system Ax ≥ b has no solutions – a contradiction
with the assumption that (LP) is feasible.
Now, if TI has a solution, this system has a solution with λ0 = 1 as well (to see this, pass from a solution
λ to the one λ/λ0 ; this construction is well-defined, since λ0 > 0 for every solution to TI ). Now, an
(m + 1)-dimensional vector λ = (1, y) is a solution to TI if and only if the m-dimensional vector y solves
the system of linear inequalities and equations
y ≥ 0;
m
AT y ≡
P
yi a i = c; (D)
i=1
T
b y ≥ a
Proposition 1.2.2 Assume that system (D) associated with the LP program (LP) has a solution (y, a).
Then a is a lower bound on the optimal value in (LP). Vice versa, if (LP) is feasible and a is a lower
bound on the optimal value of (LP), then a can be extended by a properly chosen m-dimensional vector y
to a solution to (D).
We see that the entity responsible for lower bounds on the optimal value of (LP) is the system (D): every
solution to the latter system induces a bound of this type, and in the case when (LP) is feasible, all lower
bounds can be obtained from solutions to (D). Now note that if (y, a) is a solution to (D), then the pair
(y, bT y) also is a solution to the same system, and the lower bound bT y on c∗ is not worse than the lower
bound a. Thus, as far as lower bounds on c∗ are concerned, we lose nothing by restricting ourselves to
T ∗
the solutions (y, a) of (D) with
T a = Tb y; the best lower bound on c given by (D) is therefore the optimal
value of the problem maxy b y : A y = c, y ≥ 0 , which is nothing but the dual to (LP) problem (LP∗ ).
Note that (LP∗ ) is also a Linear Programming program.
All we know about the dual problem to the moment is the following:
Proposition 1.2.3 Whenever y is a feasible solution to (LP∗ ), the corresponding value of the dual
objective bT y is a lower bound on the optimal value c∗ in (LP). If (LP) is feasible, then for every a ≤ c∗
there exists a feasible solution y of (LP∗ ) with bT y ≥ a.
Theorem 1.2.7 [Duality Theorem in Linear Programming] Consider a linear programming program
min cT x : Ax ≥ b
(LP)
x
Then
1) The duality is symmetric: the problem dual to dual is equivalent to the primal;
2) The value of the dual objective at every dual feasible solution is ≤ the value of the primal objective
at every primal feasible solution
3) The following 5 properties are equivalent to each other:
(i) The primal is feasible and bounded below.
(ii) The dual is feasible and bounded above.
(iii) The primal is solvable.
(iv) The dual is solvable.
(v) Both primal and dual are feasible.
Whenever (i) ≡ (ii) ≡ (iii) ≡ (iv) ≡ (v) is the case, the optimal values of the primal and the dual problems
are equal to each other.
Proof. 1) is quite straightforward: writing the dual problem (LP∗ ) in our standard form, we get
Im 0
min −bT y : AT y − −c ≥ 0 ,
y
−AT c
where Im is the m-dimensional unit matrix. Applying the duality transformation to the latter problem,
we come to the problem
ξ ≥ 0
η ≥ 0
T T T
max 0 ξ + c η + (−c) ζ : ,
ξ,η,ζ
ζ ≥ 0
ξ − Aη + Aζ = −b
An immediate corollary of the LP Duality Theorem is the following necessary and sufficient optimality
condition in LP:
38 LECTURE 1. CONVEX SETS IN RN
Theorem 1.2.8 [Necessary and sufficient optimality conditions in linear programming] Consider an LP
program (LP) along with its dual (LP∗ ). A pair (x, y) of primal and dual feasible solutions is comprised
of optimal solutions to the respective problems if and only if
yi [Ax − b]i = 0, i = 1, ..., m, [complementary slackness]
likewise as if and only if
c T x − bT y = 0 [zero duality gap]
Indeed, the “zero duality gap” optimality condition is an immediate consequence of the fact
that the value of primal objective at every primal feasible solution is ≥ the value of the
dual objective at every dual feasible solution, while the optimal values in the primal and the
dual are equal to each other, see Theorem 1.2.7. The equivalence between the “zero duality
gap” and the “complementary slackness” optimality conditions is given by the following
computation: whenever x is primal feasible and y is dual feasible, the products yi [Ax − b]i ,
i = 1, ..., m, are nonnegative, while the sum of these products is precisely the duality gap:
y T [Ax − b] = (AT y)T x − bT y = cT x − bT y.
Thus, the duality gap can vanish at a primal-dual feasible pair (x, y) if and only if all products
yi [Ax − b]i for this pair are zeros.
S ⊂ {x : aT x ≤ b}, T ⊂ {x : aT x ≥ b}
(i.e., S and T belong to the opposite closed half-spaces into which M splits Rn ), and, second, at
least one of the sets S, T is not contained in M itself:
S ∪ T 6⊂ M.
The separation is called strong, if there exist b0 , b00 , b0 < b < b00 , such that
S ⊂ {x : aT x ≤ b0 }, T ⊂ {x : aT x ≥ b00 }.
• A linear form a 6= 0 is said to separate (strongly separate) S and T , if for properly chosen b the
hyperplane {x : aT x = b} separates (strongly separates) S and T .
• We say that S and T can be (strongly) separated, if there exists a hyperplane which (strongly)
separates S and T .
For example,
• the hyperplane {x : aT x ≡ x2 − x1 = 1} in R2 strongly separates convex polyhedral sets T = {x ∈
R2 : 0 ≤ x1 ≤ 1, 3 ≤ x2 ≤ 5} and S = {x ∈ R2 : x2 = 0; x1 ≥ −1};
• the hyperplane {x : aT x ≡ x = 1} in R1 separates (but not strongly separates) the convex sets
S = {x ≤ 1} and T = {x ≥ 1};
• the hyperplane {x : aT x ≡ x1 = 0} in R2 separates (but not strongly separates) the sets S = {x ∈
R2 :, x1 < 0, x2 ≥ −1/x1 } and T = {x ∈ R2 : x1 > 0, x2 > 1/x1 };
• the hyperplane {x : aT x ≡ x2 − x1 = 1} in R2 does not separate the convex sets S = {x ∈ R2 :
x2 ≥ 1} and T = {x ∈ R2 : x2 = 0};
• the hyperplane {x : aT x ≡ x2 = 0} in R2 does not separate the sets S = {x ∈ R2 : x2 = 0, x1 ≤ −1}
and T = {x ∈ R2 : x2 = 0, x1 ≥ 1}.
The following Exercise presents an equivalent description of separation:
Exercise 1.11 Let S, T be nonempty convex sets in Rn . Prove that a linear form a separates S and T
if and only if
sup aT x ≤ inf aT y
x∈S y∈T
and
inf aT x < sup aT y.
x∈S y∈T
In particular, if S, T are closed nonempty non-intersecting convex sets and one of these sets is compact,
S and T can be strongly separated.
(i), Necessity. Assume that S, T can be separated, so that for certain a 6= 0 we have
inf aT x ≤ inf aT y; inf aT x < sup aT y. (1.2.9)
x∈S y∈T x∈S y∈T
We should lead to a contradiction the assumption that rint S and rint T have in common certain point x̄.
Assume that it is the case; then from the first inequality in (1.2.9) it is clear that x̄ maximizes the linear
function f (x) = aT x on S and simultaneously minimizes this function on T . Now, we have the following
simple and important
Lemma 1.2.2 A linear function f (x) = aT x can attain its maximum/minimum over a
convex set Q at a point x ∈ rint Q if and only if the function is constant on Q.
Proof. ”if” part is evident. To prove the ”only if” part, let x̄ ∈ rint Q be, say, a minimizer
of f over Q and y be an arbitrary point of Q; we should prove that f (x̄) = f (y). There is
nothing to prove if y = x̄, so let us assume that y 6= x̄. Since x̄ ∈ rint Q, the segment [y, x̄],
which is contained in M , can be extended a little bit through the point x̄, not leaving M
(since x̄ ∈ rint Q), so that there exists z ∈ Q such that x̄ ∈ [y, z), i.e., x̄ = (1 − λ)y + λz with
certain λ ∈ (0, 1]; since y 6= x̄, we have in fact λ ∈ (0, 1). Since f is linear, we have
since f (x̄) ≤ min{f (y), f (z)} and 0 < λ < 1, this relation can be satisfied only when
f (x̄) = f (y) = f (z). 2
By Lemma 1.2.2, f (x) = f (x̄) on S and on T , so that f (·) is constant on S ∪ T , which yields the
desired contradiction with the second inequality in (1.2.9). 2
(i), Sufficiency. The proof of sufficiency part of the Separation Theorem is much more instructive.
There are several ways to prove it, and we choose the one which goes via the Homogeneous Farkas Lemma
1.2.1, which is extremely important in its own right.
(i), Sufficiency, Step 1: Separation of a convex polytope and a point outside the
polytope. Let us start with seemingly very particular case of the Separation Theorem – the one where
S is the convex full points x1 , ..., xN , and T is a singleton T = {x} which does not belong to S. We
intend to prove that in this case there exists a linear form which separates x and S; in fact we shall prove
even the existence of strong separation.
x
Let us associate with n-dimensional vectors x1 , ..., xN , x the (n + 1)-dimensional vectors a =
1
xi
and ai = , i = 1, ..., N . We claim that a does not belong to the conic hull of a1 , ..., aN . Indeed, if a
1
1.2. MAIN THEOREMS ON CONVEX SETS 41
would be representable as a linear combination of a1 , ..., aN with nonnegative coefficients, then, looking
at the last, (n + 1)-st, coordinates in such a representation, we would conclude that the sum of coefficients
should be 1, so that the representation, actually, represents x as a convex combination of x1 , ..., xN , which
was assumed to be impossible.
Since a does not belong to the conic
hull of a1 , ..., aN , by the Homogeneous Farkas Lemma (Lemma
f
1.2.1) there exists a vector h = ∈ Rn+1 which “separates” a and a1 , ..., aN in the sense that
α
hT a > 0, hT ai ≤ 0, i = 1, ..., N,
whence, of course,
hT a > max hT ai .
i
Since the components in all the inner products h a, hT ai coming from the (n + 1)-st coordinates are
T
equal to each other , we conclude that the n-dimensional component f of h separates x and x1 , ..., xN :
f T x > max f T xi .
i
(i), Sufficiency, Step 2: Separation of a convex set and a point outside of the set.
Now consider the case when S is an arbitrary nonempty convex set and T = {x} is a singleton outside S
(the difference with Step 1 is that now S is not assumed to be a polytope).
First of all, without loss of generality we may assume that S contains 0 (if it is not the case, we may
subject S and T to translation S 7→ p + S, T 7→ p + T with p ∈ −S). Let L be the linear span of S. If
x 6∈ L, the separation is easy: taking as f the orthogonal to L component of x, we shall get
f T x = f T f > 0 = max f T y,
y∈S
f T x ≥ sup f T y. (1.2.10)
y∈S
Assume, on the contrary, that no such f exists, and let us lead this assumption to a contradiction. Under
our assumption for every h ∈ Σ there exists yh ∈ S such that
hT yh > hT x.
Since the inequality is strict, it immediately follows that there exists a neighbourhood Uh of the vector
h such that
(h0 )T yh > (h0 )T x ∀h0 ∈ Uh . (1.2.11)
The family of open sets {Uh }h∈Σ covers Σ; since Σ is compact, we can find a finite subfamily Uh1 , ..., UhN
of the family which still covers Σ. Let us take the corresponding points y1 = yh1 , y2 = yh2 , ..., yN = yhN
42 LECTURE 1. CONVEX SETS IN RN
and the polytope S 0 = Conv({y1 , ..., yN }) spanned by the points. Due to the origin of yi , all of them are
points from S; since S is convex, the polytope S 0 is contained in S and, consequently, does not contain
x. By Step 1, x can be strongly separated from S 0 : there exists a such that
Mathematically oriented reader should take into account that the simple-looking reasoning under-
lying Step 2 in fact brings us into a completely new world. Indeed, the considerations at Step
1 and in the proof of Homogeneous Farkas Lemma are “pure arithmetic” – we never used things
like convergence, compactness, etc., and used rational arithmetic only – no square roots, etc. It
means that the Homogeneous Farkas Lemma and the result stated a Step 1 remain valid if we, e.g.,
replace our universe Rn with the space Qn of n-dimensional rational vectors (those with rational
coordinates; of course, the multiplication by reals in this space should be restricted to multiplication
by rationals). The “rational” Farkas Lemma or the possibility to separate a rational vector from a
“rational” polytope by a rational linear form, which is the “rational” version of the result of Step 1,
definitely are of interest (e.g., for Integer Programming). In contrast to these “purely arithmetic”
considerations, at Step 2 we used compactness – something heavily exploiting the fact that our
universe is Rn and not, say, Qn (in the latter space bounded and closed sets not necessary are
compact). Note also that we could not avoid things like compactness arguments at Step 2, since the
very fact we are proving is true in Rn but not in Qn . Indeed, consider the “rational plane” – the
universe comprised of all 2-dimensional vectors with rational entries, and let S be the half-plane in
this rational plane given by the linear inequality
x1 + αx2 ≤ 0,
where α is irrational. S clearly is a “convex set” in Q2 ; it is immediately seen that a point outside
this set cannot be separated from S by a rational linear form.
∆ = S − T = {x − y : x ∈ S, y ∈ T }.
By Proposition 1.1.6.3, ∆ is convex (and, of course, nonempty) set; since S and T do not intersect, ∆
does not contain 0. By Step 2, we can separate ∆ and {0}: there exists f 6= 0 such that
In other words,
0≥ sup [f T x − f T y] & 0 > inf [f T x − f T y],
x∈S,y∈T x∈S,y∈T
It is immediately seen that in fact f separates S and T . Indeed, the quantities in the left and the right
hand sides of the first inequality in (1.2.13) clearly remain unchanged when we replace S 0 with cl S 0 and
T 0 with cl T 0 ; by Theorem 1.1.1, cl S 0 = cl S ⊃ S and cl T 0 = cl T ⊃ T , and we get inf f T x = inf 0 f T x,
x∈T x∈T
and similarly sup f T y = sup f T y. Thus, we get from (1.2.13)
y∈S y∈S 0
inf f T x ≥ sup f T y.
x∈T y∈S
It remains to note that T 0 ⊂ T , S 0 ⊂ S, so that the second inequality in (1.2.13) implies that
Exercise 1.13 Derive the statement in Remark 1.1.1 from the Separation Theorem.
Exercise 1.14 Implement the following alternative approach to the proof of Separation Theorem:
1. Prove that if x is a point in Rn and S is a nonempty closed convex set in Rn , then the problem
min{k x − y k2 : y ∈ S}
y
2. In the situation of 1), prove that if x 6∈ S, then the linear form e = x − x̄ strongly separates {x}
and S:
max eT y = eT x̄ = eT x − eT e < eT x,
y∈S
thus getting a direct proof of the possibility to separate strongly a nonempty closed convex set and
a point outside this set.
Definition 1.2.3 [Supporting plane] Let M be a convex closed set in Rn , and let x be a point from the
relative boundary of M . A hyperplane
Π = {y : aT y = aT x} [a 6= 0]
Note that since x is a point from the relative boundary of M and therefore belongs to cl M = M , the
first inequality in (1.2.14) in fact is equality. Thus, an equivalent definition of a supporting plane is as
follows:
Let M be a closed convex set and x be a relative boundary point of M . The hyperplane
{y : aT y = aT x} is called supporting to M at x, if the linear form a(y) = aT y attains its
maximum on M at the point x and is nonconstant on M .
For example, the hyperplane {x1 = 1} in Rn clearly is supporting to the unit Euclidean ball {x : |x| ≤ 1}
at the point x = e1 = (1, 0, ..., 0).
The most important property of a supporting plane is its existence:
Proposition 1.2.4 [Existence of supporting hyperplane] Let M be a convex closed set in Rn and x be
a point from the relative boundary of M . Then
(i) There exists at least one hyperplane which is supporting to M at x;
(ii) If Π is supporting to M at x, then the intersection M ∩ Π is of affine dimension less than the one
of M (recall that the affine dimension of a set is, by definition, the affine dimension of the affine hull of
the set).
Proof. (i) is easy: if x is a point from the relative boundary of M , then it is outside the relative interior of
M and therefore {x} and rint M can be separated by the Separation Theorem; the separating hyperplane
is exactly the desired supporting to M at x hyperplane.
To prove (ii), note that if Π = {y : aT y = aT x} is supporting to M at x ∈ ∂ri M , then the set
M = M ∩ Π is a nonempty (it contains x) convex set, and the linear form aT y is constant on M 0
0
and therefore (why?) on Aff(M 0 ). At the same time, the form is nonconstant on M by definition
of a supporting plane. Thus, Aff(M 0 ) is a proper (less than the entire Aff(M )) subset of Aff(M ), and
therefore the affine dimension of Aff(M 0 ) (i.e., the affine dimension of M 0 ) is less than the affine dimension
of Aff(M ) (i.e., than the affine dimension of M ). 2) .
Polar (M ) = {a : aT x ≤ 1∀x ∈ M }.
2)
In the latter reasoning we used the following fact: if P ⊂ Q are two affine subspaces, then the affine
dimension of P is ≤ the one of Q, with ≤ being = if and only if P = Q. Please prove this fact
1.2. MAIN THEOREMS ON CONVEX SETS 45
For example, Polar (Rn ) = {0}, Polar ({0}) = Rn ; if L is a liner subspace in Rn , then Polar (L) = L⊥
(why?).
The following properties of the polar are evident:
1. 0 ∈ Polar (M );
2. Polar (M ) is convex;
3. Polar (M ) is closed.
It turns out that these properties characterize polars:
Proposition 1.2.5 Every closed convex set M containing the origin is polar, specifically, it is polar of
its polar:
M is closed and convex, 0 ∈ M
l m
M = Polar (Polar (M ))
Proof. All we need is to prove that if M is closed and convex and 0 ∈ M , then M = Polar (Polar (M )).
By definition,
y ∈ Polar (M ), x ∈ M ⇒ y T x ≤ 1,
so that M ⊂ Polar (Polar (M )). To prove that this inclusion is in fact equality, assume, on the contrary,
that there exists x̄ ∈ Polar (Polar (M ))\M . Since M is nonempty, convex and closed and x̄ 6∈ M , the
point x̄ can be strongly separated from M (Separation Theorem, (ii)). Thus, for appropriate b one has
bT x̄ > sup bT x.
x∈M
Since 0 ∈ M , the left hand side quantity in this inequality is positive; passing from b to a proportional
vector a = λb with appropriately chosen positive λ, we may ensure that
aT x̄ > 1 ≥ sup aT x.
x∈M
This is the desired contradiction, since the relation 1 ≥ sup aT x implies that a ∈ Polar (M ), so that the
x∈M
relation aT x̄ > 1 contradicts the assumption that x̄ ∈ Polar (Polar (M )).
Exercise 1.15 Let M be a convex set containing the origin, and let M 0 be the polar of M . Prove the
following facts:
1. Polar (M ) = Polar (cl M );
2. M is bounded if and only if 0 ∈ int M 0 ;
3. int M 6= ∅ if and only if M 0 does not contain straight lines;
4. M is a closed cone of and only if M 0 is a closed cone. If M is a cone (not necessarily closed), then
M 0 = {a : aT x ≤ 0∀x ∈ M }. (1.2.15)
Exercise 1.16 Let M be a closed cone in Rn , and M∗ be its dual cone. Prove that
1. M is pointed (i.e., does not contain lines) if and only M∗ has a nonempty interior. Derive from
this fact that M is a closed pointed cone with a nonempty interior if and only if the dual cone has
the same properties.
2. Prove that a ∈ int M∗ if and only if aT x > 0 for all nonzero vectors x ∈ M .
Proposition 1.2.6 Let M1 , ..., Mk be cones. The cone M∗ dual to the intersection M of the cones
M1 ,...,Mk contains the arithmetic sum M f of the cones M1∗ ,...,Mk∗ dual to M1 ,...,Mk . If all the cones
M1 , ..., Mk are closed, then M∗ is equal to cl M
f. In particular, for closed cones M1 ,...,Mk , M∗ coincides
with M if and only if the latter set is closed.
f
Proof. Whenever ai ∈ Mi∗ and x ∈ M , we have aTi x ≥ 0, i = 1, ..., k, whence (a1 + ... + ak )T x ≥ 0.
Since the latter relation is valid for all x ∈ M , we conclude that a1 + ... + ak ∈ M∗ . Thus, M f ⊂ M∗ .
Now assume that the cones M1 , ..., Mk are closed, and let us prove that M = cl M f. Since M∗ is
closed and we have seen that M f ⊂ M∗ , all we should prove is that if a ∈ M∗ , then a ∈ M c = cl M f as
well. Assume, on the contrary, that there exists a ∈ M∗ \M c. Since the set M f clearly is a cone, its closure
M
c is a closed cone; by assumption, a does not belong to this closed cone and therefore, by Separation
Theorem (ii), a can be strongly separated from M c and therefore – from M f⊂M c. Thus, for some x one
has
Xk
aT x < inf bT x = inf (a1 + ... + ak )T x = inf aTi x. (1.2.16)
b∈M ai ∈Mi∗ ,i=1,...,k ai ∈Mi∗
e i=1
From the resulting inequality it follows that inf aTi x > −∞; since Mi∗ is a cone, the latter is possible
ai ∈Mi∗
if and only if inf aTi x = 0, i.e., if and only if for every i one has x ∈ (Mi∗ )∗ = Mi (recall that the
ai ∈Mi∗
cones Mi are closed). Thus, x ∈ Mi for all i, and the concluding quantity in (1.2.16) is 0. We see that
x ∈ M = ∩i Mi , and that (1.2.16) reduces to aT x < 0. This contradicts the inclusion a ∈ M∗ .
Note that in general Mf can be non-closed even when all the cones M1 , ..., Mk are closed. Indeed, take k =
p
2, and let M1 be the ice-cream cone {(x, y, z) ∈ R3 : z ≥ x2 + y 2 }, and M20 be the ray {z = x ≤ 0, y = 0}
0
0 0
in bR3 . Observe
p that the points from M ≡ M1 +M2 are exactly the points
f
√ of the form (x−t, y,√ z −t) with
t ≥ 0 and z ≥ x + y . In particular, for x positive the points (0, 1, x + 1−x) = (x−x, 1, x2 + 1−x)
2 2 2
belong to M f; as x → ∞, these points converge to p = (0, 1, 0), and thus p ∈ cl M f. On the other hand,
p
there clearly do not exist x, y, z, t with t ≥ 0 and z ≥ x2 + y 2 such that (x − t, y, z − t) = (0, 1, 0), that
is, p 6∈ M
f.
Dubovitski-Milutin Lemma presents a simple sufficient condition for M f to be closed and thus to
0
coincide with M :
Proposition 1.2.7 [Dubovitski-Milutin Lemma] Let M1 , ..., Mk be cones such that Mk is closed and the
set Mk ∩ int M1 ∩ int M2 ∩ ... ∩ int Mk−1 is nonempty, and let M = M1 ∩ ... ∩ Mk . Let also Mi∗ be the
cones dual to Mi . Then
k
T
(i) cl M = cl Mi ;
i=1
(ii) the cone M f = M1∗ + ... + Mk∗ is closed, and thus coincides with the cone M 0 dual to cl M (or,
which is the same by Exercise 1.15.1, with the cone dual to M ). In other words, every linear form which is
nonnegative on M can be represented as a sum of k linear forms which are nonnegative on the respective
cones M1 ,...,Mk .
1.2. MAIN THEOREMS ON CONVEX SETS 47
T
Proof. (i): We should prove that under the premise of the Dubovitski-Milutin Lemma, cl M = cl Mi .
i
The right hand side here contains M and is closed, so that all we should prove is that every point x in
Tk
cl Mi is the limit of an appropriate sequence xt ∈ M . By premise of the Lemma, there exists a point
i=1
x̄ ∈ Mk ∩ int M1 ∩ int M2 ∩ ... ∩ int Mk−1 ; setting xt = t−1 x̄ + (1 − t−1 )x, t = 1, 2, ..., we get a sequence
converging to x as t → ∞; at the same time, xt ∈ Mk (since x, x̄ are in cl Mk = Mk ) and xt ∈ Mi for
every i < k (by Lemma 1.1.1; note that for i < k one has x̄ ∈ int Mi and x ∈ cl Mi ), and thus xt ∈ M .
k
2
T
Thus, every point x ∈ cl Mi is the limit of a sequence from M .
i=1
(ii): To prove that under the premise of Lemma the cone M f = M1∗ + ... + Mk∗ is closed, let
x̄ ∈ Mk ∩ int M1 ∩ ... int Mk−∞ (recall that we are in the case when this intersection is nonempty). and
Pk
let yt = i=1 ytk ,, yti ∈ Mi∗ , be a sequence of points from M f converging, as t → ∞, to certain point
ȳ; we want to prove that ȳ ∈ M f. Since yt have a limit as t → ∞, the reals pt = y T x̄ = Pk y T x̄
t i=1 ti
Pk T
form a bounded sequence; since all terms in i=1 yti x̄ are nonnegative (due to x̄ ∈ Mi and yti ∈ Mi∗ ),
we conclude that every one of the k sequences {ytiT
x̄}∞i−1 , 1 ≤ i ≤ k, is bounded. Now let is make the
following observation:
(!) Let K be a convex cone and x̄ be an interior point of K, Then for certain constant C
(depending on x̄ and K) one has
∀y ∈ K∗ :k y k2 ≤ Cy T x̄. (!)
Indeed, K contains a k · k2 -ball B of certain positive radius r centered at x̄; consequently,
when y ∈ K∗ , we have
0 ≤ min y T x = y T x̄ − r k y k2 ,
x∈B
Definition 1.2.4 [extreme points] Let M be a nonempty convex set in Rn . A point x ∈ M is called an
extreme point of M , if there is no nontrivial (of positive length) segment [u, v] ∈ M for which x is an
interior point, i.e., if the relation
x = λu + (1 − λ)v
with λ ∈ (0, 1) and u, v ∈ M is valid if and only if
u = v = x.
48 LECTURE 1. CONVEX SETS IN RN
For example, the extreme points of a segment are exactly its endpoints; the extreme points of a triangle
are its vertices; the extreme points of a (closed) circle on the 2-dimensional plane are the points of the
circumference.
An equivalent definitions of an extreme point is as follows:
Exercise 1.17 Let M be a convex set and let x ∈ M . Prove that
1. x is extreme if and only if the only vector h such that x ± h ∈ M is the zero vector;
2. x is extreme if and only if the set M \{x} is convex.
M = Conv(Ext(M )),
Comment. For a closed convex set M, the set of all directions h such that x + th ∈ M
for some x and all t ≥ 0 (i.e., by Lemma – such that x + th ∈ M for all x ∈ M and all t ≥ 0)
is called the recessive cone of M [notation: Rec(M )]. With Lemma 1.2.3 it is immediately
seen (prove it!) that Rec(M ) indeed is a closed cone, and that
M + Rec(M ) = M.
for all ∈ (0, 1). As → +0, the left hand side tends to x + τ h, and since M is closed,
x + τ h ∈ M for every τ ≥ 0. 2
Exercise 1.18 Let M be a closed nonempty convex set. Prove that Rec(M ) 6= {0} if and
only if M is unbounded.
Lemma 1.2.3, of course, resolves all our problems with the ”only if” part. Indeed, here we should
prove that if M possesses extreme points, then M does not contain lines, or, which is the same, that if
M contains lines, then it has no extreme points. But the latter statement is immediate: if M contains a
line, then, by Lemma, there is a line in M passing through every given point of M , so that no point can
be extreme. 2
Now let us prove the ”if” part of (i). Thus, from now on we assume that M does not contain lines;
our goal is to prove that then M possesses extreme points. Let us start with the following
Lemma 1.2.4 Let Q be a nonempty closed convex set, x̄ be a relative boundary point of Q
and Π be a hyperplane supporting to Q at x̄. Then all extreme points of the nonempty closed
convex set Π ∩ Q are extreme points of Q.
Proof of the Lemma. First, the set Π ∩ Q is closed and convex (as an intersection of
two sets with these properties); it is nonempty, since it contains x̄ (Π contains x̄ due to the
definition of a supporting plane, and Q contains x̄ due to the closedness of Q). Second, let
a be the linear form associated with Π:
Π = {y : aT y = aT x̄},
so that
inf aT x < sup aT x = aT x̄ (1.2.17)
x∈Q x∈Q
y = λu + (1 − λ)v
aT y = aT x̄ ≥ max{aT u, aT v}
aT y = λaT u + (1 − λ)aT v;
combining these observations and taking into account that λ ∈ (0, 1), we conclude that
aT y = aT u = aT v.
Equipped with the Lemma, we can easily prove (i) by induction on the dimension of the convex set
M (recall that this is nothing but the affine dimension of the affine span of M , i.e., the linear dimension
of the linear subspace L such that Aff(M ) = a + L).
There is nothing to do if the dimension of M is zero, i.e., if M is a point – then, of course, M = Ext(M ).
Now assume that we already have proved the nonemptiness of Ext(T ) for all nonempty closed and not
containing lines convex sets T of certain dimension k, and let us prove that the same statement is valid
for the sets of dimension k + 1. Let M be a closed convex nonempty and not containing lines set of
dimension k + 1. Since M does not contain lines and is of positive dimension, it differs from Aff(M )
and therefore it possesses a relative boundary point x̄ 3) . According to Proposition 1.2.4, there exists a
hyperplane Π = {x : aT x = aT x̄} which supports M at x̄:
inf aT x < max aT x = aT x̄.
x∈M x∈M
By the same Proposition, the set T = Π∩M (which is closed, convex and nonempty) is of affine dimension
less than the one of M , i.e., of the dimension ≤ k. T clearly does not contain lines (since even the larger
set M does not contain lines). By the inductive hypothesis, T possesses extreme points, and by Lemma
1.2.4 all these points are extreme also for M . The inductive step is completed, and (i) is proved. 2
Now let us prove (ii). Thus, let M be nonempty, convex, closed and bounded; we should prove that
M = Conv(Ext(M )).
What is immediately seen is that the right hand side set is contained in the left hand side one. Thus, all
we need is to prove that every x ∈ M is a convex combination of points from Ext(M ). Here we again
use induction on the affine dimension of M . The case of 0-dimensional set M (i.e., a point) is trivial.
Assume that the statement in question is valid for all k-dimensional convex closed and bounded sets, and
let M be a convex closed and bounded set of dimension k + 1. Let x ∈ M ; to represent x as a convex
combination of points from Ext(M ), let us pass through x an arbitrary line ` = {x + λh : λ ∈ R} (h 6= 0)
in the affine span Aff(M ) of M . Moving along this line from x in each of the two possible directions,
we eventually leave M (since M is bounded); as it was explained in the proof of (i), it means that there
exist nonnegative λ+ and λ− such that the points
x̄+ = x + λ+ h, x̄− = x − λ− h
both belong to the relative boundary of M . Let us verify that x̄± are convex combinations of the extreme
points of M (this will complete the proof, since x clearly is a convex combination of the two points x̄± ).
Indeed, M admits supporting at x̄+ hyperplane Π; as it was explained in the proof of (i), the set Π ∩ M
(which clearly is convex, closed and bounded) is of affine dimension less than that one of M ; by the
inductive hypothesis, the point x̄+ of this set is a convex combination of extreme points of the set, and by
Lemma 1.2.4 all these extreme points are extreme points of M as well. Thus, x̄+ is a convex combination
of extreme points of M . Similar reasoning is valid for x̄− .
A being a m × n matrix and b being a vector from Rm . What are the extreme points of K? The answer
is given by the following
To prove the “if” part, assume that x ∈ K is such that among the inequalities aTi x ≤ bi which are
equalities at x there are n linearly independent, say, those with indices 1, ..., n, and let us prove that x
is an extreme point of K. This is immediate: assuming that x is not an extreme point, we would get
the existence of a nonzero vector h such that x ± h ∈ K. In other words, for i = 1, ..., n we would have
bi ± aTi h ≡ aTi (x ± h) ≤ bi , which is possible only if aTi h = 0, i = 1, ..., n. But the only vector which is
orthogonal to n linearly independent vectors in Rn is the zero vector (why?), and we get h = 0, which
was assumed not to be the case. .
Indeed, according to the above Theorem, every extreme point of a polyhedral set K = {x ∈ Rn : Ax ≤ b}
satisfies the equality version of certain n-inequality subsystem of the original system, the matrix of
the subsystem being nonsingular. Due to the latter fact, an extreme point is uniquely defined by the
corresponding subsystem, so that the number of extreme points does not exceed the number Cnm of n × n
submatrices of the matrix A and is therefore finite.
Note that Cnm is nothing but an upper (ant typically very conservative) bound on the number of
extreme points of a polyhedral set given by m inequalities in Rn : some n × n submatrices of A can be
singular and, what is more important, the majority of the nonsingular ones normally produce “candidates”
which do not satisfy the remaining inequalities.
Remark 1.2.1 The result of Theorem 1.2.11 is very important, in particular, for the theory
of the Simplex method – the traditional computational tool of Linear Programming. When
applied to the LP program in the standard form
min cT x : P x = p, x ≥ 0 [x ∈ Rn ],
x
with k × n matrix P , the result of Theorem 1.2.11 is that extreme points of the feasible
set are exactly the basic feasible solutions of the system P x = p, i.e., nonnegative vectors
x such that P x = p and the set of columns of P associated with positive entries of x is
linearly independent. Since the feasible set of an LP program in the standard form clearly
does not contain lines, among the optimal solutions (if they exist) to an LP program in the
standard form at least one is an extreme point of the feasible set (Theorem 1.2.14.(ii)). Thus,
in principle we could look through the finite set of all extreme points of the feasible set (≡
through all basic feasible solutions) and to choose the one with the best value of the objective.
52 LECTURE 1. CONVEX SETS IN RN
This recipe allows to find a feasible solution in finitely many arithmetic operations, provided
that the program is solvable, and is, basically, what the Simplex method does; this latter
method, of course, looks through the basic feasible solutions in a smart way which normally
allows to deal with a negligible part of them only.
Another useful consequence of Theorem 1.2.11 is that if all the data in an LP program are
rational, then every extreme point of the feasible domain of the program is a vector with
rational entries. In particular, a solvable standard form LP program with rational data has
at least one rational optimal solution.
By Krein-Milman Theorem, Πn is the convex hull of its extreme points. What are these extreme points?
The answer is given by important
Theorem 1.2.12 [Birkhoff Theorem] Extreme points of Πn are exactly the permutation matrices of order
n, i.e., n × n Boolean (i.e., with 0/1 entries) matrices with exactly one nonzero element (equal to 1) in
every row and every column.
Exercise 1.19 [Easy part] Prove the easy part of the Theorem, specifically, that every n × n permutation
matrix is an extreme point of Πn .
Proof of difficult part. Now let us prove that every extreme point of Πn is a permutation matrix. To
this end let us note that the 2n linear equations in the definition of Πn — those saying Pthat all row and
column sums are equal to 1 - are linearly dependent, and dropping one of them, say, i xin = 1, we do
not alter the set. Indeed, the remaining equalities say that all row sums are equal to 1, so that the total
sum of all entries in X is n, and that the first n − 1 column sums are equal to 1, meaning that the last
column sum is n − (n − 1) = 1. Thus, we lose nothing when assuming that there are just 2n − 1 equality
constraints in the description of Πn . Now let us prove the claim by induction in n. The base n = 1 is
trivial. Let us justify the inductive step n − 1 ⇒ n. Thus, let X be an extreme point of Πn . By Theorem
1.2.11, among the constraints defining Πn (i.e., 2n − 1 equalities and n2 inequalities xij ≥ 0) there should
be n2 linearly independent which are satisfied at X as equations. Thus, at least n2 − (2n − 1) = (n − 1)2
entries in X should be zeros. It follows that at least one of the columns of X contains ≤ 1 nonzero entries
(since otherwise the number of zero entries in X would be at most n(n − 2) < (n − 1)2 ). Thus, there
exists at least one column with at most 1 nonzero entry; since the sum of entries in this column is 1, this
nonzero entry, let it be xīj̄ , is equal to 1. Since the entries in row ī are nonnegative, sum up to 1 and
xīj̄ = 1, xīj̄ = 1 is the only nonzero entry in its row and its column. Eliminating from X the row ī and
the column j̄, we get an (n − 1) × (n − 1) double stochastic matrix. By inductive hypothesis, this matrix is
a convex combination of (n − 1) × (n − 1) permutation matrices. Augmenting every one of these matrices
by the column and the row we havePeliminated, we get a representation of X as a convex combination of
n × n permutation matrices: X = ` λ` P` with nonnegative λ` summing up to 1. Since P` ∈ Πn and X
is an extreme point of Πn , in this representation all terms with nonzero coefficients λ` must be equal to
λ` X, so that X is one of the permutation matrices P` and as such is a permutation matrix.
where A is a matrix of the column size n and certain row size m and b is m-dimensional vector. This is an
“outer” description of a polyhedral set. We are about to establish an important result on the equivalent
“inner” representation of a polyhedral set.
Consider the following construction. Let us take two finite nonempty set of vectors V (“vertices”)
and R (“rays”) and build the set
X X X
M (V, R) = Conv(V ) + Cone (R) = { λv v + µr r : λv ≥ 0, µr ≥ 0, λv = 1}.
v∈V r∈R v
Thus, we take all vectors which can be represented as sums of convex combinations of the points from V
and conic combinations of the points from R. The set M (V, R) clearly is convex (as the arithmetic sum
of two convex sets Conv(V ) and Cone (R)). The promised inner description polyhedral sets is as follows:
Theorem 1.2.13 [Inner description of a polyhedral set] The sets of the form M (V, R) are exactly the
nonempty polyhedral sets: M (V, R) is polyhedral, and every nonempty polyhedral set M is M (V, R) for
properly chosen V and R.
The polytopes M (V, {0}) = Conv(V ) are exactly the nonempty and bounded polyhedral sets. The sets
of the type M ({0}, R) are exactly the polyhedral cones (sets given by finitely many nonstrict homogeneous
linear inequalities).
Remark 1.2.2 In addition to the results of the Theorem, it can be proved that in the representation of
a nonempty polyhedral set M as M = Conv(V ) + Cone (R)
– the “conic” part Conv(R) (not the set R itself!) is uniquely defined by M and is the recessive cone
of M (see Comment to Lemma 1.2.3);
– if M does not contain lines, then V can be chosen as the set of all extreme points of M .
Postponing temporarily the proof of Theorem 1.2.13, let us explain why this theorem is that important
– why it is so nice to know both inner and outer descriptions of a polyhedral set.
Consider a number of natural questions:
• A. Is it true that the inverse image of a polyhedral set M ⊂ Rn under an affine mapping y 7→
P(y) = P y + p : Rm → Rn , i.e., the set
P −1 (M ) = {y ∈ Rm : P y + p ∈ M }
is polyhedral?
• B. Is it true that the image of a polyhedral set M ⊂ Rn under an affine mapping x 7→ y = P(x) =
P x + p : Rn → Rm – the set
P(M ) = {P x + p : x ∈ M }
is polyhedral?
• C. Is it true that the intersection of two polyhedral sets is again a polyhedral set?
• D. Is it true that the arithmetic sum of two polyhedral sets is again a polyhedral set?
The answers to all these questions are positive; one way to see it is to use calculus of polyhedral rep-
resentations along with the fact that polyhedrally representable sets are exactly the same as polyhedral
sets (section 1.2.4). Another very instructive way is to use the just outlined results on the structure of
polyhedral sets, which we intend to do now.
It is very easy to answer affirmatively to A, starting from the original – outer – definition of a
polyhedral set: if M = {x : Ax ≤ b}, then, of course,
An attempt to answer affirmatively to B via the same definition fails – there is no easy way to
convert the linear inequalities defining a polyhedral set into those defining its image, and it is absolutely
unclear why the image indeed is given by finitely many linear inequalities. Note, however, that there
is no difficulty to answer affirmatively to B with the inner description of a nonempty polyhedral set: if
M = M (V, R), then, evidently,
P(M ) = M (P(V ), P R),
where P R = {P r : r ∈ R} is the image of R under the action of the homogeneous part of P.
Similarly, positive answer to C becomes evident, when we use the outer description of a polyhedral
set: taking intersection of the solution sets to two systems of nonstrict linear inequalities, we, of course,
again get the solution set to a system of this type – you simply should put together all inequalities from
the original two systems. And it is very unclear how to answer positively to D with the outer definition
of a polyhedral set – what happens with inequalities when we add the solution sets? In contrast to this,
the inner description gives the answer immediately:
Note that in this computation we used two rules which should be justified: Conv(V ) + Conv(V 0 ) =
Conv(V + V 0 ) and Cone (R) + Cone (R0 ) = Cone (R ∪ R0 ). The second is evident from the definition of
the conic hull, and only the first needs simple reasoning. To prove it, note that Conv(V ) + Conv(V 0 ) is a
convex set which contains V + V 0 and therefore contains Conv(V + V 0 ). The inverse inclusion is proved
as follows: if X X
x= λ i vi , y = λ0j vj0
i j
are convex combinations of points from V , resp., V 0 , then, as it is immediately seen (please check!),
X
x+y = λi λ0j (vi + vj0 )
i,j
here c is a given n-dimensional vector – the objective, A is a given m × n constraint matrix and b ∈ Rm
is the right hand side vector. Note that (P) is called “Linear Programming program in the canonical
form”; there are other equivalent forms of the problem.
• solvable, if it is feasible and the optimal solution exists – the objective attains its maximum on the
feasible set.
If the program is bounded, then the upper bound of the values of the objective on the feasible set is a
real; this real is called the optimal value of the program and is denoted by c∗ . It is convenient to assign
optimal value to unbounded and infeasible programs as well – for an unbounded program it, by definition,
is +∞, and for an infeasible one it is −∞.
Note that our terminology is aimed to deal with maximization programs; if the program is to mini-
mize the objective, the terminology is updated in the natural way: when defining bounded/unbounded
programs, we should speak about below boundedness rather than about the above boundedness of the
objective, etc. For example, the optimal value of an unbounded minimization program is −∞, and of an
infeasible one it is +∞. This terminology is consistent with the usual way of converting a minimization
problem into an equivalent maximization one by replacing the original objective c with −c: the properties
of feasibility – boundedness – solvability remain unchanged, and the optimal value in all cases changes
its sign.
We have said that you for sure know the above terminology; this is not exactly true, since you definitely
have heard and used the words “infeasible LP program”, “unbounded LP program”, but hardly used the
words “bounded LP program” – only the “solvable” one. This indeed is true, although absolutely unclear
in advance – a bounded LP program always is solvable. We have already established this fact, even twice
— via Fourier-Motzkin elimination (section 1.2.4 and via the LP Duality Theorem). Let us reestablish
this fundamental for Linear Programming fact with the tools we have at our disposal now.
Theorem 1.2.14 (i) A Linear Programming program is solvable if and only if it is bounded.
(ii) If the program is solvable and the feasible set of the program does not contain lines, then at least
one of the optimal solutions is an extreme point of the feasible set.
Proof. (i): The “only if” part of the statement is tautological: the definition of solvability includes
boundedness. What we should prove is the “if” part – that a bounded program is solvable. This is
immediately given by the inner description of the feasible set M of the program: this is a polyhedral set,
so that being nonempty (as it is for a bounded program), it can be represented as
for some nonempty finite sets V and R. We claim first of all that since (P) is bounded, the inner product
of c with every vector from R is nonpositive. Indeed, otherwise there would be r ∈ R with cT r > 0; since
M (V, R) clearly contains with every its point x the entire ray {x + tr : t ≥ 0}, and the objective evidently
is unbounded on this ray, it would be above unbounded on M , which is not the case.
Now let us choose in the finite and nonempty set V the point, let it be called v ∗ , which maximizes the
objective on V . We claim that v ∗ is an optimal solution to (P), so that (P) is solvable. The justification
of the claim is immediate: v ∗ clearly belongs to M ; now, a generic point of M = M (V, R) is
X X
x= λv v + µr r
v∈V r∈R
P
with nonnegative λv and µr and with λv = 1, so that
v
cT x = λ v cT v + µr cT r
P P
v r
λ v cT v [since µr ≥ 0 and cT r ≤ 0, r ∈ R]
P
≤
v
λ v cT v ∗ [since λv ≥ 0 and cT v ≤ cT v ∗ ]
P
≤
v
cT v ∗ 2
P
= [since λv = 1]
v
(ii): if the feasible set of (P), let it be called M , does not contain lines, it, being convex and closed
(as a polyhedral set) possesses extreme points. It follows that (ii) is valid in the trivial case when the
56 LECTURE 1. CONVEX SETS IN RN
objective of (ii) is constant on the entire feasible set, since then every extreme point of M can be taken
as the desired optimal solution. The case when the objective is nonconstant on M can be immediately
reduced to the aforementioned trivial case: if x∗ is an optimal solution to (P) and the linear form cT x is
nonconstant on M , then the hyperplane Π = {x : cT x = c∗ } is supporting to M at x∗ ; the set Π ∩ M is
closed, convex, nonempty and does not contain lines, therefore it possesses an extreme point x∗∗ which,
on one hand, clearly is an optimal solution to (P), and on another hand is an extreme point of M by
Lemma 1.2.4.
Theorem 1.2.15 [Structure of a bounded polyhedral set] A bounded and nonempty polyhedral set M in
Rn is a polytope, i.e., is the convex hull of a finite nonempty set:
Proof. The first part of the statement – that a bounded nonempty polyhedral set is a polytope – is
readily given by the Krein-Milman Theorem combined with Corollary 1.2.1. Indeed, a polyhedral set
always is closed (as a set given by nonstrict inequalities involving continuous functions) and convex; if it
is also bounded and nonempty, it, by the Krein-Milman Theorem, is the convex hull of the set V of its
extreme points; V is finite by Corollary 1.2.1. 2
Now let us prove the more difficult part of the statement – that a polytope is a bounded polyhedral
set. The fact that a convex hull of a finite set is bounded is evident. Thus, all we need is to prove that
the convex hull of finitely many points is a polyhedral set. To this end note that this convex hull clearly
is polyhedrally representable:
X X
Conv{v1 , ..., vN } = {x : ∃λ : λ ≥ 0, λi = 1, x = λ i vi }
i i
1.2.9.3.2. Structure of a general polyhedral set: completing the proof. Now let us
prove the general Theorem 1.2.13. The proof basically follows the lines of the one of Theorem 1.2.15,
but with one elaboration: now we cannot use the Krein-Milman Theorem to take upon itself part of our
difficulties.
To simplify language let us call VR-sets (“V” from “vertex”, “R” from rays) the sets of the form
M (V, R), and P-sets the nonempty polyhedral sets. We should prove that every P-set is a VR-set, and
vice versa. We start with proving that every P-set is a VR-set.
1.2.9.3.2.A. P⇒VR:
P⇒VR, Step 1: reduction to the case when the P-set does not contain lines. Let
M be a P-set, so that M is the set of all solutions to a solvable system of linear inequalities:
M = {x ∈ Rn : Ax ≤ b} (1.2.19)
with m × n matrix A. Such a set may contain lines; if h is the direction of a line in M , then A(x + th) ≤ b
for some x and all t ∈ R, which is possible only if Ah = 0. Vice versa, if h is from the kernel of A, i.e., if
Ah = 0, then the line x + Rh with x ∈ M clearly is contained in M . Thus, we come to the following fact:
1.2. MAIN THEOREMS ON CONVEX SETS 57
Lemma 1.2.5 ss Nonempty polyhedral set (1.2.19) contains lines if and only if the kernel
of A is nontrivial, and the nonzero vectors from the kernel are exactly the directions of lines
contained in M : if M contains a line with direction h, then h ∈ Ker A, and vice versa: if
0 6= h ∈ Ker A and x ∈ M , then M contains the entire line x + Rh.
Given a nonempty set (1.2.19), let us denote by L the kernel of A and by L⊥ the orthogonal complement
to the kernel, and let M 0 be the cross-section of M by L⊥ :
M 0 = {x ∈ L⊥ : Ax ≤ b}.
The set M 0 clearly does not contain lines (since the direction of every line in M 0 , on one hand, should
belong to L⊥ due to M 0 ⊂ L⊥ , and on the other hand – should belong to L = Ker A, since a line in
M 0 ⊂ M is a line in M as well). The set M 0 is nonempty and, moreover, M = M 0 + L. Indeed, M 0
contains the orthogonal projections of all points from M onto L⊥ (since to project a point onto L⊥ , you
should move from this point along certain line with the direction in L, and all these movements, started
in M , keep you in M by the Lemma) and therefore is nonempty, first, and is such that M 0 + L ⊃ M ,
second. On the other hand, M 0 ⊂ M and M + L = M by Lemma 1.2.5, whence M 0 + L ⊂ M . Thus,
M0 + L = M.
Finally, M 0 is a polyhedral set together with M , since the inclusion x ∈ L⊥ can be represented by
dim L linear equations (i.e., by 2 dim L nonstrict linear inequalities): you should say that x is orthogonal
to dim L somehow chosen vectors a1 , ..., adim L forming a basis in L.
The results of our effort are as follows: given an arbitrary P-set M , we have represented is as the
sum of a P-set M 0 not containing lines and a linear subspace L. With this decomposition in mind we see
that in order to achieve our current goal – to prove that every P-set is a VR-set – it suffices to prove the
same statement for P-sets not containing lines. Indeed, given that M 0 = M (V, R0 ) and denoting by R0 a
finite set such that L = Cone (R0 ) (to get R0 , take the set of 2 dim L vectors ±ai , i = 1, ..., dim L, where
a1 , ..., adim L is a basis in L), we would obtain
M = M0 + L
= [Conv(V ) + Cone (R)] + Cone (R0 )
= Conv(V ) + [Cone (R) + Cone (R0 )]
= Conv(V ) + Cone (R ∪ R0 )
= M (V, R ∪ R0 )
We see that in order to establish that a P-set is a VR-set it suffices to prove the same statement for
the case when the P-set in question does not contain lines.
P⇒VR, Step 2: the P-set does not contain lines. Our situation is as follows: we are
given a not containing lines P-set in Rn and should prove that it is a VR-set. We shall prove this
statement by induction on the dimension n of the space. The case of n = 0 is trivial. Now assume that
the statement in question is valid for n ≤ k, and let us prove that it is valid also for n = k + 1. Let M
be a not containing lines P-set in Rk+1 :
Without loss of generality we may assume that all ai are nonzero vectors (since M is nonempty, the
inequalities with ai = 0 are valid on the entire Rn , and removing them from the system, we do not vary
its solution set). Note that m > 0 – otherwise M would contain lines, since k ≥ 0.
10 . We may assume that M is unbounded – otherwise the desired result is given already by Theorem
1.2.15. By Exercise 1.18, there exists a recessive direction r 6= 0 of M Thus, M contains the ray
{x + tr : t ≥ 0}, whence, by Lemma 1.2.3, M + Cone ({r}) = M . 2
20 . For every i ≤ m, where m is the row size of the matrix A from (1.2.20), that is, the number of
linear inequalities in the description of M , let us denote by Mi the corresponding “facet” of M – the
58 LECTURE 1. CONVEX SETS IN RN
polyhedral set given by the system of inequalities (1.2.20) with the inequality aTi x ≤ bi replaced by the
equality aTi x = bi . Some of these “facets” can be empty; let I be the set of indices i of nonempty Mi ’s.
When i ∈ I, the set Mi is a nonempty polyhedral set – i.e., a P-set – which does not contain lines
(since Mi ⊂ M and M does not contain lines). Besides this, Mi belongs to the hyperplane {aTi x = bi },
i.e., actually it is a P-set in Rk . By the inductive hypothesis, we have representations
Mi = M (Vi , Ri ), i ∈ I,
where r is a recessive direction of M found in 10 ; after the claim will be supported, our induction will be
completed.
To prove (1.2.21), note, first of all, that the right hand side of this relation is contained in the left
hand side one. Indeed, since Mi ⊂ M and Vi ⊂ Mi , we have Vi ⊂ M , whence also V = ∪i Vi ⊂ M ; since
M is convex, we have
Conv(V ) ⊂ M. (1.2.22)
Further, if r0 ∈ Ri , then r0 is a recessive direction of Mi ; since Mi ⊂ M , r0 is a recessive direction of M
by Lemma 1.2.3. Thus, every vector from ∪i∈I Ri is a recessive direction for M , same as r; thus, every
vector from R = ∪i∈I Ri ∪ {r} is a recessive direction of M , whence, again by Lemma 1.2.3,
M + Cone (R) = M.
1.2.9.3.2.B. VR⇒P: We already know that every P-set is a VR-set. Now we shall prove that
every VR-set is a P-set, thus completing the proof of Theorem 1.2.13. This is immediate: a VR-set is
polyhedrally representable (why?) and thus is a P-set by Theorem 1.2.5.
Lecture 2
Convex functions
If the above inequality is strict whenever x 6= y and 0 < λ < 1, f is called strictly convex.
A function f such that −f is convex is called concave; the domain Q of a concave function should be
convex, and the function itself should satisfy the inequality opposite to (2.1.1):
f (x) = aT x + b
– the sum of a linear form and a constant. This function clearly is convex on the entire space, and the
“convexity inequality” for it is equality. An affine function is both convex and concave; it is easily seen
that a function which is both convex and concave on the entire space is affine.
Here are several elementary examples of “nonlinear” convex functions of one variable:
• functions convex on the whole axis:
x2p , p is a positive integer;
exp{x};
• functions convex on the nonnegative ray:
xp , 1 ≤ p;
−xp , 0 ≤ p ≤ 1;
x ln x;
• functions convex on the positive ray:
1/xp , p > 0;
− ln x.
59
60 LECTURE 2. CONVEX FUNCTIONS
To the moment it is not clear why these functions are convex; in the mean time we shall derive a simple
analytic criterion for detecting convexity which immediately demonstrates that the above functions indeed
are convex.
A very convenient equivalent definition of a convex function is in terms of its epigraph. Given a
real-valued function f defined on a nonempty subset Q of Rn , we define its epigraph as the set
geometrically, to define the epigraph, you should take the graph of the function – the surface {t =
f (x), x ∈ Q} in Rn+1 – and add to this surface all points which are “above” it. The equivalent, more
geometrical, definition of a convex function is given by the following simple statement (prove it!):
Proposition 2.1.1 [definition of convexity in terms of the epigraph]
A function f defined on a subset of Rn is convex if and only if its epigraph is a nonempty convex set
in Rn+1 .
More examples of convex functions: norms. Equipped with Proposition 2.1.1, we can extend
our initial list of convex functions (several one-dimensional functions and affine ones) by more examples –
n
norms. Let π(x) be a norm on R √ (see Section 1.1.2.B). To the moment we know three examples of norms
P
– the Euclidean norm k x k2 = xT x, the 1-norm k x k1 = |xi | and the ∞-norm k x k∞ = max |xi |. It
i i
was also claimed (although not proved) that these are three members of an infinite family of norms
n
!1/p
X
p
k x kp = |xi | , 1≤p≤∞
i=1
(the right hand side of the latter relation for p = ∞ is, by definition, max |xi |).
i
We are about to prove that every norm is convex:
Proposition 2.1.2 Let π(x) be a real-valued function on Rn which is positively homogeneous of degree
1:
π(tx) = tπ(x) ∀x ∈ Rn , t ≥ 0.
π is convex if and only if it is subadditive:
Proof is immediate: the points (f (xi ), xi ) clearly belong to the epigraph of f ; since f is convex, its
epigraph is a convex set, so that the convex combination
N
X XN N
X
λi (f (xi ), xi ) = ( λi f (xi ), λ i xi )
i=1 i=1 i=1
of the points also belongs to Epi(f ). By definition of the epigraph, the latter means exactly that
PN N
P
λi f (xi ) ≥ f ( λi xi ).
i=1 i=1
Note that the definition of convexity of a function f is exactly the requirement on f to satisfy the
Jensen inequality for the case of N = 2; we see that to satisfy this inequality for N = 2 is the same as to
satisfy it for all N .
Proposition 2.1.4 [convexity of level sets] Let f be a convex function with the domain Q. Then, for
every real α, the set
levα (f ) = {x ∈ Q : f (x) ≤ α}
– the level set of f – is convex.
The proof takes one line: if x, y ∈ levα (f ) and λ ∈ [0, 1], then f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) ≤
λα + (1 − λ)α = α, so that λx + (1 − λ)y ∈ levα (f ).
Note that the convexity of level sets does not characterize convex functions; there are nonconvex
functions which share this property (e.g., every monotone function on the axis). The “proper” charac-
terization of convex functions in terms of convex sets is given by Proposition 2.1.1 – convex functions are
exactly the functions with convex epigraphs. Convexity of level sets specify a wider family of functions,
the so called quasiconvex ones.
If the expression in the right hand side involves infinities, it is assigned the value according to the
standard and reasonable conventions on what are arithmetic operations in the “extended real axis”
R ∪ {+∞} ∪ {−∞}:
• arithmetic operations with reals are understood in their usual sense;
• the sum of +∞ and a real, same as the sum of +∞ and +∞ is +∞; similarly, the sum of a real
and −∞, same as the sum of −∞ and −∞ is −∞. The sum of +∞ and −∞ is undefined;
62 LECTURE 2. CONVEX FUNCTIONS
• the product of a real and +∞ is +∞, 0 or −∞, depending on whether the real is positive, zero or
negative, and similarly for the product of a real and −∞. The product of two ”infinities” is again
infinity, with the usual rule for assigning the sign to the product.
Note that it is not clear in advance that our new definition of a convex function is equivalent to the initial
one: initially we included into the definition requirement for the domain to be convex, and now we omit
explicit indicating this requirement. In fact, of course, the definitions are equivalent: convexity of Dom f
– i.e., the set where f is finite – is an immediate consequence of the “convexity inequality” (2.1.2).
It is convenient to think of a convex function as of something which is defined everywhere, since it
saves a lot of words. For example, with this convention I can write f + g (f and g are convex functions
on Rn ), and everybody will understand what is meant; without this convention, I am supposed to add
to this expression the explanation as follows: “f + g is a function with the domain being the intersection
of those of f and g, and in this intersection it is defined as (f + g)(x) = f (x) + g(x)”.
convexity of the objective f and the constraints gi is crucial: it turns out that problems with this property
possess nice theoretical properties (e.g., the local necessary optimality conditions for these problems are
sufficient for global optimality); and what is much more important, convex problems can be efficiently
(both in theoretical and, to some extent, in the practical meaning of the word) solved numerically, which
is not, unfortunately, the case for general nonconvex problems. This is why it is so important to know
how one can detect convexity of a given function. This is the issue we are coming to.
The scheme of our investigation is typical for mathematics. Let me start with the example which
you know from Analysis. How do you detect continuity of a function? Of course, there is a definition
of continuity in terms of and δ, but it would be an actual disaster if each time we need to prove
continuity of a function, we were supposed to write down the proof that ”for every positive there exists
positive δ such that ...”. In fact we use another approach: we list once for ever a number of standard
operations which preserve continuity, like addition, multiplication, taking superpositions, etc., and point
out a number of standard examples of continuous functions – like the power function, the exponent,
etc. To prove that the operations in the list preserve continuity, same as to prove that the standard
functions are continuous, this takes certain effort and indeed is done in − δ terms; but after this effort
is once invested, we normally have no difficulties with proving continuity of a given function: it suffices
to demonstrate that the function can be obtained, in finitely many steps, from our ”raw materials” – the
standard functions which are known to be continuous – by applying our machinery – the combination
rules which preserve continuity. Normally this demonstration is given by a single word ”evident” or even
is understood by default.
This is exactly the case with convexity. Here we also should point out the list of operations which
preserve convexity and a number of standard convex functions.
[you can prove it directly by verifying the definition or by noting that the epigraph of the super-
position, if nonempty, is the inverse image of the epigraph of f under an affine mapping]
• [stability under taking pointwise sup] upper bound sup fα (·) of every family of convex functions on
α
Rn is convex, provided that this bound is finite at least at one point.
[to understand it, note that the epigraph of the upper bound clearly is the intersection of epigraphs
of the functions from the family; recall that the intersection of every family of convex sets is convex]
• [“Convex Monotone superposition”] Let f (x) = (f1 (x), ..., fk (x)) be vector-function on Rn with
convex components fi , and assume that F is a convex function on Rk which is monotone, i.e., such
that z ≤ z 0 always implies that F (z) ≤ F (z 0 ). Then the superposition
Remark 2.2.1 The expression F (f1 (x), ..., fk (x)) makes no evident sense at a point x where some
of fi ’s are +∞. By definition, we assign the superposition at such a point the value +∞.
[To justify the rule, note that if λ ∈ (0, 1) and x, x0 ∈ Dom φ, then z = f (x), z 0 = f (x0 ) are vectors
from Rk which belong to Dom F , and due to the convexity of the components of f we have
in particular, the left hand side is a vector from Rk – it has no “infinite entries”, and we may
further use the monotonicity of F :
]
Imagine how many extra words would be necessary here if there were no convention on the value of a
convex function outside its domain!
Two more rules are as follows:
is proper, i.e., is > −∞ everywhere and is finite at least at one point, then g is convex
[this can be proved as follows. We should prove that if x, x0 ∈ Dom g and x00 =
λx + (1 − λ)x0 with λ ∈ [0, 1], then x00 ∈ Dom g and g(x00 ) ≤ λg(x) + (1 − λ)g(x0 ).
Given positive , we can find y and y 0 such that (x, y) ∈ Dom f , (x0 , y 0 ) ∈ Dom f and
g(x) + ≥ f (x, y), g(y 0 ) + ≥ f (x0 , y 0 ). Taking weighted sum of these two inequalities,
we get
λg(x) + (1 − λ)g(y) + ≥ λf (x, y) + (1 − λ)f (x0 , y 0 ) ≥
64 LECTURE 2. CONVEX FUNCTIONS
[since f is convex]
(the last ≥ follows from the convexity of f ). The concluding quantity in the chain is
≥ g(x00 ), and we get g(x00 ) ≤ λg(x) + (1 − λ)g(x0 ) + . In particular, x00 ∈ Dom g (recall
that g is assumed to take only the values from R and the value +∞). Moreover, since
the resulting inequality is valid for all > 0, we come to g(x00 ) ≤ λg(x) + (1 − λ)g(x0 ),
as required.]
• the “conic transformation” of a convex function f on Rn – the function g(y, x) =
yf (x/y) – is convex in the half-space y > 0 in Rn+1 .
Now we know what are the basic operations preserving convexity. Let us look what are the standard
functions these operations can be applied to. A number of examples was already given, but we still do
not know why the functions in the examples are convex. The usual way to check convexity of a “simple”
– given by a simple formula – function is based on differential criteria of convexity. Let us look what are
these criteria.
f (xi ) → f (x), i → ∞,
(ii) is immediate consequence of (i), since, as we know from the very beginning of Calculus,
a differentiable function – in the case in question, it is f 0 – is monotonically nondecreasing
on an interval if and only if its derivative is nonnegative on this interval.
In fact, for functions of one variable there is a differential criterion of convexity which does
not assume any smoothness (we shall not prove this criterion):
Proposition 2.2.3 [convexity criterion for univariate functions]
Let g : R → R ∪ {+∞} be a function. Let the domain ∆ = {t : g(t) < ∞} of the function be
a convex set which is not a singleton, i.e., let it be an interval (a, b) with possibly added one
or both endpoints (−∞ ≤ a < b ≤ ∞). g is convex if and only if it satisfies the following 3
requirements:
1) g is continuous on (a, b);
2) g is differentiable everywhere on (a, b), excluding, possibly, a countable set of points, and
the derivative g 0 (t) is nondecreasing on its domain;
3) at each endpoint u of the interval (a, b) which belongs to ∆ g is upper semicontinuous:
g(u) ≥ lim supt∈(a,b),t→u g(t).
66 LECTURE 2. CONVEX FUNCTIONS
Proof of Proposition 2.2.2: Let x, y ∈ M and z = λx + (1 − λ)y, λ ∈ [0, 1], and let us
prove that
f (z) ≤ λf (x) + (1 − λ)f (y).
As we know from Theorem 1.1.1.(iii), there exist sequences xi ∈ rint M and yi ∈ rint M
converging, respectively to x and to y. Then zi = λxi + (1 − λ)yi converges to z as i → ∞,
and since f is convex on rint M , we have
passing to limit and taking into account that f is continuous on M and xi , yi , zi converge,
as i → ∞, to x, y, z ∈ M , respectively, we obtain the required inequality.
From Propositions 2.2.1.(ii) and 2.2.2 we get the following convenient necessary and sufficient condition
for convexity of a smooth function of n variables:
hT f 00 (x)h ≥ 0 ∀x ∈ int Q ∀h ∈ Rn .
Proof. The ”only if” part is evident: if f is convex and x ∈ Q0 = int Q, then the function
of one variable
g(t) = f (x + th)
(h is an arbitrary fixed direction in Rn ) is convex in certain neighbourhood of the point
t = 0 on the axis (recall that affine substitutions of argument preserve convexity). Since f
is twice differentiable in a neighbourhood of x, g is twice differentiable in a neighbourhood
of t = 0, so that g 00 (0) = hT f 00 (x)h ≥ 0 by Proposition 2.2.1. 2
Now let us prove the ”if” part, so that we are given that hT f 00 (x)h ≥ 0 for every x ∈ int Q
and every h ∈ Rn , and we should prove that f is convex.
Let us first prove that f is convex on the interior Q0 of the domain Q. By Theorem 1.1.1, Q0
is a convex set. Since, as it was already explained, the convexity of a function on a convex
set is one-dimensional fact, all we should prove is that every one-dimensional function
Applying the combination rules preserving convexity to simple functions which pass the “infinitesimal’
convexity tests, we can prove convexity of many complicated functions. Consider, e.g., an exponential
posynomial – a function
N
X
f (x) = ci exp{aTi x}
i=1
with positive coefficients ci (this is why the function is called posynomial). How could we prove that the
function is convex? This is immediate:
exp{t} is convex (since its second order derivative is positive and therefore the first derivative is
monotone, as required by the infinitesimal convexity test for smooth functions of one variable);
consequently, all functions exp{aTi x} are convex (stability of convexity under affine substitutions of
argument);
consequently, f is convex (stability of convexity under taking linear combinations with nonnegative
coefficients).
And if we were supposed to prove that the maximum of three posynomials is convex? Ok, we could
add to our three steps the fourth, which refers to stability of convexity under taking pointwise supremum.
Proposition 2.3.1 [Gradient inequality] Let f be a function taking finite values and the value +∞, x
be an interior point of the domain of f and Q be a convex set containing x. Assume that
• f is convex on Q
and
• f is differentiable at x,
and let ∇f (x) be the gradient of the function at x. Then the following inequality holds:
Proof. Let y ∈ Q. There is nothing to prove if y 6∈ Dom f (since there the right hand side in the gradient
inequality is +∞), same as there is nothing to prove when y = x. Thus, we can assume that y 6= x and
y ∈ Dom f . Let us set
yτ = x + τ (y − x), 0 < τ ≤ 1,
so that y1 = y and yτ is an interior point of the segment [x, y] for 0 < τ < 1. Now let us use the following
extremely simple
Lemma 2.3.1 Let x, x0 , x00 be three distinct points with x0 ∈ [x, x00 ], and let f be convex and
finite on [x, x00 ]. Then
f (x0 ) − f (x) f (x00 ) − f (x)
≤ . (2.3.2)
k x0 − x k 2 k x00 − x k2
68 LECTURE 2. CONVEX FUNCTIONS
k x0 − x k2
x0 = x + λ(x00 − x), λ= ∈ (0, 1)
k x00 − x k2
as τ → +0, the left hand side in this inequality, by the definition of the gradient, tends to k y − x k−1
2
(y − x)T ∇f (x), and we get
k y − x k−1 T −1
2 (y − x) ∇f (x) ≤k y − x k2 (f (y) − f (x)),
is given by Proposition 2.3.1. To prove the ”if” part, i.e., to establish the implication inverse to the above,
assume that f satisfies the Gradient inequality for all x ∈ int Q and all y ∈ Q, and let us verify that f is
convex on Q. It suffices to prove that f is convex on the interior Q0 of the set Q (see Proposition 2.2.2;
recall that by assumption f is continuous on Q and Q is convex). To prove that f is convex on Q0 , note
that Q0 is convex (Theorem 1.1.1) and that, due to the Gradient inequality, on Q0 f is the upper bound
of the family of affine (and therefore convex) functions:
In particular, f is bounded on K.
Remark 2.4.1 All three assumptions on K – (1) closedness, (2) boundedness, and (3) K ⊂ rint Dom f
– are essential, as it is seen from the following three examples:
• f (x) = 1/x, Dom F = (0, +∞), K = (0, 1]. We have (2), (3) but not (1); f is neither bounded, nor
Lipschitz continuous on K.
• f (x) = x2 , Dom f = R, K = R. We have (1), (3) and not (2); f is neither bounded nor Lipschitz
continuous on K.
√
• f (x) = − x, Dom f = [0, +∞), K = [0, 1]. We have (1), (2) and not (3); f is not Lipschitz
continuous on K 1) , although is bounded. With properly chosen convex function f of two variables
and non-polyhedral compact domain (e.g., with Dom f being the unit circle), we could demonstrate
also that lack of (3), even in presence of (1) and (2), may cause unboundedness of f at K as well.
Remark 2.4.2 Theorem 2.4.1 says that a convex function f is bounded on every compact (i.e., closed
and bounded) subset of the relative interior of Dom f . In fact there is much stronger statement on the
below boundedness of f : f is below bounded on any bounded subset of Rn !.
Proof of Theorem 2.4.1. We shall start with the following local version of the Theorem.
Proposition 2.4.1 Let f be a convex function, and let x̄ be a point from the relative interior of the
domain Dom f of f . Then
(i) f is bounded at x̄: there exists a positive r such that f is bounded in the r-neighbourhood Ur (x̄)
of x̄ in the affine span of Dom f :
(ii) f is Lipschitz continuous at x̄, i.e., there exists a positive ρ and a constant L such that
Implication “Proposition 2.4.1 ⇒ Theorem 2.4.1” is given by standard Analysis reasoning. All
we need is to prove that if K is a bounded and closed (i.e., a compact) subset of rint Dom f , then f
is Lipschitz continuous on K (the boundedness of f on K is an evident consequence of its Lipschitz
continuity on K and boundedness of K). Assume, on contrary, that f is not Lipschitz continuous on K;
then for every integer i there exists a pair of points xi , yi ∈ K such that
xi − yi k2 form a bounded sequence, which we know is not the case. Thus, the case x = y is impossible.
The case x 6= y is “even less possible” – since, by Proposition, f is continuous on Dom f at both the
points x and y (note that Lipschitz continuity at a point clearly implies the usual continuity at it), so
that we would have f (xi ) → f (x) and f (yi ) → f (y) as i → ∞. Thus, the left hand side in (2.4.2) remains
bounded as i → ∞. In the right hand side one factor – i – tends to ∞, and the other one has a nonzero
limit k x − y k, so that the right hand side tends to ∞ as i → ∞; this is the desired contradiction. 2
We know that x̄ is the point from the relative interior of ∆ (Exercise 1.8); since ∆ spans the same affine
subspace as Dom f , it means that ∆ contains Ur (x̄) with certain r > 0. Now, we have
Xm X
∆={ λi xi : λi ≥ 0, λi = 1}
i=0 i
so that in ∆ f is bounded from above by the quantity max f (xi ) by Jensen’s inequality:
0≤i≤m
Xm m
X
f( λ i xi ) ≤ λi f (xi ) ≤ max f (xi ).
i
i=0 i=0
whence
f (x) ≥ 2f (x̄) − f (x0 ) ≥ 2f (x̄) − C, x ∈ Ur (x̄),
and f is indeed below bounded in Ur .
(i) is proved.
30 . (ii) is an immediate consequence of (i) and Lemma 2.3.1. Indeed, let us prove that f is Lipschitz
continuous in the neighbourhood Ur/2 (x̄), where r > 0 is such that f is bounded in Ur (x̄) (we already
know from (i) that the required r does exist). Let |f | ≤ C in Ur , and let x, x0 ∈ Ur/2 , x 6= x0 . Let
2
to see that the required ∆ exists, let us act as follows: first, the case of Dom f being a singleton is evident, so
that we can assume that Dom f is a convex set of dimension m ≥ 1. Without loss of generality, we may assume
that x̄ = 0, so that 0 ∈ Dom f and therefore Aff(Dom f ) = Lin(Dom f ). By Linear Algebra, we can find m
m
P
vectors y1 , ..., ym in Dom f which form a basis in Lin(Dom f ) = Aff(Dom f ). Setting y0 = − yi and taking
i=1
into account that 0 = x̄ ∈ rint Dom f , we can find > 0 such that the vectors xi = yi , i = 0, ..., m, belong to
m
1
P
Ur̄ (x̄). By construction, x̄ = 0 = m+1
xi .
i=0
2.5. MAXIMA AND MINIMA OF CONVEX FUNCTIONS 71
us extend the segment [x, x0 ] through the point x0 until it reaches, at certain point x00 , the (relative)
boundary of Ur . We have
x0 ∈ (x, x00 ); k x00 − x̄ k2 = r.
From (2.3.2) we have
f (x00 ) − f (x)
f (x0 ) − f (x) ≤k x0 − x k2 .
k x00 − x k2
The second factor in the right hand side does not exceed the quantity (2C)/(r/2) = 4C/r; indeed, the
numerator is, in absolute value, at most 2C (since |f | is bounded by C in Ur and both x, x00 belong to
Ur ), and the denominator is at least r/2 (indeed, x is at the distance at most r/2 from x̄, and x00 is at
the distance exactly r from x̄, so that the distance between x and x00 , by the triangle inequality, is at
least r/2). Thus, we have
f (x0 ) − f (x) ≤ (4C/r) k x0 − x k2 , x, x0 ∈ Ur/2 ;
swapping x and x0 , we come to
f (x) − f (x0 ) ≤ (4C/r) k x0 − x k2 ,
whence
|f (x) − f (x0 )| ≤ (4C/r) k x − x0 k2 , x, x0 ∈ Ur/2 ,
as required in (ii).
2) To prove convexity of Argmin f , note that Argmin f is nothing but the level set levα (f ) of f
Q Q
associated with the minimal value min f of f on Q; as a level set of a convex function, this set is convex
Q
(Proposition 2.1.4).
72 LECTURE 2. CONVEX FUNCTIONS
3) To prove that the set Argmin f associated with a strictly convex f is, if nonempty, a singleton,
Q
note that if there were two distinct minimizers x0 , x00 , then, from strict convexity, we would have
1 1 1
f ( x0 + x00 ) < [f (x0 ) + f (x00 ) == min f,
2 2 2 Q
which clearly is impossible - the argument in the left hand side is a point from Q!
Another pleasant fact is that in the case of differentiable convex functions the known from Calculus
necessary optimality condition (the Fermat rule) is sufficient for global optimality:
Theorem 2.5.2 [Necessary and sufficient optimality condition for a differentiable convex function]
Let f be convex function on convex set Q ⊂ Rn , and let x∗ be an interior point of Q. Assume that
f is differentiable at x∗ . Then x∗ is a minimizer of f on Q if and only if
∇f (x∗ ) = 0.
Proof. As a necessary condition for local optimality, the relation ∇f (x∗ ) = 0 is known from Calculus;
it has nothing in common with convexity. The essence of the matter is, of course, the sufficiency of the
condition ∇f (x∗ ) = 0 for global optimality of x∗ in the case of convex f . This sufficiency is readily given
by the Gradient inequality (2.3.1): by virtue of this inequality and due to ∇f (x∗ ) = 0,
for all y ∈ Q.
A natural question is what happens if x∗ in the above statement is not necessarily an interior point
of Q. Thus, assume that x∗ is an arbitrary point of a convex set Q and that f is convex on Q and
differentiable at x∗ (the latter means exactly that Dom f contains a neighbourhood of x∗ and f possesses
the first order derivative at x∗ ). Under these assumptions, when x∗ is a minimizer of f on Q?
The answer is as follows: let
be the radial cone of Q at x∗ ; geometrically, this is the set of all directions leading from x∗ inside Q, so
that a small enough positive step from x∗ along the direction keeps the point in Q. From the convexity
of Q it immediately follows that the radial cone indeed is a convex cone (not necessary closed). For
example, when x∗ is an interior point of Q, then the radial cone to Q at x∗ clearly is the entire Rn . A
more interesting example is the radial cone to a polyhedral set
corresponding to the active at x∗ (i.e., satisfied at the point as equalities rather than as strict inequalities)
constraints aTi x ≤ bi from the description of Q.
Now, for the functions in question the necessary and sufficient condition for x∗ to be a minimizer of
f on Q is as follows:
Proposition 2.5.1 Let Q be a convex set, let x∗ ∈ Q, and let f be a convex on Q function which is
differentiable at x∗ . The necessary and sufficient condition for x∗ to be a minimizer of f on Q is that
the derivative of f taken at x∗ along every direction from TQ (x∗ ) should be nonnegative:
hT ∇f (x∗ ) ≥ 0 ∀h ∈ TQ (x∗ ).
2.5. MAXIMA AND MINIMA OF CONVEX FUNCTIONS 73
Proof is immediate. The necessity is an evident fact which has nothing in common with convexity:
assuming that x∗ is a local minimizer of f on Q, we note that if there were h ∈ TQ (x∗ ) with hT ∇f (x∗ ) < 0,
then we would have
f (x∗ + th) < f (x∗ )
for all small enough positive t. On the other hand, x∗ + th ∈ Q for all small enough positive t due to
h ∈ TQ (x∗ ). Combining these observations, we conclude that in every neighbourhood of x∗ there are
points from Q with strictly better than the one at x∗ values of f ; this contradicts the assumption that
x∗ is a local minimizer of f on Q.
The sufficiency is given by the Gradient Inequality, exactly as in the case when x∗ is an interior point
of Q.
Proposition 2.5.1 says that whenever f is convex on Q and differentiable at x∗ ∈ Q, the necessary
and sufficient condition for x∗ to be a minimizer of f on Q is that the linear form given by the gradient
∇f (x∗ ) of f at x∗ should be nonnegative at all directions from the radial cone TQ (x∗ ). The linear forms
nonnegative at all directions from the radial cone also form a cone; it is called the cone normal to Q at
x∗ and is denoted NQ (x∗ ). Thus, Proposition says that the necessary and sufficient condition for x∗ to
minimize f on Q is the inclusion ∇f (x∗ ) ∈ NQ (x∗ ). What does this condition actually mean, it depends
on what is the normal cone: whenever we have an explicit description of it, we have an explicit form of
the optimality condition.
For example, when TQ (x∗ ) = Rn (it is the same as to say that x∗ is an interior point of Q), then the
normal cone is comprised of the linear forms nonnegative at the entire space, i.e., it is the trivial cone
{0}; consequently, for the case in question the optimality condition becomes the Fermat rule ∇f (x∗ ) = 0,
as we already know.
When Q is the polyhedral set (2.5.3), the normal cone is the polyhedral cone (2.5.4); it is comprised
of all directions which have nonpositive inner products with all ai coming from the active, in the afore-
mentioned sense, constraints. The normal cone is comprised of all vectors which have nonnegative inner
products with all these directions, i.e., of vectors a such that the inequality hT a ≥ 0 is a consequence
of the inequalities hT ai ≤ 0, i ∈ I(x∗ ) ≡ {i : aTi x∗ = bi }. From the Homogeneous Farkas Lemma we
conclude that the normal cone is simply the conic hull of the vectors −ai , i ∈ I(x∗ ). Thus, in the case in
question (*) reads:
x∗ ∈ Q is a minimizer of f on Q if and only if there exist nonnegative reals λ∗i associated with “active”
(those from I(x∗ )) values of i such that
X
∇f (x∗ ) + λ∗i ai = 0.
i∈I(x∗ )
These are the famous Karush-Kuhn-Tucker optimality conditions; these conditions are necessary for
optimality in an essentially wider situation.
The indicated results demonstrate that the fact that a point x∗ ∈ Dom f is a global minimizer of a
convex function f depends only on the local behaviour of f at x∗ . This is not the case with maximizers
of a convex function. First of all, such a maximizer, if exists, in all nontrivial cases should belong to the
boundary of the domain of the function:
Theorem 2.5.3 Let f be convex, and let Q be the domain of f . Assume that f attains its maximum on
Q at a point x∗ from the relative interior of Q. Then f is constant on Q.
Proof. Let y ∈ Q; we should prove that f (y) = f (x∗ ). There is nothing to prove if y = x∗ , so that we
may assume that y 6= x∗ . Since, by assumption, x∗ ∈ rint Q, we can extend the segment [x∗ , y] through
the endpoint x∗ , keeping the left endpoint of the segment in Q; in other words, there exists a point y 0 ∈ Q
such that x∗ is an interior point of the segment [y 0 , y]:
x∗ = λy 0 + (1 − λ)y
74 LECTURE 2. CONVEX FUNCTIONS
Since both f (y 0 ) and f (y) do not exceed f (x∗ ) (x∗ is a maximizer of f on Q!) and both the weights λ
and 1 − λ are strictly positive, the indicated inequality can be valid only if f (y 0 ) = f (y) = f (x∗ ).
The next theorem gives further information on maxima of convex functions:
In particular, if S ⊂ Rn is convex and compact set, then the supremum of f on S is equal to the supremum
of f on the set of extreme points of S:
sup f = sup f (2.5.6)
S Ext(S)
Proof. To prove (2.5.5), let x ∈ Conv E, so that x is a convex combination of points from E (Theorem
1.1.4 on the structure of convex hull):
X X
x= λi xi [xi ∈ E, λi ≥ 0, λi = 1].
i i
so that the left hand side in (2.5.5) is ≤ the right hand one; the inverse inequality is evident, since
Conv E ⊃ E. 2
To derive (2.5.6) from (2.5.5), it suffices to note that from the Krein-Milman Theorem (Theorem
1.2.10) for a convex compact set S one has S = Conv Ext(S).
The last theorem on maxima of convex functions is as follows:
∗
Theorem 2.5.5 Let f be a convex function such that the domain Q of f is closed and
does not contain lines. Then
(i) If the set
Argmax f ≡ {x ∈ Q : f (x) ≥ f (y) ∀y ∈ Q}
Q
of global maximizers of f is nonempty, then it intersects the set Ext(Q) of the extreme points
of Q, so that at least one of the maximizers of f is an extreme point of Q;
(ii) If the set Q is polyhedral and f is above bounded on Q, then the maximum of f on Q is
achieved: Argmax f 6= ∅.
Q
Proof. Let us start with (i). We shall prove this statement by induction on the dimension
of Q. The base dim Q = 0, i.e., the case of a singleton Q, is trivial, since here Q = ExtQ =
Argmax f . Now assume that the statement is valid for the case of dim Q ≤ p, and let us
Q
prove that it is valid also for the case of dim Q = p + 1. Let us first verify that the set
Argmax f intersects with the (relative) boundary of Q. Indeed, let x ∈ Argmax f . There
Q Q
is nothing to prove if x itself is a relative boundary point of Q; and if x is not a boundary
point, then, by Theorem 2.5.3, f is constant on Q, so that Argmax f = Q; and since Q is
Q
2.5. MAXIMA AND MINIMA OF CONVEX FUNCTIONS 75
closed, every relative boundary point of Q (such a point does exist, since Q does not contain
lines and is of positive dimension) is a maximizer of f on Q, so that here again Argmax f
Q
intersects ∂ri Q.
Thus, among the maximizers of f there exists at least one, let it be x, which belongs to the
relative boundary of Q. Let H be the hyperplane which supports Q at x (see Section 1.2.6),
and let Q0 = Q ∩ H. The set Q0 is closed and convex (since Q and H are), nonempty (it
contains x) and does not contain lines (since Q does not). We have max f = f (x) = max 0
f
Q Q
(note that Q0 ⊂ Q), whence
∅=
6 Argmax f ⊂ Argmax f.
Q0 Q
Same as in the proof of the Krein-Milman Theorem (Theorem 1.2.10), we have dim Q0 <
dim Q. In view of this inequality we can apply to f and Q0 our inductive hypothesis to get
Ext(Q0 ) ∩ Argmax f 6= ∅.
Q0
Since Ext(Q0 ) ⊂ Ext(Q) by Lemma 1.2.4 and, as we just have seen, Argmax f ⊂ Argmax f ,
Q0 Q
0
we conclude that the set Ext(Q) ∩ Argmax f is not smaller than Ext(Q ) ∩ Argmax f and is
Q Q0
therefore nonempty, as required. 2
To prove (ii), let us use the known to us from Lecture 4 results on the structure of a polyhedral
convex set:
Q = Conv(V ) + Cone (R),
where V and R are finite sets. We are about to prove that the upper bound of f on Q is
exactly the maximum of f on the finite set V :
This will mean, in particular, that f attains its maximum on Q – e.g., at the point of V
where f attains its maximum on V .
To prove the announced statement, I first claim that if f is above bounded on Q, then every
direction r ∈ Cone (R) is descent for f , i.e., is such that every step in this direction taken
from every point x ∈ Q decreases f :
Indeed, if, on contrary, there were x ∈ Q, r ∈ R and t ≥ 0 such that f (x + tr) > f (x), we
would have t > 0 and, by Lemma 2.3.1,
s
f (x + sr) ≥ f (x) + (f (x + tr) − f (x)), s ≥ t.
t
Since x ∈ Q and r ∈ Cone (R), x + sr ∈ Q for all s ≥ 0, and since f is above bounded on Q,
the left hand side in the latter inequality is above bounded, while the right hand one, due to
f (x + tr) > f (x), goes to +∞ as s → ∞, which is the desired contradiction.
Now we are done: to prove (2.5.7), note that a generic point x ∈ Q can be represented as
X X
x= λv v + r [r ∈ Cone (R); λv = 1, λv ≥ 0],
v∈V v
76 LECTURE 2. CONVEX FUNCTIONS
and we have P
f (x) = f( λv v + r)
v∈V
P
≤ f( λv v) [by (2.5.8)]
Pv∈V
≤ λv f (v) [Jensen’s Inequality]
v∈V
≤ max f (v)
v∈V
is a nonempty convex set. Thus, there is no essential difference between convex functions
and convex sets: convex function generates a convex set – its epigraph – which of course
remembers everything about the function. And the only specific property of the epigraph as
a convex set is that it has a recessive direction – namely, e = (1, 0) – such that the intersection
of the epigraph with every line directed by h is either empty, or is a closed ray. Whenever a
nonempty convex set possesses such a property with respect to certain direction, it can be
represented, in properly chosen coordinates, as the epigraph of some convex function. Thus,
a convex function is, basically, nothing but a way to look, in the literal meaning of the latter
verb, at a convex set.
Now, we know that “actually good” convex sets are closed ones: they possess a lot of
important properties (e.g., admit a good outer description) which are not shared by arbitrary
convex sets. It means that among convex functions there also are “actually good” ones –
those with closed epigraphs. Closedness of the epigraph can be “translated” to the functional
language and there becomes a special kind of continuity – lower semicontinuity:
Definition 2.6.1 [Lower semicontinuity] Let f be a function (not necessarily convex) defined
on Rn and taking values in R ∪ {+∞}. We say that f is lower semicontinuous at a point
x̄, if for every sequence of points {xi } converging to x̄ one has
(here, of course, lim inf of a sequence with all terms equal to +∞ is +∞).
f is called lower semicontinuous, if it is lower semicontinuous at every point.
Proposition 2.6.1 A function f defined on Rn and taking values from R ∪ {+∞} is lower
semicontinuous if and only if its epigraph is closed (e.g., due to its emptiness).
We shall not prove this statement, same as most of other statements in this Section; the
reader definitely is able to restore (very simple) proofs we are skipping.
An immediate consequence of the latter proposition is as follows:
Indeed, the epigraph of the upper bound is the intersection of the epigraphs of the functions
forming the bound, and the intersection of closed sets always is closed.
Now let us look at convex lower semicontinuous functions; according to our general conven-
tion, “convex” means “satisfying the convexity inequality and finite at least at one point”,
or, which is the same, “with convex nonempty epigraph”; and as we just have seen, “lower
semicontinuous” means “with closed epigraph”. Thus, we are interested in functions with
closed convex nonempty epigraphs; to save words, let us call these functions proper.
What we are about to do is to translate to the functional language several constructions and
results related to convex sets. In the usual life, a translation (e.g. of poetry) typically results
in something less rich than the original; in contrast to this, in mathematics this is a powerful
source of new ideas and constructions.
containing Epi(f ) (note that what is written in the right hand side of the latter relation,
is one of many universal forms of writing down a general nonstrict linear inequality in the
space where the epigraph lives; this is the form the most convenient for us now). Thus, e
should be a recessive direction of Π ⊃ Epi(f ); as it is immediately seen, recessivity of e for
Π means exactly that α ≥ 0. Thus, speaking about closed half-spaces containing Epi(f ), we
in fact are considering some of the half-spaces (*) with α ≥ 0.
Now, there are two essentially different possibilities for α to be nonnegative – (A) to be
positive, and (B) to be zero. In the case of (B) the boundary hyperplane of Π is “vertical”
– it is parallel to e, and in fact it “bounds” only x – Π is comprised of all pairs (t, x) with x
belonging to certain half-space in the x-subspace and t being arbitrary real. These “vertical”
subspaces will be of no interest for us.
The half-spaces which indeed are of interest for us are the “nonvertical” ones: those given
by the case (A), i.e., with α > 0. For a non-vertical half-space Π, we always can divide the
inequality defining Π by α and to make α = 1. Thus, a “nonvertical” candidate to the role
of a closed half-space containing Epi(f ) always can be written down as
Proposition 2.6.2 A proper convex function f is the upper bound of all its affine minorants.
Moreover, at every point x̄ ∈ rint Dom f from the relative interior of the domain f f is even
not the upper bound, but simply the maximum of its minorants: there exists an affine function
fx̄ (x) which is ≤ f (x) everywhere in Rn and is equal to f at x = x̄.
Proof. I. We start with the “Moreover” part of the statement; this is the key to the entire
statement. Thus, we are about to prove that if x̄ ∈ rint Dom f , then there exists an affine
function fx̄ (x) which is everywhere ≤ f (x), and at x = x̄ the inequality becomes an equality.
I.10 First of all, we easily can reduce the situation to the one when Dom f is full-dimensional.
Indeed, by shifting f we may make the affine span Aff(Dom f ) of the domain of f to be a
linear subspace L in Rn ; restricting f onto this linear subspace, we clearly get a proper
function on L. If we believe that our statement is true for the case when the domain of f is
full-dimensional, we can conclude that there exists an affine function
dT x − a [x ∈ L]
on L (d ∈ L) such that
f (x) ≥ dT x − a ∀x ∈ L; f (x̄) = dT x̄ − a.
The affine function we get clearly can be extended, by the same formula, from L on the
entire Rn and is a minorant of f on the entire Rn – outside of L ⊃ Dom f f simply is +∞!
This minorant on Rn is exactly what we need.
I.20 . Now let us prove that our statement is valid when Dom f is full-dimensional, so that
x̄ is an interior point of the domain of f . Let us look at the point y = (f (x̄), x̄). This is a
point from the epigraph of f , and we claim that it is a point from the relative boundary of
the epigraph. Indeed, if y were a relative interior point of Epi(f ), then, taking y 0 = y + e,
we would get a segment [y 0 , y] contained in Epi(f ); since the endpoint y of the segment is
assumed to be relative interior for Epi(f ), we could extend this segment a little through this
endpoint, not leaving Epi(f ); but this clearly is impossible, since the t-coordinate of the new
endpoint would be < f (x̄), and the x-component of it still would be x̄.
Thus, y is a point from the relative boundary of Epi(f ). Now w claim that y 0 is an interior
point of Epi(f ). This is immediate: we know from Theorem 2.4.1 that f is continuous at x̄,
so that there exists a neighbourhood U of x̄ in Aff(Dom f ) = Rn such that f (x) ≤ f (x̄ + 0.5)
whenever x ∈ U , or, in other words, the set
so that the right hand side is an affine minorant of f on Dom f and therefore – on Rn
(f = +∞ outside Dom f !). It remains to note that (#) is equality at x̄, since (&) is equality
at y. 2
II. We have proved that if F if the set of all affine functions which are minorants of f , then
the function
f¯(x) = sup φ(x)
φ∈F
is equal to f on rint Dom f (and at x from the latter set in fact sup in the right hand side
can be replaced with max); to complete the proof of the Proposition, we should prove that
f¯ is equal to f also outside rint Dom f .
II.10 . Let us first prove that f¯ is equal to f outside cl Dom f , or. which is the same, prove
that f¯(x) = +∞ outside cl Dom f . This is easy: is x̄ is a point outside cl Dom f , it can be
strongly separated from Dom f , see Separation Theorem (ii) (Theorem 1.2.9). Thus, there
exists z ∈ Rn such that
Besides this, we already know that there exists at least one affine minorant of f , or, which
is the same, there exist a and d such that
This inequality clearly says that φλ (·) is an affine minorant of f on Rn for every λ > 0.
The value of this minorant at x = x̄ is equal to dT x̄ − a + λζ and therefore it goes to +∞
80 LECTURE 2. CONVEX FUNCTIONS
as λ → +∞. We see that the upper bound of affine minorants of f at x̄ indeed is +∞, as
claimed.
II.20 . Thus, we know that the upper bound f¯ of all affine minorants of f is equal to f
everywhere on the relative interior of Dom f and everywhere outside the closure of Dom f ;
all we should prove that this equality is also valid at the points of the relative boundary of
Dom f . Let x̄ be such a point. There is nothing to prove if f¯(x̄) = +∞, since by construction
f¯ is everywhere ≤ f . Thus, we should prove that if f¯(x̄) = c < ∞, then f (x̄) = c. Since
f¯ ≤ f everywhere, to prove that f (x̄) = c is the same as to prove that f (x̄) ≤ c. This
is immediately given by lower semicontinuity of f : let us choose x0 ∈ rint Dom f and look
what happens along a sequence of points xi ∈ [x0 , x̄) converging to x̄. All the points of this
sequence are relative interior points of Dom f (Lemma 1.1.1), and consequently
f (xi ) = f¯(xi ).
as i → ∞, xi → x̄, and the right hand side in our inequality converges to f¯(x̄) = c; since f
is lower semicontinuous, we get f (x̄) ≤ c.
We see why “translation of mathematical facts from one mathematical language to another”
– in our case, from the language of convex sets to the language of convex functions – may
be fruitful: because we invest a lot into the process rather than run it mechanically.
the intersection of cl Epi(f ) with a vertical line is the entire line – never occurs. This fact
evidently is a corollary of the following simple
Proof. Without loss of generality we may assume that the domain of the function f is
full-dimensional and that 0 is the interior point of the domain. According to Theorem 2.4.1,
there exists a neighbourhood U of the origin – which can be thought of to be a centered at
the origin ball of some radius r > 0 – where f is bounded from above by some C. Now, if
R > 0 is arbitrary and x is an arbitrary point with |x| ≤ R, then the point
r
y=− x
R
belongs to U , and we have
r R
0= x+ y;
r+R r+R
since f is convex, we conclude that
r R r R
f (0) ≤ f (x) + f (y) ≤ f (x) + c,
r+R r+R r+R r+R
and we get the lower bound
r+R r
f (x) ≥ f (0) − c
r R
for the values of f in the centered at 0 ball of radius R.
Thus, we conclude that the closure of the epigraph of a convex function f is the epigraph of
certain function, let it be called the closure cl f of f . Of course, this latter function is convex
(its epigraph is convex – it is the closure of a convex set), and since its epigraph is closed,
cl f is proper. The following statement gives direct description of cl f in terms of f :
In particular,
f (x) ≥ cl f (x)
for all x, and
f (x) = cl f (x)
whenever x ∈ rint Dom f , same as whenever x 6∈ cl Dom f .
Thus, the “correction” f 7→ cl f may vary f only at the points from the relative boundary of
Dom f ,
Dom f ⊂ Dom cl f ⊂ cl Dom f,
whence also
rint Dom f = rint Dom cl f.
(ii) The family of affine minorants of cl f is exactly the family of affine minorants of f , so
that
cl f (x) = sup{φ(x) : φ is an affine minorant of f },
and the sup in the right hand side can be replaced with max whenever x ∈ rint Dom cl f =
rint Dom f .
[“so that” comes from the fact that cl f is proper and is therefore the upper bound of its
affine minorants]
82 LECTURE 2. CONVEX FUNCTIONS
2.6.2 Subgradients
Let f be a convex function, and let x ∈ Dom f . It may happen that there exists an affine
minorant dT x − a of f which coincides with f at x:
From the equality in the latter relation we get a = dT x − f (x), and substituting this repre-
sentation of a into the first inequality, we get
Thus, if f admits an affine minorant which is exact at x, then there exists d which gives rise
to inequality (2.6.3). Vice versa, if d is such that (2.6.3) takes place, then the right hand
side of (2.6.3), regarded as a function of y, is an affine minorant of f which is exact at x.
Now note that (2.6.3) expresses certain property of a vector d. A vector satisfying, for a
given x, this property – i.e., the slope of an exact at x affine minorant of f – is called a
subgradient of f at x, and the set of all subgradients of f at x is denoted ∂f (x).
Subgradients of convex functions play important role in the theory and numerical methods
of Convex Programming – they are quite reasonable surrogates of gradients. The most
elementary properties of the subgradients are summarized in the following statement:
Proposition 2.6.5 Let f be a convex function and x be a point from Dom f . Then
(i) ∂f (x) is a closed convex set which for sure is nonempty when x ∈ rint Dom f
(ii) If x ∈ int Dom f and f is differentiable at x, then ∂f (x) is the singleton comprised of
the usual gradient of f at x.
Proof. (i): Closedness and convexity of ∂f (x) are evident – (2.6.3) is an infinite system
of nonstrict linear inequalities with respect to d, the inequalities being indexed by y ∈ Rn .
Nonemptiness of ∂f (x) for the case when x ∈ rint Dom f – this is the most important fact
about the subgradients – is readily given by our preceding results. Indeed, we should prove
that if x ∈ rint Dom f , then there exists an affine minorant of f which is exact at x. But
this is an immediate consequence of Proposition 2.6.4: part (i) of the proposition says that
there exists an affine minorant of f which is equal to cl f (x) at the point x, and part (i) says
that f (x) = cl f (x).
(ii): If x ∈ int Dom f and f is differentiable at x, then ∇f (x) ∈ ∂f (x) by the Gradient
Inequality. To prove that in the case in question ∇f (x) is the only subgradient of f at x,
note that if d ∈ ∂f (x), then, by definition,
f (y) − f (x) ≥ dT (y − x) ∀y
Substituting y − x = th, h being a fixed direction and t being > 0, dividing both sides of the
resulting inequality by t and passing to limit as t → +0, we get
hT ∇f (x) ≥ hT d.
This inequality should be valid for all h, which is possible if and only if d = ∇f (x).
Proposition 2.6.5 explains why subgradients are good surrogates of gradients: at a point
where gradient exists, it is the only subgradient, but, in contrast to the gradient, a sub-
gradient exists basically everywhere (for sure in the relative interior of the domain of the
function). For example, let us look at the simple function
f (x) = |x|
on the axis. It is, of course, convex (as maximum of two linear forms x and −x). Whenever
x 6= 0, f is differentiable at x with the derivative +1 for x > 0 and −1 for x < 0. At the point
2.6. SUBGRADIENTS AND LEGENDRE TRANSFORMATION 83
Note also that if x is a relative boundary point of the domain of a convex function, even
a “good” one, the set of subgradients of f at x may be empty, as it is the case with the
function √
− y, y≥0
f (y) = ;
+∞, y<0
it is clear that there is no non-vertical supporting line to the epigraph of the function at the
point (0, f (0)), and, consequently, there is no affine minorant of the function which is exact
at x = 0.
A significant – and important – part of Convex Analysis deals with subgradient calculus –
with the rules for computing subgradients of “composite” functions, like sums, superposi-
tions, maxima, etc., given subgradients of the operands. These rules extend onto nonsmooth
convex case the standard Calculus rules and are very nice and instructive; the related con-
siderations, however, are beyond our scope.
The supremum in the right hand side of the latter relation is certain function of d; this
function is called the Legendre transformation of f and is denoted f ∗ :
f ∗ (d) = sup [dT x − f (x)].
x∈Rn
Geometrically, the Legendre transformation answers the following question: given a slope d
of an affine function, i.e., given the hyperplane t = dT x in Rn+1 , what is the minimal “shift
down” of the hyperplane which places it below the graph of f ?
From the definition of the Legendre transformation it follows that this is a proper function.
Indeed, we loose nothing when replacing sup [dT x − f (x)] by sup [dT x − f (x)], so that
x∈Rn x∈Dom f
the Legendre transformation is the upper bound of a family of affine functions. Since this
bound is finite at least at one point (namely, at every d coming form affine minorant of f ; we
know that such a minorant exists), it is a convex lower semicontinuous function, as claimed.
The most elementary (and the most fundamental) fact about the Legendre transformation
is its symmetry:
84 LECTURE 2. CONVEX FUNCTIONS
Proposition 2.6.6 Let f be a convex function. Then twice taken Legendre transformation
of f is the closure cl f of f :
(f ∗ )∗ = cl f.
In particular, if f is proper, then it is the Legendre transformation of its Legendre transfor-
mation (which also is proper).
the second sup here is exactly the supremum of all affine minorants of f (this is the origin of
the Legendre transformation: a ≥ f ∗ (d) if and only if the affine form dT x − a is a minorant
of f ). And we already know that the upper bound of all affine minorants of f is the closure
of f .
The Legendre transformation is a very powerful tool – this is a “global” transformation, so
that local properties of f ∗ correspond to global properties of f . For example,
etc. Thus, whenever we can compute explicitly the Legendre transformation of f , we get
a lot of “global” information on f . Unfortunately, the more detailed investigation of the
properties of Legendre transformation is beyond our scope; I simply list several simple facts
and examples:
Specifying here f and f ∗ , we get certain inequality, e.g., the following one:
[Young’s Inequality] if p and q are positive reals such that p1 + 1q = 1, then
|x|p |d|q
+ ≥ xd ∀x, d ∈ R
p q
Now let 1 < p < ∞, so that also 1 < q < ∞. In this case we should prove that
X X X
|xi yi | ≤ ( |xi |p )1/p ( |yi |q )1/q .
i i i
There is nothing to prove if one of the factors in the right hand side vanishes; thus, we
can assume that x 6= 0 and y 6= 0. Now, both sides of the inequality are of homogeneity
degree 1 with respect to x (when we multiply x by t, both sides are multiplied by |t|),
and similarly with respect to y. Multiplying x and y by appropriate reals, we can make
both factors in the right hand side equal to 1: k x kp =k y kp = 1. Now we should
prove that under this normalization the left hand side in the inequality is ≤ 1, which
is immediately given by the Young inequality:
X X
|xi yi | ≤ [|xi |p /p + |yi |q /q] = 1/p + 1/q = 1.
i i
|xT y| ≤k x kp k y kq ; (2.6.5)
when p = q = 2, we get the Cauchy inequality. Now, inequality (2.6.5) is precise in the
sense that for every x there exists y with k y kq = 1 such that
xT y =k x kp [=k x kp k y kq ];
it suffices to take
yi =k x k1−p
p |xi |p−1 sign(xi )
(here x 6= 0; the case of x = 0 is trivial – here y can be an arbitrary vector with
k y kq = 1).
Combining our observations, we come to an extremely important, although simple,
fact:
1 1
k x kp = max{y T x :k y kq ≤ 1} [ + = 1]. (2.6.6)
p q
It follows, in particular, that k x kp is convex (as an upper bound of a family of linear
forms), whence
1 0 1 00
k x0 + x00 kp = 2 k x + x kp ≤ 2(k x0 kp /2+ k x00 kp /2) =k x0 kp + k x00 kp ;
2 2
this is nothing but the triangle inequality. Thus, k x kp satisfies the triangle inequal-
ity; it clearly possesses two other characteristic properties of a norm – positivity and
homogeneity. Consequently, k · kp is a norm – the fact that we announced twice and
have finally proven now.
• The Legendre transformation of the function
f (x) ≡ −a
is the function which is equal to a at the origin and is +∞ outside the origin; similarly,
the Legendre transformation of an affine function d¯T x − a is equal to a at d = d¯ and is
+∞ when d 6= d; ¯
86 LECTURE 2. CONVEX FUNCTIONS
f (x) =k x k2
is the function which is equal to 0 in the closed unit ball centered at the origin and is
+∞ outside the ball.
The latter example is a particular case of the following statement:
Let k x k be a norm on Rn , and let
k d k∗ = sup{dT x :k x k≤ 1}
87
88 LECTURE 3. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
– [below boundedness] the problem is called below bounded, if its optimal value is > −∞, i.e.,
if the objective is below bounded on the feasible set
• [optimal solution] a point x ∈ Rn is called an optimal solution to (3.1.1), if x is feasible and
f (x) ≤ f (x0 ) for any other feasible solution, i.e., if
x∈ Argmin f (x0 )
x0 ∈X:g(x0 )≤0,h(x0 )=0
the fact that a system of linear inequalities has no solutions; to this end we observed that if we can
combine, in a linear fashion, the inequalities of the system and get an obviously false inequality like
0 ≤ −1, then the system is unsolvable; this condition is certain affirmative statement with respect to the
weights with which we are combining the original inequalities.
Now, the scheme of the above reasoning has nothing in common with linearity (and even convexity)
of the inequalities in question. Indeed, consider an arbitrary inequality system of the type (3.2.2):
(I) :
f (x) < c
gj (x) ≤ 0, j = 1, ..., m
x ∈ X;
all we assume is that X is a nonempty subset in Rn and f, g1 , ..., gm are real-valued functions on X. It
is absolutely evident that
if there exist nonnegative λ1 , ..., λm such that the inequality
m
X
f (x) + λj gj (x) < c (3.2.3)
j=1
(f, g1 , ..., gm are real-valued functions on X) is called to satisfy the Slater condition, if g1 , ..., gm satisfy
this condition on X.
Definition 3.2.2 [Relaxed Slater Condition] Let X ⊂ Rn and g1 , ..., gm be real-valued functions on X.
We say that these functions satisfy the Relaxed Slater condition on X, if, after appropriate reordering of
g1 , ..., gm , for properly selected k ∈ {0, 1, ..., m} the functions g1 , ..., gk are affine, and there exists x in
the relative interior of X such that gj (x) ≤ 0, 1 ≤ j ≤ k, and gj (x) < 0, k + 1 ≤ j ≤ m.
An inequality constrained problem (IC) is called to satisfy the Relaxed Slater condition, if g1 , ..., gm
satisfy this condition on X.
Clearly, the validity of Slater condition implies the validity of the Relaxed Slater condition (in the latter,
we should set k = 0). We are about to establish the following fundamental fact:
where
• X is a nonempty convex set in Rn , and f : X → R is convex,
• K ⊂ Rν is a regular cone, regularity meaning that K is closed, convex, possesses a nonempty
interior, and is pointed, that is K ∩ [−K] = {0}
• gb(x) : Z → Rν is K-convex, meaning that whenever x, y ∈ X and λ ∈ [0, 1], we have
g (x) + (1 − λ)b
λb g − gb(λx + (1 − λ)y) ∈ K.
Note that in the simplest case of K = Rν+ (nonnegative orthant is a regular cone!) K-convexity means
exactly that gb is a vector function with convex on X components.
Given a regular cone K ⊂ Rν , we can associate with it K-inequality between vectors of Rν .
saying that a ∈ Rν is K-greater then or equal to b ∈ Rν (notation: A ≥K b, or, equivalently,
b ≤K a) when a − b ∈ K:
a≥K b ⇔ b≤K a ⇔ a − b ∈ K
For example, when K = Rν+ is nonnegative orthant, ≥K is the standard coordinate-wise
vector inequality ≥: a ≥ b means that every entry of a is greater than or equal to, in the
standard arithmetic sense, the corresponding entry in b. K- vector inequality possesses all
algebraic properties of ≥:
1. it is a partial order on Rν , meaning that the relation a ≥K b is
• reflexive: a ≥K a for all a
• anti-symmetric: a ≥K b and b ≥K a if and only if a = b
• transitive: is a ≥K b and b ≥K c, then a ≥K c
2. is compatible with linear operations, meaning that
• ≥-inequalities can be summed up: is a≥K b and c≥K d, then a + c≥K b + d
• multiplied by nonnegative reals: if a≥K b and λ is a nonnegative real, then λa≥K λb
3.2. CONVEX PROGRAMMING PROGRAM AND LAGRANGE DUALITY THEOREM 91
3. is compatible with convergence, meaning that one can pass to sidewise limits in ≥K -
inequality:
• if at ≥K bt , t = 1, 2, ..., and at → a and bt → b as t → ∞, then a≥K b
4. gives rise to strict version >K of ≥K -inequality: a >K b (equivalently: b <K a) meaning
that a − b ∈ int K. The strict K-inequality possesses the basic properties of the
coordinate-wise >, specifically
• >K is stable: if a >K b and a0 , b0 are close enough to a, b respectively, then
a0 >K b0
• if a >K b, λ is a positive real, and c≥K d, then λa >K λb and a + c >K b + d
In summary, the arithmetics of ≤K and <K inequalities is completely similar to the one of
the usual ≥ and >.
Since 1990’s, it was realized that as far as nonlinear Convex Optimization is concerned, it is
extremely convenient to consider, along with the usual Mathematical Programming form
Note also that when K is nonnegative orthant, (3.2.5) recovers the Mathematical Program-
ming form of a convex problem.
Exercise 3.2 Let K be the cone of m × m positive semidefinite matrices in the space Sm
of m × m symmetric matrices2 , and let gb(x) = xxT : X := Rm×n → Sm . Verify that gb is
K-convex.
It turns out that with “cone constrained approach” to Convex Programming, we loose noth-
ing when restricting ourselves with X = Rn , linear f (x), and affine gb(x); this specific version
of (3.2.5) is called “conic problem” (to be considered in more detaile later). In our course,
where convex problems are not the only subject, it makes sense to speak about a “less ex-
treme” cone constrained form of a convex program, specifically, one presented in (3.2.5); we
call problems of this form “convex problems in cone constrained form,” reserving the words
“conic problems” for problems (3.2.5) with X = Rn , linear f and affine gb.
2
So far, we have assumed that K “lives” in some Rν , while now we are speaking about something living in the
space of symmetric matrices. Well, we can identify the latter linear space with Rν , ν = m(m+1) 2
, by representing a
symmetric matrix [aij ] by the collection of its diagonal and below-diagonal entries arranged into a column vector
by writing down these entries, say, row by row: [aij ] 7→ [a1,1 ; a2,1 ; a2,2 ; ...; am,1 ; ...; am,m ].
92 LECTURE 3. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
What we have done so far in this section can be naturally extended to the cone constrained case. Specif-
ically, (in what follows X ⊂ Rn is a nonempty set, K ⊂ Rν is a regular cone, f is a real-valued function
on X, and gb(·) is a mapping from X into Rν )
1. Instead of feasibility/infeasibility of system (I) we can speak about feasibility/infeasibility of system
of constraints
(ConI) :
f (x) < c
g(x) := Ax − b ≤ 0
gb(x) ≤K 0 [⇔ gb(x) ∈ −K]
x ∈ X
in variables x. Denoting by K∗ the cone dual to K, a sufficient condition for infeasibility of (ConI)
is solvability of system of constraints
(ConII) : h i
T
bT gb(x)
inf f (x) + λ g(x) + λ ≥ c
x∈X
λ ≥ 0
λ
b ≥K∗ b ∈ K∗ ]
0 [⇔ λ
satisfies Slater/Relaxed Slater condition, if the system of its constraints satisfies this condition on
X.
Note that in the case of K = Rν+ , (ConI) and (ConII) become, respectively, (I) and (II), and the cone
constrained versions of Slater/Relaxed Slater condition become the usual ones.
The cone constrained version of Theorem 3.2.1 reads as follows:
Note: “In reality” (ConI) may have no polyhedral part g(x) := Ax − b ≤ 0 and/or no “general part”
gb(x)≤K 0; absence of one or both of these parts leads to self-evident modifications in (ConII). To unify our
forthcoming considerations, it is convenient to assume that both these parts are present. This assumption
is for free: it is immediately seen that in our context, in absence of one or both of g-constraints in (ConI)
we lose nothing when adding artificial polyhedral part g(x) := 0T x − 1 ≤ 0 instead of missing polyhedral
part, and/or artificial general part gb(x) := 0T x − 1≤K 0 with K = R+ instead of missing general part.
Thus, we lose nothing when assuming from the very beginning that both polyhedral and general parts
are present.
It is immediately seen that Convex Theorem on Alternative (Theorem 3.2.1) is a special case of the
latter Theorem corresponding to the case when K is a nonnegative orthant.
Proof of Theorem 3.2.2. The first part of the statement – “if (ConII) has a solution, then (ConI) has
no solutions” – has been already verified. What we need is to prove the inverse statement. Thus, let us
assume that (ConI) has no solutions, and let us prove that then (ConII) has a solution.
00 . Without loss of generality we may assume that X is full-dimensional: rint X = int X (indeed,
otherwise we could replace our “universe” Rn with the affine span of X). Besides this, shifting f by
constant, we can assume that c = 0. Thus, we are in the case where
(ConI) :
f (x) < 0
g(x) := Ax − b ≤ 0,
gb(x) ≤K 0, [⇔ gb(x) ∈ −K]
x ∈ X;
(ConII) : h i
T
bT gb(x)
inf f (x) + λ g(x) + λ ≥ 0
x∈X
λ ≥ 0,
λ
b ≥K ∗ b ∈ K∗ ]
0 [⇔ λ
We are in the case when g(x̄) ≤ 0 and gb(x̄) ∈ − int K for some x̄ ∈ rint X = int X; by shifting X (this
clearly does not affect the statement we need to prove) we can assume that x̄ is just the origin, so that
Recall that we are in the situation when (ConI) is infeasible, that is, the optimization problem
so that S and T are nonempty (since (P ) is feasible by (3.2.6)) non-intersecting (since Opt(P ) ≥ 0)
convex (since X and f are convex, and gb(x) is K-convex on X) sets. By Separation Theorem (Theorem
1.2.9) S and T can be separated by an appropriately chosen linear form α. Thus,
α0 t̄0 + α1T t̄1 ≥ 0. Assuming that α0 = 0, it follows that α1T t̄1 ≥ 0, and since t̄1 ∈ − int K (see (3.2.6))
and α1 ∈ K∗ , we conclude (see observation (!) on p. 47) that α1 = 0 on the top of α0 = 0, which is
impossible, since α 6= 0.
In the sequel, we set ᾱ1 = α1 /α0 , and
Observing that (3.2.8) remains valid when replacing α with ᾱ = α/α0 and that the vector [f (x); gb(x)]
belongs to T when x ∈ Y , we conclude that
x ∈ Y ⇒ h(x) ≥ 0. (3.2.9)
These sets clearly are nonempty and do not intersect (since the x-component x of a point from S ∩ T
would satisfy the premise and violate the conclusion in (3.2.9)). By Separation Theorem, there exists
[e; α] 6= 0 such that
sup [eT x + ατ ] ≤ inf [eT x + ατ ].
[x;τ ]∈S [x;τ ]∈T
Taking into account what S is, we have α ≥ 0 (since otherwise the left hand side in the inequality would
be +∞). With this in mind, the inequality reads
x ∈ X, g(x) ≤ 0 ⇒ dT x + δ ≤ 0.
dT x + δ ≤ µT g(x) ∀x ∈ X.
Note that when X = Rn , Lemma 3.2.1 is nothing but Inhomogeneous Farkas Lemma.
Proof of Lemma 3.2.1. Consider the cones
M2 is a polyhedral cone, and M1 is a closed convex cone with a nonempty interior (since
the point [0; 1] belongs to int M1 due to 0 ∈ int X); moreover, [int M1 ] ∩ M2 6= ∅ (since
the point [0; 1] ∈ int M1 belongs to M2 due to g(0) ≤ 0). Observe that the linear form
f T [x; t] := dT x + tδ is nonpositive on M1 ∩ M2 , that is, (−f ) ∈ [M1 ∩ M2 ]∗ , where, as usual,
M∗ is the cone dual to a cone M .
Indeed, let [z; t] ∈ M1 ∩ M2 , and let ys = [zs ; ts ] = (1 − s)y + s[0; 1]. Observe that
ys ∈ M1 ∩ M2 when 0 ≤ s ≤ 1. When 0 < s ≤ 1, we have ts > 0, and ys ∈ M1
implies that ws := zs /ts ∈ cl X (why?), while ys ∈ M2 implies that g(ws ) ≤ 0.
Since 0 ∈ int X, ws ∈ cl X implies that θws ∈ X for all θ ∈ [0, 1), while g(0) ≤ 0
along with g(ws ) ≤ 0 implies that g(θws ) ≤ 0 for θ ∈ [0, 1). Invoking the premise
of Lemma, we conclude that
Thus, M1 , M2 are closed cones such that [int M1 ]∩M2 6= ∅ and f ∈ [M1 ∩M2 ]∗. Applying to
M1 , M2 the Dubovitski-Milutin Lemma (Proposition 1.2.7), we conclude that [M1 ∩ M2 ]∗ =
(M1 )∗ + (M2 )∗ . Since −f ∈ [M1 ∩ M2 ]∗, there exist ψ ∈ (M1 )∗ and φ ∈ (M2 )∗ such that
[d; δ] = −φ − ψ. The inclusion φ ∈ (M2 )∗ means that the homogeneous linear inequality
φT y ≥ 0 is a consequence of the system [A, a]y ≤ 0 of homogeneous linear inequalities;
by Homogeneous Farkas Lemma (Lemma 1.2.1), it follows that identically in x it holds
φT [x; 1] = −µT g(x), with some nonnegative µ. Thus,
∀x ∈ E : dT x + δ = [d; δ]T [x; 1] = µT g(x) − ψ T [x; 1],
and since [x; 1] ∈ M1 whenever x ∈ X, we have ψ T [x; 1] ≥ 0, x ∈ X, so that µ satisfies the
requirements stated in the lemma we are proving.
30 .3. Recall that we have seen that in (3.2.10), α ≥ 0. We claim that in fact α > 0.
Indeed, assuming α = 0 and setting a = supx eT x : x ∈ X, g(x) ≤ 0 , (3.2.10) implies that
a ∈ R and
{∀(x ∈ X, g(x) ≤ 0) : eT x − a ≤ 0} & {∀x ∈ X : eT x ≥ a}. (3.2.11)
By Lemma 3.2.1, the first of these relations implies that for some nonnegative µ it holds
eT x − a ≤ µT g(x) ∀x ∈ X.
Setting x = 0, we get a ≥ 0 due to µ ≥ 0 and g(0) ≤ 0, so that the second relation in (3.2.11)
implies eT x ≥ 0 for all x ∈ X, which is impossible: since 0 ∈ int X, eT x ≥ 0 for all x ∈ X
would imply that e = 0, and since we are in the case of α = 0, we would get [e; α] = 0, which
is not the case due to the origin of [e; α].
−1
30 .4. Thus, α in (3.2.10) is strictly
T positive; replacing e with α e, we can assume that (3.2.10) holds
true with α = 1. Setting a = supx e x : x ∈ X, g(x) ≤ 0 , (3.2.10) reads
∀(x ∈ X) : h(x) + eT x ≥ a,
while the definition of a along with Lemma 3.2.1 implies that there exists µ ≥ 0 such that
∀x ∈ X : eT x − a ≤ µT g(x).
Combining these relations, we conclude that
h(x) + µT g(x) ≥ 0 ∀x ∈ X. (3.2.12)
Recalling that h(x) = f (x) + ᾱ1T gb(x) with ᾱ1 ∈ K∗ and setting λ = µ, λ
b = ᾱ1 , we get λ ≥ 0, λ
b ∈ K∗
while by (3.2.12) it holds
T
bT gb(x) ≥ 0 ∀x ∈ X,
f (x) + λ g(x) + λ
meaning that λ, λ
b solve (ConII) (recall that we are in the case of c = 0).
from which this function comes. Aggregate (3.2.14) is called the Lagrange function of the inequality
constrained optimization program
The Lagrange function of an optimization program is a very important entity: most of optimality condi-
tions are expressed in terms of this function. Let us start with translating of what we already know to
the language of the Lagrange function.
of the Lagrange function in x ∈ X is, for every λ ≥ 0, a lower bound on the optimal value in (IC), so
that the optimal value in the optimization program
• is below bounded
and
• satisfies the Relaxed Slater condition,
then the optimal value in (IC∗ ) is attained and is equal to the optimal value in (IC).
Proof. (i) is nothing but Proposition 3.2.1 (why?). It makes sense, however, to repeat here the corre-
sponding one-line reasoning:
Let λ ≥ 0; in order to prove that
m
X
L(λ) ≡ inf L(x, λ) ≤ c∗ [L(x, λ) = f (x) + λj gj (x)],
x∈X
j=1
where c∗ is the optimal value in (IC), note that if x is feasible for (IC), then evidently
L(x, λ) ≤ f (x), so that the infimum of L over x ∈ X is ≤ the infimum c∗ of f over the
feasible set of (IC). 2
(ii) is an immediate consequence of the Convex Theorem on Alternative. Indeed, let c∗ be the optimal
value in (IC). Then the system
f (x) < c∗ , gj (x) ≤ 0, j = 1, ..., m
has no solutions in X, and by the above Theorem the system (II) associated with c = c∗ has a solution,
i.e., there exists λ∗ ≥ 0 such that L(λ∗ ) ≥ c∗ . But we know from (i) that the strict inequality here is
impossible and, besides this, that L(λ) ≤ c∗ for every λ ≥ 0. Thus, L(λ∗ ) = c∗ and λ∗ is a maximizer of
L over λ ≥ 0.
3.2. CONVEX PROGRAMMING PROGRAM AND LAGRANGE DUALITY THEOREM 97
(the variables λ of the dual problem are called the Lagrange multipliers of the primal problem). The
Theorem says that the optimal value in the dual problem is ≤ the one in the primal, and under some
favourable circumstances (the primal problem is convex below bounded and satisfies the Slater condition)
the optimal values in the programs are equal to each other.
In our formulation there is some asymmetry between the primal and the dual programs. In fact both
of the programs are related to the Lagrange function in a quite symmetric way. Indeed, consider the
program
min L(x), L(x) = sup L(λ, x).
x∈X λ≥0
The objective in this program clearly is +∞ at every point x ∈ X which is not feasible for (IC) and is
f (x) on the feasible set of (IC), so that the program is equivalent to (IC). We see that both the primal
and the dual programs come from the Lagrange function: in the primal problem, we minimize over X
the result of maximization of L(x, λ) in λ ≥ 0, and in the dual program we maximize over λ ≥ 0 the
result of minimization of L(x, λ) in x ∈ X. This is a particular (and the most important) example of a
zero sum two person game – the issue we will speak about later.
We have seen that under certain convexity and regularity assumptions the optimal values in (IC)
and (IC∗ ) are equal to each. There is also another way to say when these optimal values are equal
– this is always the case when the Lagrange function possesses a saddle point, i.e., there exists a pair
x∗ ∈ X, λ∗ ≥ 0 such that at the pair L(x, λ) attains its minimum as a function of x ∈ X and attains its
maximum as a function of λ ≥ 0:
Proposition 3.2.2 (x∗ , λ∗ ) is a saddle point of the Lagrange function L of (IC) if and only if x∗ is
an optimal solution to (IC), λ∗ is an optimal solution to (IC∗ ) and the optimal values in the indicated
problems are equal to each other.
3.2.2.4 Cone constrained forms of Lagrange Function, Lagrange Duality, and Con-
vex Programming Duality Theorem
The above results related to convex optimization problems in the standard MP format admit instructive
extensions to the case of convex problems in cone constrained form. These extensions are as follows:
3.2.2.4.1. Convex problem in cone constrained form is optimization problem of the form
We see that (D) has a feasible solution with the value of the objective ≥ Opt(P ), By Weak Duality, this
value is exactly Opt(P ), the solution in question is optimal for (D), and Opt(P ) = Opt(D).
It is easy to show that the cone dual to a regular cone also is a regular cone; as a result, problem
(D), called the conic dual of conic problem (P ) indeed is a conic problem. An immediate computation
(utilizing the fact that [K∗ ]∗ = K for every regular cone K) shows that conic duality if symmetric – the
conic dual to (D) is (equivalent to) (P ).
3.2. CONVEX PROGRAMMING PROGRAM AND LAGRANGE DUALITY THEOREM 99
Indeed, rewriting (D) in the minimization form with ≤ 0 polyhedral constraints, as required
by our recipe for building the conic dual to a conic problem, we get
T
A λ + PTλb+c≤0
−Opt(D) = min bT λ + pT λ
b: b − c ≤ 0 , −λ
−AT λ − P T λ b ∈ −K∗ (D)
λ,b
λ
−λ ≤ 0
max cT [u − v] : b + A[u − v] − w = 0, P [u − v] + p − y = 0, u ≥ 0, v ≥ 0, w ≥ 0, y ∈ K .
u,v,w,y
Theorem 3.2.5 [Conic Duality Theorem] Consider a primal-dual pair of conic problems
T
Opt(P ) = minx∈Rn n c x : Ax − b ≤ 0, P x − x≤K 0 o (P )
T Tb T
Opt(D) = maxλ,b λ
−b λ − p λ : A λ + P ˇλ b + c = 0, λ ≥ 0, λ
b ∈ K∗ (D)
One always has Opt(D) ≤ Opt(P ). Besides this, if one of the problems in the pair is bounded and satisfies
the Relaxed Slater condition, then the other problem in the pair is solvable, and Opt(P ) = Opt(D).
Finally, if both the problems satisfy Relaxed Slater condition, then both are solvable with equal optimal
values.
Proof is immediate. Weak duality has already been verified. To verify the second claim, note that by
primal-dual symmetry we can assume that the bounded problem satisfying Relaxed Slater condition is
(P ); but then the claim in question is given by Theorem 3.2.4. Finally, if both problems satisfy Relaxed
Slater condition (and in particular are feasible), both are bounded, and therefore solvable with equal
optimal values by the second claim.
(ii) if the problem (IC) is convex and satisfies the Relaxed Slater condition, then the above condition
is necessary for optimality of x∗ : if x∗ is optimal for (IC), then there exists λ∗ ≥ 0 such that (x∗ , λ∗ ) is
a saddle point of the Lagrange function.
100LECTURE 3. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
Proof. (i): assume that for a given x∗ ∈ X there exists λ∗ ≥ 0 such that (3.2.15) is satisfied, and let us
prove that then x∗ is optimal for (IC). First of all, x∗ is feasible: indeed, if gj (x∗ ) > 0 for some j, then,
of course, sup L(x∗ , λ) = +∞ (look what happens when all λ’s, except λj , are fixed, and λj → +∞); but
λ≥0
sup L(x∗ , λ) = +∞ is forbidden by the second inequality in (3.2.15).
λ≥0
Since x∗ is feasible, sup L(x∗ , λ) = f (x∗ ), and we conclude from the second inequality in (3.2.15) that
λ≥0
L(x∗ , λ∗ ) = f (x∗ ). Now the first inequality in (3.2.15) reads
m
X
f (x) + λ∗j gj (x) ≥ f (x∗ ) ∀x ∈ X.
j=1
This inequality immediately implies that x∗ is optimal: indeed, if x is feasible for (IC), then the left hand
side in the latter inequality is ≤ f (x) (recall that λ∗ ≥ 0), and the inequality implies that f (x) ≥ f (x∗ ).
2
(ii): Assume that (IC) is a convex program, x∗ is its optimal solution and the problem satisfies the
Relaxed Slater condition; we should prove that then there exists λ∗ ≥ 0 such that (x∗ , λ∗ ) is a saddle
point of the Lagrange function, i.e., that (3.2.15) is satisfied. As we know from the Convex Programming
Duality Theorem (Theorem 3.2.3.(ii)), the dual problem (IC∗ ) has a solution λ∗ ≥ 0 and the optimal
value of the dual problem is equal to the optimal value in the primal one, i.e., to f (x∗ ):
in the right hand side are nonpositive (since x∗ is feasible for (IC)), and the sum
P
the terms in the
j
itself is nonnegative due to our inequality; it is possible if and only if all the terms in the sum are zero,
and this is exactly the complementary slackness.
From the complementary slackness we immediately conclude that f (x∗ ) = L(x∗ , λ∗ ), so that (3.2.16)
results in
L(x∗ , λ∗ ) = f (x∗ ) = inf L(x, λ∗ ).
x∈X
On the other hand, since x∗ is feasible for (IC), we have L(x∗ , λ) ≤ f (x∗ ) whenever λ ≥ 0. Combining
our observations, we conclude that
Theorem 3.2.7 [Saddle Point formulation of Optimality Conditions in Convex Conic Programming]
Consider a convex cone constrained problem
(X is convex, f : X → R is convex, K is a regular cone, and gb is K-convex on X) along with its Conic
Lagrange dual problem
h i
T
Opt(D) = max L(λ) := inf f (x) + λ g(x) + λ bT gb(x) : λ ≥ 0, λ
b ∈ K∗ (D)
λ]
λ=[λ;b x∈X
and assume that (P ) is bounded and satisfies Relaxed Slater condition. Then a point x∗ ∈ X is an optimal
solution to (P ) is and only if x∗ can be augmented by properly selected λ ∈ Λ := R+ × [K∗ ] to a saddle
pint of the cone constrained Lagrange function
b := f (x) + λT g(x) + λ
L(x; λ := [λ; λ]) bT gb(x)
on X × Λ.
Proof repeats with evident modifications the one of relevant part of Theorem 3.2.6, with Convex Conic
Programming Duality Theorem (Theorem 3.2.4) in the role of Convex Programming Duality Theorem
(Theorem 3.2.3).
Definition 3.2.3 [Normal Cone] Let X ⊂ Rn and x ∈ X. The normal cone of NX (x) of X taken at the
point x is the set
NX (x) = {h ∈ Rn : hT (x0 − x) ≥ 0 ∀x0 ∈ X}.
and
m
X
∇f (x∗ ) + λ∗j ∇gj (x∗ ) ∈ NX (x∗ ) (3.2.18)
j=1
Proof. Observe that under the premise of Theorem 3.2.8 the Karush-Kuhn-Tucker condition is necessary
and sufficient for (x∗ , λ∗ ) to be a saddle point of the Lagrange function. ∗ ∗
Pm Indeed, the (x , λ ) is a saddle
point of the Lagrange function if and only if a) L(x∗ , λ) = f (x∗ ) + i=1 λj gj (x∗ ) as a function of λ ≥ 0
attains its minimum at λ = λ∗ . Since L(x∗ , λ) is linear in λ, this is exactly the same as to say that
for all j we have gj (x∗ ) ≥ 0 (which we knew in advance – x∗ is feasible!) and λ∗j gj (x∗ ) = 0, which is
complementary slackness;
b) L(x, λ∗ ) as a function of x ∈ X attains it minimum at x∗ . Since L(x, λ∗ ) is convex in x ∈ X due to
λ∗ ≥ 0 and L(x, λ∗ ) is differentiable at x∗ by Theorem’s premise, PProposition 2.5.1 says that L(x, λ∗ )
∗ ∗ ∗ ∗ m ∗ ∗
achieves its minimum at x if and only if ∇x L(x , λ ) = ∇f (x ) + i=1 λi ∇gi (x ) has nonnegative inner
products with all vectors h from the radial cone TX (x∗ ), i.e., all Ph such that x∗ + th ∈ X for all small
m
enough t > 0, which is exactly the same as to say that ∇f (x ) + i=1 λ∗i ∇gi (x∗ ) ∈ NX (x∗ ) (since for a
∗
T
convex set X and all x ∈ X it clearly holds NX (x) = {f : f h ≥ 0 ∀h ∈ TX (x)). The bottom line is that
for a feasible x∗ and λ∗ ≥ 0, (x∗ , λ∗ ) is a saddle point of the Lagrange function if and only if (x∗ , λ∗ )
meet the Karush-Kuhn-Tucker condition. This observation proves (i), and combined with Theorem 3.2.3,
proves (ii) as well.
Note that the optimality conditions stated in Theorem 2.5.2 and Proposition 2.5.1 are particular
cases of the above Theorem corresponding to m = 0.
Note that in the case when x∗ ∈ int X, we have NX (x∗ ) = {0}, so that (3.2.18) reads
m
X
∇f (x∗ ) + λ∗i ∇gi (x∗ ) = 0;
i=1
when x∗ ∈ rint X, NX (x∗ ) is the orthogonal complement to the linear subspace L to which Aff(X) is
parallel, so that (3.2.18) reads
m
X
∇f (x∗ ) + λ∗i ∇gi (x∗ ) is orthogonal to L = Lin(X − x∗ ).
i=1
and assume that both problems satisfy Relaxed Slater condition. Then a pair of feasible solutions x∗ to
(P ) and λ∗ := [λ∗ ; λ
b∗ ] to (D) is comprised of optimal solutions to the respective problems
— [Zero Duality Gap] if and only if
DualityGap(x∗ ; λ∗ ) := cT x + ∗ − [−bT λ∗ − pT λ
b∗ ] = 0,
and
— [Complementary Slackness] if and only if
T
bT [p − P x∗ ] = 0
λ∗ [b − Ax∗ ] + λ
Note: Under the premise of Theorem, from feasibility of x∗ and λ∗ for respective problems it follows
that b − Ax ≥ 0 and p − P x ∈ K. Therefore Complementary slackness (which says that the sum of two
3.3. DUALITY IN LINEAR AND CONVEX QUADRATIC PROGRAMMING 103
inner products, every one of a vector from a regular cone and a vector from the dual of this cone, and as
such automatically nonnegative) is zero is a really strong restriction,
Proof of Theorem 3.2.9 is immediate. By Conic Duality Theorem (Theorem 3.2.5) we are in the case
when Opt(P ) = Opt(D) ∈ R, and therefore
DualityGap(x∗ ; λ∗ ) = cT x∗ + bT λ∗ + pT λ
b∗ = −[AT λ∗ + P T λ
c∗ ]T x∗ + bT λ∗ + pT λ
b∗
T T
= λ∗ [b − Ax∗ ] + λ
b [p − P x∗ ],
∗
so that Complementary Slackness is, for feasible for the respective problems x∗ and λ∗ . exactly the same
as Zero Duality Gap.
(3.2.17) says that L(x∗ , λ) attains its maximum in λ ≥ 0, and (3.2.18) says that L(x, λ∗ ) attains its at
λ∗ minimum in x at x = x∗ .
Now consider the particular case of (IC) where X = Rn is the entire space, the objective f is convex
and everywhere differentiable and the constraints g1 , ..., gm are linear. For this case, Theorem 3.2.8 says
to us that the KKT (Karush-Kuhn-Tucker) condition is necessary and sufficient for optimality of x∗ ; as
we just have explained, this is the same as to say that the necessary and sufficient condition of optimality
for x∗ is that x∗ along with certain λ∗ ≥ 0 form a saddle point of the Lagrange function. Combining
these observations with Proposition 3.2.2, we get the following simple result:
104LECTURE 3. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
Proposition 3.3.1 Let (IC) be a convex program with X = Rn , everywhere differentiable objective f
and linear constraints g1 , ..., gm . Then x∗ is optimal solution to (IC) if and only if there exists λ∗ ≥ 0
such that (x∗ , λ∗ ) is a saddle point of the Lagrange function (3.3.1) (regarded as a function of x ∈ Rn
and λ ≥ 0). In particular, (IC) is solvable if and only if L has saddle points, and if it is the case, then
both (IC) and its Lagrange dual
(IC∗ ) : max {L(λ) : λ ≥ 0}
λ
Let us look what this proposition says in the Linear Programming case, i.e., when (IC) is the program
In order to get the Lagrange dual, we should form the Lagrange function
m
X m
X m
X
L(x, λ) = f (x) + λj gj (x) = [c − λj aj ]T x + λj bj
j=1 j=1 j=1
of (IC) and to minimize it in x ∈ Rn ; this will give us the dual objective. In our case the minimization
m
P m
P
in x is immediate: the minimal value is −∞, if c − λj aj 6= 0, and is λj bj otherwise. We see that
j=1 j=1
the Lagrange dual is
m
X
(D) max bT λ : λj aj = c, λ ≥ 0 .
λ
j=1
The problem we get is the usual LP dual to (P ), and Proposition 3.3.1 is one of the equivalent forms of
the Linear Programming Duality Theorem which we already know.
where the objective is a strictly convex quadratic form, so that D = DT is positive definite matrix:
xT Dx > 0 whenever x 6= 0. It is convenient to rewrite the constraints in the vector-matrix form
T
b1 a1
g(x) = b − Ax ≤ 0, b = ... , A = ... .
bm aTm
In order to form the Lagrange dual to (P ) program, we write down the Lagrange function
m
P
L(x, λ) = f (x) + λj gj (x)
j=1
= cT x + λ (b − Ax) + 21 xT Dx
T
= 12 xT Dx − [AT λ − c]T x + bT λ
and minimize it in x. Since the function is convex and differentiable in x, the minimum, if exists, is given
by the Fermat rule
∇x L(x, λ) = 0,
which in our situation becomes
Dx = [AT λ − c].
3.3. DUALITY IN LINEAR AND CONVEX QUADRATIC PROGRAMMING 105
Since D is positive definite, it is nonsingular, so that the Fermat equation has a unique solution which is
the minimizer of L(·, λ); this solution is
Substituting the value of x into the expression for the Lagrange function, we get the dual objective:
1
L(λ) = − [AT λ − c]T D−1 [AT λ − c] + bT λ,
2
and the dual problem is to maximize this objective over the nonnegative orthant. Usually people rewrite
this dual problem equivalently by introducing additional variables
∆ = cT x + 21 xT Dx − [bT λ − 12 tT Dt]
= [AT λ + Dt]T x + 21 xT Dx + 12 tT Dt − bT λ
[since AT λ + Dt = c]
= λT [Ax − b] + 12 [x + t]T D[x + t]
Since Ax − b ≥ 0 and λ ≥ 0 due to primal feasibility of x and dual feasibility of (λ, t), both terms in the
resulting expression for ∆ are nonnegative. Thus, ∆ = 0 (which, by (i). is equivalent to optimality of
m
λj (Ax − b)j = 0 and (x + t)T D(x + t) = 0.
P
x for (P ) and optimality of (λ, t) for (D)) if and only if
j=1
The first of these equalities, due to λ ≥ 0 and Ax ≥ b, is equivalent to λj (Ax − b)j = 0, j = 1, ..., m; the
second, due to positive definiteness of D, is equivalent to x + t = 0.
3)
since its objective, due to positive definiteness of D, goes to infinity as |x| → ∞, and due to the following
general fact:
Let (IC) be a feasible program with closed domain X, continuous on X objective and constraints and such that
f (x) → ∞ as x ∈ X “goes to infinity” (i.e., |x| → ∞). Then (IC) is solvable.
You are welcome to prove this simple statement (it is among the exercises accompanying the Lecture)
106LECTURE 3. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
L(x, λ) : X × Λ → R
Consequently, my policy should be to choose x which minimizes my loss function L, i.e., the one which
solves the optimization problem
(I) min L(x);
x∈X
In the case (B), similar reasoning of II enforces him to choose λ maximizing his profit function
Note that these two reasonings relate to two different games: the one with priority of II (when making
his decision, II already knows the choice of I), and the one with similar priority of I. Therefore we should
not, generally speaking, expect that the anticipated loss of I in (A) is equal to the anticipated profit of II
in (B). What can be guessed is that the anticipated loss of I in (B) is less than or equal to the anticipated
profit of II in (A), since the conditions of the game (B) are better for I than those of (A). Thus, we may
guess that independently of the structure of the function L(x, λ), there is the inequality
This inequality indeed is true; which is seen from the following reasoning:
consequently, the quantity sup inf L(x, λ) is a lower bound for the function L(y), y ∈ X, and is therefore
λ∈Λ x∈X
a lower bound for the infimum of the latter function over y ∈ X, i.e., is a lower bound for inf sup L(y, λ).
y∈X λ∈Λ
Now let us look what happens when the game in question has a saddle point (x∗ , λ∗ ), so that
whence, of course,
sup L(λ) ≥ L(λ∗ ) ≥ L(x∗ , λ∗ ) ≥ L(x∗ ) ≥ inf L(x).
λ∈Λ x∈X
the very first quantity in the latter chain is ≤ the very last quantity by (3.4.2), which is possible if and
only if all the inequalities in the chain are equalities, which is exactly what is said by (A) and (B).
Thus, if (x∗ , λ∗ ) is a saddle point of L, then (*) takes place. We are about to demonstrate that the
inverse also is true:
Theorem 3.4.1 [Structure of the saddle point set] Let L : X × Y → R be a function. The set of saddle
points of the function is nonempty if and only if the related optimization problems (I) and (II) are solvable
and the optimal values in the problems are equal to each other. If it is the case, then the saddle points of
L are exactly all pairs (x∗ , λ∗ ) with x∗ being an optimal solution to (I) and λ∗ being an optimal solution
to (II), and the value of the cost function L(·, ·) at every one of these points is equal to the common
optimal value in (I) and (II).
Proof. We already have established “half” of the theorem: if there are saddle points of L, then their
components are optimal solutions to (I), respectively, (II), and the optimal values in these two problems
are equal to each other and to the value of L at the saddle point in question. To complete the proof,
we should demonstrate that if x∗ is an optimal solution to (I), λ∗ is an optimal solution to (II) and the
108LECTURE 3. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
optimal values in the problems are equal to each other, then (x∗ , λ∗ ) is a saddle point of L. This is
immediate: we have
L(x, λ∗ ) ≥ L(λ∗ ) [ definition of L]
= L(x∗ ) [by assumption]
≥ L(x∗ , λ) [definition of L]
whence
L(x, λ∗ ) ≥ L(x∗ , λ) ∀x ∈ X, λ ∈ Λ;
substituting λ = λ∗ in the right hand side of this inequality, we get L(x, λ∗ ) ≥ L(x∗ , λ∗ ), and substituting
x = x∗ in the right hand side of our inequality, we get L(x∗ , λ∗ ) ≥ L(x∗ , λ); thus, (x∗ , λ∗ ) indeed is a
saddle point of L.
is the Lagrange function of a solvable convex program satisfying the Slater condition. Note that in this
case L is convex in x for every λ ∈ Λ ≡ Rm + and is linear (and therefore concave) in λ for every fixed X.
As we shall see in a while, these are the structural properties of L which take upon themselves the “main
responsibility” for the fact that in the case in question the saddle points exist. Namely, there exists the
following
Theorem 3.4.2 [Existence of saddle points of a convex-concave function (Sion-Kakutani)] Let X and Λ
be convex compact sets in Rn and Rm , respectively, and let
L(x, λ) : X × Λ → R
be a continuous function which is convex in x ∈ X for every fixed λ ∈ Λ and is concave in λ ∈ Λ for
every fixed x ∈ X. Then L has saddle points on X × Λ.
Proof. According to Theorem 3.4.1, we should prove that
• (i) Optimization problems (I) and (II) are solvable
• (ii) the optimal values in (I) and (II) are equal to each other.
(i) is valid independently of convexity-concavity of L and is given by the following routine reasoning from
the Analysis:
Since X and Λ are compact sets and L is continuous on X × Λ, due to the well-known Analysis
theorem L is uniformly continuous on X × Λ: for every > 0 there exists δ() > 0 such that
|x − x0 | + |λ − λ0 | ≤ δ() ⇒ |L(x, λ) − L(x0 , λ0 )| ≤ 4)
(3.4.4)
4)
for those not too familiar with Analysis, I wish to stress the difference between the usual continuity and
the uniform continuity: continuity of L means that given > 0 and a point (x, λ), it is possible to choose δ > 0
such that (3.4.4) is valid; the corresponding δ may depend on (x, λ), not only on . Uniform continuity means
that this positive δ may be chosen as a function of only. The fact that a continuous on a compact set function
automatically is uniformly continuous on the set is one of the most useful features of compact sets
3.4. SADDLE POINTS 109
In particular,
|x − x0 | ≤ δ() ⇒ |L(x, λ) − L(x0 λ)| ≤ ,
whence, of course, also
|x − x0 | ≤ δ() ⇒ |L(x) − L(x0 )| ≤ ,
so that the function L is continuous on X. Similarly, L is continuous on Λ. Taking in account that X
and Λ are compact sets, we conclude that the problems (I) and (II) are solvable.
(ii) is the essence of the matter; here, of course, the entire construction heavily exploits convexity-
concavity of L.
00 . To prove (ii), we first establish the following statement, which is important by its own right:
Lemma 3.4.1 [Minmax Lemma] Let X be a convex compact set and f0 , ..., fN be a collection
of N + 1 convex and continuous functions on X. Then the minmax
The Minmax Lemma says that if fi are convex and continuous on a convex compact set X,
then the indicated inequality is in fact equality; you can easily verify that this is nothing but
the claim that the function M possesses a saddle point. Thus, the Minmax Lemma is in fact
a particular case of the Sion-Kakutani Theorem; we are about to give a direct proof of this
particular case of the Theorem and then to derive the general case from this particular one.
(note that (t, x) is feasible solution for (S) if and only if x ∈ X and t ≥ max fi (x)). The
i=0,...,N
problem clearly satisfies the Slater condition and is solvable (since X is compact set and fi ,
i = 0, ..., N , are continuous on X; therefore their maximum also is continuous on X and
thus attains its minimum on the compact set X). Let (t∗ , x∗ ) be an optimal solution to the
110LECTURE 3. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
problem. According to Theorem 3.2.6, there exists λ∗ ≥ 0 such that ((t∗ , x∗ ), λ∗ ) is a saddle
point of the corresponding Lagrange function
N
X N
X N
X
L(t, x; λ) = t + λi (fi (x) − t) = t(1 − λi ) + λi fi (x),
i=0 i=0 i=0
and the value of this function at ((t∗ , x∗ ), λ∗ ) is equal to the optimal value in (S), i.e., to t∗ .
Now, since L(t, x; λ∗ ) attains its minimum in (t, x) over the set {t ∈ R, x ∈ X} at (t∗ , x∗ ),
we should have
N
X
λ∗i = 1
i=0
so that
N
X
min max fi (x) = min λ∗i fi (x)
x∈X i=0,...,N x∈X
i=0
N
with some λ∗i ≥ 0, λ∗i = 1, as claimed.
P
i=0
From the Minmax Lemma to the Sion-Kakutani Theorem. We should prove that the
optimal values in (I) and (II) (which, by (i), are well defined reals) are equal to each other, i.e., that
We know from (3.4.4) that the first of these two quantities is greater than or equal to the second, so that
all we need is to prove the inverse inequality. For me it is convenient to assume that the right quantity
(the optimal value in (II)) is 0, which, of course, does not restrict generality; and all we need to prove is
that the left quantity – the optimal value in (I) – cannot be positive.
10 . What does it mean that the optimal value in (II) is zero? When it is zero, then the function L(λ)
is nonpositive for every λ, or, which is the same, the convex continuous function of x ∈ X – the function
L(x, λ) – has nonpositive minimal value over x ∈ X. Since X is compact, this minimal value is achieved,
so that the set
X(λ) = {x ∈ X | L(x, λ) ≤ 0}
is nonempty; and since X is convex and L is convex in x ∈ X, the set X(λ) is convex (as a level set of
a convex function, Proposition 2.1.4). Note also that the set is closed (since X is closed and L(x, λ) is
continuous in x ∈ X).
20 . Thus, if the optimal value in (II) is zero, then the set X(λ) is a nonempty convex compact set for
every λ ∈ Λ. And what does it mean that the optimal value in (I) is nonpositive? It means exactly that
there is a point x ∈ X where the function L is nonpositive, i.e., the point x ∈ X where L(x, λ) ≤ 0 for
all λ ∈ Λ. In other words, to prove that the optimal value in (I) is nonpositive is the same as to prove
that the sets X(λ), λ ∈ Λ, have a point in common.
30 . With the above observations we see that the situation is as follows: we are given a family of closed
nonempty convex subsets X(λ), λ ∈ Λ, of a compact set X, and we should prove that these sets have a
point in common. To this end, in turn, it suffices to prove that every finite number of sets from our family
have a point in common (to justify this claim, I can refer to the Helley Theorem II, which gives us much
stronger result: to prove that all X(λ) have a point in common, it suffices to prove that every (n + 1)
3.4. SADDLE POINTS 111
sets of this family, n being the affine dimension of X, have a point in common). Let X(λ0 ), ..., X(λN ) be
N + 1 sets from our family; we should prove that the sets have a point in common. In other words, let
fi (x) = L(x, λi ), i = 0, ..., N ;
all we should prove is that there exists a point x where all our functions are nonpositive, or, which is the
same, that the minmax of our collection of functions – the quantity
α ≡ min max fi (x)
x∈X i=1,...,N
– is nonpositive.
The proof of the inequality α ≤ 0 is as follows. According to the Minmax Lemma (which can be
applied in our situation – since L is convex and continuous in x, all fi are convex and continuous, and X
N
P
is compact), α is the minimum in x ∈ X of certain convex combination φ(x) = νi fi (x) of the functions
i=0
fi (x). We have
N
X N
X N
X
φ(x) = νi fi (x) ≡ νi L(x, λi ) ≤ L(x, νi λi )
i=0 i=0 i=0
(the last inequality follows from concavity of L in λ; this is the only – and crucial – point where we use
this assumption). We see that φ(·) is majorated by L(·, λ) for a properly chosen λ; it follows that the
minimum of φ in x ∈ X – and we already know that this minimum is exactly α – is nonpositive (recall
that the minimum of L in x is nonpositive for every λ).
Semi-Bounded case. The next theorem lifts the assumption of boundedness of X and Λ in Theorem
3.4.2 – now only one of these sets should be bounded – at the price of some weakening of the conclusion.
Theorem 3.4.3 [Swapping min and max in convex-concave saddle point problem (Sion-Kakutani)] Let
X and Λ be convex sets in Rn and Rm , respectively, with X being compact, and let
L(x, λ) : X × Λ → R
be a continuous function which is convex in x ∈ X for every fixed λ ∈ Λ and is concave in λ ∈ Λ for
every fixed x ∈ X. Then
inf sup L(x, λ) = sup inf L(x, λ). (3.4.6)
x∈X λ∈Λ λ∈Λ x∈X
Proof. By general theory, in (3.4.6) the left hand side is ≥ the right hand side, so that there is nothing
to prove when the right had side is +∞. Assume that this is not the case. Since X is compact and L is
continuous in x ∈ X, inf x∈X L(x, λ) > −∞ for every λ ∈ Λ, so that the left hand side in (3.4.6) cannot
be −∞; since it is not +∞ as well, it is a real, and by shift, we can assume w.l.o.g. that this real is 0:
sup inf L(x, λ) = 0.
λ∈Λ x∈X
All we need to prove now is that the left hand side in (3.4.6) is nonpositive. Assume, on the contrary,
that it is positive and thus is > c with some c > 0. Then for every x ∈ X there exists λx ∈ Λ such that
L(x, λx ) > c. By continuity, there exists a neighborhood Vx of x in X such that L(x0 , λx ) ≥ c for all
x0 ∈ Vx . Since X is compact, we can find finitely many points x1 , ..., xn in X such that the union, over
1 ≤ i ≤ n, of Vxi is exactly X, implying that max1≤i≤n L(x, λxi ) ≥ c for every x ∈ X. Now let Λ̄ be the
convex hull of {λx1 , ..., λxn }, so that maxλ∈Λ̄ L(x, λ) ≥ c for every x ∈ X. Applying to L and the convex
compact sets X, Λ̄ Theorem 3.4.2, we get the equality in the following chain:
c ≤ min max L(x, λ) = max min L(x, λ) ≤ sup min L(x, λ) = 0,
x∈X λ∈Λ̄ λ∈Λ̄ x∈X λ∈Λ x∈X
Theorem 3.4.4 [Existence of saddle point in convex-concave saddle point problem (Sion-Kakutani,
Semi-Bounded case)] Let X and Λ be closed convex sets in Rn and Rm , respectively, with X being
compact, and let
L(x, λ) : X × Λ → R
be a continuous function which is convex in x ∈ X for every fixed λ ∈ Λ and is concave in λ ∈ Λ for
every fixed x ∈ X. Assume that for every a ∈ R, there exist a collection xa1 , ..., xana ∈ X such that the set
{λ ∈ Λ : L(xai , λ) ≥ a 1 ≤ i ≤ na }
Proof. Since X is compact and L is continuous, the function L(λ) = minx∈X L(x, λ) is real-valued and
continuous on Λ. Further, for every a ∈ R, the set {λ ∈ Λ : L(λ) ≥ a} clearly is contained in the set
{λ : L(xai , λ) ≥ a, 1 ≤ i ≤ na } and thus is bounded. Thus, L(λ) is a continuous function on a closed
set Λ, and the level sets {λ ∈ Λ : L(λ) ≥ a} are bounded, implying that L attains its maximum on
Λ. Invoking Theorem 3.4.3, it follows that inf x∈X [L(x) := supλ∈Λ L(x, λ)] is finite, implying that the
function L(·) is not +∞ identically in x ∈ X. Since L is continuous, L is lower semicontinuous. Thus,
L : X → R ∪ {+∞} is a lower semicontinuous proper (i.e., not identically +∞) function on X; since
X is compact, L attains its minimum on X. Thus, both problems maxλ∈Λ L(λ) and minx∈X L(x) are
solvable, and the optimal values in the problem are equal by Theorem 3.4.3. Invoking Theorem 3.4.1, L
has a saddle point.
5
this definitely is the case true when L(x̄, λ) is coercive in λ for some x̄ ∈ X, meaning that the sets {λ ∈ Λ :
L(x̄, λ) ≥ a} are bounded for every a ∈ R, or, equivalently, whenever λi ∈ Λ and k λi k2 → ∞ as i → ∞, we have
L(x̄, λi ) → −∞ as i → ∞.
Lecture 4
Optimality Conditions
This Lecture, the last in the theoretical part of the course, is devoted to optimality conditions in general-
type Mathematical Programming programs
(P ) min {f (x) : g(x) ≡ (g1 (x), g2 (x), ..., gm (x)) ≤ 0, h(x) = (h1 (x), ..., hk (x)) = 0, x ∈ X} .
x
113
114 LECTURE 4. OPTIMALITY CONDITIONS
(note that in the last relation we skip the inclusion x ∈ X in the premise of the implication; this is
because we have assumed that x∗ is an interior point of X, so that shrinking, if necessary, U , we always
can make it part of X and thus make the inclusion x ∈ X a consequence of the inclusion x ∈ U ).
In the convex case local optimality is the same as the global one (this follows from Theorem 2.5.1
combined with the fact that the feasible set of a convex program is convex). In the general case these
two notions are different – a globally optimal solution is, of course, a locally optimal one, but not vice
versa: look at something like the problem
here all points xk = πk are local minimizers of the objective, but only one of them – 0 – is its global
minimizer.
Note that since a globally optimal solution for sure is a locally optimal one, the necessary condition
for local optimality (and these are all necessary optimality conditions to be discussed) are necessary for
global optimality as well.
Now, why it is true that in the general case it is impossible to point out local sufficient condition for
global optimality, it also is clear: because local information on a function f at a local minimizer x∗ of the
function does not allow to understand that the minimizer is only local and not a global one. Indeed, let
us take the above f and x∗ = π; this is only local, not global, minimizer of f . At the same time we can
easily change f outside a neighbourhood of x∗ and make x∗ the global minimizer of the updated function
(draw the graph of f to see it). Note that we can easily make the updated function f¯ as smooth as we
wish. Now, local information – value and derivatives at x∗ – on f and on the updated function f¯ are
the same, since the functions coincide with each other in a neighbourhood of x∗ . It follows that there
is no test which takes on input local information on the problem at x∗ and correctly reports on output
whether x∗ is or is not a global minimizer of the objective, even if we assume the objective to be very
smooth. Indeed, such a test is unable to distinguish the above f and f¯ and is therefore enforced, being
asked once about f and once about f¯, report both times the same answer; whatever is the answer, in one
of these two cases it is false!
The difficulty we have outlined is intrinsic for nonconvex optimization: not only there does not exist
“efficient local test” for global optimality; there also does not exist, as we shall see in our forthcoming
lectures, an efficient algorithm capable to approximate global minimizer of a general-type Mathematical
Programming problem, even one with very smooth data.
In view of this unpleasant and unavoidable feature of general-type Mathematical Programming prob-
lems, the answer to the second of the outlined questions – what for we use Optimality conditions in
Mathematical Programming – is not as optimistic as we would wish it to be. As far as conditions for
global optimality are concerned, we may hope for necessary optimality conditions only; in other words,
we may hope for a test which is capable to say that what we have is not a globally optimal solution.
Since there is no (local) sufficient condition for global optimality, we have no hope to design a local test
capable to say that what we have is the “actual” – global – solution to the problem. The best we may
hope for in this direction is a sufficient condition for local optimality, i.e., a local test capable to say that
what we have cannot be improved by small modifications.
The pessimism caused by the above remarks has, however, its bounds. A necessary optimality
condition is certain relation which must be satisfied at optimal solution. If we are clever enough to
generate – on paper or algorithmically – all candidates x∗ which satisfy this relation, and if the list of
these candidates turns out to be finite, then we may hope that looking through the list and choosing in it
the best, from the viewpoint of the objective, feasible solution, we eventually shall find the globally optimal
solution (given that it exists). Needless to say, the outlined possibility is met only in “simple particular
cases”, but already these cases sometimes are extremely important (we shall discuss an example of this
type at the end of this Lecture). Another way to utilize necessary and/or sufficient conditions for local
4.1. FIRST ORDER OPTIMALITY CONDITIONS 115
optimality is to use them as “driving force” in optimization algorithms. Here we generate a sequence of
approximate solutions and subject them to the test for local optimality given by our optimality condition.
If the current iterate passes the test, we terminate with a locally optimal solution to the problem; if it
is not the case, then the optimality condition (which is violated at the current iterate) normally says
to us how to update the iterate in order to reduce “violation” of the condition. As a result of these
sequential updatings, we get a sequence of iterates which, under reasonable assumptions, can be proved
to converge to a locally optimal solution to the problem. As we shall see in the forthcoming lectures, this
idea underlies all traditional computational methods of Mathematical Programming. Of course, with this
scheme it in principle is impossible to guarantee convergence to globally optimal solution (imagine that we
start at a locally optimal solution which is not a globally optimal one; with the outlined scheme, we shall
terminate immediately!) Although this is a severe drawback of the scheme, it does not kill the traditional
“optimality-conditions-based” methods. First, it may happen that we are lucky and there are no “false”
local solutions – the only local solution is the global one; then the above scheme will approximate the
actual solution (although we never will know that it is the case...) Second, in many practical situations
we are interested in improving a given initial solution to the problem rather than in finding the “best
possible” solution, and the traditional methods normally allow to achieve this restricted goal.
Now let us pass from the “motivation preamble” to the mathematics of optimality conditions. There
are two kinds of them – conditions utilizing the first-order information of the objective and the constraints
at x∗ only (the values and the gradients of these functions), and second-order conditions using the second
order derivatives as well. As we shall see, all first-order optimality conditions are only necessary for
optimality. Among the second-order optimality conditions, there are both necessary and sufficient for
local optimality.
(Pb) :
min f (x∗ ) + (x − x∗ )T ∇f (x∗ )
s.t.
gi (x∗ ) + (x − x∗ )T ∇gi (x∗ ) ≤ 0, i = 1, ..., m
(x − x∗ )T ∇hj (x∗ ) = 0, j = 1, ..., k
It is absolutely clear that if d ∈ K, then all vectors xt = x∗ + td corresponding to small enough positive
t are feasible for (Pb). Since x∗ is optimal for the latter problem, we should have
X k
X
∇f (x∗ ) = − λ∗i ∇gj (x∗ ) − µj ∇hj (x∗ ) (4.1.1)
i∈I(x∗ ) j=1
with some nonnegative λ∗i and some real µ∗j . To see it, note that K is exactly the polyhedral cone
and (*) says that the vector ∇f (x∗ ) has nonnegative inner products with all vectors from K, i.e., with
all vectors which have nonnegative inner products with the vectors from the finite set
By the Homogeneous Farkas Lemma this is the case if and only if ∇f (x∗ ) is combination with nonnegative
coefficients of the vectors from A:
X k
X
∇f (x∗ ) = − λ∗i ∇gi (x∗ ) + [µ∗j,+ − µ∗j,j ]∇hj (x∗ )
i∈I(x∗ ) j=1
with nonnegative λ∗j , µ∗j,+ , µ∗j,− . And to say that ∇f (x∗ ) is representable in the latter form is the same
as to say that it is representable as required in (4.1.1).
Now, λ∗i to the moment are defined for i ∈ I(x∗ ) only. We loose nothing when defining λ∗i = 0 for
i 6∈ I(x∗ ) and treating the right hand side sum in (4.1.1) as the sum over all i = 1, ..., m. Note also that
now we have the complementary slackness relations λ∗i gi (x∗ ) = 0, i = 1, ..., m.
We have established the following conditional statement:
4.1. FIRST ORDER OPTIMALITY CONDITIONS 117
Proposition 4.1.1 Let x∗ be locally optimal for (P ) and such that the hypothesis (B) takes place: x∗
remains optimal also for the linearized LP program (Pb). Then there exist nonnegative λ∗i and real µ∗j
such that
λ∗i gi (x∗ ) = 0, i = 1, ..., m [complementary slackness]
m k
(4.1.2)
∇f (x∗ ) + λ∗i ∇gi (x∗ ) + µ∗j ∇hj (x∗ )
P P
= 0 [KKT Equation]
i=1 j=1
The property of x to be feasible for (P ) and to satisfy the condition “there exist nonnegative λ∗i and real
∗
µ∗j such that ...” from the above Proposition is called the Karush-Kuhn-Tucker Optimality Condition;
we already know the version of this condition related to the case of inequality constrained problems. The
point x∗ which satisfies the KKT Optimality Condition is called a KKT point of (P ) (sometimes this
name is used for the pair (x∗ ; λ∗ , µ∗ ), i.e., for the point x∗ along with the certificate that it satisfies the
KKT condition).
From the above discussion it follows that all we may hope for is that the KKT condition is necessary for
local optimality of x∗ ; Proposition 4.1.2 says that this indeed is the case, but under implicit additional
assumption “x∗ remains...”. The problem, consequently, is to convert this implicit assumption into
something verifiable or to eliminate the assumption at all. The latter, unfortunately, is impossible, as it
is seen from the following elementary example (where the problem is even convex):
Qualification of constraints actually says that the feasible set of the actual problem (P ) should
approximate the feasible set of the linearized problem (Pb) in a neighbourhood of x∗ “up to the highest-
order terms in |x − x∗ |”, similarly to what is the case with the data of the problems. To give the precise
definition, let us agree to write
θ(t) = o(ts )
(θ is a function on the nonnegative ray, s > 0), if θ(t)t−s → 0 as t → +0 and θ(0) = 0; this is one of
standard Calculus conventions. And we say that problem (P ) satisfies the Qualification of Constraints
property at feasible solution x∗ , if there exists function θ(t) = o(t) such that
for every feasible solution x to the linearized problem (Pb) there exists a feasible solution x0
to the actual problem (P ) such that
|x − x0 | ≤ θ(|x − x∗ |)
– the distance between x and x0 goes to zero faster than the distance from x to x∗ as x → x∗ .
The Qualification of Constraints condition roughly speaking says that the feasible set of the linearized
problem (Pb) cannot be (locally, of course) “much wider” than the feasible set of (P ): for every x close
to x∗ and feasible for (Pb) there exists x0 “very close” to x and feasible for (P ). Note that in the above
“bad example” the situation is opposite: the feasible set of (Pb) is the entire axis (since the constraint in
the linearized problem is 0 × x ≤ 0), which is a “much wider” set, even locally, than the feasible set {0}
of (P ).
It is easily seen that under the Qualification of Constraints assumption local optimality of x∗ for (P )
implies global optimality of x∗ for (Pb), so that this assumption makes the KKT Optimality condition
necessary for optimality:
118 LECTURE 4. OPTIMALITY CONDITIONS
Proposition 4.1.2 Let x∗ be locally optimal for (P ) and let (P ) satisfy the Qualification of Constraint
assumption at x∗ . Then x∗ is optimal for (Pb) and, consequently, is a KKT point of (P ).
Proof. Let x∗ be locally optimal for (P ) and let the Qualification of Constraints take place; we should
prove that then x∗ is optimal for (Pb). Assume, on contrary, that x∗ is not optimal for (Pb). Since x∗
clearly is feasible for (Pb), non-optimality of x∗ for the latter problem means that there exists a feasible
solution x̄ to (Pb) with less value of the linearized objective f (x∗ ) + (x − x∗ )T ∇f (x∗ ) than the value of
this objective at x∗ . Setting d = x̄ − x∗ , we therefore obtain
dT ∇f (x∗ ) < 0.
Now let
xt = x∗ + t(x̄ − x∗ ), 0 ≤ t ≤ 1.
The points xt are convex combinations of two feasible solutions to (Pb) and therefore also are feasible
solutions to the latter problem (it is an LP program!) By Qualification of Constraints, there exist feasible
solutions x0t to the actual problem (P ) such that
|xt − x0t | ≤ θ(|xt − x∗ |) = θ(t|x̄ − x∗ |) ≡ θ(tq), q = |x̄ − x∗ |, (4.1.3)
with θ(t) = o(t). Now, f is continuously differentiable in a neighbourhood of x∗ (this is our once for ever
assumption made in the beginning of the Lecture). It follows – this is a well-known fact from Analysis (an
immediate consequence of the Lagrange Mean Value Theorem) – that f is locally Lipschitz continuous
at x∗ : there exists a neighbourhood U of x∗ and a constant C < ∞ such that
|f (x) − f (y)| ≤ C|x − y|, x, y ∈ U. (4.1.4)
As t → +0, we have xt → x∗ , and since |x0t − xt | ≤ θ(tq) → 0, t → 0, x0t also converges to x∗ as t → 0; in
particular, both xt and x0t belong to U for all small enough positive t. Besides this, from local optimality
of x∗ and the fact that x0t converges to x∗ as t → +0 and is feasible for (P ) for all t we conclude that
f (x0t ) ≥ f (x∗ )
for all small enough positive t. It follows that for small positive t we have
0 ≤ t−1 [f (x0t ) − f (x∗ )]
≤ t−1 [f (xt ) − f (x∗ )] + t−1 [f (x0t ) − f (xt )]
≤ t−1 [f (xt ) − f (x∗ )] + t−1 C|x0t − xt | [see (4.1.4)]
≤ t−1 [f (xt ) − f (x∗ )] + t−1 Cθ(tq) [see (4.1.3)]
∗
(x∗ )
= f (x +td)−f
t + t−1 Cθ(tq).
As t → 0, the last expression in the chain tends to dT ∇f (x∗ ) < 0 (since θ(tq) = o(t)), while the chain
itself says that the expression is nonnegative. This is the desired contradiction.
Proposition 4.1.2 is very close to tautology: the question was when the KKT condition is necessary
for local optimality, and the answer we have now – that it for sure is the case when (P ) satisfies the
Qualification of Constraints assumption at x∗ . If we can gain something from this answer, this something
is indeed very small – we do not know how to certify that the Qualification of Constraints takes place.
There is one trivial case – the one when the constraints of (P ) are linear; in this case the feasible set of the
linearized problem is even not close, but simply coincides with the feasible set of the actual problem (in
fact it suffices to assume linearity of the active at x∗ constraints only; then the feasible sets of (P ) and (Pb)
coincide with each other in a neighbourhood of x∗ , which is quite sufficient for Constraint Qualification).
Among the more general certificates – sufficient conditions – for the Qualification of Constraints 1)
the most frequently used is the assumption of regularity of x∗ for (P ), which is the property as follows:
1)
look how strange is what we are doing – we are discussing sufficient condition for something – namely, the
Qualification of Constraints – which in turn is nothing but a sufficient condition to make something third – the
KKT – necessary condition for local optimality. There indeed is something in human being, if he/she is capable
to understand these “conditions for conditions” and to operate them!
4.1. FIRST ORDER OPTIMALITY CONDITIONS 119
Definition 4.1.1 A feasible solution x of problem (P ) is called regular, if the gradients of all active at
x constraints of the problem are linearly independent.
Why regularity of x∗ implies Qualification of Constraints, it becomes clear from the following fundamental
theorem of Analysis (this is one of the forms of the Implicit Function Theorem):
Theorem 4.1.1 Let x∗ be a point from Rn and φ1 , ..., φl be k ≥ 1 times continuously differentiable in
a neighbourhood of x∗ functions which are equal to 0 at x∗ and are such that their gradients ∇φi (x∗ ) at
x∗ , i = 1, ..., l, form a linearly independent set.
Then there exists k times continuously differentiable along with its inverse substitution of argument
x = S(y)
which makes all the functions the coordinate ones, i.e. there exists
• a neighbourhood X of the point x∗ in Rn
• a neighbourhood Y of the origin in Rn
• a one-to-one mapping y 7→ S(y) of Y onto X which maps y = 0 to x∗ : S(0) = x∗
– such that
• (I) S is k times continuously differentiable in Y , and its inverse S −1 (x) is k times continuously
differentiable in X;
• (ii) the functions
ψi (y) ≡ φi (S(y))
in Y are just the coordinate functions yi , i = 1, ..., l.
Corollary 4.1.1 Let x∗ , φ1 , ..., φl satisfy the premise of Theorem 4.1.1, let q ≤ l, let X be the neigh-
bourhood of x∗ given by the Theorem, and let Φ be the solution set of the system
Then there exists a neighbourhood U ⊂ X of x∗ such that the distance from a point x ∈ U to Φ is bounded
from above by constant times the norm of the “violation vector”
max{φ(x), 0}
...
max{φq (x), 0}
δ(x) = ,
|φq+1 (x)|
...
|φl (x)|
i.e., there exists a constant D < ∞ such that for every x ∈ U there exists x0 ∈ Φ with
|x − x0 | ≤ D|δ(x)|. (4.1.5)
Proof. Let V be a closed ball of positive radius r which is centered at the origin and is contained in
Y . Since S is at least once continuously differentiable in a neighbourhood of the compact set V , its first
order derivatives are bounded in V and therefore S is Lipschitz continuous in V with certain constant
D > 0:
|S(y 0 ) − S(y 00 )| ≤ D|y 0 − y 00 | ∀y 0 , y 00 ∈ V.
Since S −1 is continuous and S −1 (x∗ ) = 0, there exists a neighbourhood U ⊂ X of x∗ such that S −1 maps
this neighbourhood into V .
120 LECTURE 4. OPTIMALITY CONDITIONS
Now let x ∈ U , and consider the vector y = S −1 (x). Due to the origin of U , this vector belongs to
V , and due to the origin of S, the first l coordinates of the vector are exactly φi (x), i = 1, ..., l (since
x = S(y), and we know that φi (S(y)) = yi , i = 1, ..., l). Now consider the vector y 0 with the coordinates
min{yi , 0}, i = 1, ..., q
yi0 = 0, i = q + 1, ..., l .
yi , i = l + 1, ..., n
First Order Optimality Conditions. Now we are able to reach the first of our two targets – to
get the First Order Optimality Conditions.
Besides this, there exists a neighbourhood W of x∗ such that all the inequality constraints which are
not active at x∗ are satisfied in the entire W (indeed, all the constraint functions are continuous at x∗ ,
so that the constraints nonactive at x∗ , being strict inequalities at the point, indeed are satisfied in a
neighbourhood of x∗ ). Now consider the mapping
x 7→ x0 (x)
defined as follows: for x ∈ U , x0 (x) is the vector x0 given by (4.1.6), if the latter vector belongs to W ;
otherwise, same as in the case of x 6∈ U , let x0 (x) = x∗ . Note that with this definition of x0 (x), this latter
vector always is a feasible solution to (P ) (why?) Besides this, as x → x∗ , then the violation vector δ(x)
clearly goes to 0, so that x0 given by (4.1.6) also goes to x∗ and therefore eventually becomes a vector
from W ; it follows that for all x close enough to x∗ the vector x0 (x) is exactly the vector given by (4.1.6).
Summarizing our observations, we come to the following conclusions:
We have defined a mapping which puts into correspondence to an arbitrary x ∈ Rn a feasible
solution x0 (x) to (P ). This mapping is bounded, and in certain neighbourhood Q of x∗ is
such that
|x0 (x) − x| ≤ D|δ(x)|. (4.1.7)
Now assume that x is a feasible solution to the linearized problem (Pb). Note that the vector φ(x) =
(φ1 (x), ..., φl (x)) admits the representation
where φlin comes from the linearizations of the functions φi at x∗ – i.e., from the constraint functions
of (Pb), and φrem comes from the remainders in the first order Taylor expansions of φi at x∗ . Since x
is feasible for (Pb), the first q entries of φlin (x) clearly are nonpositive, and remaining entries are equal
to 0. It follows that if x is feasible for (Pb), then the norm of the violation vector δ(x) does not exceed
the norm of the vector φrem (x) (look at the definition of the violation vector), and the latter norm is
≤ θ(|x − x∗ |) with certain θ(t) = o(t); indeed, the remainders in the first order Taylor expansions of
continuously differentiable in a neighbourhood of x∗ constraint functions are o(|x − x∗ |), x being the
point where the remainders are evaluated. Combining this observation with (4.1.7), we conclude that
there is a neighbourhood Z of x∗ such that if x ∈ Z is feasible for (Pb), then
with some θ(t) = o(t). Outside Z the left hand side in the latter equation clearly is bounded from above
by D0 |x − x∗ | for some D0 (recall that x0 (x) is bounded), so that, redefining θ(t) in an appropriate manner
outside a neighbourhood of t = 0, we can ensure that (4.1.8) is valid for all x feasible for (Pb). Since x0 (x),
by construction, is feasible for (P ), (4.1.8) demonstrates that the Qualification of Constraints does hold.
min {φ(x) : x ∈ Rn } ,
x
122 LECTURE 4. OPTIMALITY CONDITIONS
x∗ is local minimizer ⇒
∇φ(x∗ ) = 0 [Fermat rule – the first-order part]
∇2 φ(x∗ ) ≥ 0 [i.e., dT ∇2 φ(x∗ )d ≥ 0 ∀d –
the second-order part of the condition]
The second is the sufficient condition “if the gradient ∇φ(x∗ ) vanishes and the Hessian ∇2 φ(x∗ ) is positive
definite, then x∗ is local minimizer of f ”:
The proof of this well-known fact of Calculus is very easy: let us write down the second-order Taylor
expansion of φ at x∗ (t > 0, d is a unit vector: |d| = 1):
t2 T 2
(∗) φ(x∗ + td) − φ(x∗ ) = tdT ∇φ(x∗ ) + d ∇ φ(x∗ )d + θ(t) [θ(t) = o(t2 )].
2
If x∗ is a local minimizer of φ, then the left hand side in this relation is nonnegative whenever |d| = 1 and
t is small enough positive real; dividing both sides of the relation by t and passing to limit as t → +0,
we get dT ∇φ(x∗ ) ≥ 0 for all unit vectors d, which is possible if and only if ∇φ(x∗ ) = 0. Given this fact,
dividing both sides of the relation by t2 and passing to limit as t → +0, we get dT ∇2 φ(x∗ )d ≥ 0 for all
unit (and then – for all) vectors d. Thus, we come to the necessary second-order optimality condition.
To prove sufficiency of the sufficient second-order optimality condition, note that under this condition
the linear in t term in the right hand side of (*) vanishes, and the right hand side itself can be rewritten
as
t2 T 2
(∗∗) [d ∇ φ(x∗ )d + t−2 θ(t)]
2
Due to the positive definiteness of ∇2 φ(x∗ ), the first term in the parentheses is positive (and, of course,
continuous) function on the unit sphere {d | |d| = 1}. Since the sphere is compact, this function attains
its minimum on the sphere, so that its minimum is positive. It follows that the function dT ∇2 φ(x∗ )d is
bounded away from 0 on the unit sphere:
Thus, the first term in the parentheses in (**) is ≥ α > 0. The second term tends to 0 as t → +0, since
θ(t) = 0(t2 ). Thus, there exists δ > 0 such that quantity in parentheses is strictly positive whenever
0 < t ≤ δ. Now we see that the right hand side in (*) is positive whenever |d| = 1 and 0 < t ≤ δ, and (*)
says that x∗ indeed is a local minimizer of φ.
We are about to extend these second order optimality conditions to the case of constrained problems.
The extension will go in two stages: first, we shall consider the case of a very special constrained problem
∗
(P ) min φ(y) φ1 (y) ≡ y1 ≤ 0, ..., φq (y) ≡ yq ≤ 0,
y
φq+1 (y) ≡ yq+1 = 0, ..., φq+k (y) ≡ yq+k = 0,
where φ is smooth function and the solution to be tested for local optimality is y ∗ = 0. Then we easily
will pass from this special case to the general one, using Theorem 4.1.1.
4.2. SECOND ORDER OPTIMALITY CONDITIONS 123
Case of Special problem (P ∗ ). We start with the necessary optimality condition, which is imme-
diate:
• (ii) The Hessian of L∗ in y taken at the point (0; λ∗ , µ∗ ) is positive semidefinite on the linear
subspace T ∗ = {y | yk = ... = yq+k = 0}:
Proof is immediate. (i) is the KKT condition (which in our case is necessary for optimality due to
Theorem 4.1.2, since y ∗ = 0 clearly is a regular feasible solution to the problem. We can also see directly
why (i) is true: a direction d with di ≤ 0,, i = 1, ..., q, and di = 0, i = q + 1, ..., q + k (the remaining
entries in d can be arbitrary) clearly is a feasible direction for our problem at the point y ∗ = 0, i.e.,
y ∗ + td = td is feasible solution to the problem for all small positive – in fact, for all positive – values of
t. Since y ∗ = 0 is locally optimal for (P ∗ ), the objective φ locally cannot decrease along such a direction:
dT ∇φ(0) ≥ 0. Indeed, if dT ∇φ(0) were negative, then the values of the objective at the feasible solutions
td to (P ∗ ) would be, for small positive t, strictly less than the value of the objective at y ∗ = 0, which is
impossible, since y ∗ = 0 is locally optimal. Now, what does it mean that
n
X ∂
dT ∇φ(x∗ ) ≡ di φ(0) ≥ 0
i=1
∂yi
for all d with the first q entries being nonpositive, the next k entries being zero and the remaining entries
being arbitrary? It clearly means that the first q components of ∇φ(0) are nonpositive, the next k
components may me arbitrary, and the remaining components are zero. Now let us set λ∗i = − ∂y ∂
i
φ(0),
i = 1, ..., q (note that λ∗i are nonnegative) and µ∗j = − ∂yq+j φ(0), j = 1, ..., k; then the vector
∂
q
X k
X
∇φ(0) + λ∗i ∇φi (0) + µ∗j ∇φq+j (0)
i=1 j=1
will be 0 (note that ∇φi are simply the basic orths of Rn ), and we get the Lagrange multipliers required
in (i); their uniqueness is an immediate consequence of linear independence of ∇φi (0).
(ii) also is immediate: the entire linear subspace T ∗ is feasible for (P ∗ ), so that from local optimality
of y = 0 for (P ) it follows that 0 is local minimizer of φ on the linear subspace T ∗ . But we know what
∗
is necessary optimality condition for this latter phenomenon, since the problem of minimizing φ on a
linear subspace is in equivalent to unconstrained minimization problem: we simply should treat T ∗ , and
not Rn , as our universe. From the above second order necessary optimality condition for unconstrained
minimization we conclude that the second order derivative of φ taken at the point y ∗ = 0 along any
direction d ∈ T ∗ , i.e., the quantity dT ∇2 φ(0)d, should be nonnegative. But this quantity is the same as
dT ∇2y L∗ (0; λ∗ , µ∗ )d, since φ(y) differs from L∗ (y; λ∗ , µ∗ ) by a linear function of y, and we come to (ii).
124 LECTURE 4. OPTIMALITY CONDITIONS
One may ask: why should we express a simple thing – nonnegativity of the second order
derivative of φ taken at y ∗ = 0 along a direction from T ∗ – in that strange way – as similar
property of the Lagrange function. The answer is that this is the L∗ -form of this fact, not
the φ-form of it, which is stable with respect to nonlinear substitutions of the argument and
is therefore appropriate for extension from the case of special problem (P ∗ ) to the case of
general optimization problem (P ).
Now we know what is the necessary second order optimality condition for our special problem (P ∗ ),
and what we are about to do is to understand how the condition should be strengthened in order to
become sufficient for local optimality. The second order part of our necessary condition in fact comes
from the second order part of the necessary optimality condition for the unconstrained case – we simply
replace the entire Rn by the “unconstrained part of the feasible set” – the linear subspace T ∗ , and
the condition says that ∇2 φ(0) ≡ ∇2y L∗ (0; λ∗ , µ∗ ) should be positive semidefinite along the directions
from T ∗ . By analogy with the unconstrained case we could guess that to make our necessary condition
sufficient, we should replace “positive semidefinite” in the latter sentence by “positive definite”. The
truth, however, is more sophisticated, as is seen from the following example:
(q = 1, k = 0). Here the “first-order” part of the optimality condition from Proposition 4.2.1 is satisfied
with λ∗1 = 0, the linear subspace T ∗ is T ∗ = {d = (d1 , d2 ) | d1 = 0} and the Hessian of φ taken at 0
∂2
and restricted to T ∗ is positive definite: ∂y 2 φ(0) = 2 > 0. Nevertheless, y
∗
= 0 clearly is not locally
2
optimal, and we see that “naive” modification of the condition – “replace dT ∇2 L∗ d ≥ 0 by dT ∇2 L∗ d > 0
for nonzero d ∈ T ∗ ” – does not work: it does not make the condition sufficient for local optimality.
The actual sufficient condition for local optimality is as follows:
Proposition 4.2.2 Consider special optimization program (P ∗ ), and assume that the data φ, φi are twice
continuously differentiable in a neighbourhood of y ∗ = 0. Assume also that there exist Lagrange multipliers
λ∗i ≥ 0, i = 1, ..., q, and µ∗j , j = 1, ..., k, such that for the Lagrange function
q
X k
X
L∗ (y; λ, µ) = φ(y) + λi φi (y) + µj φq+j (y)
i=1 j=1
one has
• [first-order part] ∇y L∗ (0; λ∗ , µ∗ ) = 0
and
• [second-order part] dT ∇2y L∗ (0; λ∗ , µ∗ )d > 0 for all nonzero d from the linear subspace
Before proving this statement, let us stress the difference between the necessary optimality condition from
Proposition 4.2.1 and our new sufficient optimality condition. The first order parts of the conditions are
identical. The second order part of the sufficient optimality condition is, as it should be, stronger than
the one of the necessary condition, and it is stronger in two points:
– first, now we require positive definiteness of the Hessian ∇2y L∗ of the Lagrange function along certain
subspace of directions, not simply positive semidefiniteness of it, as it was in the necessary condition for
optimality. There is no surprise – this is the case already in the unconstrained situation;
– second, and more important, the subspace of directions along which we require definiteness of the
Hessian is wider for the sufficient condition than for the necessary one. Indeed, T ∗∗ and T ∗ impose the
same requirements on the entries of d with indices ≥ q, and at the same time impose different requirements
4.2. SECOND ORDER OPTIMALITY CONDITIONS 125
on the entries of d with indices ≤ q: T ∗ requires all these entries to be zero, while T ∗∗ asks to be zero
only the entries di , i ≤ q, associated with positive Lagrange multipliers λ∗i ; the entries of d associated
with zero Lagrange multipliers now could be arbitrary.
Note that the above “bad example” does not satisfy the second order sufficient condition for optimality
(and how it could? In this example y ∗ = 0 is not locally optimal!) exactly due to the second of the above
two points: in this example λ∗1 = 0, which makes T ∗∗ larger than T ∗ (T ∗ is the axis of y2 , T ∗∗ is the
entire plane of y1 , y2 ); the Hessian of the Lagrange function in this example is positive definite on T ∗ ,
but is not positive definite on T ∗∗ .
Note also that the second of the above two “strengthening points” is important only in the case when
some of inequality constraints φi (y) ≤ 0, i = 1, ..., q, are associated with zero Lagrange multipliers λ∗i ;
this is the only case when T ∗∗ may be indeed wider than T ∗ .
Proof of Proposition 4.2.2 is as follows. Assume that the condition for optimality stated in the Propo-
sition is satisfied; we should prove that y ∗ = 0 is locally optimal solution to (P ∗ ). Assume, on contrary,
that it is not the case: there exists a sequence {y t } of feasible solutions to (P ∗ ) which converges to y ∗ = 0
and is such that φ(y t ) < φ(0), and let us lead this assumption to a contradiction. Of course, we have
y t 6= 0 (since φ(y t ) < φ(0)). Let st be the norm, and
dt = s−1
t y
t
From the first order part of our optimality condition we know that
q
X k
X
p+ λ∗i ∇φi (0) + µ∗j ∇φq+j (0) = 0;
i=1 j=1
Substituting y = y t = st dT and taking into account that the entries with indices q + 1, ..., q + k in dt
vanish (since y t is feasible for (P ∗ )), we get
" q #
−1
X 1
t
0 > st [φ(y ) − φ(0)] = − λi di + st (dt )T Hdt + s−2
∗ t
t ρ(s t d t
) . (4.2.5)
i=1
2 2
1
• (c): As t → ∞, the quantities [·]2 converge to 21 (d∗ )T Hd∗ (since |ρ(st dt )| ≤ θ(st ) and θ(s) = o(s2 )),
whence the quantities st [·]2 converge to 0.
126 LECTURE 4. OPTIMALITY CONDITIONS
Our first conclusion from these observations is that the right hand side of (4.2.5) converges, as t → ∞, to
q
λ∗i d∗i ≥ 0, while the left hand side is nonpositive, whence the limit of the right hand side actually is
P
−
i=1
equal to 0:
q
X
(!) λ∗i d∗i = 0.
i=1
∗ t
Since the vector d inherits from the vectors d the property to have the first q coordinates nonpositive
and the next k coordinates zero, we see that every term λ∗i d∗i in the sum (!) is nonpositive; since the
sun is equal to 0, we conclude that λ∗i d∗i = 0, i = 1, ..., q. Since, besides this, d∗q+j = 0, j = 1, ..., k, we
conclude that d ∈ T ∗∗ .
Now let us divide both sides of (4.2.5) by st ; taking into account observation (a), we shall get the
inequality
0 ≥ s−2 t
t (φ(y ) − φ(0)) ≥ [·]2 .
From (c), the right hand side in this inequality converges, as t → ∞, to 21 (d∗ )T Hd∗ , so that this quantity
should be nonpositive. At the same time, we already know that d∗ ∈ T ∗∗ and we from the very beginning
know that |d∗ | = 1; thus, part (ii) of the optimality condition implies that 21 (d∗ )T Hd∗ > 0 (note that
∇2 φ(0) = ∇2y L∗ (0; λ∗ , µ∗ ): φi are linear!); this is the desired contradiction.
From special case to the general one. Now we are in a position where we can easily achieve the
second of our targets – to get the second order optimality conditions for the case of a general optimization
problem (P ).
• The Hessian ∇2x L(x∗ ; λ∗ , µ∗ ) of the Lagrange function in x is positive semidefinite on the orthogonal
complement M ∗ to the set of gradients of active at x∗ constraints:
• The Hessian ∇2x L(x∗ ; λ∗ , µ∗ ) of the Lagrange function in x is positive definite on the orthogonal
complement M ∗∗ to the set of gradients of equality constraints and those active at x∗ inequality
constraints which are associated with positive Lagrange multipliers λ∗i :
(here J(x∗ ) is the set of indices of active at x∗ inequality constraints associated with positive
Lagrange multipliers λ∗i ).
Then x∗ is locally optimal for (P ).
Proof. Without loss of generality, we may assume that the inequality constraints active at x∗ are the first
q inequality constraints. Now let us consider the q + k functions g1 , ..., gq , h1 , ..., hk ; all these functions
are equal to 0 at x∗ , and their gradients at the point are linearly independent. According to Theorem
4.1.1, there exists at least twice continuously differentiable in a neighbourhood Y of the point y ∗ = 0
substitution of argument y 7→ S(y), S(0) = x∗ , which is a one-to-one mapping of Y onto a neighbourhood
X of x∗ , has at least twice continuously differentiable in X inverse S −1 (x) and “linearizes” the functions
g1 , ..., gq , h1 , ..., hk :
φi (y) ≡ gi (S(y)) ≡ yi , i = 1, ..., q;
φi (y) ≡ hi−q (S(y)) ≡ yi , i = q + 1, ..., q + k.
Now let
φ(y) = f (S(y))
and consider the special problem
f (x) = φ(R(x)), gi (x) = φ(R(x)), i = 1, ..., q, hj (x) = φq+j (R(x)), j = 1, ..., k, (4.2.7)
S 0 (0)R0 (x∗ ) = I
do participate in the optimality conditions – they appear in the Lagrange function of (P ). But this
m
P P
is nothing but illusion (caused by our desire to write the conditions with instead of ): both
i=1 i∈I(x∗ )
the necessary and the sufficient conditions in question include complementary slackness λ∗i gi (x∗ ) = 0,
i = 1, ..., m, which implies that the Lagrange multipliers associated with non-active gi are zeros; looking
at our optimality conditions, we conclude that in fact the nonactive constraints do not enter them.
After we drop the non-active constraints (we just have seen that it can be done), the Lagrange
functions of (P ) and (P ∗ ) become linked with each other by the same relation as the data functions of
the problems:
L(x; λ, µ) = L∗ (R(x); λ, µ). (4.2.8)
Fourth, and the most important, from (4.2.8) and the rules of differentiating superpositions we have
the following relations:
∇x L(x∗ ; λ, µ) = QT ∇y L∗ (y ∗ ; λ, µ) (4.2.9)
and
v T ∇2x L(x∗ ; λ, µ)v = (Qv)T ∇2y L∗ (y ∗ ; λ, µ)(Qv) + [∇y L∗ (0; λ, µ)]T R00 [v],
where v is an arbitrary vector and
d2
R00 [v] = |t=0 R(x∗ + tv)
dt2
is the second-order derivative of the mapping R(·) taken at x∗ along the direction v. It follows that if λ
and µ are such that ∇y L∗ (0; λ, µ) = 0, then the Hessians of L and L∗ in the primal variables are linked
by the simple relation
∇2x L(x∗ ; λ, µ) = QT ∇2y L∗ (0; λ, µ)Q (4.2.10)
which does not involve the “curvature” R00 [·] of the substitution x 7→ R(x) which converts (P ) into (P ∗ ).
Since in the optimality conditions for (P ∗ ) we focus on Lagrange multipliers which make ∇L∗ (0; λ, µ)
equal to zero, the latter observation makes the “translation” of optimality conditions from the (P ∗ )-
language to the (P )-language indeed easy. Let me show how this translation is carried out for the
necessary optimality condition; with this example, the reader definitely would be able to translate the
sufficient optimality condition by himself.
Thus, let us prove (i). Assume that x∗ is locally optimal for (P ). Then, according to the above
remarks, y ∗ = 0 is locally optimal for (P ∗ ). By Proposition 4.2.1 the latter means that there exist λ∗i ≥ 0
and µ∗j such that
(#) ∇y L∗ (0; λ∗ , µ∗ ) = 0
.
(&) d ∇y L (0; λ∗ , µ∗ )d ≥
T 2 ∗
0 ∀d ∈ T ∗ ≡ {d | di = 0, i = 1, ..., q + k}
Let us verify that the above λ∗ and µ∗ are exactly the entities required in (i)2) . First, we should not
bother about complementary slackness λ∗i gi (x∗ ) = 0 – see the “recall” in the latter footnote. Second, we
do have the KKT Equation ∇x L(x∗ ; λ∗ , µ∗ ) = 0, since we have similar equation (#) for L∗ and we have
the chain rule (4.2.9). Thus, we do have the first-order part of (i). To get the second-order part, note
that we have similar statement (&) for L∗ , and from (#) it follows that we have also the “chain rule”
(4.2.10), so that we have the inequality
It remains to understand what does it mean that Qd ∈ T ∗ . We have φi (y) ≡ yi , gi (x) = φi (R(x)),
i = 1, ..., q, and hj (x) = φq+j (R(x)), which, in simple words, means that the vector-function comprised
2)
recall that we have reduced the situation to the case when all inequality constraints in (P ) are active at
x∗ ; otherwise we would be supposed to say: “Let us take λ∗i ’s coming from (P ∗ ) as the required in (i) Lagrange
multipliers for the active inequality constraints, and let us set the Lagrange multipliers of nonactive constraints
to zeros”
4.2. SECOND ORDER OPTIMALITY CONDITIONS 129
of g1 , ..., gq , h1 , ..., hk is nothing but the (q + k)-dimensional initial fragment of the n-dimensional vector-
function R. Since Q = R0 (x∗ ), the first q + k entries of the vector Qd are exactly the inner products of
the taken at x∗ gradients of g1 , ..., gq , h1 , ..., hk with the direction d; to say that Qd ∈ T ∗ is the same as
to say that all these inner products are zero (recall that T ∗ is the subspace of vectors with first q + k
entries equal to 0), i.e., to say that d ∈ M ∗ . Thus, (4.2.11) is exactly the second-order part of (i). 2
Remark 4.2.1 Now we understand what for we formulated the second-order parts of the
optimality conditions in Propositions 4.2.1, 4.2.2 in terms of the Hessian of the Lagrange
function and not the objective (although in the case of special problem (P ∗ ) the Hessian of
the Lagrangian is exactly the same as the Hessian of the objective). Only the formulation in
terms of the Lagrange function remains invariant under nonlinear substitutions of argument
– the tool we used to get the optimality conditions in the general case as an immediate
consequences of the same conditions for the simple special case.
(these coefficients should be ≤ 0). This conclusion comes from the fact that at a locally
optimal x∗ the objective has no right to decrease not only along direction tangent to S, but
also along those tangent to the surface given by equality constraints and leading inside the
“curved half-spaces” given by active inequality constraints. We see that the KKT condition
– the first-order part of the second order necessary optimality condition – is very geometric.
What about the geometry of the second-order part of the condition? A naive guess would
be like this:
“if x∗ is locally optimal, we for sure are unable to improve the objective by small
displacements along S. Perhaps it means that we are unable to improve the
objective by small displacements along the tangent to S plane x∗ + M ∗ as well
– this tangent plane locally is so close to S! If our guess is true, it means that
x∗ is local minimizer of the objective on the tangent plane, and we know what
does it mean – the gradient of the objective at x∗ should be orthogonal to M ∗ ,
and the second order derivative of it along every direction from M ∗ should be
nonnegative”.
The above reasoning is absolutely false. The second order part of the “unconstrained” opti-
mality condition – “the second order derivative along any direction should be nonnegative”
– comes from analyzing second-order effects caused by small perturbations of x∗ , and the
tangent plane does not, normally, approximate a curved surface within second-order terms.
It follows that the second-order phenomena we met when travelling along a curved surface
are not the same as those when travelling along the tangent plane to this surface, so that the
conclusions derived from analyzing these “plane” phenomena may have nothing in common
with reality. For example, in the optimization problem
the point x∗ = (0, 1) clearly is locally optimal, in spite of the fact that the second order
derivative of the objective taken at x∗ along the tangent line {x2 = 1} to the feasible
circumference is negative.
It turns out that this is the second order derivative of the Lagrange function (with properly
chosen Lagrange multipliers), not the one of the objective, which should be nonnegative
along tangent directions to S at a local optimum, and this is the main moral which can be
extracted from the second order optimality conditions.
• x∗ satisfies the Sufficient Second Order Optimality condition from Theorem 4.2.1,
and
• x∗ is such that all given by the Second Order Optimality condition Lagrange multipliers λ∗i for the
active at x∗ inequality constraints are strictly positive rather than simply nonnegative,
The third requirement imposed in the definition might look strange – what for is it? Already the first
two of the requirements enforce x∗ to be locally optimal! In fact, all three requirements make sense – all
of them are needed by Sensitivity Analysis.
4.2. SECOND ORDER OPTIMALITY CONDITIONS 131
Theorem 4.2.2 Consider parameterized family of problems (4.2.12), let x∗ be a nondegenerate solution
to problem (P0,0 ), and let λ∗ , µ∗ be the corresponding vector of Lagrange multipliers, so that (x∗ ; λ∗ , µ∗ )
is a KKT point of the problem (P0,1 ). Then there exist
• small enough neighbourhood X of the point x∗ in the space Rn of the design vectors,
• small enough neighbourhood V of the point (b, d) = (0, 0) in the space Rm × Rk of the parameters,
• continuously differentiable functions
such that
• whenever (b, d) is close enough to (0, 0), namely, whenever (b, d) ∈ V , x∗ (b, d) ∈ X is a non-
degenerate locally optimal solution to the “perturbed” problem (Pb,d ), λ∗ (b, d), µ∗ (b, d) being the
corresponding vectors of Lagrange multipliers; (x∗ (b, d), λ∗ (b, d), µ∗ (b, d)) is the only KKT point of
problem (Pb,d ) in the set of points (x, λ, µ) with x ∈ X.
• for (b, d) ∈ V , −λ∗i (b, d) and −µ∗j (b, d) are the derivatives of the “local optimal value”
Theorem 4.2.2 can be easily derived from the Implicit Function Theorem in the following standard form
(which is slightly different from the form presented in Theorem 4.1.1):
Implicit Function Theorem. Let U 0 be a neighbourhood of a point u∗ ∈ RN , V 0 be a
neighbourhood of a point v ∗ ∈ RM , and let
Φ(u, v) : U 0 × V 0 → RN
u∗ (v) : V → U
Φ(u, v) = 0
such that
• Φ(u∗ (v), v) ≡ 0, v ∈ V ;
• for every v ∈ V , the point u∗ (v) ∈ U is the unique in U solution of the equality system
Φ(u, v) = 0, u being the unknown in the system.
Here is the sketch of a derivation of Theorem 4.2.2 from the Implicit Function Theorem:
10 . Notation. Let x∗ be a nondegenerate locally optimal solution to (P0,0 ), and let λ∗ and µ∗ be
the corresponding Lagrange multipliers. Without loss of generality, we may assume that all inequality
constraints of the problem are active at x∗ – the nonactive constraints locally do not influence the problem,
and we can drop them.
20 . Let us set
N = n + m + k; M = m + k.
We partition vectors u ∈ RN into parts
u = (x, λ, µ)
of the sizes n, m, k, respectively, and set
u∗ = (x∗ , λ∗ , µ∗ ).
v = (b, d)
Φ(u∗ , v ∗ ) = 0
since x∗ is nondegenerate solution of (Pv∗ )=(P0,0 ), and λ∗ , µ∗ are the corresponding Lagrange multipliers.
All we should prove is that the matrix Φ0u (u∗ , v ∗ ) is nondegenerate. To this end it suffices to verify that
dx
φ0u (u∗ , v ∗ ) dλ = 0
dµ
4.2. SECOND ORDER OPTIMALITY CONDITIONS 133
(when computing last k rows in the matrix Φ0u (u∗ , v ∗ ), you should take into account that, by virtue of
our convention from 10 , gj (x∗ ) = 0 for all j, so that the terms dµj gj (x∗ ) which occur when differentiating
Φ in fact vanish).
From (β) and (γ) it follows that dx is orthogonal to the gradients of all equality constraints and to
the gradients of those inequality constraints for which µ∗j > 0, all the gradients being taken at x∗ . With
this in mind, let us take the inner product of both sides of (α) (this is an n-dimensional vector equality)
and the vector dx; we will get
(dx)T ∇2x L(x∗ ; λ∗ , µ∗ )dx = 0,
which, by definition of a nondegenerate solution and due to the already established orthogonality of dx
and the gradients of the equality constraints and the gradients of those inequality constraints with µ∗j > 0,
implies that dx = 0. Since dx = 0, (α) results in
m
X k
X
dλi ∇hi (x∗ ) + dµj ∇gj (x∗ ),
i=1 j=1
whence dλ = 0 and dµ = 0 (x∗ is regular, so that the set comprised of gradients of all the constraints is
linearly independent; here again we use our convention that all the inequality constraints are active at
x∗ ). Thus, dx = 0, dλ = 0, dµ = 0, as claimed.
40 . Since, by 30 , the premise of the Implicit Function Theorem is satisfied at (u∗ , v ∗ ) with s = 1, the
theorem implies that there exists a neighbourhood V of the point v ∗ = (0, 0) in the plane of parameters
identifying problems (Pv ) from our family, a neighbourhood U of the point u∗ and once continuously
differentiable function
u∗ (v) ≡ (x∗ (b, d), λ∗ (b, d), µ∗ (b, d)) : V → U [v = (b, d)]
such that
A:
Φ(u∗ (v), v) ≡ 0, v ∈ V,
and for v ∈ V the point u∗ (v) is the unique in U solution to the system of equations
Φ(u, v) = 0.
Since x∗ is nondegenerate, µ∗ > 0; since u∗ (·) is continuous, passing, if necessary, to smaller U and
V , we may assume that
B: µ∗ (v) > 0 when v ∈ V .
Now let X 0 be a neighbourhood of x∗ such that the gradients of the functions hi (x), gj (x), i = 1, ..., m,
j = 1, ..., k are linearly independent for x ∈ X 0 ; since these gradients are linearly independent at x = x∗
(x∗ is a regular point of the constraints!), such a neighbourhood exists. Some points x from X 0 can be
extended by properly chosen Lagrange multipliers to triples (x, λ, µ) satisfying the KKT equation, and
some, perhaps, cannot; let X 00 be the set of those x ∈ X 0 where these multipliers exist. Since the gradients
of all the constraints at the points from X 0 are linearly independent, the Lagrange multipliers for x ∈ X 00
are uniquely defined and continuous on X 00 ; when x = x∗ , these multipliers are λ∗ , µ∗ . Consequently,
there exists small enough neighbourhood, let it be called X, of x∗ such that
134 LECTURE 4. OPTIMALITY CONDITIONS
C: the set of gradients of all the constraints at any point x ∈ X is linearly independent, and if x ∈ X
can be extended by Lagrange multipliers λ, µ to a triple (x, λ, µ) satisfying the KKT equation, then
(x, λ, µ) ∈ U .
Shrinking, if necessary, V , we can ensure that
D: x∗ (v) ∈ X when v ∈ V .
Now note that the matrix ∇2x L(x; λ, µ) is continuous in x, λ, µ, and the plane
Φ(·, v) = 0.
By A such a solution is uniquely defined and is (x∗ (v), λ∗ (v), µ∗ (v)), as claimed.
Thus, we have proved all required statements except for (4.2.13). The proof of the latter relation is
immediate: we know from the Implicit Function Theorem that x∗ (v) is differentiable in v ∈ V ; let x0 be
the derivative of this mapping, taken at certain point v̄ ∈ V along certain direction δv = (δb, δd). Taking
inner product of both sides of the equality
m
X k
X
∇x f (x∗ (v̄)) + λ∗i (v̄)∇x hi (x∗ (v̄)) + µ∗ (v̄)∇x gj (x∗ (v̄)) = 0
i=1 j=1
(the fact that all gj are active at x∗ (v) comes from the KKT system due to µ∗j (v) > 0, j = 1, ..., k, see
B. Differentiating these equalities at the point v̄ in the direction δv, we observe that the right hand side
in (4.2.14) is equal to
Xk k
X
− λ∗i (v̄)δbi − µ∗j (v̄)δdj ,
i=1 j=1
The equality part of this system is a system of n + m + k nonlinear equations with n + m + k unknowns
– the entries of x∗ , λ∗ , µ∗ . Normally such a system has only finitely many solutions. If we are clever
enough to find all these solutions and if by some reason we know that the optimal solution exists and
indeed satisfies the KKT condition (e.g., the assumptions of Theorem 4.1.2 are satisfied at every feasible
solution), then we may be sure that looking through all solutions to the KKT system and choosing
among them the one which is feasible and has the best value of the objective, we may be sure that we
shall end up with the optimal solution to the problem. In this process, we may use the inequality part
of the system (same as the additional inequalities coming from the second order necessary optimality
conditions) to eliminate from the list candidates which do not satisfy the inequalities, which enables to
skip more detailed analysis of these candidates.
Approach of this type is especially fruitful if (P ) is convex (i.e., f, g1 , ..., gm are convex and h1 , ..., hk
are linear). The reason is that in this case the KKT conditions are sufficient for global optimality (we
know it from the previous Lecture, although for the case when there are no equality constraints; the
extension to the case of linear equality constraints is quite straightforward). Thus, if the problem is
convex and we are able to point out a single solution to the KKT system, then we can be sure that it is
the actual – globally optimal – solution to (P ), and we should not bother to look for other KKT points
and compare them to each other.
Unfortunately, the outlined program can be carried out in simple cases only; normally nonlinear KKT
system is too difficult for analytical solution. Let us look at one – and a very instructive one – of “simple
cases”.
Minimizing a homogeneous quadratic form over the unit sphere. Consider the problem
A being a symmetric n × n matrix. Let us list all locally optimal solutions to the problem.
Step 0. Let f ∗ denote the optimal value. Since x = 0 clearly is feasible solution and f (0) = 0, we
have f ∗ ≤ 0. There are, consequently, two possible cases:
Case (A): f ∗ = 0;
Case (B): f ∗ < 0.
Step 1: Case (A). Case (A) takes place if and only if xT Ax ≥ 0 for all x, |x| ≤ 1, or, which is the
same due to homogeneity with respect to x, if and only if
xT Ax ≥ 0 ∀x.
We know that symmetric matrices with this property have special name – they are called symmetric
positive semidefinite (we met with these matrices in this Lecture and also in the Convexity criterion for
twice continuously differentiable functions: such a function is convex in certain open convex domain if and
only if the Hessian of the function at any point of the domain is positive semidefinite). In Linear Algebra
136 LECTURE 4. OPTIMALITY CONDITIONS
there are tests for positive semidefiniteness (the simplest is the Sylvester rule: a symmetric matrix is
positive semidefinite if and only if all its principal minors – those formed by several rows and the columns
with the same indices as rows – are nonnegative; there are also tests which can be run in cubic in the
size of the matrix number of arithmetic operations). Now, what are the locally optimal solutions to the
problem in the case of positive semidefinite A? We claim that these are exactly the points x from the
unit ball (the feasible set of the problem) which belong to the kernel of A, i.e., are such that
Ax = 0.
First of all, if x possesses the latter property, then xT Ax = 0 = f ∗ , so that x∗ is even globally optimal.
Vice versa, assume that x is locally optimal, and let us prove that Ax = 0. The constraint in our problem
is convex; the objective also is convex (recall the criterion of convexity for smooth functions and note that
f 00 (x) = 2A), so that a locally optimal solution is in fact optimal. Thus, x is locally optimal if and only
if xT Ax = 0. In particular, if x is locally optimal, then, say, x0 = x/2 also is. At this new locally optimal
solution, the constraint is satisfied as a strict inequality, so that x0 is an unconstrained local minimizer
of function f (·), and by the Fermat rule we get ∇f (x0 ) ≡ 2Ax0 = 0, whence also Ax = 0, as claimed.
Step 2: Case (B). Now consider the case of f ∗ < 0, i.e., the one when there exists h, |h| ≤ 1, such
that
(#) hT Ah < 0.
What are the locally optimal solutions x∗ to the problem?
What is said by the First Order Optimality conditions. Logically, two possibilities can take place: the
first is that |x∗ | < 1, and the second is that |x∗ | = 1.
Let us prove that the first possibility if in fact impossible. Indeed, in the case of |x∗ | < 1 x∗ should be
locally optimal in the unconstrained problem min {f (x) : x ∈ Rn } with smooth objective. By the second
x
order necessary condition for unconstrained local optimality, the Hessian of f at x∗ (which is equal to
2A) should be positive semidefinite, which contradicts (#).
Thus, in the case in question a locally optimal solution x∗ is on the boundary of the unit ball, and
the constraint g1 (x) ≤ 0 is active at x∗ . The gradient 2x∗ of this constraint is therefore nonzero at x∗ ,
so that (by Theorem 4.1.2) x∗ is a KKT point:
∃λ∗1 ≥ 0 : ∇f (x∗ ) + λ∗1 ∇g1 (x∗ ) = 0,
or, which is the same,
Ax∗ = −λ∗1 x∗ .
Thus, x∗ should be a unit eigenvector3 of A with a nonpositive eigenvalue λ ≡ −λ∗1 ; this is all which we
can extract from the First Order Necessary Optimality conditions.
Looking at the example
A = Diag(1, 0, −1, −2, −3, ..., −8)
in R10 , we see that the First Order Necessary optimality conditions are satisfied by 18 vectors
±e2 , ±e3 , ..., ±e10 , where e1 , i = 1, ..., 10, are the standard basic orths in R10 . All these 18 vectors
are Karush-Kuhn-Tucker points of the problem, and the First Order Optimality conditions do not allow
to find out which of these 18 candidates are locally optimal and which are not. To get the answer, we
should use the Second Order Optimality conditions.
What is said by the Second Order Optimality conditions. We come back to our general problem (Q);
recall that we are in the case of (B). We already know that a locally optimal solution to (Q) is a unit
vector, and the set of constraints active at x∗ is comprised of our only inequality constraint; its gradient
2x∗ at x∗ is nonzero, so that we have (Regularity), and consequently have the Second Order Necessary
Optimality condition (Theorem 4.2.1.(i)). Thus, there should exist Lagrange multiplier λ∗1 ≡ −λ ≥ 0
such that
2Ax∗ − 2λx∗ ≡ ∇f (x∗ ) + λ∗1 ∇g1 (x∗ ) = 0
3)
recall that an eigenvector of a square matrix M is a nonzero vector e such that M e = sx for certain real s
(this real is called the eigenvalue of M associated with the eigenvector e.
4.3. CONCLUDING REMARKS 137
and
dT [2A − 2λI]d ≡ dT [∇2 f (x∗ ) + λ∗1 ∇2 g1 (x∗ )]d ≥ 0
(I is the unit matrix) for any d satisfying the condition
and
dT (A − λI)d ≥ 0 (4.3.2)
for all d such that
dT x∗ = 0.
Let us prove that in fact (4.3.2) is valid for all d ∈ Rn . Indeed, given a d ∈ Rn , we can decompose it as
d = d1 + d2 ,
where d1 is the orthogonal projection of d onto the one-dimensional subspace spanned by x∗ , and d2 is
the orthogonal projection of d onto the orthogonal complement to x∗ :
d1 = αx∗ , dT2 x∗ = 0.
We have
from d1 = αx1 and from (4.3.1) it follows that the first two terms – those containing d1 – in the last
expression are zero, and from (4.3.2) it follows that the third term is nonnegative (recall that dT2 x∗ = 0).
Consequently, the expression is nonnegative, as claimed.
We conclude that a locally optimal solution x∗ of the problem (Q) should be a unit vector such that
∗
x is an eigenvector of A with a nonpositive eigenvalue λ and, besides this, such that the matrix A − λI
should be positive semidefinite – (4.3.2) should be valid for all d. It follows that λ is the smallest of
eigenvalues of A: indeed, if λ0 is another eigenvalue, so that there exists a nonzero x0 with Ax0 = λ0 x0 ,
then we should have
0 ≤ (x0 )T (A − λI)x0 ≡ (x0 )T (λ0 − λ)x0 = (λ0 − λ)|x0 |2 ,
whence λ0 ≥ λ.
We conclude that the λ - the minus Lagrange multiplier associated with a locally optimal solution x∗
is uniquely defined by the problem: this is the smallest eigenvalue of A. And since
(we already know that a locally optimal solution must be a unit eigenvector of A with the eigenvalue λ),
we conclude that the value of the objective at a locally optimal solution also is uniquely defined by the
problem. Since the problem clearly is solvable (the feasible set is compact and the objective is continuous),
locally optimal solutions exist and among them there are optimal ones; and since the objective, as we
have seen, is constant on the set of locally optimal solutions, we conclude that
In the case of (B) locally optimal solutions are the same as optimal ones and all of them are unit
eigenvectors of A associated with the smallest eigenvalue of the matrix.
On the other hand, if x∗ is a unit eigenvector of A with the eigenvalue λ, then f (x∗ ) = λ (see the above
computation), so that all unit eigenvectors of A associated with the smallest eigenvalue of the matrix
are, in the case of (B), optimal solutions to (Q). We see that the Second Order Necessary Optimality
condition, in contrast to the First Order one, allows us to get complete description of the solutions to
(Q).
138 LECTURE 4. OPTIMALITY CONDITIONS
Remark 4.3.1 A byproduct of our reasoning is the statement that if a symmetric matrix A satisfies (#),
then there exists an eigenvector of A ((Q) for sure is solvable, and the First Order Necessary condition
says, as we have seen, that an optimal solution must be an eigenvector). Note that it is far from being clear
in advance why a symmetric matrix should have an eigenvector. Of course, our reasoning establishes the
existence of an eigenvector only under assumption (#), but this restriction can be immediately eliminated
(given an a arbitrary symmetric matrix A0 , one can apply the reasoning to the matrix A = A0 −T I which,
for large T , for sure satisfies (#), and to get existence of an eigenvector of A; of course, it will be also an
eigenvector of A0 ).
The existence of an eigenvector for a symmetric matrix is, of course, a perfectly well known elementary
fact of Linear Algebra; here is a several-line proof:
Let us prove first that an arbitrary matrix A, even with complex entries, possesses a complex
eigenvalue. Indeed, λ is an eigenvalue of A if and only if there exists a nonzero (complex)
vector z such that (A − λI)z = 0, i.e., if and only if the matrix λI − A is singular, or, which is
the same, the determinant of the matrix is zero. On the other hand, the determinant of the
matrix λI − A clearly is a nonconstant polynomial of λ, and such a polynomial, according to
FTA – the Fundamental Theorem of Algebra – has a root; such a root is an eigenvalue of A.
Now we should prove that if A is real and symmetric, then it has a real eigenvalue and a
real eigenvector. This is immediate: we simply shall prove that all eigenvalues of A are real.
Indeed, if λ is an eigenvalue of A (regarded as a complex matrix) and z is the corresponding
(complex) eigenvector, then the expression
n
X
Aij zj zi∗
i,j=1
(∗ means complex conjugation) is real (look at its conjugate!); on the other hand, for the
n n
zi zi∗ = λ |zi |2 ; since
P P P
eigenvector z we have Aij zj = λzi , so that our expression is λ
j i=1 i=1
z 6= 0, this latter expression can be real if and only if λ is real.
Finally, after we know that an eigenvalue λ of a real symmetric matrix (regarded as a matrix
with complex entries) in fact is real, we can immediately prove that the eigenvector associated
with this eigenvalue also can be chosen to be real: indeed, the real matrix λI − A is singular
and has therefore a nontrivial kernel.
In fact all results of our analysis of (Q) can be immediately derived from several other basic facts of Linear
n
Algebra (namely, from possibility to represent a quadratic form xT Ax in a diagonal form λi u2i (x),
P
i=1
ui (x) being the coordinates of x in a properly chosen orthonormal basis). Thus, in fact in our particular
example the Optimization Theory with its Optimality conditions is, in a sense, redundant. Two things,
however, should be mentioned:
• The Linear Algebra proof of the existence of an eigenvector is based on the FTA which states
existence of a (complex) root of a polynomial. To get the same result on the existence of an
eigenvector, in our proof (and in all the proofs it is based upon) we never used something like FTA!
All we used from Algebra was the elementary theory of systems of linear equations, and we never
thought about complex numbers, roots of polynomials, etc.!
This is an example of what is Mathematics, and it would be very useful exercise for a mathematician
to trace back both theories to see what are the “common roots” of two quite different ways to prove
the same fact4) .
4)
the only solution to this exercise which comes to my mind is as follows: the simplest proof of the Fundamental
Theorem of Algebra is of optimization nature. This hardly is the complete answer, since the initial proof of FTA
given by Gauss is not of optimization nature; the only fact from Analysis used in this proof is that a continuous
function on the axis which takes both positive and negative values has a zero (Gauss used this fact for a polynomial
of an odd degree). This is too far relationship, we think.
4.3. CONCLUDING REMARKS 139
• It is worthy of mentioning that the Optimization Theory (which seems to be redundant to estab-
lish the existence of an eigenvector of a symmetric matrix) becomes unavoidable when proving
a fundamental infinite-dimensional generalization of this fact: the theorem (Hilbert) that a com-
pact symmetric linear operator in a Hilbert space possesses an eigenvector [and, finally, even an
orthonormal basis comprised of eigenvectors]. We are not going to explain what all these words
mean; roughly speaking, it is said that a infinite dimensional symmetric matrix
R1 can be diagonalized
in a properly chosen orthonormal basis (e.g., an integral operator f (s) 7→ 0 K(t, s)f (s)ds with not
that bad (e.g., square summable) symmetric (K(t, s) = K ∗ (s, t)) kernel K possesses a complete
in L2 [0, 1] orthonormal system of eigenfunctions; this fact, in particular, explains why the atomic
spectra are discrete rather than continuous). When proving this extremely important theorem, one
cannot use Linear Algebra tools (there are no determinants and polynomials anymore), but still
can use the optimization ones (compactness of the operator implies solvability of the corresponding
problem (Q), and the first order necessary optimality condition which in the case in question says
that the solution is an eigenvector of the operator, in contrast to FTA, is “dimension-invariant”
and remain valid in the infinite-dimensional case as well).
140 LECTURE 4. OPTIMALITY CONDITIONS
Lecture 5
This lecture starts the second part of our course; what we are interested in from now on are numerical
methods for nonlinear continuous optimization, i.e., for solving problems of the type
Here x varies over Rn , and the objective f (x), same as the functions gi and hj , are smooth enough
(normally we assume them to be at least once continuous differentiable). The constraints
are referred to as functional constraints, divided in the evident manner into inequality and equality
constraints.
We refer to (5.0.1) as to nonlinear optimization problems in order to distinguish between these prob-
lems and Linear Programming programs; the latter correspond to the case when all the functions f, gi , hj
are linear. And we mention continuous optimization in the description of our subject to distinguish
between it and discrete optimization, where we look for a solution from a discrete set, e.g., comprised
of vectors with integer coordinates (Integer Programming), vectors with 0-1 coordinates (Boolean Pro-
gramming), etc.
Problems (5.0.1) arise in huge variety of applications, since whenever people make decisions, they
try to make them in an “optimal” manner. If the situation is simple enough, so that the candidate
decisions can be parameterized by finite-dimensional vectors, and the quality of these decisions can be
characterized by finite set of “computable” criteria, the concept of “optimal” decision typically takes the
form of problem (5.0.1). Note that in real-world applications this preliminary phase – modelling the actual
decision problem as an optimization problem with computable objective and constraints – is, normally,
much more difficult and creative than the subsequent phase when we solve the resulting problem. In our
course, anyhow, we do not touch this modelling phase, and focus on technique for solving optimization
programs.
Recall that we have developed optimality conditions for problems (5.0.1) in Lecture 7. We remember
that one can form a square system of nonlinear equations and a system of inequalities which together
define certain set – the one of Karush-Kuhn-Tucker points of the problem – which, under mild regularity
conditions, contains all optimal solutions to the problem. The Karush-Kuhn-Tucker system of equations
and inequalities typically has finitely many solutions, and if we are clever enough to find all of them
analytically, then we could look through them and to choose the one with the best value of the objective,
thus getting the optimal solution in a closed analytical form. The difficulty, however, is that as a rule we
are not so clever to solve analytically the Karush-Kuhn-Tucker system, same as are unable to find the
optimal solution analytically by other means. In all these “difficult” cases – and basically all optimization
problems coming from real world applications are difficult in this sense – all we may hope for is a numerical
routine, an algorithm which allows to approximate numerically the solutions we are interested in. Thus,
numerical optimization methods form the main tool for solving real-world optimization problems.
141
142 LECTURE 5. OPTIMIZATION METHODS: INTRODUCTION
• second-order routines using the values, the gradients and the Hessians (i.e., matrices of second-order
derivatives) of the objective and the constraints.
In principle, of course, we could speak about methods of orders higher than 2; these methods, however, are
never used in practice. Indeed, to use a method of an order k, you should provide possibility to compute
partial derivatives of the objective and the constraints up to order k. In the multi-dimensional case it is
not that easy even for k = 1 and even in the case when your functions are given by explicit analytical
expressions (which is not always the case). And there is “explosion of difficulties” in computing higher
order derivatives: for a function of n variables, there are n first order derivatives to be computed, n(n+1)
2
second order derivatives, n(n+1)(n+2)
2×3 third order derivatives, etc.; as a result, even in the medium-scale
case with n being several tens the difficulties with programming, computation time, memory required to
deal with higher-order derivatives make exploiting these derivatives too expensive. On the other hand,
there are no serious theoretical advantages in methods of order higher than 2, so there is no compensation
for the effort of computing these derivatives.
another choice of the error function could be the residual in terms of the objective and
constraints, like
resP (x) = max{f (x) − f ∗ ; [g1 (x)]+ ; ...; [gm (x)]+ ; |h1 (x)|; ...; |hk (x)|},
f ∗ being the optimal value in P and [a]+ = max(a, 0) being the positive part of a real a, etc.
For properly chosen error function (e.g., for distP ), convergence of the iterates to the solution
set implies that the scalar sequence
rt = err(xt )
converges to 0, and we measure the quality of convergence by the rate at which the nonneg-
ative reals rt tend to zero.
144 LECTURE 5. OPTIMIZATION METHODS: INTRODUCTION
• [sub- and superlinear convergence] a sequence which converges to 0, but is not linearly converging
(e.g., the sequence rt = t−1 ), is called sublinearly converging. A sequence which linearly con-
verges to zero with any positive ratio (so that the convergence ratio of the sequence is 0) is called
superlinearly converging (e.g., the sequence rt = t−t ).
A sufficient condition for a sequence {rt > 0} to be superlinearly converging is
rt+1
lim = 0.
t→∞ rt
rt+1 ≤ Crtp .
The upper bound of those p for which the sequence converges to 0 with order p is called the order
of convergence of the sequence.
t
For example, the sequence rt = a(p ) (a ∈ (0, 1), p > 1) converges to zero with order p, since
rt+1 /rtp = 1. The sequences converging to 0 with order 2 have special name – they are called
quadratically convergent.
Of course, a sequence converging to 0 with order p > 1 is superlinearly converging to 0 (but,
generally speaking, not vice versa).
Traditionally, the rate of convergence of iterative numerical routines is measured by the rank of the
corresponding sequence of errors {rt = err(xt )} in the above scale; in particular, we may speak about
sublinearly, linearly, superlinearly, quadratically converging methods, same as about methods with order
of convergence p > 1. It is common to think that the better is the rate of convergence of a method, the
more preferable is the method itself. Thus, a linearly converging method is thought to be better than a
sublinearly converging one; among two linearly converging methods the more preferable is the one with
the smaller convergence ratio; a superlinearly converging method is preferred to a linearly converging one,
etc. Of course, all these preferences are “conditional”, provided that there are no significant differences
in computational complexity of steps, etc.
We should stress that the rate of convergence, same as the very property of convergence, is an
asymptotic characteristic of the sequence of errors; it does not say when the announced rate of convergence
occurs, i.e., what are the values of C or/and “large enough” values of t mentioned in the corresponding
definitions. For concrete methods, the bounds on the indicated quantities typically can be extracted
from the convergence proofs, but it does not help a lot – the bounds are usually very complicated, rough
and depend on “invisible” quantitative characteristics of the problem like the magnitudes of high-order
5.1. PRELIMINARIES ON OPTIMIZATION METHODS 145
derivatives, condition numbers of Hessians, etc. From these observations combined with the fact that our
life is finite it follows that one should not overestimate the importance of the rate-of-convergence ranking
of the methods. This traditional approach gives a kind of orientation, nothing more; unfortunately, there
seems to be no purely theoretical way to get detailed ranking of numerical optimization methods. As a
result, practical recommendations on which method to use are based on different theoretical and empirical
considerations: theoretical rate of convergence, actual behaviour on test problems, numerical stability,
simplicity and robustness, etc.
x’ x’’
The function has two local minimizers, x0 and x00 . Observing small enough neighbourhood of every one of
these minimizers, it is impossible to guess that in fact there exists another one. As a result, any “normal”
method of nonlinear optimization as applied to the objective in question and being started from a small
neighbourhood of the “wrong” (local, not global) minimizer x0 , will converge to x0 – the local information
on f available for the method does not allow to guess that x00 exists!
It would be wrong to say that the difficulty is absolutely unavoidable. We could run the method for
different starting points, or even could look through the values of the objective along a sequence which is
dense in R 1) and define xt as the best, in terms of the values of f , of the first t points of the sequence.
The latter “method” can be easily extended to general constrained multidimensional problems; one can
immediately prove its convergence to the global solution; the method is simple in implementation, etc.
There is only one small drawback in the method: the tremendous number of function evaluations required
to solve a problem within inaccuracy .
It can be easily seen that the outlined “method”, as applied to a problem
requires, in the worst case, at least −n steps to find a point x with the residual in terms of
the objective – the quantity f (x ) − min f – not exceeding .
|x|≤1
1)
i.e., visits arbitrary small neighbourhood of every point in R, as it does, e.g., the sequence of all rational
numbers (to arrange rationals in a single sequence, list them according to the sum of absolute values of the
numerator and denominator in the corresponding fractions: first those with the above sum equal to 1 (the only
rational 0=0/1), then those with the sum 2 (-1=-1/1,1=1/1), then those with the sum 3 (-2/1,-1/2,1/2,2/1), etc.)
146 LECTURE 5. OPTIMIZATION METHODS: INTRODUCTION
When = 0.01 and n = 20 (very modest accuracy and dimension requirements), the number
of steps becomes > 1040 , and this is the lower complexity bound!
Moreover, for the family of problems in question the lower bound −n on the number of
function evaluations required to guarantee residual ≤ is valid for an arbitrary optimization
method which uses only local information on the objective.
Thus, we can approximate, within any given error > 0, the global solution to any optimization problem;
but to say that the best we can promise is to do this, for = 0.01, n = 20, in 1040 steps – is worse than
to say nothing.
As a consequence of the above considerations (same as of other, more advanced results of Information-
Based Complexity Theory of Optimization), we come to the following important, although a desperate,
conclusion:
It makes no sense to expect an optimization method to be able to approximate, with a
reasonable inaccuracy in a reasonable time, global solution to all optimization problems of
a given, even moderate, size.
In fact all we may hope to do in reasonable time is to find tight approximations to certain (not necessarily
corresponding to the optimal solution) Karush-Kuhn-Tucker point of an optimization problem (in the
unconstrained case – to a critical point of the objective). In simple cases we may hope also to approximate
a locally optimal solution, without any guarantees of its global optimality.
There is, anyhow, a “solvable case” when we can approximate with a reasonable complexity globally
optimal solution to an optimization problem. This is the case when the problem is convex (i.e., the
functions f and gi , i = 1, ..., m, are convex, while hj , if any, are linear). Properties of convex optimization
problems and the numerical methods for these problems form the subject of Convex Programming.
Convex Programming is, in its nature, simpler and, consequently, much more advanced than the general
Nonlinear Optimization. In particular, in Convex Programming we are able to point out methods with
quite reasonable global (not asymptotical!) rate of convergence which are capable to guarantee, at
a reasonable computational cost, high-accuracy approximations of globally optimal solutions even for
general-type convex programs.
I would be happy to restrict the remainder of our course with the nice world of Convex Programming,
but we cannot afford it to ourselves: in actual applications we, unfortunately, too often meet with
nonconvex problems, and we have no choice but to solve them – even at the cost of weakening the notion
of “optimal solution” from the global to a local one (or even to the one of a Karush-Kuhn-Tucker point).
of minimizing the objective over a given finite (−∞ < a < b < ∞) segment [a, b] of the real axis. To
ensure well-posedness of the problem, we make the following assumption:
f is unimodal on [a, b], i.e., possesses a unique local minimum x∗ on the segment.
This assumption, as it is easily seen, implies that f is strictly decreasing in [a, b] to the left of x∗ :
Indeed, if (5.2.3) were false, there would exist x0 and x00 such that
It follows that the set of minimizers of f on [a, x00 ] contains a minimizer, x∗ , which differs
from x00 2) . Since x∗ is a minimizer of f on [a, x00 ] and x∗ differs from x00 , x∗ is a local
minimizer of f on [a, b], while it was assumed that the only local minimizer of f on [a, b] is
x∗ ; this gives the desired contradiction. Verification of (5.2.4) is similar.
Note that relations (5.2.3) - (5.2.4), in turn, imply that f is unimodal on [a, b] and even on every smaller
segment [a0 , b0 ] ⊂ [a, b].
Given that f is unimodal on [a, b], we immediately can point out a strategy for approximating x∗ ,
namely, as follows. Let us choose somehow two points x− and x+ in (a, b),
and let us compute the values of f at these points. The basic observation is that
if [case A] f (x− ) ≤ f (x+ ), then x∗ is to the left of x+ [indeed, if x∗ were to the right of x+ , then
we would have f (x− ) > f (x+ ) by (5.2.3)], and if [case B] f (x− ) ≥ f (x+ ), then x∗ is to the right of x−
[“symmetric” reasoning].
Consequently, in the case of A we can replace the initial “uncertainty segment” ∆0 = [a, b] with the
new segment ∆1 = [a, x+ ], and in the case of B – with the segment ∆1 = [x− , b]; in both cases the new
“uncertainty segment” ∆1 covers x∗ and is strictly less than ∆0 . Since, at is was already mentioned, the
objective, being unimodal on the initial segment ∆0 = [a, b], is unimodal also on the smaller segment
∆1 ⊂ ∆0 , we may iterate this procedure – choose two points in ∆1 , compute the values of the objective
at these points, compare the results and replace ∆1 with smaller uncertainty segment ∆2 , still containing
the desired solution x∗ , and so on.
Thus, we come to the following
look: if x00 itself is not a minimizer of f on [a, x00 ], then any minimizer of f on [a, x00 ] can be chosen as x∗ ; if
2
x is a minimizer of f on [a, x00 ], then x0 also is a minimizer, since f (x0 ) ≤ f (x00 ), and we can set x∗ = x0
00
148 LECTURE 5. OPTIMIZATION METHODS: INTRODUCTION
• compute f (x− +
t ) and f (xt );
It is immediately seen that we may ensure linear convergence of the lengths of subsequent uncertainty
segments to 0, thus coming to a linearly converging algorithm for approximating x∗ . For example, if
x− + 2
t , xt are chosen to split ∆t−1 it into three equal parts, we ensure |∆t+1 | = 3 |∆t | (|∆| stands for the
p
length of a segment ∆), which results in linearly converging, with the convergence ratio 2/3, algorithm:
bk/2c
∗ k 2
|x − x | ≤ |b − a|, (5.2.5)
3
k being the # of function evaluations performed so far and xk being an arbitrary point of the uncertainty
segment ∆bk/2c formed after k function evaluations.
Estimate (5.2.5) is fine – we have non-asymptotical linear convergence rate with problem-independent
convergence ratio. Could there be something better?
The answer is “yes”. The way to improve the rate of convergence is to note that one of the two
search points used to pass from ∆t to ∆t+1 will for sure be inside ∆t+1 , and we could try to make it
the search point used to pass from ∆t+1 to ∆t+2 ; with this strategy, the cost of updating ∆t into ∆t+1
will be one function evaluation, not two of them (except the very first updating ∆0 → ∆1 which still will
costs two function evaluations). There are two ways to implement this new smart strategy – the optimal
one (“Fibonacci search”) and the suboptimal (“Golden search”).
FN −1
d1 = d0
FN
from the right and from the left endpoints of ∆0 , respectively (since FN /FN −1 = (FN −1 +
FN −2 )/FN −1 = 1 + FN −2 /FN −1 < 2, we have d1 > d0 /2, so that x− +
1 < x1 ). The length of
the new uncertainty segment ∆1 clearly will be d1 .
What we are about to do is to iterate the above step, with N replaced by N − 1. Thus, now
we should evaluate f at two points x− +
2 , x2 of the segment ∆1 placed at the distance
FN −2 FN −2 FN −1 FN −2
d2 = d1 [= d0 = d0 ] (5.2.6)
FN −1 FN −1 FN FN
from the right and the left endpoint of ∆1 . The crucial fact (which takes its origin in the
arithmetic properties of the Fibonacci numbers) is that
5.2. LINE SEARCH 149
one of these two required points where f should be computed is already processed – this is
the one of the previous two points which belongs to ∆1 .
−
Indeed, assume, without loss of generality, that ∆1 = [a, x+ 1 ] (the case ∆1 = [x1 , b] is
completely similar), so that the one of the first two search point belonging to ∆1 is x−
1 . We
have
FN −1
x−1 − a = (b − d1 ) − a = (b − a) − d1 = d0 − d1 = d0 1 − =
FN
FN −2
[since FN = FN −1 + FN −2 and d2 = FN d0 ]
FN −2
= d0 = d2 .
FN
Thus, only one of the two required points of ∆1 is new for us, and another comes from the
previous step; consequently, in order to update ∆1 into ∆2 we need one function evaluation,
not two of them. After this new function evaluation, we are able to replace ∆1 with ∆2 . To
process ∆2 , we act exactly as above, but with N replaced by N − 2; here we need to evaluate
f at the two points of ∆2 placed at the distance
FN −3 FN −3
d3 = d2 [= d0 , see (5.2.6)]
FN −2 FN
from the endpoints of the segment, and again one of these two points already is processed.
Iterating this procedure, we come to the segment ∆N −1 which covers x∗ ; the length of the
segment is
F1 b−a
dN −1 = d0 = ,
FN FN
and the total # of evaluations of f required to get this segment is N (we need 2 evaluations
of f to pass from ∆0 to ∆1 and one evaluation per every of N − 2 subsequent updatings
∆t 7→ ∆t+1 , 1 ≤ t ≤ N − 2).
Taking, as approximation of x∗ , any point xN of the segment ∆N −1 , we have
b−a
|xN − x∗ | ≤ |∆N | = . (5.2.7)
FN
To compare (5.2.7) with the accuracy estimate (5.2.5) of our initial – unsophisticated –
method, note that
√
1 t t −t
1+ 5
Ft = (λ + 1)λ + (−1) λ , λ= > 1.3) (5.2.8)
λ+2 2
3
Here is the computation: the Fibonacci numbers satisfy the homogeneous finite-difference equation
xt − xt−1 − xt−2 = 0
and initial conditions x0 = x1 = 1. To solve a finite difference homogeneous equation, one should first look for
its fundamental solutions – those of the type xt = λt . Substituting xt = λt into the equation, we get a quadratic
equation for λ:
λ2 − λ − 1 = 0,
and we come to two fundamental solutions:
√
(i) 1+ 5
xt = λti , i = 1, 2, with λ1 = > 1, λ2 = −1/λ1 .
2
Any linear combination of these fundamental solutions again is a solution to the equation, and to get {Ft }, it
remains to choose the coefficients of the combination to fit the initial conditions F0 = F1 = 1. As a result, we
come to (5.2.8). A surprise is that the expression for integer quantities Ft involves irrational number!
150 LECTURE 5. OPTIMIZATION METHODS: INTRODUCTION
λ + 2 −N
|xN − x∗ | ≤ λ |b − a|(1 + o(1)), (5.2.9)
λ+1
where o(1) denotes a function of N which converges to 0 as N → ∞).
We see that the convergence ratio for the Fibonacci search is
2
λ−1 = √ = 0.61803...
1+ 5
p
which is much better than the ratio 2/3 = 0.81649... given by (5.2.5).
It can be proved that the Fibonacci search is, in certain precise sense, the optimal, in terms
of the accuracy guaranteed after N function evaluations, zero-order method for minimizing
an unimodal function on a segment. In spite of this fine theoretical property, the method is
not that convenient from the practical viewpoint: we should choose in advance the number of
function evaluations to be performed (i.e., to tune the method to certain chosen in advance
accuracy), which is sometimes inconvenient. The Golden search method we are about to
present is free of this shortcoming and at the same time is, for not too small N , basically as
efficient as the original Fibonacci search.
The idea of the Golden Search method is very simple: at k-th step of the N -step Fibonacci
search, we choose two search points in the segment ∆k−1 , and each of these points divides
the segment (from the closer endpoint to the more far one) in the ratio
i.e., in the ratio√FN −k−1 : FN −k . According to (5.2.8), this ratio, for large N − k, is close to
1 : λ, λ = (1 + 5)/2. In the Golden search we use this ratio at every step, and that is it!
λ 1 1 λ
x−
t = at−1 + bt−1 ; x+
t = at−1 + bt−1 . (5.2.10)
1+λ 1+λ 1+λ 1+λ
It is easily seen that for t ≥ 2, one of the search points required to update ∆t−1 into ∆t is already processed
in course of updating ∆t−2 into ∆t−1 . To verify it, it suffices to consider the case when ∆t−2 = [α, β]
−
and ∆t−1 = [α, x+ t−1 ] (the “symmetric” case ∆t−1 = [xt−1 , β] is completely similar). Denoting d = β − α,
we have
1 λ
x−
t−1 = α + d, x+t−1 = α + d; (5.2.11)
1+λ 1+λ
now, we are in situation ∆t−1 = [α, x+
t−1 ], so that the second of the two search points needed to update
∆t−1 into ∆t is
λ λ2
x+
t =α+ (x+
t−1 − α) = α + d
1+λ (1 + λ)2
(see the second equality in (5.2.11)). The latter quantity, due to the first equality in (5.2.11) and the
characteristic equation λ2 = 1 + λ giving λ, is nothing but x−
t−1 :
1 λ2
λ2 = 1 + λ ⇔ = .
1+λ (1 + λ)2
5.2. LINE SEARCH 151
Thus, in the Golden search each updating ∆t−1 7→ ∆t , except the very first one, requires a single function
evaluation, not two of them. The length of the uncertainty segment is reduced by every updating by
factor
λ 1
= ,
1+λ λ
i.e.,
|∆t | = λ−t (b − a).
After N ≥ 2 function evaluations (i.e., after t = N − 1 steps of the Golden search) we can approximate
x∗ by (any) point xN of the resulting segment ∆N −1 , and inaccuracy bound will be
Thus, we have the linear rate of convergence with convergence ratio λ−1 = 0.61803..., same as for the
Fibonacci search, but now the method is “stationary” – we can perform as many steps of it as we wish.
5.2.2 Bisection
The theoretical advantage of the zero-order methods, like the Fibonacci search and the Golden search,
is that these methods use the minimal information on the objective – its values only. Besides this, the
methods have a very wide field of applications – the only requirement imposed on the objective is to be
unimodal on a given segment which localizes the minimizer to be approximated. And even under these
extremely mild restrictions these methods are linearly converging with objective-independent converging
ratio; moreover, the corresponding efficiency estimates (5.2.9) and (5.2.12) are non-asymptotical: they
do not contain “uncertain” constant factors and are valid for all values of N . At the same time, typically
our objective is better behaved than a general unimodal function, e.g., is smooth enough. Making use of
these additional properties of the objective, we may improve the behaviour of the line search methods.
Let us look what happens if we are solving problem (5.2.2) with smooth – continuously differentiable
– objective. Same as above, assume that the objective is unimodal on [a, b]. In fact we make a little bit
stronger assumption:
(A): the minimizer x∗ of f on [a, b] is an interior point of the segment, and f 0 (x) changes its sign at
∗
x :
f 0 (x) < 0, x ∈ [a, x∗ ); f 0 (x) > 0, x ∈ (x∗ , b]
[unimodality + differentiability imply only that f 0 (x) ≤ 0 on [a, x∗ ) and f 0 (x) ≥ 0 on (x∗ , b]].
Besides these restrictions on the problem, assume, as it is normally the case, that we are able to
compute not only the value, but also the derivative of the objective at a given point.
Under these assumptions we can solve (5.2.2) by definitely the simplest possible method – the bisec-
tion. Namely, let us compute f 0 at the midpoint x1 of ∆0 = [a, b]. There are three possible cases:
• f 0 (x1 ) > 0. This case, according to (A), is possible if and only if x∗ < x1 , and we can replace
the initial segment of uncertainty with ∆1 = [a, x1 ], thus reducing the length of the uncertainty
segment by factor 2;
• f 0 (x1 ) < 0. Similarly to the previous case, this inequality is possible if and only if x∗ > x1 , and
we can replace the initial segment of uncertainty with [x1 , b], again reducing the length of the
uncertainty segment by factor 2;
• f 0 (x1 ) = 0. According to (A), it is possible if and only if x1 = x∗ , and we can terminate with exact
minimizer at hand.
In the first two cases our objective clearly possesses property (A) with respect to the new segment of
uncertainty, and we can iterate the construction. Thus, we come to
Algorithm 5.2.2 [Bisection]
Initialization: set ∆0 = [a, b], t = 1
Step t: Given previous uncertainty segment ∆t−1 = [at−1 , bt−1 ],
152 LECTURE 5. OPTIMIZATION METHODS: INTRODUCTION
Remark 5.2.1 The convergence ratio of the Bisection algorithm is better than the one 0.61803... for
Fibonacci/Golden search. There is no contradiction with the announced optimality of the Fibonacci
search: the latter is optimal among the zero-order methods for minimizing unimodal functions, while
Bisection is a first-order method.
Remark 5.2.2 The Bisection method can be viewed as the “limiting case” of the conceptual zero-order
Algorithm 5.2.1: when, in the latter algorithm, we make both the search points x− +
t and xt close to the
−
midpoint of the uncertainty segment ∆t−1 , the result of comparison between f (xt ) and f (x+t ) which
governs the choice of the new uncertainty segment in Algorithm 5.2.1 is given by the sign of f 0 at the
midpoint of ∆t−1 .
Remark 5.2.3 Note that the assumption (A) can be weakened. Namely, let us assume that f 0 changes
its sign at the segment [a, b]: f 0 (a) < 0, f 0 (b) > 0; and we assume nothing about the derivative in (a, b),
except its continuity. In this case we still can successfully use the Bisection method to approximate a
critical point of f in (a, b), i.e., a point where f 0 (x) = 0. Indeed, from the description of the method it is
immediately seen that what we do is generating a sequence of nested segments ∆0 ⊃ ∆1 ⊃ ∆2 ⊃ ..., with
the next segment being twice smaller than the previous one, with the property that f 0 changes it sign
from − to + when passing from the left endpoint of every segment ∆t to its right endpoint. This process
can be terminated only in the case when the current iterate xt is a critical point of f . If it does not
happen, then the nested segments ∆t have a unique common point x∗ , and since in any neighbourhood of
the point there are points both with positive and negative values of f 0 , we have f 0 (x∗ ) = 0 (recall that f 0
is continuous). This is the critical point of f to which the algorithm converges linearly with convergence
ratio 0.5.
The indicated remark explains the nature of the bisection algorithm. This is an algorithm for finding
zero of the function f 0 rather than for minimizing f itself (under assumption (A), of course, this is the
same - to minimize f on [a, b] or to find the zero of f 0 on (a, b)). And the idea of the algorithm is absolutely
trivial: given that the zero of f 0 is bracketed by the initial uncertainty segment ∆0 = [a, b] (i.e., that
f 0 at the endpoints of the segment is of different sign), we generate the sequence of enclosed segments,
also bracketing zero of f 0 , as follows: we split the previous segment ∆t = [at−1 , bt−1 ] by its midpoint xt
into two subsegments [at−1 , xt ] and [xt , bt−1 ]. Since f 0 changes its sign when passing from at−1 to bt−1 ,
it changes its sign either when passing from at−1 to xt , or when passing from xt to bt−1 (provided that
f 0 (xt ) 6= 0, so that we can speak about the sign of f 0 (xt ); if f 0 (xt ) = 0, we are already done). We detect
on which of the two subsegments f 0 in fact changes sign and take it as the new uncertainty segment ∆t ;
by construction, it also brackets a zero of f 0 .
5.2. LINE SEARCH 153
Now let
k2
ρ = min{r; }. (5.2.14)
k1
Assume that for certain t the iterate xt−1 belongs to the ρ-neighbourhood
Uρ = [x∗ − ρ, x∗ + ρ]
of the point x∗ . Then g 0 (xt−1 ) ≥ k2 > 0 (due to (5.2.13); note that ρ ≤ r), so that the Newton iterate xt
of xt−1 is well-defined. We have
g(xt−1 )
xt − x∗ = xt−1 − x∗ − =
g 0 (xt−1 )
[since g(x∗ ) = 0]
g(xt−1 ) − g(x∗ ) g(x∗ ) − g(xt−1 ) − g 0 (xt−1 )(x∗ − xt−1 )
xt−1 − x∗ − = .
g 0 (xt−1 ) g 0 (xt−1 )
The numerator in the resulting fraction is the remainder in the first order Taylor expansion of g at
xt−1 ; due to (5.2.13) and since |xt−1 − x∗ | ≤ ρ ≤ r, it does not exceed in absolute value the quantity
1 ∗ 2
2 k1 |x − xt−1 | . The denominator, by the same (5.2.13), is at least k2 . Thus,
k1
xt−1 ∈ Uρ ⇒ |xt − x∗ | ≤ |xt−1 − x∗ |2 . (5.2.15)
2k2
Due to the origin of ρ, (5.2.15) implies that
we see that the trajectory of the Newton method, once reaching Uρ , never leaves this neighbourhood
and converges to x∗ linearly with convergence ratio 0.5. It for sure is the case when x0 ∈ Uρ , and let
us specify the “close enough” in the statement of the proposition just as inclusion x0 ∈ Uρ . With this
specification, we get that the trajectory converges to x∗ linearly, and from (5.2.15) it follows that in fact
the order of convergence is (at least) 2.
Remark 5.2.4 Both the assumptions that f 00 (x∗ ) > 0 and that x0 is close enough are essential4) . For
example, as applied to the smooth convex function
f (x) = x4
4
in fact, the assumption f 00 (x∗ ) > 0 can be replaced with f 00 (x∗ ) < 0, since the trajectory of the method remains
unchanged when f is replaced with −f (in other words, the Newton method does not distinguish between the
local minima and local maxima of the objective). We are speaking about the case of f 00 (x∗ ) > 0, not the one of
f 00 (x∗ ) < 0, simply because the former case is the only important for minimization.
5.2. LINE SEARCH 155
with unique (and nondegenerate) local (and global as well) minimizer x∗ = 0, the method becomes, as it
is immediately seen,
xt = −x3t−1 ;
this procedure converges (very fast: with order 3) to 0 provided that the starting point is in (−1, 1), and
diverges to infinity – also very fast – if |x0 | > 1.
In fact the Newton method is a linearization method for finding zero of f 0 : given the previous iterate
xt−1 , we linearize g = f 0 at this point and take as xt the solution to the linearization
f’(x)
x x
t t-1
f 0 (xt−1 ) − f 0 (xt−2 )
xt−1 − xt−2
and use this approximation to approximate f by a quadratic function
1 f 0 (xt−1 ) − f 0 (xt−2 )
p(x) = f (xt−1 ) + f 0 (xt−1 )(x − xt−1 ) + (x − xt−1 )2 .
2 xt−1 − xt−2
The new iterate is the minimizer of this quadratic function:
xt−1 − xt−2
xt = xt−1 − f 0 (xt−1 ) . (5.2.16)
f 0 (xt−1 ) − f 0 (xt−2 )
156 LECTURE 5. OPTIMIZATION METHODS: INTRODUCTION
Note that although the polynomial p is chosen in asymmetric with respect to xt−1 and xt−2
way (it is tangent to f at xt−1 , but even not necessarily coincides with f at xt−2 ), the
minimizer xt of this polynomial is symmetric with respect to the pair of working points; as
it is immediately seen, the right hand side of (5.2.16) can be equivalently rewritten as
xt−1 − xt−2
xt = xt−2 − f 0 (xt−2 ) .
f 0 (x 0
t−1 ) − f (xt−2 )
The geometry of the method is very simple: same as the Newton method, this is the method
which actually approximates the zero of g(x) = f 0 (x) (look: the values of f are not involved
into the recurrence (5.2.16)). In the Newton method we, given the value and the derivative of
g at xt−1 , approximate the graph of g by its tangent at xt−1 line g(xt−1 ) + g 0 (xt−1 )(x − xt−1 )
and choose xt as the point where this tangent line crosses the x-axis. In the Regula Falsi
method we, given the values of g at two points xt−1 and xt−2 , approximate the graph of g
by the secant line passing through (xt−1 , g(xt−1 )) and (xt−2 , g(xt−2 )) and choose as xt the
point where this secant line crosses the x-axis.
xt-2
xt xt-1
f’(x)
Regula Falsi method as zero-finding routine
x2 x3 x1 x0
f’(x)
(these are four linear equations on four coefficients of the polynomial; if x0 6= x00 , which, as
we shall see, always can be assumed, the equations uniquely define p). As the next iterate,
one chooses the local minimizer of p. If exists, the minimizer is given by the relations
u1 + u2 − f 0 (x0 )
0 0 00
xt = x − (x − x ) 0 00 ,
f (x ) − f 0 (x0 ) + 2u2
f (x0 ) − f (x00 )
q
u1 = f 0 (x0 ) + f 0 (x00 ) − 3 , u2 = u21 − f 0 (x0 )f 0 (x00 ). (5.2.17)
x0 − x00
The step is for sure well-defined if f 0 (x0 ) and f 0 (x00 ) are of opposite signs (“V -shape”; com-
pare with Bisection). One can prove that if x∗ is a nondegenerate local minimizer of a
smooth enough function f , then the method, started close enough to x∗ , converges to x∗
quadratically.
search (global linear convergence with objective-independent rate in the unimodal case) and
those of curve fitting (superlinear local convergence for well-behaved objectives).
x 7→ x + γ ∗ d
φ(γ) = f (x + γd)
of one variable also is once continuously differentiable; moreover, due to (5.2.18), we have
φ0 (0) < 0,
Our desire is to choose a “‘reasonably large” stepsize γ ∗ > 0 which results in the progress φ(γ ∗ ) − φ(0)
in the objective “of order of γ ∗ φ0 (0)”. The Armijo test for this requirement is as follows:
Armijo’s Test:
we fix once for ever constants ∈ (0, 1) (popular choice is = 0.2) and η > 1 (say, η = 2 or η = 10) and
say that candidate value γ > 0 is appropriate, if the following two relations are satisfied:
[this part of the test says that the progress in the value of φ given by the stepsize γ is “of order of γφ0 (0)”]
[this part of the test says that γ is “maximal in order” stepsize which satisfies (5.2.19) – if we multiply
γ by η, the increased value fails to satisfy (5.2.19), at least to satisfy it as a strict inequality]
Under assumption (5.2.18) and the additional (very natural) assumption that f (and, consequently,
φ) is below bounded, the Armijo test is consistent: there do exist values of γ > 0 which pass the test. To
see it, it suffices to notice that
5.2. LINE SEARCH 159
φ(γ) − φ(0)
0 > φ0 (0) = lim ,
γ→+0 γ
whence
φ(γ) − φ(0)
φ0 (0) ≥
γ
for all small enough positive γ (since φ0 (0) > φ0 (0) due to φ0 (0) < 0, ∈ (0, 1)); the resulting inequality
is equivalent to (5.2.19);
B. (5.2.19) is not valid for all large enough values of γ.
Indeed, the right hand side of (5.2.19) tends to −∞ as γ → ∞, due to φ0 (0) < 0, and the left hand side
is assumed to be below bounded.
Now let us choose an arbitrary positive γ = γ0 and test whether it satisfies (5.2.19). If it is the case,
let us replace this value subsequently by γ1 = ηγ0 , γ2 = ηγ1 , etc., each time verifying whether the new
value of γ satisfies (5.2.19). According to B, this cannot last forever: for certain s ≥ 1 γs for sure fails to
satisfy (5.2.19). When it happens for the first time, the quantity γs−1 turns out to satisfy (5.2.19), while
the quantity γs = ηγs−1 fails to satisfy (5.2.19), which means that γ = γs−1 passes the Armijo test.
If the initial γ0 does not satisfy (5.2.19), we replace this value subsequently by γ1 = η −1 γ0 , γ2 = η −1 γ1 ,
etc., each time verifying whether the new value of γ still does not satisfy (5.2.19). According to A, this
cannot last forever: for certain s ≥ 1, γs for sure satisfies (5.2.19). When it happens for the first time, γs
turns out to satisfy (5.2.19), while γs−1 = ηγs fails to satisfy (5.2.19), and γ = γs passes the Armijo test.
Note that the presented proof in fact gives an explicit (and fast) algorithm for finding a stepsize
passing the Armijo test, and this algorithm can be used (and often is used) in Armijo-aimed line search
instead of more accurate (and normally more time-consuming) line search methods from the previous
sections.
Goldstein test:
we fix one for ever constant ∈ (0, 1/2) and say that candidate value γ > 0 is appropriate, if
Here again relation (5.2.16) and below boundedness of f imply consistency of the test.
160 LECTURE 5. OPTIMIZATION METHODS: INTRODUCTION
Lecture 6
Starting from this lecture, we shall speak about methods for solving unconstrained multidimensional
problems
f (x) → min | x ∈ Rn . (6.0.1)
From now on, let us make the following assumptions:
• (A) the objective f in (6.0.1) is continuously differentiable;
• (B) the problem in question is solvable: the set
X ∗ = Argmin f
Rn
is nonempty.
d
|γ=0 f (x − γ∇f (x)) = −|∇f (x)|2 < 0;
dγ
moreover, this is the best among the descent directions h (normalized to have the same length as that
one of g) of f at x: for any h, |h| = |g|, one has
d
|γ=0 f (x + γh) = hT ∇f (x) ≥ −|h||∇f (x)| = −|∇f (x)|2
dγ
(we have used the Cauchy inequality), the inequality being equality if and only if h = g.
The indicated observation demonstrates that in order to improve x – to form a new point with smaller
value of the objective – it makes sense to perform a step
x 7→ x + γg ≡ x − γ∇f (x)
161
162 LECTURE 6. GRADIENT DESCENT AND NEWTON METHODS
from x in the intergradient direction; with properly chosen stepsize γ > 0, such a step will for sure
decrease f . And in the Gradient Descent method, we simply iterate the above step. Thus, the generic
scheme of the method is as follows
Thus, the generic Gradient Descent method is the recurrence (6.1.1) with certain rule for choosing step-
sizes γt > 0; normally, the stepsizes are given by a kind of line search applied to the univariate functions
• ArD [Gradient Descent with Armijo-terminated line search]: the stepsize γt > 0 at iteration t
where ∇f (xt−1 ) 6= 0 is chosen according to the Armijo test (Section 5.2.4.1):
∈ (0, 1) and η > 1 being the parameters of the method. And if xt−1 is a critical point of f , i.e.,
∇f (xt−1 ) = 0, the choice of γt > 0 is absolutely unimportant: independently of the value of γt ,
(6.1.1) will result in xt = xt−1 .
• StD [Steepest Descent]: γt minimizes f along the ray {xt−1 − γ∇f (xt−1 ) | γ ≥ 0} :
Of course, the Steepest Descent is a kind of idealization: in nontrivial cases we are unable to minimize
the objective along the search ray exactly. Moreover, to make this idealization valid, we should assume
that the corresponding steps are well-defined, i.e., that
for every x; in what follows, this is assumed “by default” whenever we are speaking about the Steepest
Descent.
In contrast to the Steepest Descent, the Gradient Descent with Armijo-terminated line search is quite
“constructive” – we know from Section 5.2.4.1 how to find a stepsize passing the Armijo test.
6.1. GRADIENT DESCENT 163
X ∗∗ = {x ∈ Rn | ∇f (x) = 0}.
Theorem 6.1.1 [Global convergence of Gradient Descent] For both StD and ArD , the following
statements are true:
(i) If the trajectory {xt } of the method is bounded, then the trajectory possesses limiting points, and
all these limiting points are critical points of f ;
(ii) If the level set
S = {x ∈ Rn | f (x) ≤ f (x0 )}
of the objective is bounded, then the trajectory of the method is bounded (and, consequently, all its limiting
points, by (i), belong to X ∗∗ ).
Proof. (ii) is an immediate consequence of (i), since both ArD and StD clearly are descent methods:
Therefore the trajectory, for each of the methods, is contained in the level set S; since under assumption
of (ii) this set is bounded, the trajectory also is bounded, as claimed in (ii).
It remains to prove (i). Thus, let the trajectory {xt } be bounded, and let x∗ be a limiting point of
the trajectory; we should prove that ∇f (x∗ ) = 0. Assume, on contrary, that it is not the case, and let
us lead this assumption to a contradiction. The idea of what follows is very simple: since ∇f (x∗ ) 6= 0, a
step of the method taken from x∗ reduces f by certain positive amount δ; this is absolutely clear from
the construction of the step. What is very likely (it should, of course, be proved, and we shall do it in
a while) is that there exists a small neighbourhood U of x∗ such that a step of the method taken from
arbitrary point x ∈ U also improves the objective at least by fixed positive quantity δ 0 . It is absolutely
unimportant for us what is this δ 0 ; all we need is to know that this quantity is positive and is independent
of the particular choice of x ∈ U . Assume that we already have proved that the required U and δ 0 exist.
With this assumption, we get the desired contradiction immediately: since x∗ is a limiting point of the
trajectory, the trajectory visits U infinitely many times. Each time it visits U , the corresponding step
decreases f at least by δ 0 > 0, and no step of the method increases the objective. Thus, in course of
running the method we infinitely many times decrease the objective by δ 0 and never increase it, so that
the objective must diverge to −∞ along our trajectory; the latter is impossible, since the objective was
assumed to be below bounded.
Now it is time to prove our key argument – the one on existence of the above U and δ 0 . Let me stress
why there is something to be proved, in spite of the already known to us descentness of the method –
the fact that the objective is improved by every step taken from a non-critical point of f (and all points
close enough to non-critical x∗ also are noncritical, since ∇f is continuous). The difficulty is that the
progress in f in course of a step depends on from which point the step is taken; in principle it might
happen that a step from every point from a neighbourhood of x∗ improves the objective, but there is no
independent of the point positive lower bound δ 0 for the improvements. And in the above reasoning we
indeed require “point-independent” progress – otherwise it might happen that subsequent visits of U by
the trajectory result in smaller and smaller improvements in f , and the sum of these improvements is
finite; this possibility would kill the above reasoning completely.
In fact, of course, the required U, δ 0 exist. It suffices to prove this statement for ArD only – it is
absolutely clear that the progress in the objective in course of a step of StD is at least the one for a step
of ArD , both steps being taken from the same point. The proof for the case of ArD looks as follows:
164 LECTURE 6. GRADIENT DESCENT AND NEWTON METHODS
Since f is continuously differentiable and ∇f (x∗ ) 6= 0, there exist positive r, P and p such
that
|x − x∗ | < r ⇒ p ≤ |∇f (x)| ≤ P ;
by the same reasons, there exists r0 ∈ (0, r) such that in the r0 -neighbourhood V of x∗ one
has
|∇f (x0 ) − ∇f (x00 )| ≤ ζ ≡ (1 − )P −1 p2 .
Let U be r0 /2-neighbourhood of x∗ . We claim that
(*) whenever x ∈ U , the stepsize sx given by the Armijo line search as applied to the function
is at least
1 0 −1 −1
s∗ = rη P .
2
Note that (*) is all we need. Indeed, the progress in the objective in the Armijo line search
as applied to a function φ and resulting in a stepsize s is at least s|φ0 (0)|. Applying this
observation to a step of ArD taken from a point x ∈ U and using (*), we come to the
conclusion that the progress in the objective at the step is at least s∗ |∇f (x)|2 ≥ s∗ p2 ,
and the latter quantity (which is positive and is independent of x ∈ U ) can be taken as the
desired δ 0 .
It remains to prove (*), which is immediate: assuming that x ∈ U , sx < s∗ , and taking into
account the construction of the Armijo test, we would get
Now, since sx < s∗ , the segment [x, x − ηsx ∇f (x)] is of the length at most ηs∗ P ≤ r0 /2,
and since one endpoint of the segment belongs to U , the segment itself belongs to V . Con-
sequently, the derivative of f along the segment varies at most by ζ, so that the derivative
of φ varies on the segment [0, ηsx ] at most by
On the other hand, from the Lagrange Mean Value Theorem it follows that
here ξ is some point on the segment [0, ηsx ]. Combining this inequality with (6.1.5), we come
to
ηsx (1 − )p2 > −(1 − )ηsx φ0 (0) ≡ (1 − )ηsx |∇f (x)|2 ≥ (1 − )ηsx p2 ,
which is a contradiction.
Please pay attention to the above proof: it structure is typical for convergence proofs in traditional
Optimization and looks as follows. We know in advance that the iterative process in question possesses
certain Lyapunov function L – one which decreases along the trajectory of the process and is below
bounded (in the above proof this function is f itself); we also either assume that the trajectory is
bounded, or assume boundedness of the level set of the Lyapunov function, the set being associated with
the value of the function at the initial point of the trajectory (then, of course, the trajectory for sure is
bounded – since the Lyapunov function never decreases along the trajectory, the latter is unable to leave
the aforementioned level set). Now assume that the three entities – (1) the Lyapunov function, (2) our
iterative process, and (3) the set X ∗ we agree to treat as the solution set of our problem – are linked by
the following relation:
(**) if a point on the trajectory does not belong to X ∗ , then the step of the process from
this point strictly decreases the Lyapunov function
6.1. GRADIENT DESCENT 165
Normally (**) is evident from the construction of the process and of the Lyapunov function; e.g., in the
above proof where L is the objective, the process is ArD or StD and X ∗ is the set of critical points of the
objective, you should not work hard in order to prove that the step from a non-critical point somehow
decreases the objective. Now, given all this, we are interested to prove that the trajectory of the process
converges to X ∗ ; what is the main point of the proof? Of course, an analogy of (*), i.e., a “locally
uniform” version of (**) – we should prove that a point not belonging to X ∗ possesses a neighbourhood
such that whenever the trajectory visits this neighbourhood, the progress in the Lyapunov function at the
corresponding step is bounded away from zero. After we have proved this crucial fact, we can immediately
apply the scheme of the above proof to demonstrate that the trajectory indeed converges to X ∗ .
We have had a good reason to invest that many effort in explaining the “driving forces” of the above
convergence proof: from now on, we shall skip similar proofs, since we believe that the reader understands
the general principle, and the technicalities are of minor interest here. We hope that now it becomes
clear why in the Armijo test we require the stepsize to be the largest one (up to factor η) resulting in
“significant” progress in the objective. If we would skip this “maximality” requirement, we would allow
arbitrarily small stepsizes even from the points which are far from the solution set; as a result, (*) would
not be the case anymore, and we would become unable to ensure convergence of the process (and this
property indeed may be lost).
2Lf
f [t] ≡ min |∇f (xt )|2 ≤ [f (x0 ) − min f ]. (6.1.7)
0≤t<N N
ηLf
f [t] ≡ min |∇f (xt )|2 ≤ [f (x0 ) − min f ], (6.1.8)
0≤t<N 2(1 − )N
∈ (0, 1), η > 1 being the parameters of the underlying Armijo test.
Proof.
10 . Let us start with the following simple
Proof of Lemma. Let φ(γ) = f (x + γ(y − x)). Note that φ is continuously differentiable
(since f is) and
|φ0 (α) − φ0 (β)| = |(y − x)T (∇f (x + α(y − x)) − ∇f (x + β(y − x))| ≤
[Cauchy’s inequality]
[(6.1.6)]
≤ |y − x|2 Lf |α − β|.
Thus,
|φ0 (α) − φ0 (β)| ≤ Lf |y − x|2 |α − β|, ∀α, β ∈ R. (6.1.10)
We have
Z 1
0
T
f (y) − f (x) − (y − x) ∇f (x) = φ(1) − φ(0) − φ (0) = φ0 (α)dα − φ0 (0) =
0
Z 1
= [φ0 (α) − φ0 (0)]dα ≤
0
[see (6.1.10)]
Z 1
Lf
≤ |y − x|2 Lf αdα = |y − x|2 ,
0 2
as required in (6.1.9).
20 . Now we are ready to prove (i). By construction of the Steepest Descent,
N
The left hand side here is ≥ 2Lf min0≤t<N |∇f (xt )|2 , and (6.1.7) follows.
30 . The proof of (ii) is a little bit more involved, but follows the same basic idea: the progress
at a step of ArD can be small only if the gradient at the previous iterate is small, and the
progress at certain step from a long segment of the steps must be small, since the total
progress cannot be larger than the initial residual. Thus, in a long segment of steps we must
pass through a point with small norm of the gradient.
The quantitative reasoning is as follows. First of all, progress in the objective at a step t of
ArD is not too small, provided that both γt and |∇f (xt−1 )|2 are not too small:
this is immediate consequence of the first inequality in (6.1.2). Second, γt is not too small.
Indeed, by Lemma 6.1.1 applied with x = xt−1 , y = xt−1 − ηγt ∇f (xt−1 ) we have
Lf 2 2
f (xt−1 − ηγt ∇f (xt−1 )) ≤ f (xt−1 ) − ηγt |∇f (xt−1 )|2 + η γt |∇f (xt−1 )|2 ,
2
while by the second inequality in (6.1.2)
Combining (6.1.12) and (6.1.13), we come to the following inequality (compare with (6.1.11):
2(1 − )
f (xt−1 ) − f (xt ) ≥ |∇f (xt−1 )|2 . (6.1.14)
ηLf
Now the proof can be completed exactly as in the case of the Steepest Descent.
Remark 6.1.1 The efficiency estimate of Proposition 6.1.1 gives sublinearly converging to 0 non-
asymptotical upper bound on the inaccuracies f (·) of the iterates. Note, anyhow, that this is a bound
on the inaccuracy of the best (with the smallest norm of the gradient) of the iterates generated in course
of the first N steps of the method, not on the inaccuracy of the last iterate xN (the quantities |∇f (xt )|2
may oscillate, in contrast to the values f (xt ) of the objective).
Proof.
10 . Let x∗ be a point from X ∗ , and let us look how the squared distances
d2t = |xt − x∗ |2
it follows that
1 1
γt |∇f (xt−1 )|2 ≤ [f (xt−1 ) − f (xt )] = [t−1 − t ].
Combining this inequality with (6.1.18), we get
Since, by assumption, 1/2 ≤ , and clearly s ≥ 0, the quantity in the parentheses in the
right hand side is nonnegative. We know also from (6.1.13) that
2(1 − )
γt ≥ γ̄ = ,
ηLf
From (*) it immediately follows that {xt } converges to certain point x̄∗ ∈ X ∗ , as claimed
in (i). Indeed, by Theorem 6.1.1 the trajectory, being bounded, has all its limiting points
in the set X ∗∗ of critical points of f , or, which is the same (f is convex!), in the set X ∗ of
global minimizers of f . Let x̄∗ be one of these limiting points, and let us prove that in fact
{xt } converges to x̄∗ . To this end note that the sequence |xt − x̄∗ |, which, as we know from
(*), is non-increasing, has 0 as its limiting point; consequently, the sequence converges to 0,
so that xt → x̄∗ as t → ∞, as claimed.
It remains to prove (6.1.15). Taking sum of inequalities (6.1.20) over t = 1, ..., N , we get
Since 0 ≥ 1 ≥ 2 ≥ ... (our method is descent – it never decreases the values of the
objective!), the left hand side in the resulting inequality can only become smaller if we
replace all t with N ; thus, we get
ηLf |x0 − x∗ |2
N ≤ ;
4(1 − )N
Strongly convex C1,1 case. Proposition 6.1.2 deals with the case of smooth convex f , but there
were no assumptions on the non-degeneracy of the minimizer – the minimizer might be non-unique, and
the graph of f could be very “flat” around X ∗ . Under additional assumption of strong convexity of f we
may get better convergence results.
The notion of strong convexity is given by the following
Definition 6.1.1 [Strongly convex function] A function f : Rn → R is called strongly convex with the
parameters of strong convexity (lf , Lf ), 0 < lf ≤ Lf ≤ ∞, if f is continuously differentiable and satisfies
the inequalities
lf Lf
f (x) + (y − x)T ∇f (x) + |y − x|2 ≤ f (y) ≤ f (x) + (y − x)T ∇f (x) + |y − x|2 , ∀x, y ∈ Rn . (6.1.22)
2 2
Strongly convex functions traditionally play the role of “excellent” objectives, and this is the family
on which the theoretical convergence analysis of optimization methods is normally performed. For our
further purposes it is worthy to know how to detect strong convexity and what are the basic properties
of strongly convex functions; this is the issue we are coming to.
The most convenient sufficient condition for strong convexity is given by the following
Proposition 6.1.3 [Criterion of strong convexity for twice continuously differentiable functions]
Let f : Rn → R be twice continuously differentiable function, and let (lf , Lf ), 0 < lf ≤ Lf < ∞, be two
given reals. f is strongly convex with parameters lf , Lf if and only if the spectrum of the Hessian of f at
every point x ∈ Rn belongs to the segment [lf , Lf ]:
where λmin (A), λmax (A) denote the minimal, respectively, the maximal eigenvalue of a symmetric matrix
A and ∇2 f (x) denotes the Hessian (the matrix of the second order derivatives) of f at x.
The most important for us properties of strongly convex functions are summarized in the following
statement:
Now we come back to Gradient Descent. The following important proposition says that ArD as
applied to a strongly convex f possesses global linear convergence:
Lf
Qf = (6.1.25)
lf
Proof.
10 . According to Proposition 6.1.4, f is C1,1 convex function which attains its minimum, and
the gradient of f is Lipschitz continuous with constant Lf . Consequently, all conclusions of
the proof of Proposition 6.1.2 are valid, in particular, relation (6.1.19):
2(1 − )
d2t ≡ |xt − x∗ |2 ≤ d2t−1 − γ̄ (2 − −1 )t−1 + −1 t , γ̄ =
, s = f (xs ) − min f.
ηLf
(6.1.27)
Applying (6.1.22) to the pair (x = x∗ , y = xs ) and taking into account that ∇f (x∗ ) = 0, we
get
lf lf
s ≥ |xs − x∗ |2 = d2s ;
2 2
therefore (6.1.27) implies
γ̄lf
d2t ≤ d2t−1 − (2 − −1 )d2t−1 + −1 d2t ,
2
or, substituting the expression for γ̄ and rearranging the expression,
Lf
f (xN ) − min f ≡ f (xN ) − f (x∗ ) ≤ |xN − x∗ |2 ;
2
consequently,
Lf
f (xN ) − min f ≤ |xN − x∗ |2 ≤
2
[see (6.1.24)]
Lf 2N
≤ θ |x0 − x∗ |2 ≤
2
[see (6.1.29)]
Lf 2N
≤ θ [f (x0 ) − min f ],
lf
as required in (6.1.26).
172 LECTURE 6. GRADIENT DESCENT AND NEWTON METHODS
Global rate of convergence in convex C1,1 case: summary. The results given by Proposi-
tions 6.1.2 and 6.1.5 can be summarized as follows. Assume that we are solving the problem
f (x) → min
1,1
with convex C objective (i.e., ∇f (x) is a Lipschitz continuous vector field), and assume that f possesses
a nonempty set X ∗ of global minimizers. And assume that we are minimizing f by ArD with properly
chosen parameter , namely, 1/2 ≤ < 1. Then
• A. In the general case, where no strong convexity of f is imposed, the trajectory {xt } of the method
converges to certain point x̄∗ ∈ X ∗ , and the residuals in terms of the objective – the quantities
N = f (xN ) − min f – go to zero at least as O(1/N ), namely, they satisfy the estimate
ηLf dist2 (x0 , X ∗ ) 1
N ≤ . (6.1.30)
4(1 − ) N
Note that
– no quantitative assertions on the rate of convergence of the quantities |xN − x̄∗ | can be given;
all we know is that these quantities converge to 0, but the convergence can be as slow as
you wish. Namely, given an arbitrary decreasing sequence {dt } converging to 0, one can
point out a C1,1 convex function f on 2D plane such that Lf = 1, dist(x0 , X ∗ ) = d0 and
dist(xt , X ∗ ) ≥ dt for every t;
– estimate (6.1.30) establishes correct order of convergence to 0 of the residuals in terms of the
objective: for properly chosen C1,1 convex function f on the 2D plane one has
α
N ≥ , N = 1, 2, ...
N
with certain positive α.
• B. If f is strongly convex with parameters (lf , Lf ), then the method converges linearly:
|xN − x∗ | ≤ θN |x0 − x∗ |, f (xN ) − min f ≤ Qf θ2N [f (x0 ) − min f ],
s
Qf − (2 − −1 )(1 − )η −1
θ= , (6.1.31)
Qf + (−1 − 1)η −1
Qf = Lf /lf being the condition number of f .
Note that the convergence ratio θ (or θ2 , depending on which accuracy measure – the distance from the
iterate to the optimal set or the residual in terms of the objective – we use) tends to 1 as the condition
number of the problem goes to infinity (as people say, as the problem becomes ill-conditioned). When
Qf is large, we have
θ ≈ 1 − pQ−1
f , p = (1 − )η
−1
, (6.1.32)
so that to decrease the upper bound (6.1.31) on |x· − x∗ | by an absolute constant factor, say, by factor
10, it requires O(Qf ) steps of the method. In other words, what we can extract from (6.1.31) is that
(**) the number of steps of the method resulting in a given in advance progress in accuracy (the one
required to decrease the initial distance from the optimal set by a given factor, say, 106 ), is proportional
to the condition number Qf of the objective.
Of course, this conclusion is derived from an upper bound on inaccuracy; it might happen that our upper
bounds “underestimate” actual performance of the method. It turns out, anyhow, that our bounds are
tight, and the conclusion is valid:
the number of steps of the Gradient Descent required to reduce initial inaccuracy (measured
either as the distance from the optimal set or as the residual in terms of the objective) by a
given factor is typically proportional to the condition number of f .
To justify the claim, let us look what happens in the case of quadratic objective.
6.1. GRADIENT DESCENT 173
Let us look what happens if Gradient Descent is applied to a strongly convex quadratic objective
1 T
f (x) = x Ax − bT x + c.
2
A being symmetric positive definite matrix. As we know from Example 6.1.1, f is strongly convex with the
parameters lf = λmin (A), Lf = λmax (A) (the minimal and the maximal eigenvalues of A, respectively).
It is convenient to speak about the Steepest Descent rather than about the Armijo-based Gradient
Descent (in the latter case our considerations would suffer from uncertainty in the choice of the stepsizes).
We have the following relations:
in particular, the unique minimizer of f is given by the equation (the Fermat rule)
Ax∗ = b. (6.1.34)
1
f (x) = E(x) + f (x∗ ), E(x) = (x − x∗ )T A(x − x∗ ); (6.1.35)
2
where γt+1 is minimizer of the strongly convex quadratic function φ(γ) = f (xt −γgt ) of real variable
γ. Solving equation φ0 (γ) = 0 which identifies γt+1 , we get
gtT gt
γt+1 = ; (6.1.37)
gtT Agt
gtT gt
xt+1 = xt − gt . (6.1.38)
gtT Agt
174 LECTURE 6. GRADIENT DESCENT AND NEWTON METHODS
1)
• Explicit computation results in
(gtT gt )2
E(xt+1 ) = 1− E(xt ). (6.1.39)
[gt Agt ][gtT A−1 gt ]
T
Now we can obtain the convergence rate of the method from the following
Lemma 6.1.2 [Kantorovich] Let A be a positive definite symmetric matrix with the condition number
(the ratio between the largest and the smallest eigenvalue) Q. Then for any nonzero vector x one has
(xT x)2 4Q
≥ .
[xT Ax][xT A−1 x] (1 + Q)2
( i yi2 )2
P
. (6.1.40)
( i λi yi2 )( i λ−1 2
P P
i yi )
This quantity remains unchanged if all yi ’s are Pmultiplied by common nonzero factor; thus,
without loss of generality we may assume that i yi2 = 1. Further, the quantity in question
remains unchanged if all λi ’s are multiplied by common positive factor; thus, we may assume
that λ1 = 1, so that λn = Q is the condition number of the matrix A. Setting ai = yi2 , we
come to the necessity to prove that
P P −1 P
if u = i ai λi , v = i ai λi , where 0 ≤ ai , i ai = 1, and 1 ≤ λi ≤ Q, then uv ≤
(1 + Q)2 /(4Q).
This is easy: due to its origin, the point (u, v) on the 2D plane is convex combination, the
coefficients being ai , of the points Pi = (λi , λ−1
i ) belonging to the arc Γ on the graph of
the function η = 1/ξ, the arc corresponding to the segment [1, Q] of values of ξ (ξ, η are
1
Here is the computation: since φ(γ) is a convex quadratic form and γt+1 is its minimizer, we have
1 2
φ(0) = φ(γt+1 ) + γt+1 φ00 ;
2
due to the origin of φ, we have φ00 = gtT Agt , so that
1 2
E(xt ) − E(xt+1 ) ≡ f (xt ) − f (xt+1 ) ≡ φ(0) − φ(γt+1 ) = γt+1 [gtT Agt ],
2
or, due to (6.1.37),
(gtT gt )2
E(xt ) − E(xt+1 ) = .
2gtT Agt
At the same time by (6.1.35), (6.1.36) one has
1 1 1
E(xt ) = (xt − x∗ )T A(xt − x∗ ) = [A−1 gt ]T A[A−1 gt ] = gtT A−1 gt .
2 2 2
Combining the resulting relations, we come to
as required in (6.1.39).
6.1. GRADIENT DESCENT 175
the coordinates on the plane). Consequently, (u, v) belongs to the convex hull C of Γ. This
convex hull is as you see on the picture:
eta
xi
1 Q
1−a
uv ≤ max [(a + (1 − a)Q)(a + )];
0≤a≤1 Q
the right hand side maximum can be explicitly computed (it corresponds to a = 1/2), and
the resulting value is (Q + 1)2 /(4Q), as claimed.
Proposition 6.1.6 [Convergence ratio for the Steepest Descent as applied to strongly convex quadratic
form]
As applied to a strongly convex quadratic form f with condition number Q, the Steepest Descent converges
linearly with the convergence ratio not worse than
2
4Q Q−1
1− = , (6.1.41)
(Q + 1)2 Q+1
Note that the Proposition says that the convergence ratio is not worse than (Q − 1)2 (Q + 1)−2 ; the actual
convergence ratio depends on the starting point x0 . It is known, anyhow, that (6.1.42) gives correct
description of the rate of convergence: for “almost all” starting points, the process indeed converges
at the rate close to the indicated upper bound. Since the convergence ratio given by Proposition is
1 − O(1/Q) (cf. (6.1.32)), quantitative conclusion (**) from the previous subsection indeed is valid, even
in the case of strongly convex quadratic f .
6.1.5 Conclusions
Let us summarize our knowledge of Gradient Descent. We know that
• In the most general case, under mild regularity assumptions, both StD and ArD converge to the
set of critical points of the objective (see Theorem 6.1.1), and there is certain guaranteed sublinear
rate of global convergence in terms of the quantities |∇f (xN )|2 (see Proposition 6.1.1);
• In the convex C1,1 case ArD converges to a global minimizer of the objective (provided that such
a minimizer exists), and there is certain guaranteed (sublinear) rate of global convergence in terms
of the residuals in the objective f (xN ) − min f (see Proposition 6.1.2);
• In the strongly convex case ArD converges to the unique minimizer of the objective, and both dis-
tances to the minimizer and the residuals in terms of the objective admit global linearly converging
to zero upper bounds. The corresponding convergence ratio is given by the condition number of
the objective Q (see Proposition 6.1.5) and is of the type 1 − O(1/Q), so that the number of steps
required to reduce the initial inaccuracy by a given factor is proportional to Q (this is an upper
bound, but typically it reflects the actual behaviour of the method);
• It the quadratic case - globally, and in the nonquadratic one – asymptotically, StD converges
linearly with the convergence ratio 1 − O(1/Q), Q being the condition number of the Hessian of the
objective at the minimizer towards which the method converges (in the quadratic case, of course,
this Hessian simply is the matrix of our quadratic form).
This is what we know. What should be conclusions – is the method good, or bad, or what? As it normally
is the case in numerical optimization, we are unable to give a definite answer: there are too many different
criteria to be taken into account. What we can do, is to list advantages and disadvantages of the method.
Such a knowledge provides us with a kind of orientation: when we know what are the strong and the weak
points of an optimization method and given a particular application we are interested in, we can decide
“how strong in the case in question are the strong points and how weak are the weak ones”, thus getting
possibility to choose the solution method better fitting the situation. As about the Gradient Descent,
the evident strong points of the method are
• broad family of problems where we can guarantee global convergence to a critical point (normally
- to a local minimizer) of the objective;
• simplicity: at a step of the method, we need single evaluation of ∇f and a small number of
evaluations of f (the evaluations of f are required by the line search; if one uses ArD with
simplified line search mentioned in Section 5.2.4.1, this number indeed is small). Note that each
evaluation of f is accompanied by small (O(n), n being the dimension of the design vector) number
of arithmetic operations.
6.2. BASIC NEWTON’S METHOD 177
The most important weak point of the method is relatively low rate of convergence: even in the strongly
convex quadratic case, the method converges linearly. This itself is not that bad; what indeed is bad, is
that the convergence ratio is too sensitive to the condition number Q of the objective. As we remember,
the number of steps of the method, for a given progress in accuracy, is proportional to Q. And this
indeed is too bad, since in applications we typically meet with ill-conditioned problems, with condition
numbers of orders of thousands and millions; whenever this is the case, we hardly can expect something
good from Gradient Descent, at least when we are interested in high-accuracy solutions.
It is worthy to understand what is the geometry underlying slowing down the Gradient Descent in
the case of ill-conditioned objective. Consider the case of strongly convex quadratic f . The level surfaces
Sδ = {x | f (x) = min f + δ}
of f are homothetic ellipsoids centered at the minimizer x∗ of f ; the squared half-axes of these ellipsoids
are inverse proportional to the eigenvalues of A = ∇2 f . Indeed, as we know from (6.1.35),
1
f (x) = (x − x∗ )T A(x − x∗ ) + min f,
2
so that in the orthogonal coordinates xi associated with the orthonormal eigenbasis of A and the origin
placed at x∗ we have
1X
f (x) = λi x2i + min f,
2 i
λi being the eigenvalues of A. Consequently, the equation of Sδ in the indicated coordinates is
X
λi x2i = 2δ.
i
Now, if A is ill-conditioned, the ellipsoids Sδ become a kind of “valleys” – they are relatively narrow in
some directions (those associated with smallest half-axes of the ellipsoids) and relatively long in other
directions (associated with the largest half-axes). The gradient – which is orthogonal to the level surface
– on the dominating part of this surface looks “almost across the valley”, and since the valley is narrow,
the steps turn out to be short. As a result, the trajectory of the method is a kind of short-step zig-zag
movement with slow overall trend towards the minimizer.
What should be stressed is that in the case in question there is nothing intrinsically bad in the problem
itself; all difficulties come from the fact that we relate the objective to a “badly chosen” initial coordinates.
√
Under appropriate non-orthogonal linear transformation of coordinates (pass from xi to yi = λi xi ) the
objective becomes perfectly conditioned – it becomes the sum of squares of the coordinates, so that
condition number now equals 1, and the Gradient Descent, being run in the new coordinates, will go
directly towards the minimizer. The problem, of course, is that the Gradient Descent is associated with
a once for ever fixed initial Euclidean coordinates (since the underlying notion of gradient is a Euclidean
notion: different Euclidean structures result in different gradient vectors of the same function at the
same point). If these initial coordinates are badly chosen for a given objective f (so that the condition
number of f with respect to these coordinates is large), the Gradient Descent will be slow, although if
we were clever enough to perform first appropriate scaling – linear non-orthogonal transformation of the
coordinates – and then run Gradient Descent in these new coordinates, we might obtain fast convergence.
In the next Section we will consider the famous Newton method which, in a sense, is nothing but “locally
optimally scaled” Gradient Descent, with the scaling varying from step to step.
The indicated method is not necessarily well-defined (e.g., what to do when the Hessian at the current
iterate turns out to be singular?) We shall address this difficulty, same as several other difficulties which
may occur in the method, in the next Lecture. Our current goal is to establish the fundamental result
on the method – its local quadratic convergence in the non-degenerate case:
Theorem 6.2.1 [Local Quadratic Convergence of the Newton method in the nondegenerate case]
Assume that f is three times continuously differentiable in a neighbourhood of x∗ ∈ Rn , and that x∗ is
a nondegenerate local minimizer of f , i.e., ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is positive definite. Then the Basic
Newton method, starting close enough to x∗ , converges to x∗ quadratically.
Proof. Let U be a convex neighbourhood of x∗ where the third order partial derivatives of
f (i.e., the second order partial derivatives of the components of ∇f ) are bounded. In this
neighbourhood, consequently,
with some constant β1 (we have applied to the components of ∇f the standard upper bound
on the remainder in the first order Taylor expansion: if g(·) is a scalar function with bounded
second order derivatives in U , then
for some constant beta2 ; here and in what follows, for a matrix A |A| denotes the operator
norm of A, i.e.,
|A| = max |Ah|,
|h|≤1
2
note that the magnitude of β is of order of the magnitude of second order derivatives of g in U
6.2. BASIC NEWTON’S METHOD 179
the right hand side norms being the standard Euclidean norms on the corresponding vector
spaces.
Now assume that certain point xt of the trajectory of the Basic Newton method on f is close
enough to x∗ , namely, is such that
1
xt ∈ U 00 , U 00 = {x | |x − x∗ | ≤ ρ ≡ min[ , r]}. (6.2.4)
2β1 β2
We have
|xt+1 − x∗ | = |xt − x∗ − [∇2 f (xt )]−1 ∇f (xt )| =
= |[∇2 f (xt )]−1 ∇2 f (xt )(xt − x∗ ) − ∇f (xt ) | ≤ |[∇2 f (xt )]−1 ||−∇f (xt )−∇2 f (xt )(x∗ −xt )| ≤
• the Armijo-based Newton method, where γt+1 is given by the Armijo-terminated line search,
or
• the Goldstein-based Newton method, where γt+1 is given by the Goldstein-terminated line search.
We could expect the indicated modifications to make the method globally converging; at the same time,
we may hope that close enough to a nondegenerate local minimizer of f , the indicated line search will
result in stepsize close to 1, so that the asymptotical behaviour of the modified method will be similar
to the one of the basic method (provided, of course, that the modified method indeed converges to a
nondegenerate local minimizer of f ). Whether these hopes indeed are valid or not, it depends on f ,
and at least one property of f seems to be necessary to make our hopes valid: e(x) should be a descent
direction of f at a non-critical x:
∇f (x) 6= 0 ⇒ eT (x)∇f (x) ≡ −(∇f (x))T [∇2 f (x)]−1 ∇f (x) < 0. (6.2.8)
Indeed, if there exists x with nonzero ∇f (x) such that the Newton direction e(x) is not a descent direction
of f at x, it is unclear whether there exists a step in this direction which reduces f . If, e.g., f (x + γe(x))
is a nondecreasing function of γ > 0, then the Steepest Newton method started at x clearly will never
leave the point, thus converging to (simply staying at) non-critical point of f . Similar difficulties occur
with the Armijo- and Goldstein-based Newton methods: is e(x) is not a descent direction of f at x, then
the Armijo/Goldstein-terminated line search makes no sense at all.
We see that one may hope to prove convergence of the Newton method with line search only if f
satisfies the property
∇f (x) 6= 0 ⇒ (∇f (x))T [∇2 f (x)]−1 ∇f (x) > 0.
The simplest way to impose this property is to assume that f is convex with nonsingular Hessian:
indeed, if the matrix ∇f (x) is positive definite for every x, then, as it is known from Linear Algebra (and
can be proved immediately), the matrix [∇2 f (x)]−1 also is positive definite at every x, so that (6.2.8)
takes place. It indeed turns out that under assumption (6.2.9) the line search versions of the Newton
method possess global convergence:
Proposition 6.2.1 Let f be a twice continuously differentiable convex function with the Hessian ∇2 f (x)
being positive definite at every point x, and let x0 be such that the level set
S = {x | f (x) ≤ f (x0 )}
associated with x0 is bounded. Then the Steepest Newton method, same as the Armijo/Goldstein-based
Newton method, started at x0 converges to the unique global minimizer of f .
The proof of the Proposition is completely similar to the one of Theorem 6.1.1 and is therefore omitted.
The “line search” modification of the Basic Newton method is not quite appropriate: as we just have
seen, in this modification we meet with severe difficulties when the Newton direction at certain point
is not a descent direction of the objective. Another difficulty is that the Newton direction, generally
speaking, can simply be undefined – ∇2 f (x) may be singular at a point of the trajectory. What to do in
this situation? We see that in order to make the Newton method reliable, we need to modify not only
the stepsize used in the method, but also the search direction itself, at least in the cases when it is “bad”
(is not a descent direction of f or simply is undefined). We shall discuss the corresponding modifications
later.
The main advantage of the method is its fast (quadratic) local convergence to a nondegenerate local
minimizer of the objective, provided that we were lucky to bring the trajectory close enough to such a
minimizer.
Rigorously speaking, the indicated attractive property is possessed only by the basic version of the
method. For a modified version, the indicated phenomenon takes place only when the modified method
manages to drive the trajectory to a small neighbourhood of a nondegenerate local minimizer of f ; if it
is not so, we have no reasons to expect fast convergence. Thus, let us assume that the trajectory of the
modified Newton method converges to a nondegenerate local minimizer x∗ of f 3) . Is then the convergence
asymptotically quadratic?
The answer depends on the rules for line search; indeed, to get something asymptotically close to the
Basic Newton method, we need nearly unit stepsizes at the final phase of the process. One can prove,
e.g., that the required property of “asymptotically unit stepsizes in the Newton direction” is ensured by
the exact line search. To get the same behaviour in the Armijo-terminated line search, the parameters
and η of the underlying Armijo test should be chosen properly (e.g., = 0.2 and η = 10), and we should
always start the line search with testing the unit stepsize.
In spite of all indicated remarks which say that the modifications of the Basic Newton method
aimed to ensure global convergence may spoil the theoretical quadratic convergence of the basic method
(either because of bad implementation, or due to degeneracy of the minimizer the trajectory converges
to), the Newton-based methods should be qualified as the most efficient tool for smooth unconstrained
minimization. The actual reason of the efficiency of these methods is their intrinsic ability (spoiled, to
some extent, by the modifications aimed to ensure global convergence) to adjust themselves to the “local
geometry” of the objective.
The main shortcoming of the Newton-type methods is their relatively high computational cost. To
run such a method, we should be able to compute the Hessian of the objective and should solve at
each step an n × n system of linear equations. If the objective is too complicated and/or the dimension
n of the problem is large, these requirements may become “too costly” from the viewpoint of coding,
execution time and memory considerations. In order to overcome these drawbacks, significant effort was
invested into theoretical and computational development of first-order routines (those not using second-
order derivatives) capable to “imitate” the Newton method. These are the methods we are about to
consider in Lecture 10.
6.2.4.1 Preliminaries
The traditional “starting point” in the theory of the Newton method – Theorem 6.2.1 –
possesses an evident drawback (which, anyhow, remained unnoticed by generations of re-
searchers). The Theorem establishes local quadratic convergence of the Basic Newton method
as applied to a function f with positive definite Hessian at the minimizer, this is fine; but
what is the “quantitative” information given by the Theorem? What indeed is the “region
of quadratic convergence” Q of the method – the set of those starting points from which
the method converges quickly to x∗ ? The proof provides us with certain “constructive”
description of Q, but look – this description involves differential characteristics of f like
3)
this, e.g., for sure is the case when f is strongly convex: here the only critical point is a nondegenerate
global minimizer, while convergence to the set of critical points is given by Proposition 6.2.1
182 LECTURE 6. GRADIENT DESCENT AND NEWTON METHODS
To find the Newton iterate xt+1 of the previous iterate xt , take the second order Taylor
expansion of f at xt and choose, as xt+1 , the minimizer of the resulting quadratic form.
Thus, the coordinates are responsible only for the point of view we use to investigate the
process and are absolutely irrelevant to the process itself. And the results of Theorem 6.2.1
in their quantitative part (same as other traditional results on the Newton method) reflect
this “point of view”, not only the actual properties of the Newton process! This “dependence
on viewpoint” is a severe drawback: how can we get correct impression of actual abilities of
the method looking at the method from an “occasionally chosen” position? This is exactly
the same as to try to get a good picture of a landscape directing the camera in a random
manner.
6.2.4.2 Self-concordance
After the drawback of the traditional results is realized, could we choose a proper point of
view – to orient our camera properly, at least for “good” objectives? Assume, e.g., that our
objective f is convex with nondegenerate Hessian. Then at every point x there is a natural,
intrinsic for the objective, Euclidean structure on the space of variables, namely, the one
given by the Hessian of the objective at x; the corresponding norm is
r
d2
q
T 2
|h|f,x = h ∇ f (x)h ≡ |t=0 f (x + th). (6.2.10)
dt2
Note that the first expression for |h|f,x seems to be “frame-dependent” – it is given in terms
of coordinates used to compute inner product and the Hessian. But in fact the value of this
expression is “frame-independent”, as it is seen from the second representation of |h|f,x .
Now, from the standard results on the Newton method we know that the behaviour of the
method depends on the magnitudes of the third-order derivatives of f . Thus, these results
are expressed, among other, in terms of upper bounds
d3
|t=0 f (x + th) ≤ α
dt3
on the third-order directional derivatives of the objective, the derivatives being taken along
unit in the standard Euclidean metric directions h. What happens if we impose similar upper
bound on the third-order directional derivatives along the directions of the unit | · |f,x length
6.2. BASIC NEWTON’S METHOD 183
rather than along the directions of the unit “usual” length? In other words, what happens
if we assume that
d3
|h|f,x ≤ 1 ⇒ |t=0 f (x + th) ≤ α ?
dt3
Since the left hand side of the concluding inequality is of homogeneity degree 3 with respect
to h, the indicated assumption is equivalent to the one
d3
|t=0 f (x + th) ≤ α|h|3f,x ∀x ∀h.
dt3
Now, the resulting inequality, qualitatively, remains true when we scale f – replace it by λf
with positive constant λ, but the value of α varies: α 7→ λ−1/2 α. We can use this property
to normalize the constant factor α, e.g., to set it equal to 2 (this is the most technically
convenient normalization).
Thus, we come to the main ingredient of the notion of a
Of course, the second part of the definition imposes something on f only when the domain
of f is less than the entire Rn .
Note that the definition of a self-concordant function is “coordinateless” – it imposes certain
inequality between third- and second-order directional derivatives of the function and certain
behaviour of the function on the boundary of its domain; all notions involved are “frame-
independent”.
It turns out that the Newton method as applied to a self-concordant function f possesses
extremely nice global convergence properties. Namely, one can more or less straightforwardly
prove the following statements:
and
2λ2 (f, xt )
λ(f, xt+1 ) ≤ . (6.2.16)
1 − λ(f, xt )
The indicated statements demonstrate extremely nice global convergence properties of the
Damped Newton method (6.2.13) as applied to a self-concordant function f . Namely, as-
sume that f is self-concordant with nondegenerate Hessian at certain (and then, as it was
mentioned in the above proposition, at any) point of Qf . Assume, besides this, that f is
below bounded on Qf (and, consequently, attains its minimum on Qf by (ii)). According to
(iii), the Damped Newton method is keeps the iterates in Af . Now, we may partition the
trajectory into two parts:
• the initial phase: from the beginning to the first moment, let it be called t∗ , when
λ(f, xt ) ≤ 1/4;
• the final phase: starting from the moment t∗ .
According to (iii.3), at every step of the initial phase the objective is decreased at least by
absolute constant
1 5
κ = − ln > 0;
4 4
consequently,
• the initial phase is finite and is comprised of no more than
f (x0 ) − minQf f
Nini =
κ
iterations.
Starting with t = t∗ , we have in view of (6.2.15):
2λ2 (f, xt ) 1
λ(f, xt+1 ) ≤ ≤ λ(f, xt );
1 − λ(f, xt ) 2
thus,
6.2. BASIC NEWTON’S METHOD 185
λ2 (f, x)
= fˆ(x) − min fˆ(y),
2 y
where
1
fˆ(y) = f (x) + (y − x)T ∇f (x) + (y − x)T ∇2 f (x)(y − x)
2
is the second-order Taylor expansion of f at x. This is a coordinateless definition of λ(f, x).
Note that the region of quadratic convergence of the Damped Newton method as applied to
a below bounded self-concordant function f is, according to (iii.4), the set
1
Qf = {x ∈ Qf | λ(f, x) ≤ }. (6.2.18)
4
was the main event in Optimization during the last decade, it completely changed the entire
area of large-scale Convex Optimization, in particular, Linear Programming.
Right now we are not going to speak about Interior Point methods in more details; we shall
come back to these methods at the end of our course. What should be stressed now is that
the crucial point in the design of the Interior Point methods is our ability to construct “good”
self-concordant functions with prescribed domains. To this end it is worthy to note how to
construct self-concordant functions. Here the following “raw materials” and “combination
rules” are useful:
Raw materials: basic examples of self-concordant functions. For the time being,
the following examples are sufficient:
1 T
f (x) = x Ax − bT x + c
2
(A is symmetric positive semidefinite n × n matrix) is self-concordant on Rn ;
• [Logarithm] The function
− ln(x)
is self-concordant with the domain R+ = {x ∈ R | x > 0};
• [Extension of the previous example: Logarithmic barrier, linear/quadratic case] Let
Qg = {ξ | Aξ + b ∈ Qf }
To justify self-concordance of the indicated functions, same as the validity of the combina-
tion rules, only minimal effort is required; at the same time, these examples and rules give
almost all required to establish excellent global efficiency estimates for Interior Point meth-
ods as applied to Linear Programming and Convex Quadratically Constrained Quadratic
programming.
After we know examples of self-concordant functions, let us look how our now understanding
of the behaviour of the Newton method on such a function differs from the one given by
Theorem 6.2.1. To this end consider a particular self-concordant function – the logarithmic
barrier
f (x) = − ln(δ − x1 ) − ln(δ + x1 ) − ln(1 − x2 ) − ln(1 + x2 )
for the 2D rectangle
D = {x ∈ R2 | |x1 | < δ, |x2 | < 1};
in what follows we assume that the rectangle is “wide”, i.e., that
δ >> 1.
This function indeed is self-concordant (see the third of the above “raw material” examples).
The minimizer of the function clearly is the origin; the region of quadratic convergence of
the Damped Newton method is given by
x21 x22 1
Q = {x ∈ D | 2 2 + 2 ≤ }
δ + x1 1 + x2 32
(see (6.2.18)). We see that the region of quadratic convergence of the Damped Newton
method is large enough – it contains, e.g., 8 times smaller than D concentric to D rectangle
D0 . Besides this, (6.2.17) says that in order to minimize f to inaccuracy, in terms of the
objective, , starting with a point x0 ∈ D, it suffices to perform no more than
1 1
O(1) ln + ln ln
k x0 k
|x1 |
k x k= max{ , |x2 |}.
δ
Now let us look what Theorem 6.2.1 says. The Hessian ∇2 f (0) of the objective at the
minimizer is −2
2δ 0
H= ,
0 2
and |H −1 | = O(δ 2 ); in, say, 0.5-neighbourhood U of x∗ = 0 we also have |[∇2 f (x)]−1 | =
O(δ 2 ). The third-order derivatives of f in U are of order of 1. Thus, in the notation from
the proof of Theorem 6.2.1 we have β1 = O(1) (this is the magnitude of the third order
derivatives of f in U ), U 0 = U , r = 0.5 (the radius of the circle U 0 = U ) and β2 = O(δ 2 )
(this is the upper bound on the norm of the inverted Hessian of f in U 0 ). According to the
proof, the region U 00 of quadratic convergence of the Newton method is ρ-neighbourhood of
x∗ = 0 with
ρ = min[r, (2β1 β2 )−1 ] = O(δ −2 ).
Thus, according to Theorem 6.2.1, the region of quadratic convergence of the method becomes
the smaller the larger is δ, while the actual behaviour of this region is quite opposite.
In this simple example, the aforementioned drawback of the traditional approach – its “frame-
dependence” – is clearly seen. Applying Theorem 6.2.1 to the situation in question, we used
extremely bad “frame” – Euclidean structure. If we were clever enough to scale the variable
188 LECTURE 6. GRADIENT DESCENT AND NEWTON METHODS
f being the objective to be minimized; from now on we assume the objective to be smooth enough
(namely, 3 times continuously differentiable) and such that the level set
189
190 LECTURE 7. AROUND THE NEWTON METHOD
of unconstrained minimization of three times continuously differentiable function f . The idea of the
method is very simple. Assume for starters that the third derivative of f is bounded, the third directional
derivative of f , taken at any point along any direction of unit length does not exceed some constant
L ∈ (0, ∞) depending solely on f . Then for every x and h it holds
1 L
f (x + h) ≤ f x (h) := f (x) + hT ∇f (x) + hT ∇2 f (x)h + k h k3 . (7.1.2)
2 6
(Theorem A.6.5). Now, the right hand side in this inequality as a function of h goes to +∞ as k h k→ ∞
and as such achieves minimum over h ∈ Rn , let a (global) minimizer be h∗ . Observe that f x (h∗ ) ≤ f x (0),
and this inequality is strict, unless h = 0 is a global minimizer of f x . Assuming that the latter is not the
case, we have
f (x + h∗ ) ≤ f x (h∗ ) < f x (0) = f (x), (7.1.3)
where the first inequality is given by (7.1.2). Thus, in the case in question passing from x to x+ = x + h∗ ,
we strictly reduce the value of f . On the other hand, when h = 0 is a global minimizer of f x (h), we should
have ∇f (x) = 0 and ∇2 f (x) 0 (by second order necessary optimality conditions for unconstrained
minimization as applied to f x (·); note that this function is twice continuously differentiable). Iterating
the construction, we arrive at the algorithm
1 L
xt+1 = xt + ht , ht ∈ Argmin[f (xt ) + hT ∇f (xt ) + hT ∇2 f (xt )h + k h k3 ] (7.1.4)
h 2 6
From the above analysis it follows that the algorithm is well-defined and monotone:
for all t such that xt does not satisfy the second order necessary optimality conditions. If xt does satisfy
these conditions, then f t (·) is a convex function satisfying f xt (h) > f xt (0) whenever h 6= 0, implying
that h = 0 is the only minimizer of the function; consequently, in the case in question the algorithm gets
stuck at xt : xt = xt+1 = xt+2 = ....
So far, we have assumed that f is three times continuously differentiable with the third derivative
bounded on the entire Rn ; besides this, we have implicitly assumed that we know constant L participating
in (7.1.2) – this constant is used in the recurrence (7.1.4). Both these assumptions can be weakened,
specifically,
1
Yu. Nesterov and B.T. Polyak, Cubic regularization of Newton method and its global performance, Mathe-
matical Programming 108 (2006), 177-205.
7.1. NEWTON METHOD WITH CUBIC REGULARIZATION 191
1. We can assume that f : Rn → R is three times continuously differentiable on a closed and bounded
convex set X which is “large enough,” specifically, int X contains the level set
where Lt > 0 is good, goodness meaning that the point xt+1 defined according to (7.1.5) belongs
to X and satisfies the inequality
1 Lt
f (xt+1 ) ≤ f (xt ) + hT ∇f (xt ) + hT ∇2 f (xt )h + k h k3 ; (7.1.6)
2 6
whenever this is the case, we, same as above, have
1 Lt
xt+1 ∈ X0 & f (xt+1 ) ≤ min[f (xt ) + hT ∇f (xt ) + hT ∇2 f (xt )h + k h k3 ] ≤ f (xt ), (7.1.7)
h 2 6
where the concluding inequality is strict, provided xt does not satisfy second order necessary
conditions for (unconstrained) local optimality.
It is immediately seen that when xt ∈ X0 , all large enough values of Lt are good; specifically, we have
the following
Proof. To save notation, and w.l.o.g., let xt = 0, let L ≥ M , and let h̄ be a minimizer of the function
φ(h) := f (0) + hT ∇f (0) + 21 hT ∇2 f (0)h + L6 k h k3 . Observe that h̄T ∇f (0) ≤ 0, since otherwise we
would have φ(−h̄) < φ(h̄). Observe that setting γ(t) = φ(th̄), t ≥ 0, we get a cubic polynomial with
γ 0 (0) = h̄T ∇f (0) ≤ 0 and γ 0 (1) = 0; besides this, γ 0 (t) is a convex quadratic function. As a result, we
have γ 0 (t) ≤ 0, 0 ≤ t ≤ 1, implying that γ(t) ≤ γ(0) = f (0) for t ∈ [0, 1]. Now let t̄ be the largest t ∈ [0, 1]
such that th̄ ∈ X, so that the function g(t) = f (th̄) is well defined on [0, t̄], and since L ≥ M , we have
g(t) ≤ γ(t) for 0 ≤ t ≤ t̄, which combines with the facts that xt = 0 ∈ X0 and γ(t) ≤ f (0), 0 ≤ t ≤ 1, to
imply that th̄ ∈ X0 for t ≤ t̄. Recalling what t̄ is and that X0 ⊂ int X, we conclude that t̄ = 1 and that
1 L
f (h̄) = g(1) ≤ γ(1) = φ(h̄) = min[f (0) + hT ∇f (0) + hT ∇2 f (0)h + k h k3 ],
h 2 6
which combines with the already established fact that h̄ = t̄x ∈ X0 to imply that Lt = L is good for
xt = 0.
What we would like to use, is the “nearly smallest” good value of Lt (since the less is Lt , the less
is the upper bound on f (xt+1 ) given by (7.1.7)). Such a value can be rapidly found by a kind of “line
search,” namely, as follows. Given xt and Lt−1 (where L−1 is, say, 1), we check whether Lt = Lt−1 is
good; if it is the case, we test for goodness one by one the values Lt−1 /2, Lt−1 /4, Lt−1 /8,... until either
the value of Lt we are testing looses goodness, or it becomes less than a small a priori selected threshold,
say, 10−6 ; in both cases, the last good value of Lt we have observed is used to build the actual, and not
just a candidate, next iterate xt+1 according to (7.1.5). On the other hand, when the value Lt−1 of Lt
turns out to be bad, we test one by one the values 2Lt−1 , 4Lt−1 , 8Lt−1 ,..., until the first good value of
Lt is met, and use this value to build the next iterate. Note that with this policy, the values of Lt used
to build xt+1 are bounded by 2 max[MX (f ), L−1 ].
192 LECTURE 7. AROUND THE NEWTON METHOD
It is easily seen that given X and x0 as above, the just described algorithm with line search is well
defined and monotone:
f (x0 ) ≥ f (x1 ) ≥ f (x2 ) ≥ ... (7.1.8)
with all inequalities being strict, unless the algorithms “gets stuck” at a point where the necessary second
order optimality conditions are met. It is not difficult to verify that if the algorithms does not get stuck,
every limiting point of the trajectory (the trajectory is bounded by (7.1.8); recall that X0 is bounded)
satisfies the second order necessary optimality conditions for unconstrained minimization. Moreover,
given > 0, δ > 0, one can explicitly upper-bound, via , δ, MX (f ) and the diameter of X, the number
of steps t in which the second order necessary optimality conditions are satisfied within accuracy (, δ),
that is, xt satisfies the relation
where λmin (A) is the minimal eigenvalue of a symmetric matrix A. Moreover, it can be proved (for the
proofs of all announced results, see the cited paper of Nesterov and Polyak) that if among the limiting
points of the trajectory there is a point x̄ where the necessary second order conditions for local optimality
are satisfied (i.e., ∇f (x̄) = 0, ∇2 f (x̄) 0), then in fact the trajectory converges to x̄ quadratically.
Pay attention to the extremely attractive property of the Newton method with cubic regularization
– under mild smoothness and boundedness assumptions, all limiting points of the trajectory provably
satisfy second order necessary optimality conditions, which is in sharp contrast with basically all other
traditional algorithms for smooth unconstrained minimization – for these algorithms, one can prove only
that all limiting points of the trajectory are critical points of the objective, those where the first order
optimality conditions for unconstrained minimization are satisfied.
φ(h) = aT h + hT Qh + ρ k h k3 ,
where ρ > 0 and Q is a symmetric matrix. The simplest, to the best of our knowledge, answer to this
question is given by the following algorithm:
1. We start with computing eigenvalue decomposition Q = U M U T (U is orthogonal, M =
Diag{µ1 , ..., µn } is diagonal) of Q. Passing from the original variable h to the variable y = U T h,
we reduce the problem of minimizing φ(·) to the problem of minimizing in y the function
n
X X
ψ(y) = bT y + µi yi2 + ρ( yi2 )3/2 [b = U T a]
i=1 i
2. It is clear that at a global minimizer of ψ, the signs of yi are opposite to the signs of bi , that is,
bi yi ≤ 0 (otherwise replacing yi with −yi , we would reduce the value of the function). Thus, to
minimize ψ(y) is the same as to minimize the function
X X X
ω(z) = − |bi |zi + µi zi2 + ρ( yi2 )3/2
i i i
over z ≥ 0: a minimizer z of the latter function gives rise to the global minimizer y ∗ of ψ given
∗
of our problem of interest; after a high accuracy solution λ∗ to the dual problem is found, we can recover
a high accuracy minimizer of χ as ζ∗ = ζ(λ∗ ).
Note that the most expensive computationally step in the algorithm – eigenvalue decomposition of A
- does not operate with ρ. As a result, in the above “line search” implementation of the Newton method
with cubic regularization, adjusting Lt is relatively cheap (the eigenvalue decomposition of ∇2 f (xt ) should
be done once, and the remaining effort is dominated by the necessity to compute the values of f at the
candidates to the role of xt+1 stemming from the trial values of Lt ).
and choose, as the next direction of movement, the “most perspective” of descent directions of the right
hand side. Now, how we were comparing the directions to choose the “most perspective” one? We took
the unit ball
W = {d | dT d ≤ 1}
of directions and choose in this ball the direction which minimizes the value
of the approximate objective. This direction, as it is immediately seen, is simply the normalized anti-
gradient direction
−|∇f (x)|−1 ∇f (x),
and in the Gradient Descent we used it as the current direction of movement, choosing the stepsize in
order to achieve “significant” progress in the objective value. Note that instead of minimizing f¯(x + d)
on the ball W , we could minimize the quadratic function
1
fˆ(d) = dT ∇f (x) + dT d
2
over d ∈ Rn ; the result will be simply the anti-gradient direction −∇f (x). This is not the same as the
above normalized anti-gradient direction, but the difference in normalization is absolutely unimportant
194 LECTURE 7. AROUND THE NEWTON METHOD
for us – in any case we indent to use line search in the generated direction, so that what in fact we are
interested in is the search ray {x + γd | γ ≥ 0}, and proportional, with positive coefficient, directions
result in the same ray.
With the outlined interpretation of the Gradient Descent as a method with line search and the search
direction given by minimization of the linearized objective f¯(x + d) over d ∈ W , we may ask ourselves:
why we use in this scheme the unit ball W , not something else? For example, why not to use an ellipsoid
WA = {d | dT Ad ≤ 1},
has the same “right to exist” as the Gradient Descent. This new “scaled by matrix A Gradient Descent”
is nothing but the usual Gradient Descent, but associated with the coordinates on Rn where WA becomes
the unit ball. Note that the usual Gradient Descent corresponds to the case A = I, I being the unit
matrix. Now, the initial coordinates are absolutely “occasional” – they have nothing in common with the
problem we are solving; consequently, we have no reason to prefer these particular coordinates. Moreover,
if we were lucky to adjust the coordinates we use to the “geometry” of the objective (cf. discussion in
the previous Lecture), we could get a method with better convergence than the one of the usual Gradient
Descent.
Same as above, the direction given by (7.2.2) is, up to renormalization (the latter, as it was already
explained, is unimportant – it is “suppressed” by line search), nothing but the direction given by the
minimization of the quadratic form
1
fˆA (d) = dT ∇f (x) + dT Ad; (7.2.3)
2
minimizing the right hand side with respect to d (to this end it suffices to solve the Fermat equation
∇d fˆA (d) ≡ ∇f + Ad = 0), we come to the explicit form of the search direction:
Note that this direction for sure is a descent direction of f at x, provided that x is not a critical point of
f:
∇f (x) 6= 0 ⇒ dT ∇f (x) = −(∇f (x))T A−1 ∇f (x) < 0
(recall that A is symmetric positive definite, whence A−1 also is symmetric positive definite), so that we
are in a good position to apply to f line search in the direction d.
The summary of our considerations is as follows: choosing a positive definite symmetric matrix A,
we can associate with it “A-anti-gradient direction” −A−1 ∇f (x), which is a descent direction of f at x
(provided that ∇f (x) 6= 0). And we have the same reasons to use this direction in order to improve f as
those to use the standard anti-gradient direction (given by the same construction with A = I).
Now we can make one step of generalization more: why should we use at each step of the method a
once for ever fixed matrix A instead of varying this matrix from iteration to iteration? The “geometry” of
the objective varies along the trajectory, and it is natural to adjust the matrix A to this varying geometry.
Thus, we come to the following generic scheme of a Variable Metric method:
• choose somehow positive definite symmetric matrix At and compute the At -anti-gradient direction
dt = −A−1
t ∇f (xt−1 )
of f at xt−1 ;
• perform line search from xt−1 in the direction dt , thus getting new iterate
The outlined scheme covers all methods we know so far: to get different versions of the Gradient Descent,
we should set At ≡ I and should specify the version of the line search to be used. With At = ∇2 f (xt−1 ),
we get, as dt , the Newton direction of f at xt−1 , and we come to the Newton method with line search;
further specifying the line search by the “programmed” rule γt = 1, we get the Basic Newton method.
Thus, the Basic Newton method is nothing but the Gradient Descent scaled by the Hessian of the objective
at the current iterate. Note, anyhow, that the “Newton choice” At = ∇2 f (xt−1 ) is compatible with
the outlined scheme (where At should be symmetric positive definite) only when ∇2 f (xt−1 ) is positive
definite2) ; from the above discussion we know that if it is not the case, we indeed have no reasons to use
the Newton direction and should somehow modify it to make it descent. Thus, the generic Algorithm
7.2.1 covers all we know to the moment and provides us with good idea how to “cure” the Newton method
at a “bad” iterate:
at such an iterate, we should replace the actual Hessian At = ∇2 f (xt−1 ) in the expression
−A−1t ∇f (xt−1 ) for the Newton direction by its positive definite “correction” in order to
make the resulting direction descent for the objective.
Thus, we come to the family of Modified Newton methods – those given by the generic Algorithm 7.2.1
where we use as At , whenever it is possible, the Hessian ∇2 f (xt−1 ) of the objective, and when it is
impossible, choose as At a “positive definite correction” of the current Hessian.
Before passing to the Modified Newton methods themselves, we should understand whether our new
scheme indeed achieves our target A – whether it makes the modified method globally converging.
(as always, λmin (A)and λmax (A) denote the minimal and the maximal eigenvalues of a symmetric matrix
A). Thus, “uniform descentness” simply means that the matrices At never become “too large” (their
maximal eigenvalues are bounded away from infinity) and never become “almost degenerate” (their
minimal eigenvalues are bounded away from zero).
S = {x ∈ Rn | f (x) ≤ f (x0 )}
2)
the scheme requires also symmetry of At , but here we have no problems: since f is from the very beginning
assumed to be three times continuously differentiable, its Hessians for sure are symmetric
196 LECTURE 7. AROUND THE NEWTON METHOD
is bounded. Assume that f is minimized by a uniformly descent Variable Metric method started at x0 ,
and assume that the line search used in the method is either the exact one-dimensional line search, or
the Armijo-terminated one. Then the trajectory of the method is bounded, the objective is non-increasing
along the trajectory, and all limiting points of the trajectory (which for sure exist, since the trajectory is
bounded) belong to the set
X ∗∗ = {x | ∇f (x) = 0}
of critical points of f .
Proof is similar to the one of Theorem 6.1.1 and is therefore omitted. It is an excellent (non-obligatory!)
exercise to restore the proof and to understand what is the role of the “uniform descentness”.
and compares the eigenvalues of Ht – i.e., the diagonal entries of Dt – with a once for ever chosen
“threshold” δ > 0. If all the eigenvalues are ≥ δ, we qualify Ht as “well positive definite” and use it as
At+1 . If some of the eigenvalues of Ht are < δ (e.g., are negative), we set
At+1 = Ut D̄t UtT ,
where D̄t is the diagonal matrix obtained from D by replacing the diagonal entries smaller than δ by δ.
Another way to “correct” Ht is to replace the negative diagonal values in Dt by their absolute values
(and then to replace by δ those diagonal entries, if any, which are less than δ).
Both indicated strategies result in
λmin (At ) ≥ δ, t = 1, 2, ...,
and never increase “significantly” the norm of the Hessian:
λmax (At ) ≤ max [|λmax (Ht )|, |λmin (Ht )|, δ] ;
as a result, the associated modified Newton method turns out to be uniformly descent (and thus globally
converging), provided that the level set of f associated with the starting point is bounded (so that the
Hessians along the trajectory are uniformly bounded). A drawback of the approach is its relatively large
computational cost: to find spectral decomposition (7.2.6) to machine precision, it normally requires
between 2n3 and 4n3 arithmetic operations, n being the row size of Ht (i.e., the design dimension of the
optimization problem). As we shall see in a while, this is, in a sense, too much.
It is known from Linear Algebra, that a symmetric n × n matrix A is positive definite is and
only if it admits factorization
A = LDLT , (7.2.8)
where
The Choleski Factorization is an algorithm which computes the factors L and D in decom-
position (7.2.8), if such a decomposition exists. The algorithm is given by the recurrence
j−1
X
2
dj = aii − ds ljs , (7.2.9)
s=1
j−1
!
1 X
lij = aij − ds ljs lis , j ≤ i ≤ n, (7.2.10)
dj s=1
(dj is j-th diagonal entry of D, lij and aij are the entries of L and A, respectively). The
indicated recurrence allows to compute D and L, if they exist, in
n3
Cn = (1 + o(1))
6
arithmetic operations (o(1) → 0, n → ∞), in a numerically stable manner. Note that L is
computed in the “column by column” fashion: the order of computations is
d1 → l1,1 , ld2,1 , ..., ln,1 → d2 → ld2,2 , ld3,2 , ..., ln,2 → d3 → ld3,3 , ld4,3 , ...ln,3 → ... → dn → ln,n ;
if the right hand side in (7.2.9) turns out to be nonpositive for some j, this indicates that A
is not positive definite.
The main advantage of Choleski factorization is not only its ability to check in Cn computa-
tions whether A is positive definite, but also to get, as a byproduct, the factorization (7.2.8)
of a positive definite A. With this factorization, we can immediately solve the linear system
Ax = b;
the solution x to the system can be identified by backsubstitutions – sequential solving, for
u, v and x, two triangular and one diagonal systems
Lu = b; Dv = u; LT x = v,
and this computation requires only O(n2 ) arithmetic operations. The resulting method –
Choleski decomposition with subsequent backsubstitutions (the square root method) – is
thought to be the most efficient in the operation count and most numerically stable Lin-
ear Algebra routine for solving linear systems with general-type symmetric positive definite
matrices.
is a quadratic form, and apply direct Linear Algebra tools to find, under this assumption, its minimizer
– namely, form and solve with respect to x the linear system
(N ) ∇2 f (xt )(x − xt ) = −∇f (xt ),
which, for the case of quadratic (and convex) f would give us the minimizer of f . In the non-quadratic
case the resulting x, of course, does not minimize f , and we treat it as our new iterate. The drawbacks
of the scheme we are trying to eliminate are that we need second-order information on f to assemble the
Newton system (N ) and should solve the system in order to get x.
In the Conjugate Gradient scheme we also act as if we were believing that the objective is quadratic,
but instead of direct forming and solving (N ) we solve the system by an iterative method. It turns out
that one can choose this method in such a way that it
• (i) does not involve explicitly the matrix of the system; all operations are described in terms of the
values and the first order derivatives of f at subsequent iterates;
• (ii) solves the system exactly in no more than n steps, n being the dimension of x.
Since the iterative method for solving (N ) we use is described in terms of the values and the gradients
of f , it formally can be applied to an arbitrary smooth objective; and if the objective turns out to be
convex quadratic form, the method, by (ii), will find its minimizer in at most n steps. In view of the
latter property of the method we could expect that if the objective is “nearly quadratic”, then n steps
of the method, although not resulting in exact minimizer of f , give us a much better approximation
to the minimizer than the point the method is started from. Choosing this approximation as our new
starting point and performing n steps of the method more, we may hope for new significant reduction
in inaccuracy, and so on. Of course, all these hopes are under the assumption that the objective is
“nearly quadratic”; but this indeed will be eventually our case, if the method will converge, and this
latter property will indeed be ensured by our construction.
It is clear from the outlined general idea that the “main ingredient” of the scheme is certain iterative
method for minimizing convex quadratic forms, and we start with the description of this method.
so that Et is the linear span of the first t Krylov vectors. It is easily seen that
• The Krylov subspaces grow with t:
E1 ⊂ E2 ⊂ E3 ⊂ ...
• Let k ≥ 0 be the first value of t such that the first t Krylov vectors are linearly dependent. Then
the inclusion
Et ⊂ Et+1
is strict for t < k − 1 and is equality for t ≥ k − 1.
Indeed, there is nothing to prove if k = 1 (it is possible if and only if g0 = 0), so that
let us assume that k > 1 (i.e., that g0 6= 0). When t < k − 1, the dimensions of Et and
Et+1 clearly are t, t + 1, respectively, so that the inclusion Et ⊂ Et+1 is strict. Now,
the vectors g0 , Hg0 , ..., H k−2 g0 are linearly independent, and if we add to this family
the vector H k−1 g0 , we get a linearly dependent set; it follows that the vector we add is
a linear combination of the vectors g0 , Hg0 , ..., H k−2 g0 :
Multiplying relation by H, H 2 , H 3 ,..., we see that t-th Krylov vector, starting with
t = k, is a linear combination of the previous Krylov vectors, whence (by induction)
it is also a linear combination of the first k − 1 of these vectors. Thus, Et = Ek−1
whenever t ≥ k.
Now consider the affine sets
Ft = x0 + Et ,
and let xt be the minimizer of the quadratic form f on Ft . By definition, the trajectory of the Conjugate
Gradient method as minimizing f , x0 being the starting point, is the sequence x0 , x1 , ..., xk−1 . We are
about to prove that
• xk−1 is the global minimizer of f ;
• there exists explicit recurrence which allows to build sequentially the points x1 , ..., xk−1 .
set t = 1.
Step t: if gt−1 ≡ ∇f (xt−1 ) = 0, terminate, xt−1 being the result. Otherwise set
[new iterate]
g T dt−1
xt = xt−1 + γt dt−1 , γt = − Tt−1 , (7.3.4)
dt−1 Hdt−1
[new gradient]
gt = ∇f (xt ) ≡ Hxt − b, (7.3.5)
[new direction]
gtT Hdt−1
dt = −gt + βt dt−1 , βt = , (7.3.6)
dTt−1 Hdt−1
replace t with t + 1 and loop.
We are about to prove that the presented algorithm is a Conjugate Direction method:
(vii) The algorithm terminates no later than after n steps with the result being the exact minimizer
of f .
Proof.
10 . We first prove (i)-(iv) by induction on t.
Base t = 1. We should prove that if the algorithm does not terminate at the first step (i.e.,
if g0 6= 0), then (i1 ) - (iv1 ) are valid. The only statement which should be proved is (iv1 ),
and to prove it is the same as to prove that g1 = ∇f (x1 ) is orthogonal to E1 . This is an
immediate corollary of the following
Step t 7→ t + 1. Assume that (is ) - (ivs ) are valid for s ≤ t and that the algorithm does not
terminate at the step t + 1, i.e., that gt 6= 0, and let us derive from this assumption (it+1 ) -
(ivt+1 ).
10 . From (ivs ), s ≤ t, we know that gs is orthogonal to Es , and from (iis ) and (is ), the
subspace Es is the linear span of the vectors d0 , ..., ds−1 , same as it is the linear span of the
vectors g0 , ..., gs−1 , and we conclude that gs is orthogonal to the vectors d0 , ..., ds−1 :
By (it−1 ) and (iit−1 ), both gt−1 and dt−1 belong to Lin{g0 , Hg0 , ..., H t−1 g0 }, so that the
above computation demonstrates that gt ∈ Lin{g0 , Hg0 , ..., H t g0 }, which combined with (it )
means that
Lin{g0 , ..., gt } ⊂ Lin{g0 , Hg0 , ..., H t g0 } = Et+1 .
Since g0 , ..., gt are nonzero and mutually orthogonal (see (7.3.13)), the left hand side subspace
in the latter inclusion is of dimension t+1, while the right hand side subspace is of dimension
at most t + 1; a t + 1-dimensional linear subspace can be enclosed into a linear subspace of
dimension ≤ t + 1 only if the subspaces are equal, and we come to (it+1 ).
To prove (iit+1 ), note that by (7.3.6)
dt = −gt + βt dt−1 ;
both right hand side vectors are from Et+1 = Lin{g0 , Hg0 , ..., H t g0 } (by (iit ) and already
proved (it+1 )), so that dt ∈ Lin{g0 , Hg0 , ..., H t g0 }; combining this observation with (iit ), we
come to
Lin{d0 , ..., dt } ⊂ Lin{g0 , Hg0 , ..., H t g0 }. (7.3.14)
Besides this, dt 6= 0 (indeed, gt is nonzero and is orthogonal to dt−1 by (7.3.12)). Now let us
prove that dt is H-orthogonal to d0 , ..., dt−1 . From formula for dt we have
When s = t − 1, the right hand side is 0 by definition of βt ; and when s < t − 1, both terms in
the right hand side are zero. Indeed, the first term is zero due to (iit ): this relation implies
that
Hds ∈ Lin{Hg0 , H 2 g0 , ..., H s g0 },
and the right hand side subspace, due to (is+1 ) (s < t − 1, so that we can use (is+1 )!), is
contained in the linear span of the gradients g0 , ..., gs+1 , and gt is orthogonal to all these
gradients by virtue of (7.3.14) (recall that s < t − 1). The second term in the right hand side
of (7.3.15) vanishes because of (iiit ).
7.3. CONJUGATE GRADIENT METHODS 203
Thus, the right hand side in (7.3.15) is zero for all s < t; in other words, we have proved
(iiit+1 ) – the vectors d0 , ..., dt indeed are H-orthogonal. We already know that dt 6= 0;
consequently, in view of (iit ), all d0 , ..., dt are nonzero. Since, as we already know, these
vectors are H-orthogonal, they are linearly independent4) . Consequently, inclusion in (7.3.14)
is in fact equality, and we have proved (iit+1 ).
It remains to prove (ivt+1 ). We have xt+1 = xt + γt+1 dt , whence
We should prove that gt+1 is orthogonal to Et+1 , i.e., due to already proved (iiit+1 ), is
orthogonal to every di , i ≤ t. Orthogonality of gt+1 and dt is given by Lemma 7.3.1, so that
T
we should verify that gt+1 di = 0 for i < t. This is immediate:
T
gt+1 di = gtT di + γt dTt Hdi ;
the first term in the right hand side is zero due to (7.3.12), and the second term is zero due
to already proved (iiit+1 ).
The inductive step is completed.
20 . To prove (v), note that if t > 1 then, by (7.3.6),
T T T
−gt−1 dt−1 = gt−1 gt−1 − βt−1 gt−1 dt−2 ,
T T
and the second term in the right hand side is 0 due to (7.3.13); thus, −gt−1 dt−1 = gt−1 gt−1
for t > 1. For t = 1 this relation also is valid (since, by the initialization rule, d0 = −g0 ).
T T
Substituting equality −gt−1 dt−1 = gt−1 gt−1 into formula for γt from (7.3.4), we get (7.3.10).
To prove (vi), note that gtT gt−1 = 0 (see (7.3.14). Besides this, from (7.3.4) we have
1
Hdt−1 = [gt − gt−1 ]
γt
(note that γt > 0 due to (v) and (it ), so that we indeed can rewrite (7.3.4) in the desired
way); taking inner product with gt , we get
1 T dT Hdt−1 T
gtT Hdt−1 = gt gt = t−1
T g
gt gt
γt gt−1 t−1
(we have used (7.3.10)); substituting the result into (7.3.6), we come to (7.3.11).
30 . It remains to prove (vii). This is immediate: as we already know, if the method does
not terminate at step t, i.e., if gt−1 6= 0, then the vectors g0 , ..., gt−1 are mutually orthogonal
nonzero (and, consequently, linearly independent) vectors which form a basis in Et ; since
there cannot be more than k − 1 ≤ n linearly independent vectors in Et , k being the
smallest t such that the Krylov vectors g0 , Hg0 , ..., H t−1 g0 are linearly dependent, we see
that the method terminates in no more than k ≤ n steps. And since it is terminated when
gt = 0 = ∇f (xt ) = 0, the result of the method indeed is the global minimizer of f .
4)
Indeed, we should verify that if
t
X
λi di = 0,
i=0
then λi = 0, i = 0, ..., t. Multiplying the equality by H and then taking inner product of the result with di , we
get
λi dTi Hdi = 0
(the terms dTi Hdj with i 6= j are zero due to H-orthogonality of d0 , ..., dt ), whence λi = 0 (since dTi Hdi > 0 due
to di 6= 0 and to the positive definiteness of H). Thus, all coefficients λi indeed are zeros
204 LECTURE 7. AROUND THE NEWTON METHOD
where x0 is the starting point, x∗ is the exact minimizer of f , I is the unit matrix, and Pk
is the family of all polynomials of degree ≤ k.
5)
please do not think that the problem in question can be solved by straightforward application of the CG:
the influence of rounding errors makes the actually computed gradients very far from being mutually orthogonal!
7.3. CONJUGATE GRADIENT METHODS 205
Corollary 7.3.1 Let Σ be the spectrum (the set of distinct eigenvalues) of H, and let xt be
t-th point of the trajectory of CG as applied to f . Then
applying this identity to the polynomial r(z) = z(1 − zp(z))2 , p ∈ Pt−1 , we get
" n
#T " n #
X X
∗ 2 2 ∗ 2
(x0 − x ) H[1 − Hp(H)] (x0 − x ) = si ei λi (1 − λi p(λi )) si ei =
i=1 i=1
n
X
= (1 − λi p(λi ))2 λi s2i . (7.3.22)
i=1
[see (7.3.19)].
2
= 2 max(1 − λp(λ)) E(x0 ).
λ∈Σ
When p runs through Pt−1 , the polynomial q(z) = 1 − zp(z) clearly runs through the entire
Pt∗ , and (7.3.20) follows.
Relation (7.3.21) is proved similarly; the only difference is that instead of bounding from
above the right hand side of (7.3.22) by the quantity S, we bound the expression by
n
X
max λ(1 − λp(λ))2 s2i ,
λ∈Σ
i=1
Corollary 7.3.1 provides us with a lot of information on the rate of convergence of the
Conjugate Gradient method:
• A. Rate of convergence in terms of Condition number It can be proved that for any
segment ∆ = [l, L], 0 < l < L < ∞, and for any positive integer s there exists a
polynomial qs ∈ Ps∗ with
√ 2s
Q−1 L
max qs2 (λ) ≤ 4 √ , Q= .
λ∈∆ Q+1 l
Combining (7.3.20) and the just indicated result where one should substitute
where
λmax (H)
QH =
λmin (H)
is the condition number of the matrix H.
We came to a non-asymptotical linear rate of convergence with the convergence ratio
√ 2
Q −1
√ H ;
QH + 1
√
for large Qh , this ratio is of the form 1 − O(1) QH , so that the number of steps
required to improve initial inaccuracy √ by a given factor is, independently of the
value of n, bounded from above by O(1) QH ln(1/). The number of required steps is
proportional to the square root of the condition number of the Hessian, while for the
Steepest Descent in the quadratic case similar quantity is proportional to the condition
number itself (see Lecture 3); this indeed is a great difference!
7.3. CONJUGATE GRADIENT METHODS 207
• B. Rate of convergence in terms of |x0 − x∗ | It can be proved also that for any L > 0
and any integer s > 0 there exists a polynomial rs ∈ Ps∗ such that
L
max λr2 (λ) ≤ .
0≤λ≤L (2s + 1)2
Since Σ for sure is contained in the segment [0, L = |H|], |H| being the norm of
matrix H, we can use just indicated result and (7.3.21) to get non-asymptotical (and
independent on the condition number of H) sublinear efficiency estimate
|H||x0 − x∗ |2 1
E(xN ) ≤ , N = 1, 2, ... (7.3.24)
2 (2N + 1)2
This result resembles sublinear global efficiency estimate for Gradient Descent as ap-
plied to a convex C1,1 function (Lecture 9, Proposition 6.1.2); note that a convex
quadratic form (7.3.1) indeed is a C1,1 function with Lipschitz constant of gradient
equal to |H|. As compared to the indicated result about Gradient Descent, where the
convergence was with the rate O(1/N ), the convergence given by (7.3.24) is twice better
in order – O(1/N 2 ).
• C. Rate of convergence in terms of the spectrum of H The above results established
rate of convergence of CG in terms of bounds – both lower and upper or only the upper
one – on the eigenvalues of H. If we take into account details of the distribution of the
eigenvalues, more detailed information on convergence rate can be obtained. Let
λmax (H) ≡ λ1 > λ2 > ... > λm ≡ λmin (H)
be distinct eigenvalues of H written down in descent order. For every k ≤ s let
k
Y
πk (z) = (1 − z/λi ) ∈ Pk∗ ,
i=1
(indeed, πk vanishes on the spectrum of H to the right of λk+1 and is in absolute value
≤ 1 between zero and λk+1 , while qs,k satisfies the required bound to the left of λk+1
∗
in Σ). Besides this, πk qs,k ∈ Pk+s . Consequently, by (7.3.20)
" #2s
pQ
N −s,H − 1
E(xN ) ≤ 4 min p E(x0 )
1≤s≤N QN −s,H + 1
– this is an extension of the estimate (7.3.23) (this latter estimate corresponds to the
case when we eliminate the outer min and set s = N in the inner brackets, which
results in Q0,H = QH ). We see also that if N ≥ m, m being the number of distinct
eigenvalues in H, then E(xN ) = 0 (set q = πm in (7.3.20)); thus, in fact
CG finds the exact minimizer of f in at most as many steps as many distinct eigenvalues
are there in matrix H.
Taking into account the details of the spectrum of H, one can strengthen the estimate
of item B as well.
208 LECTURE 7. AROUND THE NEWTON METHOD
Hx = b
Note that all remaining actions at a step are simple – taking inner products and linear
combinations of vectors, and all these actions together cost O(n) arithmetic operations.
Thus, the arithmetic cost of a step in CG (same as that one for Steepest Descent) is
This is a very important fact. It demonstrates that sparsity of H – relatively small number
N (N << n2 ) of nonzero entries – can be immediately utilized by CG. Indeed, in the
“dense” case matrix-vector multiplication costs 2n2 multiplications and additions, and this
is the principal term in the arithmetic cost of a step of the method; in the sparse case this
principal term reduces to 2N << 2n2 .
Large-scale linear systems of equations typically have matrices H which either are extremely
sparse (something 0.01% − 1% of nonzero entries), or are not sparse themselves, but are
products of two sparse matrices (“implicit sparsity”, e.g., the least square matrices arising in
Tomography); in this latter case matrix-vector multiplications are as cheap as if H itself were
sparse. If the size of the matrix is large enough (tens of thousands; in Tomography people
deal with sizes of order of 105 - 106 ) and no sparsity – explicit or “implicit” – is present, then,
typically, there are no ways to solve the system. Now, if the matrix of the system is sparse
and the pattern of the nonzero entries is good enough, one can solve the system by a kind
of Choleski decomposition or Gauss elimination, both the methods being modified in order
to work with sparse data and not to destroy sparsity in course of their work. If the matrix
is large and sparsity is not “well-structured” or is “implicit”, the direct methods of Linear
Algebra are unable to solve the system, and all we can do is to use an iterative method, like
Steepest Descent or CG. Here we exploit the main advantage of an iterative method based
on matrix-vector multiplications – cheap step and modest memory requirements.
The indicated advantages of iterative methods are shared by both Steepest Descent and
Conjugate Gradient. But there is an important argument in favour of CG – its better rate
of convergence. In fact, the Conjugate Gradient algorithm possesses the best, in certain
exact sense, rate of convergence an iterative method (i.e., the one based on matrix-vector
multiplications) may have.
These are the main advantages of the CG – simplicity and theoretical optimality among the
iterative methods for quadratic minimization. And the main disadvantage of the method is
7.3. CONJUGATE GRADIENT METHODS 209
its sensitivity to the condition number of the matrix of the system – although less than the
one for Steepest Descent (see item A in the discussion above), but still rather unpleasant.
Theoretically, all bad we could expect of an ill-conditioned H is that the convergence of CG
will be slow, but after n steps (as always, n is the size of H) the method should magically
bring us the exact solution. The influence of rounding errors makes this attractive picture
absolutely unrealistic. Even with moderate condition number of H, the method will not find
exact solution in n steps, but will come rather close to it; and with large condition number,
the n-th approximate solution can be even worse than the initial one. Therefore when people,
by some reasons, are interested to solve a moderate-size linear system by CG, they allow the
method to run 2n, 4n or something like steps (I am saying about “moderate size” systems,
since for large-scale ones 2n or 4n steps of CG simply cannot be carried out in reasonable
time).
The conclusion here should be as follows: if you are solving a linear system with symmetric
positive definite matrix and the size of the system is such that direct Linear Algebra methods
– like Choleski decomposition – can be run in reasonable time, it is better to use these direct
methods, since they are much more numerically stable and less sensitive to the conditioning
of the system than the iterative methods. It makes sense to solve the system by CG only
when the direct methods cannot be used, and in this case your chances to solve the problem
heavily depend on whether you can exploit explicit or implicit sparsity of the matrix in
question and especially on how well-conditioned is the matrix.
(C): gt = ∇f (xt ) :
gtT Hdt−1
(D): dt = −gt + βt dt−1 , βt = dT Hdt−1
;
t−1
“almost ignores” the quadratic nature of f : the matrix H is involved only in the formulae for scalars
γt (the stepsize) and βt (the coefficient in the updating formulae for the search directions). If we were
able to eliminate the presence of H completely and to describe the process in terms of f and ∇f only,
we would get a recurrence CG∗ which, formally, could be applied to an arbitrary objective f , and in
the case of strongly convex quadratic objective would become our basic method CG. This latter method
solves quadratic problem exactly in n steps; since close to a nondegenerate local minimizer x∗ a general
smooth objective f is very similar to a strongly convex quadratic one fq , we could hope that CG∗ applied
to f and started close to x∗ would “significantly” reduce inaccuracy in n steps. Now we could again
apply to f n steps of the same routine CG∗ , but with the starting point given by the first n steps, again
hopefully significantly reducing inaccuracy, and so on. If, besides this, we were clever enough to ensure
global convergence of the indicated “cyclic” routine, we would get a globally converging method with
good asymptotical behaviour.
This is the idea, and now let us look how to implement it. First of all, we should eliminate H in
formulae for the method in the quadratic case. It is easy to do it with the formula for the stepsize γt .
Indeed, we know from Lemma 7.3.1 that in the strongly convex quadratic case gt is orthogonal to dt−1 ,
so that xt is the minimizer of f along the line passing through xt−1 in the direction dt−1 . Thus, we may
replace (B) by equivalent, in the case of strongly convex quadratic f , rule
210 LECTURE 7. AROUND THE NEWTON METHOD
gtT gt
βt = T
.
gt−1 gt−1
Algorithm 7.3.2 [Fletcher-Reeves Conjugate Gradient method for minimization of a general-type func-
tion f over Rn ]
Initialization: choose arbitrary starting point x0 . Set cycle counter k = 1.
Cycle k:
Initialization of the cycle: given x0 , compute
g0 = ∇f (x0 ), d0 = −g0 ;
The Fletcher-Reeves algorithm is not the only extension of the quadratic Conjugate Gradient algo-
rithm onto non-quadratic case. There are many other ways to eliminate H from Algorithm 7.3.1, and
each of them gives rise to a non-quadratic version of CG. For example, one can rewrite the relation
gtT gt
βt = T
(7.3.25)
gt−1 gt−1
(as we remember from the proof of Theorem 7.3.1, in the quadratic case gt is orthogonal to gt−1 , so
that both equations for βt are in this case equivalent). When we replace the formula for βt in the
Fletcher-Reeves method by (7.3.26), we again obtain a method (the Polak-Ribiere one) for unconstrained
minimization of smooth general-type functions, and this method also becomes the quadratic CG in the
case of quadratic objective. It should be stressed that in the nonquadratic case the Polak-Ribiere method
differs from the Fletcher-Reeves one, since relations (7.3.25) and (7.3.26) are equivalent only in the case
of quadratic f .
The proof, basically, repeats the one for the Steepest Descent, and I shall only sketch it. The
first observation is that the objective never increases along the trajectory, since all steps of
the method are based on precise line search. In particular, the trajectory never leaves the
compact set S.
Now, the crucial observation is that the first step of cycle k is the usual Steepest Descent
step from xk and therefore it “significantly” decreases f , provided that the gradient of f at
xk is not very small 6) . Since the subsequent inter-cycle steps do not increase the objective,
we conclude that the sum of progresses in the objective values at the Steepest Descent steps
starting the cycles is bounded from above (by the initial residual in terms of the objective).
Consequently, these progresses tend to zero as k → ∞, and due to the aforementioned
relation
we conclude that ∇f (xk ) → 0, k → ∞. Thus, any limiting point of the sequence {xk } is a
critical point of f .
The actual justification of non-quadratic extensions of the Conjugate Gradient method is the following
proposition which says that the property “finite n-step convergence” of the method in the quadratic case
transforms into the property of “n-step quadratic convergence” in the nonquadratic one:
Proposition 7.3.3 [Asymptotical “n-step quadratic” convergence of the Fletcher-Reeves and Polak-
Ribiere methods in nonquadratic nondegenerate case]
Let f : Rn → R be three times continuously differentiable function, and let x∗ be a nondegenerate local
minimizer of f , so that ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is positive definite. Assume that f is minimized by
the Fletcher-Reeves or Polak-Ribiere versions of the Conjugate Gradient algorithm, and assume that the
sequence {xk } of points starting the cycles of the algorithm converges to x∗ . Then the sequence {xk }
converges to x∗ quadratically:
|xk+1 − x∗ | ≤ C|xk − x∗ |2
We should stress that the quadratic convergence indicated in the theorem is not the quadratic convergence
of the subsequent search points generated by the method: in the Proposition we speak about “squaring
the distance to x∗ ” in n steps of the method, not after each step, and for large n this “n-step quadratic
convergence” is not that attractive.
6)
all is going on within a compact set S where f is continuously differentiable; therefore for x ∈ S we have
with independent of x ∈ S reminder (s) → 0, s → 0. Consequently, properly chosen step from a point x ∈ S in
the anti-gradient direction indeed decreases f at least by ψ(|∇f (x)|), with some positive on the ray s > 0 and
nondecreasing on the ray function ψ(s)
212 LECTURE 7. AROUND THE NEWTON METHOD
where St+1 are symmetric positive definite matrices (in the notation of Algorithm 7.2.1, St+1 = A−1
t+1 ).
As about stepsizes γt+1 , they are given by a kind of line search in the direction
∇f (xt ) 6= 0 ⇒ dTt+1 ∇f (xt ) ≡ −(∇f (xt ))T St+1 ∇f (xt ) < 0 (7.4.2)
In the quasi-Newton methods we use another approach to implement the same idea: we compute the
matrices St+1 recursively without explicit usage of the Hessians of the objective and inverting these
Hessians. The recurrence defining St+1 is aimed to ensure, at least in good cases, that
which, basically, means that the method asymptotically becomes close to the Newton one and therefore
quickly converges.
The problem is how to ensure (7.4.3) at a “cheap” computational cost.
Thus,
pt ≡ xt − xt−1 ≡ −γt St gt−1 , (7.4.4)
γt being the stepsize given by linesearch. The first-order information on f at the points xt and xt−1
allows us to define the vectors
qt ≡ gt − gt−1 . (7.4.5)
If the step pt is small in norm (as it should be the case at the final stage) and f is twice continuously
differentiable (which we assume from now on), then
qt ≡ gt − gt−1 ≡ ∇f (xt ) − ∇f (xt−1 ) ≈ [∇2 f (xt−1 )](xt − xt−1 ) ≡ [∇2 f (xt−1 )]pt ; (7.4.6)
7.4. QUASI-NEWTON METHODS 213
St+1 qt = pt .
This relation, of course, can be satisfied by infinitely many updatings St 7→ St+1 , and there are different
“natural” ways to specify this updating. When the policy of the updatings is fixed, we get a particular
version of the generic Variable Metric algorithm, up to the freedom in the linesearch tactics. We shall
always assume in what follows that this latter issue – which linesearch to use – is resolved in favour of
the exact linesearch. Thus, the generic algorithm we are going to consider is as follows:
• set
dt = −St gt−1 ;
• perform exact line search from xt−1 in the direction dt , thus getting new iterate
xt = xt−1 + γt dt ;
• (U): update St into positive definite symmetric matrix St+1 , maintaining the relation
St+1 qt = pt , (7.4.7)
When specifying the only “degree of freedom” in the presented generic algorithm, i.e., the rules for (U),
our targets are at least the following:
• (B) in the case of strongly convex quadratic f the matrices St should converge (ideally – in finite
number of steps) to the inverse Hessian [∇2 f ]−1 .
The first requirement is the standard requirement for Variable Metric algorithms; it was motivated in
Section 7.2.1. The second requirement comes from our final goal – to ensure, at least in good cases (when
the trajectory converges to a nondegenerate local minimizer of f ), that St+1 − [∇2 f (xt )]−1 → 0, t → ∞;
as we remember, this property underlies fast – superlinear – asymptotical convergence of the algorithm.
Implementing (U) in a way which ensures the indicated properties and incorporating, in addition,
certain policies which make the algorithm globally converging (e.g., restarts, which we already have used
in non-quadratic extensions of the Conjugate Gradient method), we will come to globally converging
methods with nice, in good cases, asymptotical convergence properties.
In the rest of the Lecture we focus on two frequently used forms of updating (U)
214 LECTURE 7. AROUND THE NEWTON METHOD
7.4.3 Implementations
7.4.3.1 Davidon-Fletcher-Powell method
In this method, (U) is given by
1 1
St+1 = St + T
pt pTt − St qt qtT St . (7.4.8)
pt qt qtT St qt
We are about to demonstrate that (7.4.8) is well-defined, results in positive definite St+1 and maintains
(7.4.7).
Proposition 7.4.1 Let R be a positive definite symmetric matrix, and let p and q be two vectors such
that
pT q > 0. (7.4.9)
Then the matrix
1 1
R0 = R + ppT − Rqq T R (7.4.10)
pT q q T Rq
is symmetric positive definite and satisfies the relation
R0 q = p. (7.4.11)
1 1
R0 q = Rq + (pT q)p − (q T Rq)Rq = Rq + p − Rq = p.
pT q q T Rq
Since, by assumption, pT q > 0, the second fraction is nonnegative. The first one also is
nonnegative by Cauchy’s inequality. Thus, xT R0 x ≥ 0, while we need to prove that this
quantity is positive. To this end it suffices to verify that both numerators in the right hand
side of the latter relation cannot vanish simultaneously. Indeed, the first numerator can
vanish only if a is proportional to b (this is said by Cauchy’s inequality: it becomes equality
only when the vectors in question are proportional). If a is proportional to b, then x is
proportional to q (see the origin of a and b). But if x = sq for some nonzero (x is nonzero!)
s, then the second numerator is s2 (q T p)2 , and we know that q T p is positive.
Proposition 7.4.2 Let at a step t of Algorithm 7.4.1 with (U) given by (7.4.8) the matrix St be positive
definite and gt−1 = ∇f (xt−1 ) 6= 0 (so that xt−1 is not a critical point of f , and the step t indeed should
be performed). Then St+1 is well-defined, is positive definite and satisfies (7.4.7).
7.4. QUASI-NEWTON METHODS 215
Proof. It suffices to verify that qtT pt > 0; then we would be able to get all we need from
Proposition 7.4.1 (applied with R = St , q = qt , p = pt ).
Since gt−1 6= 0 and St is positive definite, the direction dt = −St gt−1 is a descent direction
of f at xt−1 (see (7.4.2)). Consequently, the exact linesearch results in a nonzero stepsize,
and pt = xt − xt−1 = γt dt 6= 0. We have
[since xt is a minimizer of f on the ray {xt−1 + γdt | γ > 0} and therefore gt is orthogonal
to the direction dt of the ray]
T T
= −γt gt−1 dt = γt gt−1 St gt−1 > 0
Now we have two updating formulae – (I) and (II); they transform a positive definite matrix St into
BF GS DF P
positive definite matrices St+1 and St+1 , respectively, satisfying (*). Since (*) is linear in St+1 ,
any convex combination of the resulting matrices also satisfies (*). Thus, we come to the Broyden
implementation of (U) given by
φ DF P BF GS
St+1 = (1 − φ)St+1 + φSt+1 ; (7.4.12)
here φ ∈ [0, 1] is the parameter specifying the updating. Note that the particular case of φ = 0 corresponds
to the Davidon-Fletcher-Powell method.
One can verify by a direct computation that
φ DF P T 1 1
St+1 = St+1 + φvt+1 vt+1 , vt+1 = (qtT St qt )1/2 T pt − T St qt . (7.4.13)
pt qt qt St qt
From the considerations which led us to (7.4.12), we get the following
Corollary 7.4.1 A Broyden method, i.e., Algorithm 7.4.1 with (U) given by (7.4.12), φ ∈ [0, 1] being
the parameter of the method (which may vary from iteration to iteration), is a quasi-Newton method: it
maintains symmetry and positive definiteness of the matrices St and ensures (7.4.7).
216 LECTURE 7. AROUND THE NEWTON METHOD
It can be proved that, as applied to a strongly convex quadratic form f , the Broyden method minimizes the
form exactly in no more than n steps, n being the dimension of the design vector, and if S0 is proportional
to the unit matrix, then the trajectory of the method on f is exactly the one of the Conjugate Gradient
method.
There is the following remarkable fact:
all Broyden methods, independently of the choice of the parameter φ, being started from the same pair
(x0 , S1 ), equipped with the same exact line search and being applied to the same problem, generate the
same sequence of iterates (although not the same sequence of matrices Ht !).
Thus, in the case of exact line search all methods from the Broyden family are the same. The methods
became different only when inexact line search is used; although inexact line search is forbidden by our
theoretical considerations, it is always the case in actual computations.
Broyden methods are thought to be the most efficient practically versions of the Conjugate Gradient
and quasi-Newton methods. After intensive numerical testing of different policies of tuning the parameter
φ, it was found that the best is the simplest policy φ ≡ 1, i.e., the pure Broyden-Fletcher-Goldfarb-Shanno
method.
Remark 7.4.1 Practical versions of the Broyden methods.
In practical versions of Broyden methods, exact line search is replaced with inexact one.
Besides the standard requirements of “significant decrease” of the objective in course of
the line search (like those given by the Armijo test), here we meet with specific additional
requirement: the line search should ensure the relation
In view of Proposition 7.4.1 this relation ensures that the updating formulae (I) and (II)
(and, consequently, the final formula (7.4.12) with φ ∈ [0, 1]) maintain positive definiteness
of St ’s and relation (7.4.7), i.e., the properties crucial for the quasi-Newton methods.
Relation (7.4.14) is ensured by the exact line search (we know this from the proof of Propo-
sition 7.4.2), but of course not only by it: the property is given by a strict inequality and
therefore, being valid for the stepsize given by the exact line search, is for sure valid if the
stepsize is close enough to the “exact” one.
Another important implementation issue is as follows. Under assumption (7.4.14), updating
(7.4.12) should maintain positive definiteness of the matrices St . In actual computations,
anyhow, rounding errors may eventually destroy this crucial property, and when it happens,
the method may become behave itself crazy. To overcome this difficulty, in good implemen-
tations people store and update from step to step not the matrices St themselves, but their
Choleski factors: lower-triangular matrices Ct such that St = Ct CtT , or, more frequently, the
Choleski factors Ct0 of the inverse matrices St−1 . Updating formula (7.4.12) implies certain
routines for updatings Ct 7→ Ct+1 (respectively, Ct0 7→ Ct+1
0
), and these are the formulae in
fact used in the algorithms. The arithmetic cost of implementing such a routine is O(n2 ),
i.e., is of the same order of complexity as the original formula (7.4.12); on the other hand,
it turns out that this scheme is much more stable with respect to rounding errors, as far as
the descent properties of the actually computed search directions are concerned.
Of course, there is no difficulty in proving global convergence for the scheme with restarts, where one
resets current St after every m steps to the initial (positive definite) value of the matrix; here m is certain
fixed “cycle duration” (compare with the non-quadratic Conjugate Gradient methods from the previous
Section). For the latter scheme, it is easy to prove that if the level set
L = {x | f (x) ≤ f (x0 )}
associated with the initial point is bounded, then the trajectory is bounded and all limiting points of the
sequence {xmk }∞k=0 are critical points of f . The proof is quite similar to the one of Proposition 7.3.2: the
steps with indices 1 + mt are
S being a once for ever fixed symmetric positive definite matrix (the one to which St is reset at the
restart steps). Such a step decreases the objective “significantly”, provided that ∇f (xmk ) is not small7) ;
this property can be immediately derived (try to do it!) from positive definiteness of S, continuous
differentiability of the objective and boundedness of L. At the remaining steps the objective never
increases due to the linesearch, and we conclude that the sum over t of progresses f (xt−1 ) − f (xt ) in
the objective value is bounded. Thus, the progress at step t tends to 0 as t → ∞, and, consequently,
∇f (xmk ) → 0 as k → ∞ – otherwise, as it was indicated, the progresses at the restart steps could not
tend to 0 as k → ∞. Thus, ∇f (xmk ) → 0, k → ∞, so that every limiting point of the sequence {xmk }k
indeed is a critical point of f .
We are about to answer this question for the most frequently used BFGS method (recall that it is the
Broyden method with φ ≡ 1).
First of all, consider the scheme with restarts and assume, for the sake of simplicity, that m = n and
S = I, S being the matrix to which St is reset after every n steps. Besides this, assume that the objective
is smooth – three times continuously differentiable. In this case one can expect that the convergence of
the sequence {xmk }∞ ∗
k=0 to x will be quadratic. Indeed, in the strongly convex quadratic case the BFGS
method with the initialization in question becomes the Conjugate Gradient method, so that in fact we
are speaking about a nonquadratic extension of the CG and can use the reasoning from the previous
section. Our expectations indeed turn out to be valid.
Anyhow, the “squaring the distance to x∗ in every n steps” is not that attractive property, especially
when n is large. Moreover, the scheme with restarts is not too reasonable from the asymptotical point of
view: all our motivation of the quasi-Newton scheme came from the desire to ensure in a good case (when
the trajectory converges to a nondegenerate local minimizer of f ) the relation St − ∇2 f (xt ) → 0, t → ∞,
in order to make the method asymptotically similar to the Newton one; and in the scheme with restarts
we destroy this property when resetting St to the unit matrix at the restarting steps. To utilize the actual
potential of the quasi-Newton scheme, we should avoid restarts, at least asymptotically (at the beginning
they might become necessary to ensure global convergence).
It is very difficult to prove something good about local convergence of a quasi-Newton method ex-
ecuted in “continuous fashion” without restarts. There is, anyhow, the following remarkable result of
Powell (1976):
7)
here is the exact formulation: for every > 0 there exists δ > 0 such that if, for some k, |∇f (xmk )| ≥ , then
f (xmk+1 ) ≤ f (xmk ) − δ
218 LECTURE 7. AROUND THE NEWTON METHOD
Theorem 7.4.1 Consider the Broyden-Fletcher-Goldfarb-Shanno method (i.e., the Broyden method with
φ ≡ 1) without restarts and assume that the method converges to a nondegenerate local minimizer x∗ of
a three times continuously differentiable function f . Then the method converges to x∗ superlinearly.
This result has also extensions onto the “practical” – with properly specified inexact line search – versions
of the method.
St+1 qt = pt ,
for the case of symmetric positive definite St+1 , is equivalent to the relation
Ht+1 pt = qt (7.4.15)
−1
with symmetric positive definite Ht+1 ≡ St+1 . Thus,
each policy P of generating positive definite symmetric matrices St maintaining (7.4.7) induces certain
policy P ∗ of generating positive definite symmetric matrices Ht = St−1 maintaining (7.4.15),
and, of course, vice versa:
each policy P ∗ of generating positive definite symmetric matrices Ht maintaining (7.4.15) induces certain
policy of generating positive definite symmetric matrices St = Ht−1 maintaining (7.4.7).
Now, we know how to generate matrices Ht satisfying relation (7.4.15) – this is, basically, the same
problem as we already have studied, but with swapped qt and pt . Thus,
given any one of the above updating formulae for St , we can construct a “complementary” updating
formula for Ht by replacing, in the initial formula, all S’s by H’s and interchanging q’s and p’s.
For example, the complementary formula for the DFP updating scheme (7.4.8) is
1 1
Ht+1 = Ht + qt q T − T Ht pt pTt Ht , (7.4.16)
qtT pt t pt Ht pt
– it is called the Broyden-Fletcher-Goldfarb-Shanno updating of Ht .
We have the following analogy to Proposition 7.4.2:
Proposition 7.4.3 Let Ht be positive definite symmetric matrix, let xt−1 be an arbitrary point with
gt−1 = ∇f (xt−1 ) 6= 0, and let dt be an arbitrary descent direction of f at xt−1 (i.e., dTt gt−1 < 0). Let,
further, xt be the minimizer of f on the ray {xt−1 + γdt | γ ≥ 0}, and let
Then the matrix Ht+1 given by (7.4.16) is symmetric positive definite and satisfies (7.4.15).
Proof. From the proof of Proposition 7.4.2 we know that pTt qt > 0, and it remains to apply Proposition
7.4.1 to the data R = Ht , p = qt , q = pt .
According to Proposition 7.4.3, we can look at (7.4.16) as at certain maintaining (7.4.15) policy P ∗ of
updating positive definite symmetric matrices Ht . As we know, every policy of this type induces certain
policy of updating positive definite matrices St = Ht−1 which maintains (7.4.7). In our case the induced
policy is given by
−1
−1 1 T 1 −1 T −1
St+1 = St + T qt qt − T −1 St pt pt St . (7.4.17)
pt qt pt St pt
It can be shown by a direct computation (which I skip), that the latter relation is nothing but the BFGS
updating formula.
Lecture 8
Convex Programming
Today we start the last part of the Course – methods for solving constrained optimization problems.
This lecture is in sense slightly special – we shall speak about Convex Programming problems. As we
remember, a Convex Programming problem is the one of the form
where
• the objective f0 and the constraints fi , i = 1, ..., m, are convex functions on Rn ; for the sake of
simplicity, from now on we assume that the domains of these functions are the entire Rn .
Convex Programming is, in a sense, the “solvable case” in Nonlinear Optimization: as we shall see in
a while, convex programs, in contrast to the general ones, can be solved efficiently: one can approxi-
mate their global solutions by globally linearly converging, with data-independent ratio, methods. This
phenomenon – possibility to approximate efficiently global solutions – has no analogy in the general
nonconvex case1) and comes from nice geometry of convex programs.
Computational tools for Convex Programming is the most developed, both theoretically and algo-
rithmically, area in Continuous Optimization, and what follows is nothing but a brief introduction to
this rich area. In fact I shall speak about only one method – the Ellipsoid algorithm, since it is the
simplest way to achieve the main goal of the lecture – to demonstrate that Convex Programming indeed
is “solvable”. Practical conclusion of this theoretical in its essence phenomenon is, in my opinion, very
important, and I would formulate it as follows:
When modelling a real-world situation as an optimization program, do your best to make the program
convex; I strongly believe that it is more important than to minimize the number of design variables or
to ensure smoothness of the objective and the constraints. Whenever you are able to cast the problem
as a convex program, you get in your disposal wide spectrum of efficient and reliable optimization tools.
8.1 Preliminaries
Let me start with recalling to you two important notions of Convex Analysis – those of a subgradient of
a convex function and a separating plane.
1)
Recall that in all our previous considerations the best we could hope for was the convergence to a local
minimizer of the objective, and we never could guarantee this local minimizer to be the global one. Besides this,
basically all our rate-of-convergence results were asymptotical and depended on local properties of the objective
219
220 LECTURE 8. CONVEX PROGRAMMING
is everywhere below the graph of g itself, and at the point x both graphs touch each other. Generally
speaking, subgradient of g at a point x is not unique; it is unique if and only if g is differentiable at x,
and in this case the unique subgradient of g at x is exactly the gradient ∇g(x) of g at x.
When speaking about methods for solving (8.0.1), we always assume that we have in our disposal
a First order oracle, i.e., a routine which, given on input a point x ∈ Rn , returns on output the values
fi (x) and certain subgradients fi0 (x), i = 0, ..., m of the objective and the constraints at x.
Π = {y | eT y = a}
such that x and G belong to opposite open half-spaces into which Π splits Rn :
When speaking about methods for solving (8.0.1), we always assume that we have in our disposal a
Separation oracle for G, i.e., a routine which, given on input a point x ∈ Rn , reports whether x ∈ G, and
if it is not the case, returns “the proof” of the relation x 6∈ G, i.e., a vector e satisfying (8.1.2).
Remark 8.1.1 Implementation of the oracles. The assumption that when solving (8.0.1), we have in our
disposal a First order oracle for the objective and the constraints and a Separation oracle for the domain
G are crucial for our abilities to solve (8.0.1) efficiently. Fortunately, this assumption indeed is valid in
all “normal” cases. Indeed, typical convex functions f arising in applications are either differentiable –
and then, as in was already mentioned, subgradients are the same as gradients, so that they are available
whenever we have an “explicit” representation of the function – or are upper bounds of differentiable
convex functions:
f (x) = sup fα (x).
α∈A
In the latter case we have no problems with computing subgradients of f in the discrete case – when the
index set A is finite. Indeed, it is immediately seen from the definition of a subgradient that if α(x) is the
index of the largest at x (or a largest, if there are several of them) of the functions fα , then the gradient
of fα(x) at the point x is a subgradient of f as well. Thus, if all fα are “explicitly given”, we can simply
compute all of them at x and choose the/a one largest at the point; its value will be the value of f at x,
and its gradient at the point will be a subgradient of f at x.
8.2. THE ELLIPSOID METHOD 221
G = {x ∈ Rn | gj (x) ≤ 0, j = 1, ..., k}
G = {x | (x − a)T (x − a) − R2 ≤ 0}
G = {x | ai ≤ x ≤ bi , i = 1, ..., n}
(a box). Whenever it is the case, we have no problems with Separation oracle for G: given x, it suffices to
check whether all the inequalities describing G are satisfied at x. If it is the case, then x ∈ G, otherwise
gi (x) > 0 for some i, and it is immediately seen that ∇gi (x) can be chosen as separator.
It should be stressed, anyhow, that not all convex programs admit “efficient” oracles. For example,
we may meet with a problem where the objective (or a constraint) is given by “continuous” maximization
like
f (x) = max fα (x)
α∈[0,1]
(“semi-infinite programs”); this situation occurs in the problems of best uniform approximation of a given
n
P
function φ(α), 0 ≤ α ≤ 1, by a linear combination xi φi (α) of other given functions; the problem can
i=1
be written down as a “simple” convex program
( n
)
X
min f (x) ≡ max |φ(α) − xi φi (α)| ;
x α∈[0,1]
i=1
this is a convex program, but whether it can or cannot be solved efficiently, it heavily depends on whether
we know how to perform efficiently maximization in α required to compute the value of f at a point.
of minimizing a convex function f on a solid G, i.e., a closed and bounded convex set with a nonempty
interior. In the mean time we shall see that general problem (8.0.1) can be easily reduced to (8.2.1).
It is possible that f 0 (x1 ) = 0; then (8.2.2) says that x1 is a global minimizer of f on Rn , and since this
global minimizer belongs to G, it is an optimal solution to the problem, and we can terminate. Now
assume that f 0 (x1 ) 6= 0 and let us ask ourselves what can be said about localization of the optimal set,
let it be called X ∗ . The answer is immediate:
222 LECTURE 8. CONVEX PROGRAMMING
so that the right hand side in (8.2.2) is positive at x, then x for sure is not optimal: (8.2.2) says that
f (x) > f (x1 ), so that x is worse than feasible solution x1 . Consequently, the optimal set (about which
we initially knew only that X ∗ ⊂ G) belongs to the new localizer
G1 = {x ∈ G | (x − x1 )T f 0 (x1 ) ≤ 0}.
This localizer again is a convex solid (as the intersection of G and a closed half-space with the boundary
passing through the interior point x1 of G) and is smaller than G (since an interior point x1 of G is a
boundary point of G1 ).
Thus, choosing somehow an interior point x1 in the “initial localizer of the optimal set” G ≡ G0 and
looking at f 0 (x1 ), we either terminate with exact solution, or can perform a cut – pass from G0 to a
smaller solid G1 which also contains the optimal set. In this latter case we can iterate the construction
– choose somehow an interior point x2 in G1 , compute f 0 (x2 ) and, in the case of f 0 (x2 ) 6= 0, perform a
new cut – replace G1 with the new localizer
G2 = {x ∈ G1 | (x − x2 )T f 0 (x2 ) ≤ 0},
and so on. With the resulting recurrence, we either terminate at certain step with exact solution, or
generate a sequence of shrinking solids
G = G0 ⊃ G1 ⊃ G2 ⊃ ...,
A (very nontrivial) geometrical fact is that with this Center of Gravity policy we get linear convergence
of the volumes of Gt to 0:
n t
n
Vol(Gt ) ≤ 1 − Vol(G0 ),
n+1
which in turn implies linear convergence in terms of the residual in the objective value: if xt is the best
– with the smallest value of the objective – of the points x1 , ..., xt , then
n t/n
n
f (xt ) − min f ≤ 1 − [max f − min f ]. (8.2.3)
G n+1 G G
(8.2.3) demonstrates global linear convergence of the Center-of-Gravity method with objective-independent
convergence ratio
n 1/n
n
κ(n) = 1 − ≤ (1 − exp{−1})1/n .
n+1
Consequently, to get an -solution to the problem – to find a point xt ∈ G with
ln 1 1
c 1 b≤ 2.13n ln + 1 (8.2.4)
ln κ(n)
take the center x2 of the ellipsoid G1 , to perform a new cut, getting a new half-ellipsoid G+
1 which covers
the optimal set, to embed it into the ellipsoid G2 of the smallest possible volume, etc. In this scheme,
we actually make a kind of trade-off between efficiency of the routine and computational complexity of a
step: when extending the “actual localizers” – half-ellipsoids – to ellipsoids, we add points which for sure
could not be optimal solutions, and thus slow the procedure down. At the cost of this slowing the process
down we, anyhow, enable ourselves to deal with “simple sets” – ellipsoids, and thus reduce dramatically
the computational complexity of a step.
There are two points which should be clarified.
• first, it is unclear in advance whether we indeed are able to decrease at linear rate the volumes of
sequential localizers and thus get a converging method – it could happen that, when extending the
half-ellipsoid G+ +
t+1 to the ellipsoid Gt+1 of the smallest volume containing Gt+1 , we come back to
the previous ellipsoid Gt , so that no progress in volume is achieved. Fortunately, this is not the case:
the procedure reduces the volumes of the sequential ellipsoids Gt by factor κ∗ (n) ≤ exp{− 2n 1
}, thus
∗ 3)
enforcing the volumes of Gt to go to 0 at linear rate with the ratio κ (n) . This ratio is worse
than the one for the Center-of-Gravity method (there the ratio was at most absolute constant
1
1 − exp{−1}, now it is dimension-dependent constant close to 1 − 2n ); but we still have linear
convergence!
• second, it was assumed that G is an ellipsoid. What to do if it is not so? The answer is easy: let
us choose, as G0 , an arbitrary ellipsoid which covers the domain G of the problem; such a G0 for
sure will be a localizer of the optimal set, and this is all we need. This answer is good, but there
is a weak point: it may happen that the center x1 of the ellipsoid G0 is outside G; how should we
perform the first cut in this case? Moreover, this “bad” situation – when the center xt+1 of the
current localizer Gt is outside G – may eventually occur even when G0 = G is an ellipsoid: at each
step of the method, we add something to the “half” of the previous localizer, and this something
could contain points not belonging to G. As a result, Gt , generally speaking, is not contained in
G, and it may happen that the center of Gt is outside G. And to the moment we know how to
perform cuts only through points of G, not through arbitrary points of Rn .
We can immediately resolve the indicated difficulty. Given the previous ellipsoidal localizer Gt and
its center xt+1 , let us ask the Separation oracle whether xt+1 ∈ G. If it is the case, we have no
problems with the cut – we call the First order oracle, get a subgradient of f at xt+1 and use it
as it was explained above to produce the cut. Now, if xt+1 is not in G, the Separation oracle will
return a separator e:
eT xt+1 > max eT x.
x∈G
(x − xt+1 )T e < 0,
G+ T
t+1 = {x ∈ Gt | (x − xt+1 ) e ≤ 0}.
After all explanations and remarks, we can pass to formal description of the Ellipsoid method.
3
The fact that even after extending G+ t+1 to Gt+1 we still have progress in volume heavily depends on the
specific geometrical properties of ellipsoids; if. e.g., we would try to replace the ellipsoids with boxes, we would
fail to ensure the desired progress. The ellipsoids, anyhow, are not the only solids appropriate for our goals; we
could use simplexes as well, although with worse progress in volume per step
8.2. THE ELLIPSOID METHOD 225
Algorithm 8.2.1 [The Ellipsoid algorithm for convex program min {f (x) : x ∈ G ⊂ Rn }]
x
Assumptions:
• G is a solid (bounded and closed convex set with a nonempty interior)
• we are given First order oracle for f and Separation oracle for G
• we are able to point out an ellipsoid G0 = E(c0 , B0 ) which contains G.
Initialization: set t = 1
Step t: Given ellipsoid Gt−1 = E(ct−1 , Bt−1 ), set xt = ct−1 and act as follows:
1) [Call the Separation Oracle] Call the Separation oracle, xt being the input. If the oracle says that
xt ∈ G, call the step t productive and go to 2), otherwise call the step t non-productive, set et equal to
the separator returned by the oracle:
x ∈ G ⇒ (x − xt )T et < 0
and go to 3)
2) [Call the First order oracle] Call the First order oracle, xt being the input. If f 0 (xt ) = 0, terminate
– xt ∈ G is the minimizer of f on the entire Rn (see the definition (8.1.1) of a subgradient) and is
therefore an optimal solution to the problem. In the case of f 0 (xt ) 6= 0 set
et = f 0 (xt )
and go to 3).
3) [Update the ellipsoid] Update the ellipsoid E(ct−1 , Bt−1 ) into E(ct , Bt ) according to formula (8.2.7)
with e = et , i.e., set
1
Bt = α(n)Bt−1 − γ(n)(Bt−1 pt )pTt , ct = ct−1 − Bt−1 pt−1 ,
n+1
where 1/4 r T
n2
n−1 Bt−1 et
α(n) = , γ(n) = α(n) , pt = q .
n2 − 1 n+1 eTt Bt−1 Bt−1
T e
t
Then, for any ∈ (0, 1), the approximate solution xN () found by the method in course of the first N ()
steps, is well-defined and is an -solution to the problem, i.e., belongs to G and satisfies the inequality
Proof. For the sake of brevity, let N = N (). We may assume that the method does not terminate in
course of the first N steps – otherwise there is nothing to prove: the only possibility for the method to
terminate is to find an exact optimal solution to the problem.
Let us fix 0 ∈ (, 1), and let x∗ be an optimal solution to (8.2.1) (it exists, since the domain of the
problem is compact and the objective, being convex on Rn , is continuous (we had such a theorem in the
course Optimization I) and therefore attains its minimum on compact set G.
10 . Set
G∗ = x∗ + 0 (G − x∗ ),
so that G∗ is the image of G under homothety transformation with the center at x∗ and the coefficient
0 .
By construction, we have
Vol(G∗ ) = (0 )n Vol(G) > n Vol(G) (8.2.9)
(“homothety with coefficient α > 0 in Rn multiplies volumes by αn ”). On the other hand, by Lemma
8.2.1 we have Vol(E(ct , Bt )) ≤ exp{−1/(2(n − 1))} Vol(E(ct−1 , Bt−1 )), whence
N
Vol(E(cN , BN ) ≤ exp{− } Vol(E(c0 , B0 )) ≤
2(n − 1)
[definition of N = N ()]
≤ V −N N Vol(E(c0 , B0 )) =
[definition of V]
Vol(G)
= N Vol(E(c0 , B0 )) = N Vol(G).
Vol(E(c0 , B0 ))
Comparing the resulting inequality with (8.2.9), we conclude that Vol(E(cN , BN )) < Vol(G∗ ), so that
G∗ \E(cN , BN ) 6= ∅. (8.2.10)
y ∈ Vol(G∗ )\E(cN , BN ).
I claim that
G∗ = x∗ + 0 (G − x∗ ) = {x∗ + 0 (z − x∗ ) | z ∈ G}.
To prove the second relation in (8.2.11), note that from the first one y ∈ G ⊂ E(c0 , B0 ), while by the
origin of y we have y 6∈ E(cN , BN ). Consequently, there exists t ≤ N such that
According to our policy of updating the ellipsoids (see Lemma 8.2.1), E(ct , Bt ) contains the half-ellipsoid
this inclusion and (8.2.12) demonstrate that (y −xt )T et > 0, as required in the second relation in (8.2.11).
228 LECTURE 8. CONVEX PROGRAMMING
30 . Now let us look what happens at the step t given by the second relation in (8.2.11). First of all, I
claim that the step t is productive: xt ∈ G. Indeed, otherwise, by construction of the method, et would
be a separator of xt and G:
(x − xt )T et < 0 ∀x ∈ G,
but this relation, as we know from (8.2.11), is violated at y ∈ G and therefore cannot take place.
Thus, t is productive, whence, by construction of the method, et = f 0 (xt ). Now the second relation
in (8.2.11) reads
(y − xt )T f 0 (xt ) > 0,
whence, by definition of subgradient, f (y) > f (xt ). This inequality, along with productivity of the step
t and the definition of approximate solutions, says that xN is well-defined and
40 . Now we are done. By the first relation in (8.2.11) and due to convexity of f we have
whence
The latter inequality is valid for all 0 ∈ (, 1), and (8.2.8) follows.
• Ĝ = {x ∈ G | fi (x) ≤ 0, i = 1, ..., m} is a solid (bounded and closed convex set with a nonempty
interior)
• we are given First-order oracle for f0 , ..., fm and Separation oracle for G
• we are able to point out an ellipsoid G0 = E(c0 , B0 ) which contains Ĝ.
Initialization: set t = 1
Step t: Given ellipsoid Gt−1 = E(ct−1 , Bt−1 ), set xt = ct−1 and act as follows:
1) [Call the Separation oracle] Call the Separation oracle for G, xt being the input. If the oracle says
that xt ∈ G, go to 2), otherwise call the step t non-productive, set et equal to the separator returned by
the oracle:
x ∈ G ⇒ (x − xt )T et < 0
and go to 4)
2) [Call the First order oracle] Call the First order oracle, xt being the input, and check whether
fi (xt ) ≤ 0, i = 1, ..., m. If fi (xt ) ≤ 0 for all i ≥ 1, call the step t productive and look at f00 (xt ). If
f00 (xt ) = 0, terminate – xt is feasible for (8.0.1) and is the minimizer of f on the entire Rn (see the
definition (8.1.1) of a subgradient), whence it is an optimal solution to the problem. In the case of
f00 (xt ) 6= 0 set
et = f00 (xt )
and go to 4).
3) [The case of xt ∈ G and fi (xt ) > 0 for some i ≥ 1] Call step t non-productive and find i ≥ 1 such
that fi (xt ) > 0 (when we arrive at 3), such an i exists), set
et = fi0 (xt )
and go to 4).
4) [Updating the ellipsoid] Update the ellipsoid E(ct−1 , Bt−1 ) into E(ct , Bt ) according to formula
(8.2.7) with e = et , i.e., set
1
Bt = α(n)Bt−1 − γ(n)(Bt−1 pt )pTt , ct = ct−1 − Bt−1 pt−1 ,
n+1
where 1/4 r T
n2
n−1 Bt−1 et
α(n) = , γ(n) = α(n) , pt = q .
n2 − 1 n+1 eTt Bt−1 Bt−1
T e
t
min f (x) = cT x : x ∈ G = {x ∈ Rn | Ax ≤ b} ,
x
A being m × n matrix and b being m-dimensional vector. The data vector of a problem instance
is comprised of the pair n, m and of n + m + nm entries of c, b, A written down in a once for ever
fixed order (e.g., first the n entries of c, then the m entries of b and finally the nm entries of A in
the row-wise order).
Example 2. Quadratically Constrained Quadratic Programming. Here the instances are
of the form
1
min f0 (x) = xT H0 x − bT0 x
x 2
subject to
1
x ∈ G = {x ∈ Rn | fi (x) = xT Hi x − bTi x + ci ≤ 0, i = 1, ..., m},
2
and the data vector is comprised of n, m and the entries of the matrices Hi , the vectors bi and the
reals ci written down in a once for ever fixed order.
The number of examples can be easily extended; basically, the typical families of problems in
question are comprised of problems of a “fixed generic analytical structure”, and the data vector is
the set of numerical coefficients of the (fixed by the description of the family) analytical expressions
corresponding to the particular instance in question.
In what follows we restrict ourselves with the families comprised of solvable problems with bounded
feasible sets.
8.3. ELLIPSOID METHOD AND COMPLEXITY OF CONVEX PROGRAMMING 231
there are other definitions of an -solution, but let us restrict ourselves with the indicated one.
• A computational algorithm for P is a program for an idealized computer capable to perform opera-
tions√of exact real arithmetic (four arithmetic operations and computation of elementary functions
like ·, sin(·), log(·), etc.). When solving a problem instance from P, the algorithm gets on
input the data vector of the instance and the required accuracy and starts to process these
data; after finitely many elementary steps the algorithm should terminate and return the vec-
tor xout of the corresponding dimension, which should be an -solution to the input instance.
The computational effort of solving the instance is, by definition, the total # of elementary steps
(arithmetic operations) performed in course of processing the instance.
The second way to formalize the complexity-related notions, the way which is the most widely used
in theoretical Computer Science and in Combinatorial Optimization, is given by
Algorithmic Complexity Model:
• A family of problems is a set of problems (8.2.1) such that a particular member p = (f, G) of the
family is encoded by an integer i(p); the # L of digits in this integer is the bit size of the instance
p.
Example 3. Integer programming. Here the problem instances are of the form
min f (x) = cT x : x ∈ G = {x ∈ Zn | Ax ≤ b} ,
x
n
where Z is the space of n-dimensional vectors with integer entries, c and b are vectors of dimensions
n and m with integer entries, and A is an m × n matrix with integer entries.
To encode the data (which form a collection of integers) as a single integer, one can act as follows:
first, write down the data as a finite sequence of binary integers (similarly to Example 1, where
the data were real);
second, encode the resulting sequence of binary integers as a single integer, e.g., by representing
- binary digit 0 as 00
- binary digit 1 as 11
- blank between integers as 01
- sign – at an integer as 10
Example 4. Linear Programming over Rationals. This is a subfamily of the Linear Pro-
gramming family (Example 1), where we impose on all the entries of c, b, A the requirement to be
rational numbers. To encode a problem instance by a single binary integer, it suffices to write the
data as a sequence of binary integers, same as in Example 3 (with the only difference that now
every rational element of the data is represented by two sequential integers, its numerator and
denominator), and then encode the sequence, as it was explained above, by a single binary integer.
In what follows, speaking about Algorithmic Complexity model, we always assume that
- the families in question are comprised of solvable problems with bounded feasible sets
- any problem instance from the family admits a solution which can be naturally encoded by a
binary integer
The second assumption is clearly valid for the Integer Programming (Example 3). It is also valid
for Linear Programming over Rationals (Example 4), since it can be easily proved that a solvable
LP program with integer data admits a solution with rational entries; and we already know that
any finite sequence of integers/rationals can be naturally encoded by a single integer.
232 LECTURE 8. CONVEX PROGRAMMING
• Normally, in Algorithmic Complexity model people are not interested in approximate solutions and
speak only about exact ones; consequently, no problem with measuring accuracy occurs.
• A computational algorithm is an algorithm in any of (the equivalent to each other) definitions given
in Mathematical Logic; you loose nothing when thinking of a program for an idealized computer
which is capable to store in memory as many finite binary words as you need and to perform
bit-wise operations with these words. The algorithm as applied to a problem instance p from P
gets on input the code i(p) of the instance and starts to process it; after finitely many elementary
steps, the algorithm should terminate and return the code of an exact solution to the instance.
The computational effort is measured as the total # of elementary bit-wise operations performed
in course of the solution process.
Let me stress the difference between the notions of the size of an instance in the Real Arithmetic
and the Algorithmic Complexity models. In the first model, the size of an instance is, basically, the #
of real coefficients in the natural analytical representation of the instance, and the size is independent of
the numerical values of these coefficients – they may be arbitrary reals. For example the Real Arithmetic
size of any LP program over Reals (Example 1) with n = 2 variables and m = 3 inequality constraints is
the dimension of the vector
i.e., 13. In contrast to this, the Algorithmic Complexity size of an LP program over Rationals (Example
4) with n = 2 variables and m = 3 constraints can be arbitrarily large, if the rational data are “long”.
We can observe similar difference between the measures of computational effort in the Real Arithmetic
and the Algorithmic Complexity models. In the first of them, the effort required, say, to add two reals is
one, independently of what the reals are. In the second model, we in principle cannot deal with reals –
only with integers; and the effort to add two N -digit integers is not 1 – it is proportional to N .
The analogy of this definition for the Real Arithmetic Complexity model is as follows:
A solution algorithms A for a family P of problem instances is called R-polynomial, if the computational
effort required by A to solve a problem instance p of an arbitrary size l to an arbitrary accuracy ∈ (0, 1)
never exceeds
V(p)
q(l) ln , (8.3.1)
where q is certain polynomial and V(p) is certain data-dependent quantity; here all complexity-related
notions are understood according to the Real Arithmetic Complexity model.
The definition of an R-polynomial algorithm admits a very natural interpretation. The quantity
ln(V(p)/) can be viewed as # of accuracy digits in an -solution; with this interpretation, (8.3.1) means
that the arithmetic cost per accuracy digit is bounded from above by a polynomial of the dimension of
the data vector.
• imitate the First order oracle for f , i.e., compute f (x) and f 0 (x).
Then there exists an R-polynomial algorithm for P.
we have !
0 Vol1/np (E(c0 , B0 ))
N () ≤ N () =c2(np − 1)np ln 1/np
b
vp
(see (i)); according to (i), we can compute N 0 () in polynomial in lp number of operations and
terminate the process after N 0 (p) steps, thus getting an -solution to p.
The overall arithmetic effort to find an -solution to p, in view of the above remarks, is bounded from
above by
1/np
0 V(p) Vol(E(c0 , B0 ))
r(lp )N () ≤ q(lp ) ln , V(p) ≡ ,
vp
both r(·) and q(·) being certain polynomials, so that the presented method indeed is R-polynomial.
When thinking of Theorem 8.3.1, you should take into account that the “unpleasant” assumptions
(i) are completely technical and normally can be ensured by slight regularization of problem instances.
Assumption (ii) is satisfied in all “non-pathological” applications, so that in fact Theorem 8.3.1 can be
qualified as a General Theorem on R-Polynomial Solvability of Convex Programming. I should add that
this is, in a sense, an “existence theorem”: it claims that in Convex Programming there exists a “universal”
R-polynomial solution routine, but it does not say that this is the best possible R-polynomial routine for
all particular cases. The main practical drawback of the Ellipsoid method is that it is “slow” – the # of
iterations required to solve the problem within a once for ever fixed accuracy grows quadratically with
the dimension n of the problem. This quadratic in n growth of the effort makes it basically impossible
to use the method as a practical tool in dimensions like hundreds and thousands. Recently, for many
important families of Convex Programming problems (Linear, Quadratically Constrained Quadratic,
Geometric and some others) more specialized and more efficient interior point polynomial algorithms
were developed; these methods are capable to struggle with large-scale problems of real-world origin.
I would say that the Ellipsoid method, as a practical tool, is fine for not large – up to 20-30 variables
– convex problems. The advantages of the method are:
• simplicity in implementation and good numerical stability
• universality
• low order of dependence of computational effort on m – the # of functional constraints
The latter remark relates to Algorithm 8.2.2): the number of steps required to achieve a given
accuracy is simply independent of m, and the effort per step is at most proportional to the m (the
only place where the number of constraints influence the effort is the necessity to check feasibility of
the current iterate and to point out a violated constraint, if any exists). As a result, for “high and
narrow” convex programs (say, with up to 20-30 variables and as many thousands of constraints as
you wish) the Ellipsoid method seems to be one of the best known Convex Programming routines.
knows what does it mean “efficiently” in the above sentence. This is practical efficiency – when solving
an LP program
min cT x : Ax ≤ b
(8.4.1)
x
with n variables and m > n inequality constraints, the Simplex normally performs 2m - 4m iterations
to find the solution. Anyhow, theoretically Simplex is not a polynomial algorithm: since the beginning
of sixties, it is known that there exist simple LP programs pn with n = 1, 2, 3, ... variables and integer
data of the total bit size Ln = O(n2 ) such that some versions of the Simplex method solve pn in no less
than 2n steps. Consequently, the aforementioned versions of the Simplex method are not polynomial –
for a polynomial algorithm, the solution time should be bounded by a polynomial of Ln = O(n), i.e.,
of n. Similar “bad examples” were constructed for other versions of the Simplex method; and although
nobody was able to prove that no version of Simplex can be polynomial, nobody was also lucky to point
out a polynomial version of the Simplex method. Moreover, since mid-sixties, where the Algorithmic
Complexity approach became standard, and till 1979 nobody knew whether LP over Rationals (Example
4 above) is or is not polynomially solvable.
• the data in (8.4.4) are integer, and their bit size L2 is bounded by a polynomial (something like
200L2 ) of the bit size L of the initial data 4) .
It follows that if we were able to solve in polynomial time the Feasibility problem (8.4.4), we would get,
as a byproduct, a polynomial algorithm for the initial problem (8.4.1).
(M is the number of rows in R) be the residual function of the system of inequalities Rz ≤ r; this
function clearly is nonnegative at a nonnegative z if and only if z is a feasible solution to (8.4.4).
Consequently, (S) is exactly the following problem:
Lemma 8.4.1 The answer in (S0 ) is “yes” if and only if it is “yes” in the following problem
Proof. If the answer in (S00 ) is “yes”, then, of course, the answer in (S0 ) also is “yes”. It remains
to prove, therefore, that if the answer in (S0 ) is “yes” (or, which is the same, if (S) is solvable),
then the answer in (S00 ) also is “yes”. This is easy. The solution set of (S), being nonempty, is a
nonempty closed convex polyhedral set (here and in what follows I use the standard terminology
of Convex Analysis; this terminology, along with all required facts, was presented in the course
Optimization I); since (S) involves nonnegativity constraints, this set does not contain lines, and
therefore, due to the well-known fact of Convex Analysis, possesses an extreme point z̄. From the
standard facts on the structure of extreme points of a polyhedral set given by (8.4.4) it is known
that the vector z ∗ comprised of nonzero entries of z̄, if such entries exist, satisfy nonsingular system
of linear equations of the form
R̄z ∗ = r̄,
where
4
the latter property comes from the fact that the common denominator of the entries in (8.4.3) is an integer
of bit size at most L1 ≤ 10L; therefore when passing from (8.4.3) to (8.4.4), we increase the bit size of each entry
by at most 10L. Since even the total bit size of the entries in (8.4.3) is at most 10L, the bit size of an entry in
(8.4.4) is at most 20L; and there are at most 10L entries. All our estimates are extremely rough, but it does not
matter – all we are interested in is to ensure polynomiality of the bit size of the transformed problem in the bit
size of the initial one
8.4. POLYNOMIAL SOLVABILITY OF LINEAR PROGRAMMING 237
∆i
zi∗ = .
∆0
All entries in this matrix are integers, and the total bit size of the entries does not exceed L2 . It
follows that all the determinants are, in absolute value, at most 2L2 5) . Thus, the numerators in
the Cramer formulae are ≤ L2 in absolute values, while the denominator (being a nonzero integer)
is in absolute value ≥ 1. Consequently, |zi∗ | ≤ 2L2 .
Thus, all nonzero entries in z̄ are ≤ 2L2 in absolute values. Since z̄ is a solution to (S), this is
a point where f is nonnegative. We conclude that if the answer in (S) is “yes”, then f attains
nonpositive value in the box 0 ≤ zi ≤ 2L2 , 1 ≤ i ≤ N , so that the answer in (S00 ) also is “yes”.
• (S00 ) as a convex program. To solve problem (S00 ) is exactly the same as to check whether the
optimal value in the optimization program
is or is not positive. The objective in the problem is easily computable convex function (since
it is maximum of M linear forms), and the domain G of the problem is a simple solid - a box.
Remark 8.1.1 explains how to imitate the First order and the Separation oracles for the problem;
we can immediately point out the initial ellipsoid which contains G (simply the Euclidean ball
circumscribed around the cube G). Thus, we are able to solve the problem by the Ellipsoid
method. From Theorem 8.2.1 (where one should estimate the quantities V and max f − min f via
G G
L2 ; this is quite straightforward) it follows that in order to approximate the optimal value f ∗ in
(8.4.5) within a prescribed absolute accuracy ν > 0, it suffices to perform
1
Nν = O(1)N 2 [L2 + ln ]
ν
steps with at most O(1)(M + N )N arithmetic operations per step, which gives totally
1
M(ν) = O(1)N 3 (M + N )[L2 + ln ] (8.4.6)
ν
arithmetic operations (O(1) here and in what follows are positive absolute constants).
All this looks fine, but in order to detect whether the optimal value f ∗ in (8.4.5) is or is not non-
positive (i.e., whether (S) is or is not solvable), we should distinguish between two “infinitesimally
close to each other” hypotheses f ∗ ≤ 0 and f ∗ > 0, which seems to require the exact value of f ∗ ;
5
This is a consequence of the Hadamard inequality: the absolute value of a determinant (≡ the volume of the
parallelotope spanned by the rows of the determinant) does not exceed the product of the Euclidean lengths of
the rows of the determinant (product of the edges of the parallelotope). Consequently, log2 |∆i | does not exceed
the sum of binary logs of the Euclidean lengths of the rows of [R̄|r̄]. It remains to note that the binary logarithm
of the Euclidean length of an integer vector clearly does not exceed the total bit length of the vector:
k k
1 1 X1 X
log2 (a21 + ... + a2k ) ≤ log2 [(1 + a21 )(1 + a2 )2 ...(1 + a2k )] = log2 [1 + a2i ] ≤ log2 [1 + |ai |]
2 2 2
i=1 i=1
and the latter expression clearly is ≤ the total # of binary digits in integers a1 , ..., ak .
238 LECTURE 8. CONVEX PROGRAMMING
and all we can do is to approximate “quickly” f ∗ to a prescribed accuracy ν > 0, not to find f ∗
exactly.
Fortunately, our two hypotheses are not infinitesimally close to each other – there is a “gap”
between them. Namely, it can be proved that
if the optimal value f ∗ in (8.4.5) is positive, it is not too small, namely f ∗ ≥ 2−π(L2 ) with certain
polynomial π(·) 6) .
The latter remark says that to distinguish between the cases f ∗ ≤ 0 and f ∗ > 0 means in fact to
distinguish between f ∗ ≤ 0 and f ∗ ≥ 2−π(L2 ) ; and to this end it suffices to restore f ∗ within absolute
accuracy like ν = 2−π(L2 )−2 . According to (8.4.6), this requires O(1)N 3 (N + M )[L2 + π(L2 )]
arithmetic operations, which is not more than a polynomial of L2 and, consequently, of L (since
L2 is bounded from above by a polynomial of L).
Thus, we indeed can decide whether f ∗ is or is not nonpositive or, which is the same, whether (S)
is or is not solvable, in polynomial in L number of arithmetic operations.
This is not exactly what we need: our complexity model counts not arithmetic, but bit-wise
operations. It turns out, anyhow (the verification is straightforward, although very dull), that one
can apply the Ellipsoid method with inexact arithmetic instead of the exact one, rounding the
intermediate results to a polynomial in L number of accuracy digits, and it still allows to restore
f ∗ within required accuracy. With the resulting inexact computations, the bit-wise overall effort
becomes polynomial in L, and the method becomes polynomial.
Thus, (S) is polynomially solvable.
The optimal value in this problem is achieved at an extreme point, and this point, same as in Lemma 8.4.1, is
rational with not too large numerators and denominators of the entries. Consequently, the optimal value of t is
rational with not too large numerator and denominator, and such a fraction, if positive, of course, is not too close
to 0.
8.4. POLYNOMIAL SOLVABILITY OF LINEAR PROGRAMMING 239
(*0 ) into a new system (*1 ); this new system is solvable, and every solution to it is a solution to (*0 ) as
well, and the new system has one inequality less than the initial system.
Now let us apply the same trick to system (*1 ), trying to make the equality the second inequality
P2T z ≤ p2 of the initial system; as a result, we will “kill” this second inequality – either make it the
equality, or eliminate it at all – and all solutions to the resulting system (*2 ) (which for sure will be
solvable) will be solutions to (*1 ) and, consequently, to the initial system (*0 ).
Proceeding in this manner, we in M steps (M is the row size of P ) will “kill” all the inequalities
P z ≤ p - some of them will be eliminated, and some - transformed into equalities. Now let us kill in
the same fashion the inequalities zi ≥ 0, i = 1, ..., N . As a result, in N more steps we shall “kill” all
inequalities in the original system (*0 ), including the nonnegativity ones, and shall end up with a system
of linear equations. According to our construction, the resulting system (*M +N ) will be solvable and
every solution to it will be a solution to (*0 ).
It follows that to get a solution to (*0 ) it remains to solve the resulting solvable system (*M +N ) of
linear equations by any standard Linear Algebra routine (all these routines are polynomial).
Note that the overall process requires to solve N + M Solvability problems of, basically, the same bit
size as (8.4.4) and to solve a system of linear equations, again of the same bit size as (8.4.4); thus, the
overall complexity is polynomial in the size of (8.4.4).
The proof of polynomial time solvability of LP over Rationals – which might look long and dull – in
fact uses absolutely simple and standard arguments; the only nonstandard – and the key one – argument
is the Ellipsoid method.
(P )
f (x) → min
s.t.
hi (x) = 0, i = 1, ..., m,
gj (x) < 0, j = 1, ..., k
with x ∈ Rn (when saying “general type problems”, I mean not necessarily convex ones; Convex Pro-
gramming is another and much nicer story) can be, roughly speaking, separated into the following four
groups:
• primal methods, where we try to act similarly to what we did in the unconstrained case – i.e., to
move along the feasible set in a way which ensures, at each step, progress in the objective;
• barrier and penalty methods, where we reduce (P ) to a series of “approximating” the problem
unconstrained programs;
• Lagrange multipliers methods, where we focus on the dual problem associated with (P ); this dual
problem is either unconstrained one (when (P ) is equality constrained), or has simple nonnegativity
constraints (when (P ) includes inequalities) and is therefore simpler than (P ). When solving the
dual problem, we get, as a byproduct, approximate solutions to (P ) itself. Note that a posteriori the
Lagrange multiplier methods, same as the penalty/barrier ones, reduce (P ) to a sequence of uncon-
strained problems, but in a “smart” manner quite different from the straightforward penalty/barrier
scheme;
• SQP (Sequential Quadratic Programming) methods. The SQP methods, in contrast to all previous
ones, neither try to improve the objective staying within the feasible set, nor approximate the
constrained problem by unconstrained ones, but directly solve the KKT system of (nonlinear)
equations associated with (P ) by a kind of the Newton method.
This lecture is devoted to brief overview of the first two groups of methods; my choice is motivated
by the fact that these methods basically directly reduce the constrained problem to a sequence of un-
constrained ones, and it is easy to understand them, given our knowledge of technique for unconstrained
minimization.
Before passing to the main body of the lecture, let me make an important comment.
When solving an unconstrained minimization problem, we were aimed to find the optimal
solution; but the best we indeed were able to do was to ensure convergence to the set of
critical points of the objective, the set of points where the First Order Necessary Optimality
241
242 LECTURE 9. ACTIVE SET AND PENALTY/BARRIER METHODS
condition – the Fermat rule – is satisfied. Similarly, the best we can do in constrained
optimization is to ensure convergence to the set of KKT points of the problem – those points
where the First Order Necessary Optimality condition form Lecture 7 is satisfied. Whether
it fits our actual goal or not – this is another story, sometimes with happy end (e.g., in the
case of convex problem a KKT point is for sure a globally optimal solution), sometimes –
not, but this is all we can achieve.
During all this lecture, we make the following assumption on problem (P ) in question:
Regularity: The problem is regular, i.e., it is feasible, the objective and the constraints are at least once
continuously differentiable, and every feasible solution x is a regular point for the system of constraints:
the gradients of the constraints active at x (i.e., satisfied at x as equalities) are linearly independent.
The idea of the method is as follows: the First Order Necessary Optimality conditions (the KKT condi-
tions) are “constructive”: if they are not satisfied at a given feasible solution x, then we can explicitly
point out a better – with a smaller value of the objective – feasible solution x+ .
Indeed, we remember from the Lecture 7 that the KKT conditions were obtained from the following
construction: given feasible solution x, we associate with it linearization of (LIP) at x – the auxiliary
Linear Programming program
subject to
gj (y) ≤ 0, j = 1, ..., m.
x is a KKT point of (9.1.1) if and only if x is optimal solution to (Px ) (this was exactly the conjecture
which led us to the KKT condition).
From this “if and only if” statement we conclude that if x is not a KKT point of (9.1.1), then x is
not optimal solution to (Px ). In other words (pass in (Px ) from variable y to d = y − x), there exists a
descent direction d – direction satisfying
When performing a small enough step in this direction, we improve the objective (by the same reasons
as in the Gradient Descent) and do not violate the constraints; after such a step is performed, we can
iterate the construction at the new point, an so on.
Normally, at a non-KKT point there exist many descent directions. In the Feasible Directions method
we choose the one which is “most perspective” – along which the objective decreases at the highest possible
rate. Of course, to choose the most perspective direction, we should normalize somehow the candidates.
The standard normalization here is given by the restriction
Normalization (9.1.2) instead of more natural normalization like |d| ≤ 1 is motivated by the desire for
the direction to be determined via an LP, and thus effectively solvable, program.
After the “most perspective” direction d is found, we define the largest stepsize γ, let it be called γ̄,
among the stepsizes which keep the point x + γd feasible, and define the next iterate x+ via the linesearch
applied to f on the segment [0, γ̄]:
gi (x) = 0, i ∈ I,
so that GI is a polyhedral set in SI ; to get this set, we should add to the linear equalities defining SI
linear inequalities
gi (x) ≤ 0, i 6∈ I.
From the Regularity assumption it follows that a face GI is a polyhedral set of affine dimension exactly
n − Card(I), where Card(I) is the number of elements in the set I, and that affine dimension of the affine
set SI also is n−Card(I); in other words, SI is the affine span of GI . G∅ – the only face of affine dimension
n - is the entire G; the boundary of G is comprised of facets – (n − 1)-dimensional polyhedral sets GI
associated with one-element sets I. The relative boundary of a facet is comprised of (n − 2)-dimensional
faces GI corresponding to 2-element sets I, and so on, until we come to 1-dimensional edges – faces GI
associated with (n−2)-element sets I – and vertices – faces of dimension 0 given by n-element sets I. The
simplest way to get an impression of this face structure of G is to imagine a 3D cube given by 6 linear
inequalities xi − 1 ≤ 0, −1 − x1 ≤ 0, i = 1, 2, 3. Here we have one 3-dimensional facet – the entire cube;
it is bounded by 6 2-dimensional facets, each of them being bounded by 4 1-dimensional edges (there are
totally 12 of them). And edges, in turn, are bounded by 0-dimensional vertices (totally 8).
so that there is exactly one KKT point x∗ in our problem, which is the global solution. Let us denote by
I ∗ the set of indices of those constraints which are active at x∗ ; then x∗ belongs to the relative interior of
the face GI ∗ (the constraints active at x∗ participate in the description of the affine span SI ∗ of GI ∗ , and
the remaining constraints are satisfied at x∗ as strict inequalities). Since x∗ is globally optimal solution
to our problem, it is also optimal solution to the problem
min {f (x) : x ∈ GI ∗ } ,
x
(since the feasible set of the latter problem contains x∗ and is contained in G), and since x∗ is a relative
interior point of GI ∗ , it is a local solution to the problem
But the objective f is strongly convex, so that x∗ is a local solution to the latter problem if and only
if it is a global solution to it, and such a solution is unique. It follows that if we were clever enough to
guess in advance what is I ∗ , we could solve instead of our original problem (P ) the problem (PI ∗ ). Now,
the problem (PI ∗ ) is actually unconstrained – we could choose orthogonal coordinates in the affine set
SI ∗ , express the n original coordinates of a point x ∈ SI ∗ as linear functions of these new coordinates
and substitute these expressions into f , thus getting a function f¯ of k = dim SI ∗ new coordinates – the
restriction of f onto SI ∗ . Problem (PI ∗ ) clearly is equivalent to unconstrained minimization of f¯, so that
it can be solved via unconstrained minimization machinery already known to us1) .
We see that if we were clever enough to guess what are the constraints active at the optimal solution,
we would be able immediately reduce our constrained problem to an unconstrained one which we already
know how to solve. The difficulty, of course, is that we never are that clever to guess the active constraints.
The idea of the Active Set methods is to use at every iteration t certain current guess It−1 of the actual
“active set” I ∗ and act as if we were sure that our guess is exact, until current information will say to us
that the guess is false. When it happens, we somehow update the guess and proceed with the new one,
and so on. Now let us look how this idea is implemented.
as it was already explained, this latter problem is, essentially, an unconstrained one. The only assumption
on the method we use is that it is of our standard structure – if xt−1 is not a critical point of f on our
current working plane S t−1 ≡ SIt−1 , i.e., if the orthogonal projection of ∇f (xt−1 ) on the working plane
is nonzero, the method chooses somehow a descent direction dt of f at xt−1 in the working plane and
uses a kind of line search to perform a step in this direction which reduces the value of the objective2) .
Assume that xt−1 is not a critical point of f on the current working plane S t−1 , so that dt is well-
defined descent direction of the objective. When performing line search in this direction, let us impose
1)
In actual computations, people do not use literally this way to solve (PI ∗ ) – instead of explicit parame-
terization of SI ∗ and then running, say, a Quasi-Newton method in the new coordinates, it is computationally
better to work with the original coordinates, modifying the method accordingly; this algorithmic difference has no
importance for us, since the trajectory we get is exactly the same trajectory we would get when parameterizing
SI ∗ and running the usual Quasi-Newton in the coordinates parameterizing the affine set SI ∗ .
2)
Please note slight incorrectness in the above description. Rigorously speaking, we were supposed to say
“orthogonal projection of the gradient on the linear subspace of directions parallel to the working plane” instead
of “orthogonal projection of the gradient on the working plane” and “direction dt parallel to the working plane”
instead of “direction dt in the working plane”; exact formulations would be too long
9.1. PRIMAL METHODS 245
where γt∗ is the largest stepsize which keeps the shifted point in the face Gt−1 ≡ GIt−1 :
Note that the situation is as follows: since It−1 is exactly the set of indices of the constraints active at
xt−1 , xt−1 is relative interior point of the face Gt−1 ; since the direction dt is a direction in the current
working plane S t−1 , an arbitrary step from xt−1 in this direction keeps the shifted point in S t−1 – i.e., the
shifted point for sure satisfies the constraints which were active at xt−1 . A too large step, however, may
make the shifted point to violate one or more of the constraints which were nonactive at x∗ ; γt∗ is exactly
the largest of those steps for which this bad thing does not occur. It is clear that small enough steps
from xt−1 in the direction dt keep the shifted point within the feasible domain, so that γt∗ is positive. It
may happen that γt∗ is +∞ – an arbitrarily large step in the direction dt keeps the shifted point feasible;
but in the case when γt∗ is finite, the point x+ ∗
t = xt−1 + γt dt for sure belongs to the relative boundary
of the face Gt−1 , i.e., the set of constraints active at x+t is strictly larger than the set {gi , i ∈ It−1 } of
constraints active at xt−1 .
Now let us look at the new iterate
xt = xt−1 + γt dt
given by our method for unconstrained minimization with the above restriction on the stepsize. There
are two possibilities:
• γt < γt∗ . It means that xt is in the relative interior of the face Gt−1 , so that the set of indices It of
the constraints active at xt is exactly the same as It−1 , and the new working plane S t = SIt is the
same as the old working plane S t−1 ; in this case the next iteration will deal with the same problem
(Pt ) and will simply perform a new step of the chosen method for unconstrained minimization of
f over the old working plane;
• γt = γt∗ . It means that xt is on the relative boundary of the face Gt−1 , i.e., that the set It of indices
of the constraints active at xt is strictly larger than It−1 – some constraints which were nonactive
at xt−1 become active at xt . In this case the new working plane S t = SIt is strictly less than the
previous one; in other words, we have shrunk the working plane, and the next iteration will deal
with a new problem (Pt+1 ) – the one of minimizing f on the new working plane. In other words,
we have corrected our guess for the actual active set I ∗ by extending the previous guess It−1 to a
new guess It .
Now let us look what will eventually happen in the outlined process. During a number of iterations, it
will work with our initial guess I0 , solving the problem of unconstrained minimization of f on the initial
working plane SI0 . Then the guess will be corrected, the working plane – shrunk, and the method will
switch to minimization of f on this new working plane. After several steps more, the working plane
again will be shrunk, and so on. It is clear that there could be at most n switches of this type – the
initial working plane is of the dimension at most n, and each time it is shrunk, the dimension of the new
working plane is reduced at least by 1. It follows that, starting from some iteration, the method all the
time will work with a fixed working plane and will solve the problem of minimizing f on this working
plane. Assuming the method for unconstrained minimization we use globally converging, we conclude
that the trajectory of the method
• either will be finite and will terminate at a critical point of f at the current working plane,
• or will be infinite, and all limiting points of the trajectory will be critical points of the objective
at the current working plane.
With a reasonable idealization, we may assume that in course of running the method, we eventually will
246 LECTURE 9. ACTIVE SET AND PENALTY/BARRIER METHODS
find a critical point of the objective at the current working plane3) . Let us look what can be the situation
when it happens – our current iterate xt turns out to be a critical point of f on the current working plane
S t . In other words, assume that the gradient of f at the point xt is orthogonal to all directions parallel
to the affine plane
S t = {x | gi (x) = 0, i ∈ It },
It being the set of indices of the constraints active at xt . The set of directions parallel to S t is the linear
subspace
Lt = {d | dT ∇gi = 0, i ∈ It },
and the vector ∇f (xt ) is orthogonal to Lt if and only if it is a linear combination of the vectors ∇gi ,
i ∈ It (see the results of Lecture 1 on the structure of the orthogonal complement to a linear subspace
given by a system of homogeneous linear equations). In other words, we have
X
∇f (xt ) + λi ∇gi = 0 (9.1.4)
i∈It
with properly chosen coefficients λi . Note that these coefficients are uniquely defined, since from Regu-
larity assumption it follows that the vectors ∇gi , i ∈ It , are linearly independent.
There are two possibilities:
• A. The coefficients λi , i ∈ It , are nonnegative. It means that xt is a KKT of the problem (P ), λi
being the Lagrange multipliers certifying this fact.
• B. At least one of the coefficients λi is negative.
In the case A we have reached our target – have found a KKT point of the original problem – and can
terminate. What to do in the case B? The answer is as follows: in this case we can easily find a new
iterate which is feasible for (P ) and is better than xt from the viewpoint of the objective. To see it,
assume for the sake of definiteness that 1 ∈ It and λ1 is negative, and let d be the orthogonal projection
of −∇f (xt ) on the plane
LI = {h | hT ∇gi = 0, i ∈ I}, I = It \{1}.
We claim that d is a nonzero direction which is descent for the objective and is such that
Before proving this simple fact, let us look at it consequences. Let us perform a small step γ > 0 from
xt along the direction d and look what can be said about the shifted point
x+ = xt + td.
Since d belongs to the plane LI , d is orthogonal to the gradients of the constraints gi with the indices
i ∈ I; all these constraints are active at xt , and due to the indicated orthogonality they, independently
of γ, will be satisfied as active constraints also at x+ . The constraints which were not active at xt , will
remain nonactive also at x+ , provided that γ is small. All we need is to understand what will happen
with the constraint g1 which was active at xt . This is clear – from (9.1.5) it follows that this constraint
will be satisfied and will be nonactive at x+ whenever γ > 0. Thus, for small positive γ x+ will be
feasible, and the constraints active at x+ are exactly those with indices from I. And what about the
objective? It also is clear: it was claimed that d is a descent direction of the objective at xt , so that small
enough step from xt along this direction reduces the objective. We see that the step
xt 7→ xt+1 ≡ xt + γd
3)
As we shall see, sometimes this is even not an idealization. In the general case it, of course, is an idealization:
minimizing the objective over a fixed working plane, we normally never will reach the set of exactly critical point
of f on the plane. In actual computations, we should speak about “–critical points” – those where the gradient
of the objective reduced at the working plane, i.e., the orthogonal projection of ∇f onto the plane, is of norm at
most , being a small tolerance; such a point indeed will be eventually reached
9.1. PRIMAL METHODS 247
with properly chosen stepsize γ > 0 strictly reduces the objective and results in a new iterate xt , still
feasible for (P ), with It+1 = I. Thus, when the above process “gets stuck” – reaches a feasible solution
which is a critical point of f on the current working plane, but is not a KKT point of (P ) – we can
“release” the process by a step which improves the objective value and extends the current working plane
(since It+1 = I is strictly less than It−1 , the new working plane S t+1 is strictly larger than S t ).
Incorporating in our scheme this mechanism of extending working planes whenever the method reaches
a critical point of the objective in the current working plane, but this point is not a KKT point of the
problem – we get a descent method (one which travels along the feasible set, each iteration improving
the objective) which can terminate only at a KKT point of the problem, and this is what is called an
Active Set method. In fact the presented scheme is an “idealized” method – we have assumed that the
method for unconstrained minimization we use, being applied to the problem of minimizing f on a fixed
working plane, will eventually find a critical point of f on this plane. For a general-type objective this
is not the case, and here in actual computations we should speak about “nearly critical, within a given
tolerance, points”, pass to approximate versions of (9.1.4), etc.
To proceed, we should prove the announced statement about the direction d. This is immediate:
since d is orthogonal projection of −∇f (xt ) on LI , this vector is orthogonal to all ∇gi with
i ∈ I, and the inner product of d with ∇f (xt ) is −|d|2 . With these observations, multiplying
(9.1.4) by d, we get
−|d|2 + λ1 dT ∇g1 = 0,
and since λ1 is negative, we conclude that dT ∇g1 is negative whenever d is nonzero. To
prove that d indeed is nonzero, note that otherwise ∇f (xt ) would be orthogonal to LI =
{∇gi i ∈ I}⊥ , i.e., would be a linear combination of the vectors ∇gi , i ∈ I, or, which is
the same, along with equality (9.1.4) there would exist similar equality not involving ∇g1 .
This is impossible, since from the Regularity assumption it follows that the coefficients in
a relation of the type (9.1.4) are uniquely defined. Thus, (9.1.5) indeed takes place and d
is nonzero; since, as we already know, dT ∇f (xt ) = −|d|2 , d is a descent direction of the
objective.
Linear Programming: not only the constraints in (P ) are linear inequalities, but also the objective
is linear. In this case the above Active Set scheme becomes as follows:
We start from a feasible solution x0 and try to minimize our linear objective on the corresponding
working plane S 0 . If this plane is of positive dimension and the problem is below bounded (which we
assume from now on), then the first step – minimization of the objective in a chosen descent direction of
it in the working plane – will bring us to the boundary of the initial face G0 , the working plane will be
shrunk, and so on, until the working plane will not become 0-dimensional, i.e., we will find ourselves at a
vertex of the feasible domain4) . Starting with this moment, the procedure will be as follows. Our current
working plane is a singleton which is a vertex of the feasible domain, and there are exactly n inequality
constraints active at the iterate (recall that we are under the Regularity assumption!). Of course, this
iterate is the critical point of the objective on the (0-dimensional) current working plane, so that we have
relation (9.1.4). If current λ’s are nonnegative, we are done – we have found a KKT point of the problem,
i.e., a global solution to it (the problem is convex!). If there is a negative λ, it gives us extended by one
dimension new working plane along with a new iterate in the corresponding one-dimensional face of the
feasible domain. The subsequent step from this new iterate will bring us to a boundary point of this
one-dimensional face, i.e., to a new vertex of the feasible domain, and will shrink the working plane by
one dimension – it again will become 0-dimensional, and so on. As a result of two subsequent iterates
vertex 7→ better point on 1-dimensional face 7→ better vertex
we simply move from a vertex to a new vertex (linked with the previous one by an edge), each time
improving the value of the objective, until we reach the optimal vertex. As you have already guessed,
what we get is the usual Simplex method.
Linearly Constrained Convex Quadratic Programming. Now assume that the constraints
in (P ) are linear, and the objective
1 T
x Ax + bT x
f (x) =
2
is a strongly convex quadratic form (A is a positive definite symmetric matrix). Problems of this type
are extremely important by their own right; besides this, they are the auxiliary problems solved at
each iteration of several general purpose optimization methods: the Sequential Quadratic Programming
method for smooth nonconvex constrained optimization (this method is very popular now), Bundle
methods for large-scale nonsmooth convex optimization, etc.
Quadratic Programming is a very convenient application field for the Active Set scheme, since here
we can explicitly minimize the objective on the current working plane SI . Indeed, representing SI by the
set of linear equations:
SI = {x | P x = p},
P being a matrix of full row rank with the rows (∇gi )T , i ∈ I, we observe that the necessary (and also
sufficient – f is convex!) condition for a point x ∈ S to minimize f on S is the existence of multiplier
vector λ of the same dimension as p such that
Ax + b + P T λ = 0
(9.1.6)
Px = p
(you can look at this system as at the KKT optimality condition or simply observe that x ∈ SI is the
minimizer of f on SI if and only if ∇f (x) = Ax + b is orthogonal to the linear subspace {h | P h = 0} of
4)
we ignore the degenerate case when the objective is constant on a non-singleton face of G
9.2. PENALTY AND BARRIER METHODS 249
the directions parallel to SI , which in turn is the case if and only if ∇f (x) can be represented as a linear
combination of rows of P , i.e., as −P T λ for some λ).
What we get is a square system
of linear equations with unknowns x and λ, and you can easily
A PT
prove that the matrix of the system is nonsingular5) . Consequently, the system has a unique
P 0
solution; x-component of this solution is exactly the minimizer of f on S.
With the possibility to find a minimizer of our strongly convex quadratic form on every working plane
SI in one step, just by solving the corresponding linear system, we can implement the Active Set scheme
to find global minimizer of the function f on the polyhedral set G as follows:
At an iteration t, we have a feasible solution xt−1 to the problem along with certain set Jt−1
of indices of constraints active at xt−1 (this set can be the set It−1 of indices of all constraints
active at xt−1 or can be less than the latter set), and define the current working plane S t−1
as SJ t−1 .
In course of iteration, we act as follows:
1) Find, as explained above, the minimizer x+ t of f on the working plane S
t−1
, and check
whether this point is feasible for the problem. If it is the case, we go to 2), otherwise to 3).
2) If x+
t is feasible for the problem, we check whether the Lagrange multipliers λ associated
with this point as the solution to the problem
min f (x) : x ∈ S t−1
x
are nonnegative; note that these Lagrange multipliers are given by the λ-part of the same
system (9.1.6) which gives us x+t . If all the Lagrange multipliers are nonnegative, we ter-
minate – x+ t is a KKT of (P ), i.e., is a global solution to the problem (the problem is
convex!). If some of λ’s are negative, we choose one of these negative λ’s, let the index of
the corresponding constraint be i, and set
xt = x+
t , Jt = Jt−1 \{i}.
In order to approximate this constrained problem by an unconstrained one, let us add to our objective a
term which “penalizes” violation of constraints; the simplest term of this type is
m
1 X 2
ρ h (x),
2 i=1 i
where ρ > 0 is “penalty parameter”. This term is zero on the feasible set and is positive outside this set;
if ρ is large, then the penalizing term is large everywhere except tight neighbourhood of the feasible set.
Now let us add the penalty term to the objective. From the above discussion it follows that the
resulting “combined objective”
m
1 X 2
fρ (x) = f (x) + ρ h (x) (9.2.1)
2 i=1 i
Thus, we could say that the limit of fρ as ρ → ∞ is the function taking values in the extended real axis
(with +∞ added) which coincides with f on the feasible set and is +∞ otherwise this set; it is clear that
unconstrained local/global minimizers of this limit are exactly the constrained local, respectively, global
minimizers of f . This exact coincidence takes place only in the limit; we could, anyhow, expect that
“close to the limit”, for large enough values of the penalty parameter, the unconstrained minimizers of
the penalized objective are close to the constrained minimizers of the actual objective. Thus, solving the
unconstrained problem
min fρ (x)
x
for large enough value of ρ, we may hope to get good approximations to the solutions of (ECP). As we
shall see in the mean time, under mild regularity assumptions all these “could expect” and “may hope”
indeed take place.
The Penalty Scheme: general constrained case. The idea of penalization can be easily
carried out in the case when there are inequality constraints as well. Given a general type constrained
problem
(GCP) min {f (x) : hi (x) = 0, i = 1, ..., m, gj ≤ 0, j = 1, ..., k} ,
x
where
+ a, a≥0
a = max[a, 0] =
0, a<0
is the “positive part” of a real a. The resulting penalty term is zero at any point where all the inequality
constraints are satisfied and is positive (and proportional to ρ) at any point where at least one of the
inequalities is violated.
Adding to the objective penalty terms for both equality and inequality constraints, we come to the
penalized objective
m k
1 X 2 1 X +
fρ (x) = f (x) + ρ hi (x) + ρ (g (x))2 ; (9.2.2)
2 i=1 2 j=1 j
same as above, we can expect that the unconstrained minimizers of the penalized objective approach the
constrained minimizers of the actual objective as the penalty parameter goes to infinity. Thus, solutions
of the unconstrained problem
min fρ (x),
x
The Barrier Scheme. The idea of the barrier methods is similar; the only difference is that instead
of allowing violations of the constraints and penalizing these violations, we now prevent the constraints
to be violated by a kind of interior penalty which blows up to infinity as a constraint is about to be
violated. This “interior penalty” approach can be normally used in the case of inequality constrained
problems with “full-dimensional” feasible set. Namely, consider an inequality constrained problem
and assume that the feasible domain G of the problem is such that
• the interior int G of the domain G is nonempty, and every gj is strictly negative in int G
• every point from G can be represented as the limit of a sequence of points from the interior of G
Assume also, just for the sake of simplicity, that G is bounded.
Given problem with the indicated properties, one can in many ways define an interior penalty function
(also called a barrier) for the feasible domain G, i.e., a function F defined on the interior of G and such
that
• F is continuously differentiable on int G
• F (xi ) → ∞ for any sequence of points xi ∈ int G converging to a boundary point x of G
For example, one can set
k
X 1
F (x) =
j=1
−gj (x)
sequence F possesses the required behaviour, while f remains bounded due to continuity of the objective).
In particular, the level sets of Fρ – the sets of the type
{x ∈ int G | Fρ (x) ≤ a}
are closed6) . Since they are also bounded (recall that G is assumed to bounded), they are compact sets,
and Fρ , being continuous on such a compact set, attains its minimum on it; the corresponding minimizer
clearly is a minimizer of Fρ on int G as well.
Thus, for every positive ρ the function Fρ attains its minimum on int G. At the same time, when
the penalty parameter ρ is large, Fρ “almost everywhere in int G” is “almost equal” to f – indeed, due
to the factor ρ1 at F in Fρ , the contribution of the interior penalty term for large ρ is not negligible only
in a thin, the smaller the larger is ρ, neighbourhood of the boundary. From this observation it is natural
to guess (and it turns out to be indeed true) that the minimizers of Fρ on int G are, for large ρ, close to
the optimal set of (ICP) and could therefore be treated as good approximate solutions to (ICP).
Now, the problem
min Fρ (x)
x
is, formally, a constrained problem – since the domain of the objective is int G rather than entire Rn .
Nevertheless, we have basically the same possibilities to solve the problem as if it was unconstrained.
Indeed, any descent (i.e., forming a sequence of iterates along which the objective never increases) method
for unconstrained minimization, as applied to Fρ and started at an interior point of G, never will come
too close to the boundary of G (since, as we know, close to the boundary Fρ is large, and along the
trajectory of the method Fρ is not greater than at the starting point). It means that the behaviour of
the method as applied to Fρ will, basically, be the same as if Fρ was defined everywhere – the method
simply will not feel that the objective is only partially defined. Thus, the barrier scheme in fact reduces
the constrained minimization problem to an unconstrained one (or, better to say, allows to approximate
the constrained problem by “essentially unconstrained” problem).
After the ideas of penalty and barrier schemes are outlined, let us come to more detailed investigation
of the schemes.
9.2.2.1 Convergence
The main questions we should focus on are
For the sake of definiteness, let us focus on the case of equality constrained problem (ECP) (the results
for the general case are similar).
The first – and very simple – statement is as follows:
6)
indeed, to prove closedness of a level set, let it be called L, is the same as to prove that if fρ (xi ) ≤ a < ∞
for certain sequence of points {xi } converging to a point x, then Fρ (x) ≤ a (a closed set is by definition, the one
which contains the limits of all converging sequences comprised of elements of the set). A priori x might be either
an interior, or a boundary point of G. The second possibility should be excluded, since if it is the case, then, due
to already indicated properties of Fρ , it would be Fρ (xi ) → ∞ as i → ∞, which is impossible, since Fρ is above
bounded on every level set. Thus, x ∈ int G; but then Fρ is continuous at x, and since Fρ (xi ) ≤ a and xi → x,
i → ∞, we get Fρ (x) ≤ a, as required.
9.2. PENALTY AND BARRIER METHODS 253
Theorem 9.2.1 Let the objective f in problem (ECP) possess bounded level sets:
f (x) → ∞, |x| → ∞,
and let (ECP) be feasible. Then, for any positive ρ, the set of global minimizers X ∗ (ρ) of the penalized
objective fρ is nonempty. Moreover, if X ∗ is the set of globally optimal solutions to (ECP), then, for
large ρ, the set Xρ∗ is “close” to X ∗ : for any > 0 there exists ρ = ρ() such that the set X ∗ (ρ), for all
ρ ≥ ρ(), is contained in -neighbourhood
Proof. First of all, (ECP) is solvable. Indeed, let x0 be a feasible solution to the problem,
and let U be the corresponding level set of the objective:
f (x∗ρ ) ≤ f ∗ ; (9.2.3)
fρ (x∗ ) = f (x∗ ) = f ∗ ,
so that the optimal value in (Pρ ) can be only ≤ the one in (ECP), which justifies the first
claim. Further, due to nonnegativity of the penalty term we have
which justifies the second claim. And this second claim immediately implies that x∗ρ ∈ U by
construction of U .
Our observations immediately result in the desired conclusions. Indeed, we already have
proved that X ∗ (ρ) is nonempty, and all we need is to verify that, for large ρ, the set X ∗ (ρ) is
contained in tight neighbourhood of X ∗ . Assume that it is not the case: there exist positive
and a sequence ρi → ∞ such that X ∗ (ρi ) is not contained in X∗ , so that one can choose
254 LECTURE 9. ACTIVE SET AND PENALTY/BARRIER METHODS
x∗i ∈ X ∗ (ρi ) in such a way that x∗i is outside X∗ . According to the third claim, the points
x∗i belong to U and form therefore a bounded sequence. Passing to a subsequence, we may
assume that x∗i converge to certain point x as i → ∞. W claim that x is an optimal solution
to (ECP) (this will give us the desired contradiction: since x∗i → x ∈ X ∗ , the points x∗i for all
large enough i must be in X∗ – look at the definition of the latter set – and we have chosen
xi not to be in X∗ ). To prove that x is an optimal solution to (ECP), we should prove that
f (x) ≤ f ∗ and that x is feasible. The first inequality is readily given by (9.2.3) – it should
be satisfied for x∗ρ = x∗i , and the latter points converge to x; recall that f is continuous.
Feasibility of x is evident: otherwise h(x) 6= 0 and, since x∗i → x and h is continuous, for all
large enough i one has
1
|h(x∗i )| ≥ a = |h(x)| > 0,
2
whence for these i
1 a
fρ∗i = fρi (x∗i ) = ρi |h(x∗i )|2 + f (x∗i ) ≥ ρi + f (x∗i ) → ∞, i → ∞
2 2
(note that ρi → ∞, while f (x∗i ) → f (x)), which contradicts our first claim.
The formulated theorem is not that useful: we could conclude something reasonable from its state-
ment, if we were able to approximate the global solutions to (Pρ ), which is the case only when fρ is
convex (as it, e.g., happens when f is convex and the equality constraints are linear). What we indeed
need is a local version of the theorem. This version is as follows:
Theorem 9.2.2 Let x∗ be a nondegenerate locally optimal solution to (ECP), i.e., a solution where the
gradients of the constraints are linearly independent and the Second Order Sufficient Optimality Condition
from Lecture 7 is satisfied. Then there exists a neighbourhood V of x∗ (an open set containing x∗ ) and
ρ̄ > 0 such that for every ρ ≥ ρ̄ the penalized objective fρ possesses in V exactly one critical point x∗ (ρ).
This point is a nondegenerate local minimizer of fρ and a minimizer of fρ in V , and x∗ (ρ) → x∗ as
ρ → ∞.
Proof. The simplest way to prove the theorem is to reduce the situation to the case when the
constraints are linear; this is the same approach we used to prove the Optimality Conditions
in Lecture 7. The detailed proof is as follows (cf. the proof of the Optimality Conditions):
Since x∗ is nondegenerate, the gradients of the constraints at x∗ are linearly independent.
From the appropriate version of the Implicit Function Theorem (we again use this magic
Calculus tool) it follows that you can choose locally new coordinates y (nonlinearly related
to the original ones!) in which our m constraints will be simply the first coordinate functions.
Namely, there exist
such that
• X(·) and Y (·) are continuously differentiable as many times as the constraints h (i.e.,
at least twice; recall that we once for ever restricted our considerations to problems
with twice continuously differentiable data)
• x∗ = X(0) (“x∗ in the coordinates y becomes the origin”)
• hi (X(y)) ≡ yi , i = 1, ..., m (“in y-coordinates the constraints become the first m
coordinate functions”).
9.2. PENALTY AND BARRIER METHODS 255
Now let us pass in (ECP) from coordinates x to coordinates y, which results in the problem
0
(ECP ) min {φ(y) ≡ f (X(y)) : yi ≡ hi (X(y)) = 0, i = 1, ..., m} ;
y
this, of course, makes sense only in the neighbourhood W 0 of the origin in y-variables.
10 . We claim that y ∗ = 0 is nondegenerate solution to the problem (ECP0 ). Indeed, we
should prove that this is a regular point for the constraints in the problem (which is evident)
and that the Second Order Sufficient Optimality condition takes place at the point, i.e, there
exist λ∗ such that the Lagrange function
m
X
L0 (y, λ) = φ(y) + λi yi
i=1
satisfies
∇y L0 (y ∗ , λ∗ ) = 0
and
dT ∇2y L0 (y ∗ , λ∗ )d > 0
for every nonzero vector d which is orthogonal to the gradients of the constraints of (ECP0 ).
And what we know is that there exists λ∗ which ensures similar properties of the Lagrange
function
Xm
L(x, λ) = f (x) + λi hi (x)
i=1
of the original problem at x = x∗ . What we shall prove is that the latter λ∗ satisfies all our
needs in (ECP0 ) as well. Indeed, we clearly have
L0 (y, λ∗ ) = L(X(y), λ∗ ),
d2
dT [X 0 (0)]T ∇2x L(x, λ∗ )X 0 (0)d + [∇x L(x∗ , λ∗ )]T |t=0 X(td) =
dt2
[the second term is zero due to the origin of λ∗ ]
[again due to the origin of λ∗ ], and we see that the second order part of the required conditions
also is satisfied.
256 LECTURE 9. ACTIVE SET AND PENALTY/BARRIER METHODS
20 . Now note that (ECP0 ) is a problem with linear constraints, so that the Hessian with
respect to y of the Lagrangian of the problem, independently of the values of the Lagrange
multipliers, is ∇2y φ(y). In particular, the “second-order part” of the fact that y ∗ = 0 is
nondegenerate solution to (ECP0 ) simply says that H = ∇2y φ(y ∗ ) is positive definite on the
plane
L = {d | di = 0, i = 1, ..., m}
(this the tangent plane at y ∗ to the feasible surface of the problem). This is the key argument
in the proof of the following crucial fact:
(*) function
m
1 X 2
φρ (y) = φ(y) + ρ y
2 i=1 i
– the penalized objective of (ECP0 ) – is, for large enough ρ, strictly convex in a properly
chosen convex neighbourhood W of y ∗ = 0, i.e., there exists small enough convex neighbour-
hood W ⊂ W 0 of y ∗ = 0 (simply an open ball of certain small enough positive radius) and
ρ∗ > 0 such that
∇2y φρ (y)
is positive definite whenever y ∈ W and ρ ≥ ρ∗ .
dT Ad ≥ α|d|2 ∀d ∈ L
with certain α > 0. Then there exists ρ∗ such that the matrix
A + ρ(I − P )
of the matrix A.
(note that |2uv| ≤ γ −1 u2 + γv 2 – this is nothing but the inequality between the
arithmetic and the geometric mean)]
9.2. PENALTY AND BARRIER METHODS 257
β 02
≥ α|d0 |2 − |d | − βγ|d00 |2 + ρ|d00 |2 .
γ
2β
Setting here γ = α , we get
α 02 2β 2 00 2
dT (A + ρ(I − P ))d ≥ |d | + [ρ − ]|d | ;
2 α
d4β 2 2β 2
choosing finally ρ∗ = α and assuming ρ ≥ ρ∗ , so that ρ − α ≥ ρ2 , we come to
α 0 2 ρ 00 2
dT (A + ρ(I − P )d ≥ |d | + |d |
2 2
for all ρ ≥ ρ∗ .
Now we are ready to prove (*). Indeed, since ∇2 φ(y) is continuous in y and ∇2 φ(y ∗ ) is
positive definite on
L = {d | di = 0, i = 1, ..., m},
we could choose
• small enough ball W centered at y ∗ = 0 and contained in W 0
• small enough positive α
• large enough positive β
such that
dT [∇2 φ(y)]d ≥ α|d|2 , y ∈ W, d ∈ L,
P being the orthoprojector on L, and
|∇2 φ(y)| ≤ β, y ∈ W.
P being the orthoprojector onto L. According to Lemma 9.2.1, there exists ρ∗ > 0 such that
all the matrices
∇2 φρ (y)
corresponding to y ∈ W and ρ ≥ ρ∗ are positive definite, moreover, satisfy
α 0 2 ρ 00 2
dT [∇2 φρ (y)]d ≥ |d | + |d | , d0 = P d, d00 = (I − P )d. (9.2.4)
2 2
Thus, whenever ρ ≥ ρ∗ , the Hessian of the function φρ is positive definite in W , so that φρ
is convex in W .
40 . Let ρ ≥ ρ∗ . From (9.2.4) it immediately follows that
α 0 2 ρ 00 2
φρ (y) ≥ φρ (0) + y T ∇y φ(0) + |y | + |y | , y 0 = P y, y 00 = (I − P )y 00 . (9.2.5)
4 4
The gradient of φρ at y ∗ = 0 is, first, independent of ρ (it is simply the gradient of φ at the
point) and, second, is orthogonal to L, as it is given by the “first order part” of the fact that
y ∗ = 0 is a nondegenerate solution to (ECP0 ). Consequently, (9.2.5) can be rewritten as
α 0 2 ρ 00 2
φρ (y) ≥ φρ (0) + (y 00 )T g + |y | + |y | , (9.2.6)
4 4
258 LECTURE 9. ACTIVE SET AND PENALTY/BARRIER METHODS
g = ∇y φ(y ∗ ) being a once for ever fixed vector. From this relation it easily follows7)
(**) there exists ρ̄ ≥ ρ∗ such that φρ (y) > φρ (0) whenever ρ ≥ ρ̄ and y is a boundary point
of the ball W .
Now let ρ ≥ ρ̄. The function φρ , being continuous on the closure of W (which is a closed
ball, i.e., a compact set), attains its minimum on cl W , and, due to strong convexity of the
function, the minimizer, let it be called y ∗ (ρ), is unique. By (**), this minimizer cannot be
a boundary point of cl W – on the boundary φρ is greater than at y ∗ = 0, i.e., it is a point
from W . Since φρ is smooth and y ∗ (ρ) ∈ W , y ∗ (ρ) is a critical point of φρ . There is no other
critical point of φρ in W , since φρ is convex and therefore a critical point of the function
is its minimizer on cl W , and we know that such a minimizer is unique. Note that y ∗ (ρ)
is a nondegenerate local minimizer of φρ , since φρ is with positive definite Hessian. Last,
y ∗ (ρ) → y ∗ = 0 as ρ → ∞. Indeed, we have
φρ (y ∗ (ρ)) ≤ φρ (0),
whence, denoting yL (ρ) = P y ∗ (ρ), yL⊥ (ρ) = (I − P )y ∗ (ρ) and applying (9.2.6),
α ρ
g T yL⊥ (ρ) + |yL (ρ)|2 + |yL⊥ (ρ)|2 ≤ 0,
4 4
or, with the Cauchy inequality,
α ρ
|g||yL⊥ (ρ)| ≥ |yL (ρ)|2 + |yL⊥ (ρ)|2 .
4 4
From this inequality it immediately follows that |yL⊥ (ρ)| → 0 as ρ → ∞ (why?), and with
this observation the same inequality results also in |yL (ρ)| → 0, ρ → ∞, so that indeed
y ∗ (ρ) → y ∗ = 0, ρ → ∞.
50 . Thus, we have established the statement we are going to prove – but for the “locally
equivalent to (ECP)” problem (ECP0 ) rather than for the actual problem of interest: we
have pointed out a neighbourhood W of the point y ∗ such that the penalized objective φρ of
(ECP0 ), for all large enough ρ, has in this neighbourhood exactly one critical point, which
is a nondegenerate minimizer of φρ in the neighbourhood; and as ρ → ∞, this critical point
y ∗ (ρ) converges to y ∗ = 0. Now let us take the image V of W under our substitution of
variables mapping y 7→ X(y). We will get a neighbourhood of x∗ , and in this neighbourhood
we clearly have
fρ (x) = φρ (Y (x)).
Now, Y is one-to-one differentiable mapping of V onto W with differentiable inverse X; it
immediately follows that a point x is a critical point (or a minimizer) of fρ in V if and only
if Y (x) is a critical point, respectively, a minimizer of φρ in W ; in particular, for ρ ≥ ρ̄ fρ
7)
here is the derivation: we should prove that
α 0 2 ρ 00 2
(y 00 )T g + |y | + |y | > 0
4 4
whenever y is on the boundary of W and ρ is large enough; recall that W is centered at the origin ball of certain
radius r > 0. Denoting s = |y 00 |2 and taking into account that |y 0 |2 = r2 − s and (y 00 )T g ≥ −cs1/2 by Cauchy’s
inequality, we reduce the problem in question to the following one: prove that
h
α 2 ρ−α √ i
(!) min θρ (s), θρ (s) = r + s−c s
0≤s≤r 2 4 4
is positive, provided that ρ is large enough. This is evident (split the entire segment [0, r2 ] where s varies into
the segment ∆ where cs1/2 ≤ α8 r2 – in this segment θρ (s) is positive whenever ρ > α – and the complementary
segment ∆0 , and note that in this complementary segment s is bounded away from zero and therefore θρ for sure
is positive for all large enough values of ρ.
9.2. PENALTY AND BARRIER METHODS 259
indeed possesses unique critical point x∗ (ρ) = X(y ∗ (ρ)) in V , and this is the minimizer of
fρ in V . As ρ → ∞, we have y ∗ (ρ) → y ∗ = 0, whence x∗ (ρ) → X(0) = x∗ . The only
property of x∗ (ρ) we did not verify so far is that it is nondegenerate local minimizer of fρ ,
i.e., that ∇2x fρ (x∗ (ρ)) is positive definite; we know that it is the case for φρ and y ∗ (ρ), but
our substitution of variables is nonlinear and therefore, generally speaking, does not preserve
positive definiteness of Hessians. Fortunately, it does preserve positive definiteness of the
Hessians taken at critical points: if Y (x) is twice continuously differentiable mapping with
differentiable inverse and ψ is such that ∇y ψ(ȳ) = 0, ȳ = Y (x̄), then, as it is immediately
seen, the Hessian of the composite function g(x) = ψ(Y (x)) at the point x̄ is given by
(in the general case, there are also terms coming from the first order derivative of ψ at ȳ
and second order derivatives of Y (·), but in our case, when ȳ is a critical point of ψ, these
terms are zero), so that ∇2x g(x̄) is positive definite if and only if ∇2y ψ(ȳ) is so. Applying this
observation to ψ = φρ and ȳ = y ∗ (ρ), we obtain the last fact we need – nondegeneracy of
x∗ (ρ) as an unconstrained local minimizer of fρ .
(“x∗ (ρ) minimizes f on the set of all those points from V where the constraints are violated at
most as they are violated at x∗ (ρ)”).
Indeed, x∗ (ρ) evidently belongs to X + (ρ); if there were a point x in X + (ρ) with f (x) < f (x∗ (ρ)), we
would have
ρ
fρ (x) = f (x) + |h(x)|2 <
2
[since f (x) < f (x∗ (ρ)) and |h(x)| ≤ |h(x∗ (ρ))| due to x ∈ X + (ρ)]
ρ
< f (x∗ (ρ)) + |h(x∗ (ρ))|2 = fρ (x∗ (ρ)).
2
The resulting inequality is impossible, since x∗ (ρ) minimizes fρ in V .
• [Monotonicity of the optimal value of the penalized objective] The optimal (in V ) values fρ (x∗ (ρ))
of the penalized objectives do not decrease with ρ.
Indeed, if ρ ≤ ρ0 , then clearly fρ (x) ≤ fρ0 (x) everywhere in V , and consequently the
same inequality holds for the minimal values of the functions.
do not increase with ρ (“the larger is the penalty parameter, the less are violations of the constraints
at the solution to the penalized problem”)
260 LECTURE 9. ACTIVE SET AND PENALTY/BARRIER METHODS
Indeed, assume that ρ0 > ρ00 , and let x0 = x∗ (ρ0 ), x00 = x∗ (ρ00 ). We have
ρ0 ρ0
[fρ0 (x00 ) ≡] f (x00 ) + |h(x00 )|2 ≥ f (x0 ) + |h(x0 )|2 [≡ fρ0 (x0 )]
2 2
and, similarly,
ρ00 ρ00
f (x0 ) +|h(x0 )|2 ≥ f (x00 ) + |h(x00 )|2 .
2 2
Taking sum of these inequalities, we get
ρ0 − ρ00 ρ0 − ρ00
|h(x00 )|2 ≥ |h(x0 )|2
2 2
whence, due to ρ0 > ρ00 , v(ρ00 ) = |h(x00 )| ≥ |h(x0 )| = v(ρ0 ), as required.
• [Monotonicity of the actual objective] The values of the actual objective f along the path x∗ (ρ) do
not decrease with ρ.
Indeed, we already know that
According to the previous statement, the sets in the right hand side over which the
minimum is taken do not increase with ρ, and, consequently, the minimal value of f
over these sets does not decrease with ρ.
The indicated properties demonstrate the following nice behaviour of the path x∗ (ρ): as the penalty
parameter grows, the path approaches the constrained minimizer x∗ of (ECP); the values of the objective
along the path are always better (more exactly, not worse) than at x∗ and increase (actually, do not
decrease) with ρ, approaching the optimal value of the constrained problem from the left. Similarly, the
violation of constraints becomes smaller and smaller as ρ grows and approaches zero. In other words, for
every finite value of the penalty parameter it turns out to be profitable to violate constraints, getting,
as a result, certain progress in the actual objective; and as the penalty parameter grows, this violation,
same as progress in the objective, monotonically goes to zero.
An additional important property of the path is as follows:
• [Lagrange multipliers and the path] The quantities
Thus, λi (ρ) are the coefficients in the representation of the vector ψ(ρ) ≡ −∇x f (x∗ (ρ))
as a linear combination of the vectors ψi (ρ) = ∇x hi (x∗ (ρ)). As we know, x∗ (ρ) → x∗
as ρ → ∞, so that ψ(ρ) → ψ ≡ ∇x f (x∗ ) and ψi (ρ) → ψi ≡ ∇x hi (x∗ ). Since the
vectors ψ1 , ..., ψm are linearly independent, from the indicated convergencies it follows
(why?) that λi (ρ) → λ∗i , ρ → ∞, where λ∗i are the (uniquely defined) coefficients in the
representation of −ψ as a linear combination of ψi , i.e., are the Lagrange multipliers.
Remark 9.2.1 Similar results and properties take place also for the penalty method as applied to the
general type constrained problem (GCP).
9.2. PENALTY AND BARRIER METHODS 261
Remark 9.2.2 The quadratic penalty term we used is not, of course, the only option; for (ECP), we
could use penalty term of the form
ρΦ(h(x))
as well, with smooth function Φ which is zero at the origin and positive outside the origin (in our
considerations, Φ was set to 21 |u|2 ); similar generalizations are possible for (GCP). The results for these
more general penalties, under reasonable assumptions on Φ, would be similar to those as for the particular
case we have considered.
Advantages and drawbacks. Now it is time to look what are our abilities to solve the uncon-
strained problems
(Pρ ) min fρ (x)
x
which, as we already know, for large ρ are good approximations of the constrained problem in question.
In principle we can solve these problems by any one of unconstrained minimization methods we know,
and this is definitely a great advantage of the approach.
There is, anyhow, a severe weak point of the construction – to approximate well the constrained
problem by unconstrained one, we must deal with large values of the penalty parameter, and this, as
we shall see in a while, unavoidably makes the unconstrained problem (Pρ ) ill-conditioned and thus –
very difficult for any unconstrained minimization methods sensitive to the conditioning of the problem.
And all the methods for unconstrained minimization we know, except, possibly, the Newton method,
are “sensitive” to conditioning (e.g., in the Gradient Descent the number of steps required to achieve an
-solution is, asymptotically, inverse proportional to the condition number of the Hessian of objective at
the optimal point). Even the Newton method, which does not react on the conditioning explicitly – it is
“self-scaled” – suffers a lot as applied to an ill-conditioned problem, since here we are enforced to invert
ill-conditioned Hessian matrices, and this, in actual computations with their rounding errors, causes a lot
of troubles. The indicated drawback – ill-conditioness of auxiliary unconstrained problems – is the main
disadvantage of the “straightforward” penalty scheme, and because of it the scheme is not that widely
used now and is in many cases replaced with more smart modified Lagrangian scheme.
It is time now to justify the above claim that problem (Pρ ) is, for large ρ, ill-conditioned.
Indeed, assume that we are in the situation of Theorem 9.2.2, and let us compute the Hessian
Hρ of fρ at the point x = x∗ (ρ). The computation yields
" m
#
X
2
Hρ = ∇x f (x) + [ρhi (x)]∇x hi (x) + ρ[∇x h(x)]T [∇x h(x)].
2
i=1
We see that Hρ is comprised of two terms: the matrix in the brackets, let it be called Lρ , and
the proportional to ρ matrix ρMρ ≡ [∇x h(x)]T [∇x h(x)]. When ρ → ∞, then, as we know,
x = x∗ (ρ) converges to x∗ and ρhi (x∗ ) converge to the Lagrange multipliers λ∗i of (ECP),
so that Lρ possesses quite respectable limit L, namely, the Hessian of the Lagrange function
∇2x L(x∗ , λ∗ ). The matrix Mρ also possesses limit, namely,
this limit, as it is clearly seen, is a matrix which vanishes on the tangent at x∗ plane T
to the feasible surface of the problem and is nondegenerate on the orthogonal complement
T ⊥ to this tangent plane. Since Mρ is symmetric, both T and T ⊥ are invariant for M ,
and M possesses n − m eigenvectors with zero eigenvalues – these vectors span T – and m
eigenvectors with positive eigenvalues – these latter vectors span T ⊥ . Since
Hρ = Lρ + ρMρ ,
we conclude that the spectrum of Hρ , for large ρ, is as follows: there are n − m eigenvectors
“almost in T ” with the eigenvalues “almost equal” to those of the reduction of L onto T ;
262 LECTURE 9. ACTIVE SET AND PENALTY/BARRIER METHODS
since L is positive definite on T , these eigenvalues are positive reals. Now, Hρ possesses m
eigenvectors “almost orthogonal” to T with eigenvalues “almost equal” to ρ times the nonzero
eigenvalues of M . Thus, excluding trivial cases m = 0 (no constraints at all) and m = n
(locally unique feasible solution x∗ ), the eigenvalues of Hρ form two groups – group of n − m
asymptotically constant positive reals and group of m reals asymptotically proportional to
ρ. We conclude that the condition number of Hρ is of order of ρ, as it was claimed.
Theorem 9.2.3 Let F be an interior penalty function for the feasible domain G of (ICP), and assume
that the feasible domain is bounded and is the closure of its interior int G. Then the set X ∗ (ρ) of
minimizers of Fρ on int G is nonempty, and these sets converge, as ρ → ∞, to the optimal set X ∗ of
(ICP): for every > 0 there exists ρ̄ such that X ∗ (ρ), for all ρ ≥ ρ̄, is contained in the -neighbourhood
Proof is completely similar to that one of Theorem 9.2.1. First of all, X ∗ is nonempty
(since f is continuous and the feasible domain is bounded and closed and is therefore a
compact set). The fact that X ∗ (ρ) are nonempty for all positive ρ was proved in Section
9.2.1. To prove that X ∗ (ρ) is contained, for large ρ, in a tight neighbourhood of X ∗ , let us
act as follows. Same as in the proof of Theorem 9.2.1, it suffices to lead to a contradiction
the assumption that there exists a sequence {xi ∈ X ∗ (ρi )} with ρi → ∞ which converges
to a point x ∈ G\X ∗ . Assume that it is the case; then f (x) > min f + δ with certain
G
positive δ. Let x∗ be a global minimizer of f on G. Since G is the closure of int G, x∗ can
be approximated, within an arbitrarily high accuracy, by points from int G, and since f is
continuous, we can find a point x0 ∈ int G such that
δ
f (x0 ) ≤ f (x∗ ) + .
2
We have
Fρi (xi ) = min Fρi (x) ≤ Fρi (x0 ), (9.2.8)
x∈int G
whence
1 1
f (xi ) + F (xi ) ≤ f (x0 ) + F (x0 ).
ρi ρi
Since F is a barrier for bounded domain G, F is below bounded on int G (since it attains
its minimum on int G – use the reasoning from Section 9.2.1 for the case of f ≡ 0). Thus,
F (x) ≥ a > −∞ for all x ∈ int G, and therefore (9.2.8) implies that
1 1
f (xi ) ≤ f (x0 ) + F (x0 ) − a.
ρi ρi
As i → ∞, the right hand side in this inequality tends to f (x0 ) ≤ f (x∗ ) + 2δ , and the left
hand side tends to f (x) ≥ f (x∗ ) + δ; since δ > 0, we get the desired contradiction.
9.2. PENALTY AND BARRIER METHODS 263
If I was writing this lecture in 1980’s, we would proceed with the statements similar to the one of Theorem
9.2.2 and those on behaviour of the path of minimizers of Fρ and conclude with the same laments “all this
is fine, but the problems of minimization of Fρ normally (when the solution to the original problem is on
the boundary of G; otherwise the problem actually is unconstrained) become the more ill-conditioned the
larger is ρ, so that the difficulties of their numerical solution grow with the penalty parameter”. When
indeed writing this lecture, I am saying something quite opposite: there exists important situations when
the difficulties in numerical minimization of Fρ do not increase with the penalty parameter, and the
overall scheme turns out to be theoretically efficient and, moreover, the best known so far. This change
in evaluation of the scheme is the result of recent “interior point revolution” in Optimization which
we have already mentioned in Lecture 10. Those interested may get some impression of essence of the
matter from the below non-obligatory part of the lecture; please take into account that the prerequisite
for reading it is non-obligatory Section 6.2.4 from Lecture 9.
Assume from now on that our problem (ICP) is convex (the revolution we are speaking
about deals, at least directly, with Convex Programming only). It is well-known that convex
program (ICP) can be easily rewritten as a program with linear objective; indeed, it suffices
to extend the vector of design variables by one variable, let it be called t, more and to rewrite
(ICP) as the problem
min f (x) ≡ cT x : x ∈ G ⊂ Rn .
(P)
x
Here the feasible set G of the problem is convex (we are speaking about convex programs!);
we also assume that it is closed, bounded and possesses a nonempty interior.
Our abilities to solve (P) efficiently by an interior point method depend on our abilities to
point out a “good” interior penalty F for the feasible domain. What we are interested in is
a ϑ-self-concordant barrier F ; the meaning of these words is given by the following
• F is self-concordant function on int G (Section 6.2.4.2, Lecture 4), i.e., three times
continuously differentiable convex function on int G possessing the barrier property
(i.e., F (xi ) → ∞ along every sequence of points xi ∈ int G converging to a boundary
point of G) and satisfying the differential inequality
d3 3/2
|t=0 F (x + th)| ≤ 2 hT F 00 (x)h ∀x ∈ int G ∀h ∈ Rn ;
|
dt3
An immediate example is as follows (cf. “Raw materials” in Section 6.2.4.4, Lecture 4):
264 LECTURE 9. ACTIVE SET AND PENALTY/BARRIER METHODS
G = {x ∈ Rn | aTj x ≤ bj , j = 1, ..., m}
be a polytope given by a list of linear inequalities satisfying the Slater condition (i.e., there
exists x̄ such that aTj x̄ < bj , j = 1, ..., m). Then the function
m
X
F (x) = − ln(bj − aTj x)
j=1
In the mean time, we shall justify this example (same as shall consider the crucial issue of
how to find a self-concordant barrier for a given feasible domain). For the time being, let us
focus on another issue: how to solve (P), given a ϑ-self-concordant barrier for the feasible
domain of the problem.
What we intend to do is to use the path-following scheme associated with the barrier –
certain very natural implementation of the barrier method.
as we know from Theorem 9.2.3, this path converges to the optimal set of (P) as ρ → ∞;
besides this, it can be easily seen that the path is continuous (even continuously differentiable)
in ρ. In order to approximate x∗ (ρ) with large values of ρ via the path-following scheme,
we trace the path x∗ (ρ), namely, generate sequentially approximations x(ρi ) to the points
x∗ (ρt ) along certain diverging to infinity sequence ρ0 < ρ1 < ... of values of the parameter.
This is done as follows:
given “tight” approximation x(ρt ) to x∗ (ρt ), we update it into “tight” approximation x(ρt+1 )
to x∗ (ρt+1 ) as follows:
• first, choose somehow a new value ρt+1 > ρt of the penalty parameter
• second, apply to the function Fρt+1 (·) a method for unconstrained minimization started
at x(ρt ), and run the method until closeness to the new target point x∗ (ρt+1 ) is restored,
thus coming to the new iterate x(ρt+1 ) close to the new target point of the path.
All this is very close to what we did when tracing feasible surface with the Gradient Projection
scheme; our hope is that since x∗ (ρ) is continuous in ρ and x(ρt ) is “close” to x∗ (ρt ), for “not
too large” ρt+1 − ρt the point x(ρt ) will be “not too far” from the new target point x∗ (ρt+1 ),
so that the unconstrained minimization method we use will quickly restore closeness to the
9.2. PENALTY AND BARRIER METHODS 265
new target point. With this “gradual” movement, we may hope to arrive near x∗ (ρ) with
large ρ faster than by attacking the problem (Pρ ) directly.
All this was known for many years; and the progress during last decade was in transforming
these qualitative ideas into exact quantitative recommendations.
Namely, it turned out that
• A. The best possibilities to carry this scheme out are when the barrier F is ϑ-self-
concordant; the less is the value of ϑ, the better;
• B. The natural measure of “closeness” of a point x ∈ int G to the point x∗ (ρ) of the
path is the Newton decrement of the self-concordant function
(cf. Proposition 6.2.2.(iii)). More specifically, the notion “x is close to x∗ (ρ)” is conve-
nient to define as the relation
λ(Φρ , x) ≤ 0.05 (9.2.10)
(in fact, 0.05 in the right hand side could be replaced with arbitrary absolute constant
< 1, with slight modification of subsequent statements; we choose this particular value
for the sake of simplicity)
Now, what do all these words “the best possibility” and “natural measure” actually mean?
It is said by the following two statements.
• C. Assume that x is close, in the sense of (9.2.10), to a point x∗ (ρ) of the path x∗ (·)
associated with a ϑ-self-concordant barrier for the feasible domain G of problem (P).
Let us increase the parameter ρ to the larger value
0.08
ρ+ = 1 + √ ρ (9.2.11)
ϑ
and replace x by its damped Newton iterate (cf. (6.2.13), Lecture 4)
1
x+ = x − [∇2 Φ + (x)]−1 ∇x Φρ+ (x). (9.2.12)
1 + λ(Φρ+ , x) x ρ
Then x+ is close, in the sense of (9.2.10), to the new target point x∗ (ρ+ ) of the path.
C. says that we are able to trace the path (all the time staying close to it in the sense of
B.) increasing the penalty parameter linearly in the ratio (1 + 0.08ϑ−1/2 ) and accompanying
each step in the penalty parameter by a single Newton step in x. And why we should be
happy with this, it is said by
• D. If x is close, in the sense of (9.2.10), to a point x∗ (ρ) of the path, then the inaccuracy,
in terms of the objective, of the point x as of an approximate solution to (P) is bounded
from above by 2ϑρ−1 :
2ϑ
f (x) − min f (x) ≤ . (9.2.13)
x∈G ρ
D. says that the inaccuracy of the iterates x(ρi ) formed in the above path-following procedure
goes to 0 as 1/ρi , while C. says that we are able increase ρi linearly, at the cost of a single
Newton step per each updating of ρ. Thus, we come to the following
266 LECTURE 9. ACTIVE SET AND PENALTY/BARRIER METHODS
Then the iterates of the method are well-defined, belong to the interior of G and the method
possesses linear global rate of convergence:
−t
2ϑ 0.08
f (xt ) − min f ≤ 1+ √ . (9.2.15)
G ρ0 ϑ
In particular, to make the residual in f less than a given > 0, it suffices to perform no
more that
√
20ϑ
N () ≤c20 ϑ ln 1 + b (9.2.16)
ρ0
Newton steps.
We see that the parameter ϑ of the self-concordant barrier underlying the method is respon-
sible for the Newton complexity of the method – the factor at the log-term in the complexity
bound (9.2.16).
Remark 9.2.3 The presented result does not explain how to start tracing the path – how
to get initial pair (x0 , ρ0 ) close to the path. This turns out to be a minor difficulty: given
in advance a strictly feasible solution x̄ to (P), we could use the same path-following scheme
(applied to certain artificial objective) to come close to the path x∗ (·), thus arriving at a
position from which we can start tracing the path. In our very brief outline of the topic, it
makes no sense to go in these “ details of initialization”; it suffices to say that the necessity to
start from approaching x∗ (·) basically does not violate the overall complexity of the method.
It makes sense if not to prove the aforementioned statements – the complete proofs, although
rather simple, go beyond the scope of our lecture – but at least to motivate them – to explain
what is the role of self-concordance and “magic inequality” (9.2.9) in ensuring properties C.
and D. (this is all we need – the Theorem, of course, is an immediate consequence of these
two properties).
Let us start with C. – this property is much more important. Thus, assume we are at a
point x close, in the sense of (9.2.10), to x∗ (ρ). What this inequality actually says?
Let us denote by
k h kH −1 = (hT H −1 h)1/2
the scaled Euclidean norm given by the inverse to the Hessian matrix
(the equality comes from the fact that Φρ and F differ by a linear function ρf (x) ≡ ρcT x).
Note that by definition of λ(·, ·) one has
Due to the last formula, the closeness of x to x∗ (ρ) (see (9.2.10)) means exactly that
(the concluding inequality here is given by (9.2.9) 8) , and this is the main point where this
component of the definition of a self-concordant barrier comes into the play).
From the indicated relations
|ρ+ − ρ
= k ρc kH −1 +λ(Φρ , x) ≤
ρ
[see (9.2.11), (9.2.17)]
0.08 √
≤ √ (0.05 + ϑ) + 0.05 ≤ 0.134
ϑ
(note that ϑ ≥ 1 by Definition 9.2.1). According to Proposition 6.2.2.(iii.3), Lecture 4,
the indicated inequality says that we are in the domain of quadratic convergence of the
damped Newton method as applied to self-concordant function Φρ+ ; namely, the indicated
Proposition says that
2(0.134)2
λ(Φρ+ , x+ ) ≤ < 0.05.
1 − 0.134
as claimed in C.. Note that this reasoning heavily exploits self-concordance of F
To establish property D., it requires to analyze in more details the notion of a self-concordant
barrier, and we are not going to do it here. Just to demonstrate where ϑ comes from, let
us prove an estimate similar to (9.2.13) for the particular case when, first, the barrier in
question is the standard logarithmic barrier given by Example 9.2.1 and, second, the point
x is exactly the point x∗ (ρ) rather than is close to the latter point. Under the outlined
assumptions we have
x = x∗ (ρ) ⇒ ∇x Φρ (x) = 0 ⇒
[substitute expressions for Φρ and F ]
m
X aj
ρc + =0⇒
j=1
bj − aTj x
[take into account that aTj (x∗ − x) = aTj x∗ − aTj x ≤ bj − aTj x due to x∗ ∈ G]
≤ m,
whence
m ϑ
cT (x − x∗ ) ≤ ≡
ρ ρ
(for the case in question ϑ = m). This estimate is twice better than (9.2.13) – this is because
we have considered the case of x = x∗ (ρ) rather than the one of x close to x∗ (ρ).
8)
indeed, for a positive definite symmetric matrix H it clearly is the same (why?) to say that |g|H −1 ≤ α and
to say that |g T h| ≤ α|h|H for all h
268 LECTURE 9. ACTIVE SET AND PENALTY/BARRIER METHODS
9.2.3.3 Applications
The most famous (although, we believe, not the most important) application of Theorem
9.2.4 deals with Linear Programming, when G is a polytope and F is the standard logarithmic
barrier for this √
polytope (see Example 9.2.1). For this case, the Newton complexity of the
method9) is O( m), m being the # of linear inequalities involved into the description of
G. Each Newton step costs, as it is easily seen, O(mn2 ) arithmetic operations, so that
the arithmetic cost per accuracy digit – number of arithmetic operations required to reduce
current inaccuracy by absolute constant factor – turns out to be O(m1.5 n2 ). Thus, we get
a polynomial time solution method for LP, with complexity characteristics typically (for m
and n of the same order) better than those for the Ellipsoid method (Lecture 7). Note also
that with certain “smart” implementation of Linear Algebra, the above arithmetic cost can
be reduced to O(mn2 ); this is the best known so far cubic in the size of the problem upper
complexity bound for Linear Programming.
To increase list of application examples, note that our abilities to solve in the outlined style
a convex program of a given structure are limited only by our abilities to point out self-
concordant barrier for the corresponding feasible domain. In principle, there are no limits
at all – it can be proved that every closed convex domain in Rn admits a self-concordant
barrier with the value of parameter at most O(n). This “universal barrier” is given by
certain multivariate integral and is too complicated for actual computations; recall that we
should form and solve Newton systems associated with our barrier, so that we need it to be
“explicitly computable”.
Thus, we come to the following important question:
How to construct “explicit” self-concordant barriers. There are many cases when we
are clever enough to point out “explicitly computable self-concordant barriers” for convex
domains we are interested in. We already know one example of this type – Linear Program-
ming (although we do not know to the moment why the standard logarithmic barrier for
a polytope given by m linear constraints is m-self-concordant; why it is so, it will become
clear in a moment). What helps us to construct self-concordant barriers and to evaluate
their parameters are the following extremely simple combination rules, completely similar to
those for self-concordant functions (see Section 6.2.4.4, Lecture 4):
• [Linear combination with coefficients ≥ 1] Let Fi , i = 1, ...m, be ϑi -self-concordant
barriers for the closed convex domains Gi , let the intersection G of these domains
possess a nonempty interior Q, and let αi ≥ 1, i = 1, ..., m, be given reals. Then the
function
m
X
F (x) = αi Fi (x)
i=1
m
P
is ( αi ϑi )-self-concordant barrier for G.
i=1
• [Affine substitution] Let F (x) be ϑ-self-concordant barrier for the closed convex domain
G ⊂ Rn , and let x = Aξ + b be an affine mapping from Rk into Rn with the image
intersecting int G. Then the composite function
F + (ξ) = F (Aξ + b)
G+ = {ξ | Aξ + b ∈ G}
9)
recall that it is the factor at the logarithmic term in (9.2.16), i.e., the # of Newton steps sufficient to reduce
current inaccuracy by an absolute constant factor, say, by factor 2; cf. with the stories about polynomial time
methods from Lecture 7
9.2. PENALTY AND BARRIER METHODS 269
the function
F (X) = − ln Det X
is n-self-concordant barrier for the cone Sn+ of positive definite symmetric n×n matrices.
One hardly could imagine how wide is the class of applications – from Combinatorial op-
timization to Structural Design and Stability Analysis/Synthesis in Control – of the latter
two barriers, especially of the ln Det -one.
not just the theoretical upper bound on the worst-case complexity of the method, but the
indication of the “typical” performance of the algorithm. And a method actually working
according to the complexity estimate (9.2.16) could be fine theoretically, but it definitely will
be of very restricted practical interest in the large-scale case. For example, in LP program
with m ≈ 105 inequality constraints and n ≈ 104 variables (these are respectable, but in no
sense “outstanding” sizes for a practical LP program) estimate (9.2.16) predicts something
like hundreds of Newton steps with Newton systems of the size 104 × 104 ; even in the case
of good sparsity structure of the systems, such a computation would be much more time
consuming than the one given by the Simplex Method.
In order to get “practical” path-following methods, we need a long-step tactics – rules for
on-line adjusting the stepsizes in the penalty parameter to, let me say, local curvature of the
path, rules which allow to update parameter as fast as possible – possible from the viewpoint
of the actual “numerical circumstances” the method is in rather than from the viewpoint of
very conservative theoretical worst-case complexity analysis.
Today there are efficient “long-step” policies of tracing the paths, policies which are both
fine theoretically (i.e., satisfy complexity bound (9.2.16)) and very efficient computationally.
Extremely surprising phenomenon here is that for “good” long-step path-following methods
as applied to convex problems of the most important classes (Linear Programming, Quadrat-
ically Constrained Convex Quadratic Programming and some other) it turns out that
the actually observed number of Newton iterations required to solve the problem
within reasonable accuracy is basically independent of the sizes of the problem
and is within 30-50.
This “empirical fact” (which can be only partly supported by theoretical considerations, not
proved completely) is extremely important for applications; it makes polynomial time interior
point methods the most attractive (and sometimes - the only appropriate) optimization tool
in many important large-scale applications.
We should add that the efficient “long-step” implementations of the path-following scheme
are relatively new, and for a long time the only interior point methods which demonstrated
the outlined “data- and size-independent” convergence rate were the so called potential re-
duction interior point methods. In fact, the very first interior point method – the method
of Karmarkar for LP – which initialized the entire interior point revolution, was a potential
reduction algorithm, and what indeed caused the revolution was outstanding practical per-
formance of this method. The method of Karmarkar possesses a very nice (and in fact very
simple) geometry and is closely related to the interior penalty scheme; anyhow, time limita-
tions enforce me to skip description of this wonderful, although now a little bit old-fashioned,
algorithm.
The concluding remark we would like to make is as follows: all polynomial time implemen-
tations of the penalty/barrier scheme known so far are those of the barrier scheme (which
is reflected in the name of these implementations: “interior point methods”); numerous at-
tempts to do something similar with the penalty approach failed to be successful. It is a
pity, due to some attractive properties of the scheme (e.g., here you do not meet with the
problem of finding a feasible starting point, which, of course, is needed to start the barrier
scheme).
Lecture 10
Augmented Lagrangians
Penalty methods studied in the previous lecture are very natural and general; unfortunately, they suffer
from a serious drawback: to approximate well the solution to constrained problem, you have to work
with large penalty parameters, and this inevitably makes the problem of unconstrained minimization of
the penalized objective very ill-posed. The main advantage of the augmented Lagrangian methods is
that they allow to approximate well the solution to a constrained problem by solutions of unconstrained
(and penalized) auxiliary problems without pushing the penalty parameter to infinity; as a result, the
auxiliary problems remain reasonably conditioned even when we are seeking for high-accuracy solutions.
Let us look how all this works. We shall mainly focus on the case of equality constrained problem
h1 (x)
(ECP) f (x) → min | h(x) = . . . = 0 [x ∈ Rn ];
hm (x)
in the mean time I shall explain how the results for this case can be straightforwardly extended onto the
general one.
be the Lagrange function of the problem, and let λ∗ be the vector of Lagrange multipliers corresponding
to the solution x∗ , so that
∇x L(x∗ , λ∗ ) = 0, (10.1.1)
and the matrix ∇2x L(x∗ , λ∗ ) is positive definite along the plane tangent at x∗ to the feasible surface of
the problem:
∇h(x∗ )d = 0 ⇒ dT ∇2x L(x∗ , λ∗ )d ≥ α|d|2 (10.1.2)
with certain positive α > 0.
Assume for a moment that instead of (10.1.2) a stronger condition is satisfied:
(!) the matrix ∇2x L(x∗ , λ∗ ) is positive definite on the entire space
What would be the consequences of this stronger assumption?
The consequences are immediate: (10.1.1) together with (!) mean that x∗ is a nondegenerate local
minimizer of the Lagrange function L(·, λ∗ ). It follows that if
271
272 LECTURE 10. AUGMENTED LAGRANGIANS
The function L attains its maximum over λ ∈ Λ at the point λ∗ , the maximizer being unique and nonde-
generate1) .
problem is well-posed and basically unconstrained: the Theorem says that for λ close enough to λ∗ Lλ
possesses unique critical point x∗ (λ) in V , and this point is nondegenerate local minimizer. Thus, we can
approximate x∗ (λ) by unconstrained minimization tools. And from (10.1.3) it follows that x∗ (λ) gives us
not only the value of L(λ) of the dual objective, but also its gradient – this gradient is simply the vector
h(·) of constraints at x∗ (λ).
With these observations, we see that it is possible to approximate λ∗ applying to L(·) an uncon-
strained optimization method (Gradient Descent, Quasi-Newton or whatever else); in order to provide
the method by the required first-order information at the current dual iterate λ, it suffices to find, again
by unconstrained minimization, the minimizer x∗ (λ) of the function Lλ (x). And what is nice in this
picture, is that there are no “large parameters” – both the dual problem of maximizing L(λ) and the
auxiliary primal problems of minimizing Lλ (x) possess, at least in neighborhoods of their solutions, nice
smoothness and nondegeneracy properties, and the values of the parameters responsible for these latter
properties are not spoiled from iteration to iteration.
but in contrast to what we did in the pure penalty method, let us keep the constraints as they were, thus
coming to the problem
(ECPρ ) fρ (x) → min | h(x) = 0.
Since the penalty term is zero on the feasible surface, the new problem is equivalent to the initial one,
and you can ask what is the goal of this strange manipulation. The answer is as follows:
(+) if x∗ is a nondegenerate solution to the original problem (ECP), then it is nondegenerate local
solution to (ECPρ ) as well [which by itself is not that important] and, moreover, for all large enough ρ
this nondegenerate solution to (ECPρ ) satisfies the crucial property (!).
The immediate consequence of (+) is as follows: passing from (ECP) to equivalent problem (ECPρ ),
we may ensure (!) and thus get all nice possibilities of solving (ECPρ ) (or, which is the same, (ECP))
outlined in the previous section.
Even if you believe in (+), you may ask: all this is fine, but we again meet with “large enough” ρ;
this large ρ will finally occur in the objectives Lλ (·) of the auxiliary primal problems given by the scheme
of the previous section, and we know from our discussion of the penalty methods that it would make
these auxiliary problems very difficult for numerical solution. This could be a quite reasonable objection,
if there were no crucial difference between the situation with penalty methods and our now situation. In
the penalty scheme, the value of the penalty parameter is directly linked with the accuracy to which
we are going to solve the problem: as we remember, violations of the constraints at the unconstrained
minimizer of penalized objective are of order of 1/ρ, and to make these violations ≤ we must work with
penalty parameter of order of 1/. The role of penalization now is quite different: it should enforce (!),
i.e., positive definiteness (on the entire space) of the Hessian of the Lagrangian Lρ (x, λ) of the penalized
problem, and (+) says that all large enough values of ρ are appropriate for this purpose. Whenever a
given ρ results in positive definite ∇2x Lρ (x∗ , λ∗ ), we may use this value of the penalty in the scheme of
the previous section to solve problem (ECPρ ) (or, which is the same, our original problem – they are
equivalent!) to arbitrarily high accuracy . Thus, in our now situation we are not enforced to push the
penalty parameter to infinity and will be satisfied by fixed value of it; whether this value indeed must be
large and will cause numerical difficulties when solving auxiliary primal problems, or it is quite moderate
274 LECTURE 10. AUGMENTED LAGRANGIANS
and no problems of this type occur, it depends on local properties of the data, but not on the accuracy
to which the problem should be solved. The simplest way to see the indicated difference is to look what
happens in the convex case (linear constraints, convex objective). It can be easily seen that here each
positive value of ρ makes ∇2 Lρ (x∗ , λ∗ ) positive definite, so that each value of ρ is “large enough”; note
that in the penalty scheme we should work with indeed large penalties independently of whether the
problem is convex or not.
It is time now to support our claim (+). To see that (+) indeed is valid, let us denote by λ∗ the
vector of Lagrange multipliers associated with x∗ in the original problem (ECP); as we know, this vector
is uniquely defined by the KKT condition
m
X
∇x f (x∗ ) + λ∗i ∇hi (x∗ ) = 0.
i=1
and we see that at the feasible surface (and, in particular, at the point x∗ ), where h(x) = 0, the gradients
of the actual and the penalized objectives coincide with each other; thus, λ∗ serves as the vector of
Lagrange multipliers associated with x∗ in problem (ECPρ ) as well.
Now let us compute the Hessian with respect to x of the Lagrange function
m m m
X X 1 X 2
Lρ (x, λ) = fρ (x) + λi hi (x) ≡ f (x) + λi hi (x) + ρ h (x) ≡
i=1 i=1
2 i=1 i
m
1 X 2
≡ L(x, λ) + ρ h (x) (10.1.4)
2 i=1 i
of the modified problem (ECPρ ) (from now on, L(x, λ) means the Lagrangian of the original problem).
We have
m
X
ρ
∇x L (x, λ) = ∇x L(x, λ) + ρ hi (x)∇x hi (x),
i=1
whence
m
X m
X
∇2x Lρ (x, λ) = ∇2x L(x, λ) + ρ ∇x hi (x)[∇x hi (x)]T + ρ hi (x)∇2x hi (x).
i=1 i=1
Introducing notation
[∇h1 (x)]T m
X
H(x, λ) = ∇2x L(x, λ), A(x) = ∇x h(x) ≡ ... , R(x) = hi (x)∇2x hi (x),
T
[∇hm (x)] i=1
We would like to prove that the left hand side in this relation at the point (x = x∗ , λ = λ∗ ) is positive
definite, provided that ρ is large enough. Now let us make the following useful observations:
• H(x∗ , λ∗ ) is positive definite at the subspace T of directions tangent to the feasible surface at
the point x∗ ; this is the second order part of the Second Order Sufficient Optimality condition
which is assumed to be satisfied at x∗ (recall that we all the time are speaking about the case
when x∗ is a nondegenerate solution to (ECP)). As we know, T is exactly the kernel of the matrix
∇h(x∗ ) = A(x∗ ):
T = {d | A(x∗ )d = 0};
10.2. PUTTING THINGS TOGETHER: AUGMENTED LAGRANGIAN SCHEME 275
• the matrix AT (x∗ )A(x∗ ) is positive semidefinite (as any matrix of the form B T B: indeed,
dT B T Bd = |Bd|2 ); its kernel, as it is immediately seen from the computation in parentheses,
is exactly the kernel of A(x∗ ), i.e., T ;
• The matrix R(·) simply vanishes at x∗ , due to presence of factors hi (x) in the expression for the
matrix; at the point x∗ – this point is feasible for (ECP) – these factors become zero, so that
R(x∗ ) = 0.
We conclude that
∇2x Lρ (x∗ , λ∗ ) = H(x∗ , λ∗ ) + ρAT (x∗ )A(x∗ ),
and the fact that the right hand side is positive definite for large enough ρ is an immediate consequence
of the following simple fact (prove it!):
Lemma 10.1.1 Let D be positive semidefinite matrix and C be a symmetric matrix which is positive
definite on the kernel of D. Then the matrix C + ρD is positive definite for all large enough values of ρ.
To derive (+) from above observations, it suffices to apply the Lemma to D = AT (x∗ )A(x∗ ) and
C = H(x∗ , λ).
taken at the point (x∗ , λ∗ ) be positive definite on the entire space; here x∗ is the nondegenerate solution of
(ECP) to be approximated and λ∗ is the corresponding vector of Lagrange multipliers. To make further
considerations clear, let us summarize what is said about the situation by the discussion of Section 10.1.1.
We know that
• to find λ∗ is the same as to solve the dual problem
Now we may run any method for unconstrained maximization of the dual objective Lρ (“outer iterations”
generating sequence λ0 , λ1 , ... of iterates converging to λ∗ ), accompanying every step t of the method by
solving the auxiliary problem (Pρλt−1 ) in order to find
xt−1 = xρ (λt−1 )
and thus to get the first order information on the dual objective which is required to update λt−1 into
λt .
Implementations of the scheme differ from each other mainly by methods we use to maximize the
dual objective and to solve the auxiliary primal problems.
Relation (10.2.2) has a very simple motivation: since xt−1 is unconstrained minimizer of Lρ (·, λt−1 ), we
have
m
X m
X
0 = ∇x Lρ (xt−1 , λt−1 ) = ∇x f (x) + ρhi (x)∇x hi (x) + (λt−1 )i ∇hi (x) =
i=1 i=1
m
X
= ∇x f (xt−1 ) + [(λt−1 )i + ρhi (xt−1 )]∇hi (x).
i=1
Comparing this equality with the one
m
X
0 = ∇x f (x∗ ) + λ∗i ∇x hi (x∗ )
i=1
defining the Lagrange multipliers λ∗ , we see that the right hand side in (10.2.2) can be viewed as a natural
approximation to λ∗ , the better the closer is xt−1 to x∗ .
10.2. PUTTING THINGS TOGETHER: AUGMENTED LAGRANGIAN SCHEME 277
This is only motivation, of course; now let us look what indeed can be said about recurrence (10.2.2).
From (10.2.1) we observe that h(xt−1 ) is exactly the gradient of the dual objective at λt−1 , so that
(10.2.2) is in fact Gradient Ascent method for the dual problem with constant stepsize ρ (recall that the
dual problem is the maximization one, so that a gradient-type method here should perform steps along
the gradient rather than along the intergradient, as it is in the minimization case; this is why we are
speaking about Gradient Ascent).
∇x Lρ (x(λ), λ) = 0,
∂2 ρ
∇2x Lρ (x(λ), λ)∇λ x(λ) + L (x(λ), λ) = 0;
∂λ∂x
as it is immediately seen from the expression for Lρ , the second term in the left hand side of the latter
equation is [∇x h(x(λ))]T , and we get
Φρ being the Hessian of the augmented Lagrangian Lρ with respect to x taken at the point (x(λ), λ).
Substituting this expression into (10.2.3), we get
Substituting expression for the Hessian of the augmented Lagrangian Lρ in x (see (10.1.5), we get the
following important conclusion:
The Hessian Ψρ of the dual objective Lρ (·) at the point λ∗ is given by
A = ∇x h(x∗ ) and
m
X
H = ∇2x f (x∗ ) + λi ∇2x hi (x∗ )
i=1
is the Hessian with respect to x of the Lagrange function of the original problem (ECP), the Hessian
being taken at the point (x∗ , λ∗ ).
Relation (10.2.5) allows to make the following crucial observation:
Proof. Assuming, for the sake of simplicity, that the matrices H and AH −1 AT are nonsingular (this
assumption can be easily eliminated by a little bit more complicated reasoning) and applying Superman-
Morrison formula, we get
Corollary 10.2.1 [Dual convergence rate] Let ρ be large enough, and let the starting point λ0 in re-
currence (10.2.2) be close enough to λ∗ . Then recurrence (10.2.2) converges to λ∗ linearly with the
convergence ratio κ(ρ) which tends to 0 as ρ → ∞.
The statement given by the corollary is indeed very important: it establishes (local) linear convergence
of simple and easily implementable recurrence (10.2.2), provided that ρ is large enough; moreover, the
convergence can be done arbitrarily fast, if ρ is large enough. Of course, this observation does not say
that in actual computations you should set ρ to a huge value and get the solution in one step: this
“one step” is one step of outer iteration in λ, and to perform this step, you should solve the auxiliary
primal problem with huge value of penalty, which, as we already know, is very difficult (note also that we
know without any dual considerations that solving the auxiliary primal problem with huge ρ we get good
approximate solution to (ECP) – this is the penalty method). The actual meaning of the statement is, of
course, that we may hope to deal with “large enough” value of the penalty which will ensure reasonable
rate of convergence of (10.2.2) and at the same time will be not “too large” to make the auxiliary primal
problems very difficult.
It is time now to prove the Corollary. I shall give the model of the proof only. Namely, let us believe
that the behaviour of recurrence (10.2.2) started close enough to λ∗ is basically the same as if the dual
objective was the quadratic function
1
ψ(λ) = (λ − λ∗ )T Ψρ (λ − λ∗ );
2
it indeed is so, but let us skip the required dull (although simple) justification. With our quadratic model
of the dual objective, the result can be obtained in one line. Recall that (10.2.2) is Gradient Ascent with
stepsize ρ, so that
thus, the residual vector λt − λ∗ is multiplied at each step by the matrix Im + ρΨρ , and this matrix tends
to 0 as ρ → ∞ by Proposition 10.2.1 and therefore is close to 0 for large ρ.
2
coming from the standard expansion
which is valid, for all small in absolute value , also in the matrix case
10.3. INCORPORATING INEQUALITY CONSTRAINTS 279
One may ask whether indeed simple recurrence (10.2.2) is the best we can use. The question is quite
reasonable: for fixed ρ – and our main motivation of the entire method was the desire to deal with perhaps
large, but fixed value of the penalty parameter – the rate of convergence of this recurrence is linear. It is
easily seen that the distance between xt and the exact solution x∗ is of order of distance between λt and
λ∗ , so that with recurrence (10.2.2) we get only linear convergence of both primal and dual variables.
Could we get something better? Of course we could: relation (10.2.4) expresses the Hessian of the dual
objective via the data which are available after x(λ) = xρ (λ) is found from the solution of the auxiliary
primal problem; thus, we could solve the dual problem by the Newton method, thus ensuring quadratic
rate of local convergence of the scheme.
with extended list of variables. Applying the Augmented Lagrangian Scheme to the resulting problem,
we come to the necessity to work with the augmented Lagrangian
m k m k
X X ρ X X
Lρ (x, s; λ, µ) = f (x) + λi hi (x) + µj [gj (x) + s2j ] + h2i (x) + [gj (x) + s2j ]2 ;
i=1 j=1
2 i=1 j=1
what we should do with this Lagrangian is to find its (local) saddle point, and to this end the scheme in
question suggests to maximize with respect to the dual variables λ, µ the function
Lρ (λ, µ) = min Lρ (x, s; µ, λ).
x,s
In fact we have no problems with minimization with respect to sj in the right hand side of the latter
relation. Indeed, explicit computation results in
k 2 X m m k
ρX µj ρX 2 X µ2j
Lρ (x, s; λ, µ) = f (x) + gj (x) + s2j + + λi hi (x) + hi (x) − .
2 j=1 ρ i=1
2 i=1 j=1
2ρ
It follows, consequently, that minimization of the Lagrange function with respect to the part s of primal
variables results in setting s2j equal to the quantity −(gj (x) + µj /ρ), when this quantity is nonnegative,
or to zero, when this quantity is negative. Consequently,
k 2 Xm m k
ρX µj ρX X µ2j
Lρ (λ, µ) = min{f (x) + gj (x) + + λi hi (x) + hi (x)2 } − , (10.3.1)
x 2 j=1 ρ + i=1 2 i=1 j=1
2ρ
where, as in the previous Lecture, (a)+ is the positive part of real a (a itself, if it is positive, and 0
otherwise).
Thus, the auxiliary primal problems arising in the Augmented Lagrangian Scheme in fact are problems
in the initial design variables.
From theoretical viewpoint, the following question is important. Validity of the Augmented La-
grangian Scheme for equality constrained problem (i.e., the theoretical local convergence results of the
previous sections) required nondegeneracy of the solution x∗ to be approximated. Thus, we should
understand when a locally optimal solution x∗ to (GCP), extended in the evident way (by setting
∗
sj = −gj (x∗ )) to solution (x∗ , s∗ ) of (ECP∗ ), results in a nondegenerate solution to the latter problem.
p
3 ∗
Definition 4.2.1: x is regular for the constraints, satisfies the Second Order Sufficient Optimality conditions
and, besides this, Lagrange multipliers related to all inequality constraints active at x∗ are strictly positive – this
is called strict complementary slackness
10.4. CONVEX CASE: AUGMENTED LAGRANGIANS; DECOMPOSITION 281
where f and gj are convex and twice continuously differentiable functions on Rn . From now on, let
us assume that the problem is solvable and satisfies the Slater condition: there exists x̄ such that all
constraints are strictly negative at x̄.
Those who have passed through the course “Optimization I” know, and other should believe that
under the indicated assumptions one can define the problem dual to (Rsvp), namely, the problem
with
k
X
L(x, µ) = f (x) + µj gj (x)
j=1
being the classical Lagrange function of problem (Rsvp). From the construction of the dual problem it
follows that the dual objective is concave function taking values in the extended by −∞ real axis. Due
to the concavity of the dual objective, (CnvD) is in fact convex optimization problem (this is the same
– to maximize a concave function φ∗ or to minimize its negation −φ∗ , which is a convex function).
The links between the primal problem (Rsvp) and its dual (CnvD) are given by the Convex Program-
ming Duality Theorem which states that
If (Rsvp) is solvable and satisfies the Slater condition, then (CnvD) also is solvable, and the optimal
values in the problems are equal to each other. In this case a feasible solution x∗ to (Rsvp) and a feasible
solution µ∗ to (CnvD) are optimal for the corresponding problems if and only if (x∗ , µ∗ ) is a saddle point
of the Lagrange function L(x, µ) on Rn × Rk+ (i.e., L(x, µ∗ ) attains its minimum over x ∈ Rn at x = x∗ ,
while L(x∗ , µ) attains its maximum over µ ≥ 0 at µ = µ∗ ).
In particular, all optimal solutions to (Rsvp) are among unconstrained minimizers of the function
L(x, µ∗ ), µ∗ being any optimal solution of the dual problem.
Note that Theorem 10.1.1 is a local version of this Duality Theorem.
The Duality Theorem has a lot of applications, both of theoretical and of computational type. From
the computational viewpoint, there are two ways to exploit the theorem:
• There are cases when the structure of the data in (Rsvp) allows to compute the dual objective
analytically (this is the case in Linear, Linearly Constrained Quadratic and Geometric Program-
ming). Whenever this is the case, we may gain a lot solving (CnvD) instead of (Rsvp) and then
recovering the primal solutions from the dual ones (which sometimes is easy and sometimes is not,
see below).
Computational consequences of this type clearly are restricted by our abilities to compute the dual objec-
tive analytically; in our today lecture we are not interested in these (in fact very important) “particular
cases”.
• In the general case we can implement the Lagrange Multipliers scheme, namely, to solve (CnvD)
numerically by a first-order method for convex optimization under linear inequality constraints
(note that the only explicit inequalities in (CnvD) are µj ≥ 0); thus, we may switch from nonlinearly
constrained primal problem to a problem with simple linear inequalities. Of course, to solve the
dual problem, we need possibility to compute its objective and its gradient at a given point; same
as in the Augmented Lagrangian Scheme for general-type problems, to this end we can solve
numerically the problem of minimizing the Lagrangian in x with µ ≥ 0 being fixed. In our now
situation, this “auxiliary primal problem” is fine - with smooth convex objective, so that we have
good possibilities to approximate the global solution to the auxiliary problem.
282 LECTURE 10. AUGMENTED LAGRANGIANS
If we somehow assure existence and uniqueness of the solution x∗ (λ) in the latter problem and its
continuity in µ, then we, first, will get continuously differentiable dual objective. Indeed, it can be
proved (cf. Section 10.1.1) that in the case in question
which is the continuous function of µ, provided that x∗ (µ) is continuous. Second, in the case under
consideration we clearly shall be able to restore the optimal solution to the primal problem via the
optimal solution to the dual.
There is one evident case when problem (10.4.1) has exactly one solution: this is the case when the
sum
k
X
r(x) = f (x) + gj (x)
j=1
of the objective and the constraints is strongly convex (with positive definite Hessian at every point) and
grows faster than |x| as |x| → ∞:
r(x)/|x| → ∞, |x| → ∞.
Indeed, in this case it is easily seen that the objective in (10.4.1) for every µ > 0 also is strongly convex
and goes to infinity as |x| → ∞; consequently, (10.4.1) has unique solution (which, as it is easily seen,
is continuous in µ > 0). Consequently, in the case under consideration application of the Lagrange
Multipliers Scheme causes no conceptual difficulties.
What to do if the just mentioned sufficient condition of “well-posedness” of (10.4.1) is not satisfied? A
good idea is to pass to augmented Lagrange function – basically the same idea we used in the general case
10.4. CONVEX CASE: AUGMENTED LAGRANGIANS; DECOMPOSITION 283
of nonconvex problems. Namely, let θj be increasing strongly convex functions on the axis normalized by
the conditions
θj (0) = 0, θj0 (0) = 1.
The problem
(CnvP∗ ) min {f (x) : θj (gj (x)) ≤ 0, j = 1, ..., k}
x
clearly is equivalent to (Rsvp) and, as it is easily seen, also is convex. One can straightforwardly verify
that, due to the normalization conditions on θj , the vector of optimal Lagrange multipliers for the new
problem is the same as for the old one (the new dual objective is, anyhow, different from the old one).
Under mild regularity assumptions one can prove that the modified problem satisfies the aforementioned
sufficient condition for well-posedness of problems (10.4.1); e.g., if all gj are linear, this sufficient condition
is satisfied whenever the feasible domain of (Rsvp) is bounded. Thus, by passing from (Rsvp) to equivalent
problem (Rsvp∗ ), or, which is the same, passing from the classical Lagrange function L(x, µ) of problem
(Rsvp) the augmented Lagrange function
k
X
L(x, µ) = f (x) + µj θj (gj (x))
j=1
(which is the classical Lagrange function of the equivalent problem), we normally get possibility to solve
the constrained problem (Rsvp) by the Lagrange Multipliers Scheme (which now becomes augmented).
All this is completely similar to the Augmented Lagrangian Scheme we used in the general case, in-
cluding possibility to introduce penalty parameter into our augmented scheme for (Rsvp); the “penalized”
augmented Lagrangian for (Rsvp) is
k
X
Lρ (x, µ) = f (x) + µj ρ−1 θj (ρgj (x));
j=1
this corresponds to replacing θj (·) by the rescaled functions θ̄j (s) = ρ−1 θj (ρs), which also satisfy the
normalization conditions. In the general (nonconvex) case the role of the penalty parameter is to ensure
the crucial property (!); besides this, large penalty, as we remember, enforces the Hessian of the dual
objective at λ∗ to be almost proportional to the unit matrix and thus assures fast local convergence of
simple gradient-type methods for solving the dual problem. In our now case, we are not interested in the
first of these two issues – due to the convexity of the original problem, (!) is ensured even with small
values of ρ; the second advantage of penalization, anyhow, still is important: the larger is ρ, the better
is conditioning of the dual objective associated with the augmented Lagrange function Lρ and the better
are our abilities for fast maximization of this objective.
I should stress that the presented brief outline of the Augmented Lagrangian Multipliers Scheme
in Convex Programming is far from being complete; e.g., we did not speak in details on how to solve
the dual problem associated with the augmented Lagrangian (now this problem is constrained, although
with simple constraints, and simple gradient projection routines are possible, but in no sense best possible
options). More detailed considerations of these issues are beyond the scope of this course.
Whenever it is the case, the Lagrange scheme becomes especially attractive due to the following evident
property: computation of the dual objective
k
X
φ∗ (µ) = min f (x) + µj gj (x)
x
j=1
i = 1, ..., k. Note that the computational complexity of an optimization problem normally is superadditive
function of the design dimension: it is easier, and usually - much easier, to solve l independent convex
problems (Pi ) of a given total design dimension n than to solve a single convex program of dimension
n. Therefore separability simplifies a lot implementation of the classical Lagrange scheme (note that
the main computational effort in this scheme is the one to solve auxiliary primal problems). In any
words, Lagrange scheme, in contrast to other Convex Programming techniques, is capable to exploit
separability of the objective and the constraints: it allows to decompose the original problem into a
collection of independent problems which “interact” with each other only via the “coordinating” dual
problem (CnvD).
In fact similar decomposition approach can be carried out also in the case when (Rsvp) contains,
besides separable objective and constraints, a number of constraints depending each of its own part xi of
the variables. Such a problem can be written down as
( l l
)
X X
(CnvPS) min fi (xi ) : gji (x) ≤ 0, j = 1, ..., k, xi ∈ Xi ,
x
i=1 i=1
where Xi are the convex sets given by those inequalities which depend on xi solely.
The corresponding version of Lagrange Duality says that the natural dual problem is
l
X
(CnvDS) φ∗i (µ) → max, µ ≥ 0,
i=1
where
k
X
φ∗i (µ) = min fi (xi ) + µj gji (xi ) ;
xi ∈Xi
j=1
the difference with the standard duality is that now minimization over xi in the formulae for dual objective
is taken over the set Xi given by the constraints depending on xi solely, not over the entire space.
The meaning of the words “(CnvDS) is natural dual to (CnvPS)” is given by the corresponding
version of the Duality Theorem:
(*) If (CnvPS) is solvable and satisfies the Slater condition, then (CnvDS) (which always is a Convex
Programming problem) also is solvable, the optimal values in the problems are equal to each other, and
the pair (x∗ , µ∗ ) of feasible primal and dual solutions to the problems is comprised of optimal solutions
if and only if (x∗ , λ∗ ) is saddle point of the classical Lagrange function L(x, µ) of (CnvPS) on the set
X ≡ X1 × ... × Xl of values of x and the set Rk+ = {µ ∈ Rk | µ ≥ 0} of values of µ, i.e., if L(x, µ∗ )
attains its minimum over x ∈ X at x = x∗ , while the function L(x∗ , µ) attains its maximum over µ ≥ 0
at µ = µ∗ .
As a corollary, we get that if µ∗ is optimal solution to (CnvDS), then the components x∗i of (any)
optimal solution to (CnvPS) are among the minimizers of the Lagrange functions
k
X
Li (xi , µ∗ ) = fi (x) + µ∗j gji (xi )
j=1
10.4. CONVEX CASE: AUGMENTED LAGRANGIANS; DECOMPOSITION 285
over xi ∈ Xi . Under assumptions of strong convexity similar to those indicated in the previous subsection,
this information is sufficient to restore, given µ∗ , the components of optimal primal solution x∗ . In the
general case we again might meet with the difficulty that there are many minimizers of Li (xi , µ∗ ) over
xi ∈ Xi , and only some of these minimizers can be combined into optimal solution to (CnvPS). How
to overcome this difficulty, this is another story which goes beyond the bounds of this lecture; what
should be stressed, anyhow, is that the outlined decomposition scheme allows to obtain important, and
sometimes complete information on the optimal solution to (CnvPS).
The Decomposition approach is especially attractive when l and the total number of constraints
participating in the description of the sets Xi are large, and there is relatively small number of separable
“linking constraints” gj (x) ≤ 0. Indeed, in this case the dual problem (CnvDS) is of small dimension,
which simplifies it computationally. At the same time, to compute the objective of (CnvDS) at a given
µ, one has to solve l independent problems of minimizing Li (xi , µ) over xi ∈ Xi . This latter task can be
achieved by parallel computations, and even with serial computations it is, as I have already mentioned,
much easier to solve many “small” independent problems than a single “large” one.
The outlined Decomposition scheme has very nice economical interpretation. Imagine that certain
company is comprised of basically independent units; the only interconnections between the units are
that all the units consume a number of resources like money, energy, etc., which are controlled by the
company. Let us interpret xi as the action of i-th unit, Xi as the set of actions allowed for the unit by
its own technological constraints. Let, further, gji (xi ) be the amount of resource j which is required by
unit i to implement action xi , and let −fi (xi ) be the profit the unit will get carrying out this action.
The company is interested to maximize the total profit of the units
l
X
−f (x) ≡ (−fi (xi ))
i=1
over xi ∈ Xi , i.e., minimize their losses Li (xi , µ); their orders will allow the company to compute the
objective of the dual problem (CnvDS), which for the case of nonzero right hand sides in the resource
inequality constraints clearly is
k
X m
X
φ∗ (µ) = Li (x∗i (µ), µ) − µ j bj .
j=1 j=1
In fact the vectors of the resources ordered by the units allow the company to compute the gradient of
the dual objective at the current µ (or a supergradient, if the dual objective is nonsmooth). With this
286 LECTURE 10. AUGMENTED LAGRANGIANS
information, the company can update the prices on the resources, applying certain method for concave
minimization of the dual objective on the nonnegative orthant, ask the units to update accordingly their
decisions, and to repeat this procedure until the dual objective will be maximized, thus solving – in
parallel and distributed fashion – the problem of maximizing total profit under given constraints on the
common resources. This is exactly what is Decomposition does.
It is time now to explain why in the Decomposition scheme we deal with the classical Lagrange func-
tion, not an augmented one. The reason is that the “augmenting” gj 7→ θj (gj ) of a separable constraint
gj (x) ≤ 0 destroys separability. The price for using the classical Lagrangian is typical nonsmoothness
of the dual objective, and in fact Decomposition is the main source of nonsmooth convex optimization
programs.
Lecture 11
This last lecture of the course is devoted to the Sequential Quadratic Programming methods – the methods
which are now thought to be the most efficient for practical solution of general-type (i.e., nonconvex)
constrained optimization problems.
The idea underlying the methods is to solve directly the KKT system of the problem under con-
sideration by a Newton-type iterative process. In order to get the idea, let us start with the equality
constrained case.
is the Lagrangian of (ECP). As we remember from Lecture 8, any locally optimal solution x∗ of (ECP)
which is a regular point of the system of constraints (i.e., is such that the gradients of constraints taken at
this point are linearly independent) can be extended by properly chosen λ = λ∗ to a solution of (KKT).
Now, (KKT) is a system of n + m equations with n + m unknowns x, λ, and we can try to solve this
system with the Newton method. To the moment we know what is the Newton minimization method,
not the Newton method for solving systems of equations; let us look what is this latter routine.
287
288 LECTURE 11. SEQUENTIAL QUADRATIC PROGRAMMING
with continuously differentiable real-valued functions pi , one can try to solve this system as follows. Given
current approximate solution ū, let us linearize the system at this point, i.e., replace the actual nonlinear
system by the linear one
p1 (ū) + [∇p1 (ū)]T (u − ū)
[P (u) ≈] P (ū) + P 0 (ū)(u − ū) ≡ ... = 0.
pN (ū) + [∇pN (ū)]T (u − ū)
Assuming the N × N matrix P 0 (ū) nonsingular, we can write down the solution to the linearized system:
The vector
ū+ − ū ≡ −[P 0 (ū)]−1 P (ū)
Note that the basic Newton method for unconstrained minimization of a smooth function f (Lecture 4)
is nothing but the above recurrence applied to the Fermat equation1)
Proposition 11.1.1 [Local superlinear/quadratic convergence of the Newton method for solving systems
of equations]
Let u∗ be a solution to (E), and assume that this solution is nondegenerate, i.e., that the matrix P 0 (u∗ )
is nonsingular. Then the Newton method (11.1.1) locally superlinearly converges to u∗ : there exists a
neighbourhood U of u∗ such that the recurrence (11.1.1), being started at a point u0 ∈ U , is well defined
(i.e., the required inverse matrices do exist), keeps the iterates in U and ensures that
|ut − u∗ | ≤ Cκ κt , t = 0, 1, ...
The proof of this proposition repeats word by word the one of the similar statement from Lecture 4,
and you are welcome to prove the statement by yourself.
Now let us look how all this works when (E) is the Karush-Kuhn-Tucker system of the equality
constrained problem (ECP).
1
to save words, we shall use the word “equation” also for systems of scalar equations, interpreting such a
system as a single vector equation. It always will be clear which equation – a vector or a scalar one – is meant.
11.1. SQP METHODS: EQUALITY CONSTRAINED CASE 289
Proposition 11.1.2 Let x∗ be a nondegenerate solution to (ECP) (Definition 4.2.1), and let λ∗ be the
corresponding vector of Lagrange multipliers. Then the matrix P 0 (x∗ , λ∗ ) is nonsingular.
whence
Hv + AT w = 0; Av = 0.
The second equation says that v is in the kernel of A = ∇h(x∗ ), i.e., is in the plane T tangent to the
feasible surface at the point x∗ . Multiplying the first equation from the left by v T and taking into account
that v T AT w = (Av)T w = 0 by the second equation, we get v T Hv = 0. Since x∗ is a nondegenerate local
solution to (ECP), H = ∇2x L(x∗ , λ∗ ) is positive definite on T ; consequently, v ∈ T and v T Hv = 0 imply
that v = 0.
Since v = 0, the first of our two equations implies that AT w = 0; but x∗ is regular for the constraints
of (ECP), i.e., the columns of the matrix AT = [∇h(x∗ )]T – these columns are exactly the gradients of the
constraints at the point x∗ – are linearly independent; since the columns of AT are linearly independent,
their combination AT w can be zero only if the vector w of coefficients in the combination is zero. Thus,
both v and w are zero.
290 LECTURE 11. SEQUENTIAL QUADRATIC PROGRAMMING
(x̄, λ̄) being the current iterate to be updated by the Newton step.
It is convenient to pass from the unknowns dx and dλ to dx and λ+ = λ̄ + dλ (note that λ+ is the
λ-component of the Newton iterate of (x̄, λ̄)). The new unknowns are given by the system
The resulting form of the Newton system is very instructive. To see this, let us for the time being forget
about the Newton method for solving the KKT system of problem (ECP) and look at the situation from
a little bit different viewpoint. Let x∗ be a nondegenerate solution to (ECP) and λ∗ be the corresponding
vector of Lagrange multipliers. Then
∇x L(x∗ , λ∗ ) = 0
and the Hessian ∇2x L(x∗ , λ∗ ) of the Lagrange function is positive definite on the plane T tangent to the
feasible surface of (ECP) at the point x∗ . In other words, the Lagrange function regarded as function
of x (with λ being set to λ∗ ) attains at x = x∗ its nondegenerate local minimum on the plane T . If we
knew the tangent plane T (and λ∗ ) in advance, we could find x∗ by applying to L(x, λ∗ ) any method
of unconstrained minimization, not in the entire space, of course, but in the plane T . Let us look what
would happen if we would use in this scheme the Newton minimization method. How we should act? We
should replace our current objective – the function L(·, λ∗ ) – by its quadratic expansion at the current
iterate x̄, i.e., by the function
1
L(x̄, λ∗ ) + (dx)T ∇x L(x̄, λ∗ ) + (dx)T ∇2x L(x̄, λ∗ )dx (11.1.5)
2
and to minimize this quadratic approximation over displacements dx such that x̄ + dx ∈ T ; the new
iterate would be then x̄ + dx, where dx is given by the aforementioned minimization.
Now, actually we do not know T and λ∗ ; all we have is the current approximation (x̄, λ̄) of (x∗ , λ∗ ).
With this approximation, we can imitate the latter construction as follows:
• use, instead of λ∗ , its current approximate λ̄;
• use, instead of T , the plane T̄ given by linearization of the constraints h(x) = 0 at the current
iterate x̄:
T̄ = {y = x̄ + dx | [∇h(x̄)]dx + h(x̄) = 0}.
Thus, the above idea – to improve current iterate x̄ by applying a single step of the Newton minimization
of the Lagrange function over the plane T – could be implemented as follows: we define dx as the solution
to the quadratic minimization problem with linear equality constraints:
(∗)
L(x̄, λ̄) + (dx)T ∇x L(x̄, λ̄) + 21 (dx)T ∇2x L(x̄, λ̄)dx → min
s.t.
[∇h(x̄)]dx = −h(x̄)
11.1. SQP METHODS: EQUALITY CONSTRAINED CASE 291
the quantity in the brackets is constant on the feasible plane of the problem and therefore can be elimi-
nated from the objective without varying the solution. Thus, problem (*) is equivalent to
(QP(x̄,λ̄) )
f (x̄) + (dx)T ∇f (x̄) + + 12 (dx)T ∇2x L(x̄, λ̄)dx → min
s.t.
[∇h(x̄)]dx = −h(x̄).
Problem (QP(x̄,λ̄) ) is quite respectable optimization problem, and its solution, if exists, is achieved at a
KKT point. Moreover, the solution to the problem does exist, if (x̄, λ̄) is close enough to (x∗ , λ∗ ). Indeed,
if x̄ = x∗ , λ̄ = λ∗ , then the objective of the problem is a quadratic form which is strongly convex at the
feasible plane T of the problem (recall that x∗ is a nondegenerate solution to (ECP)). By continuity
reasons, for (x̄, λ̄) close enough to (x∗ , λ∗ ) the objective in (QP(x̄,λ̄) ) also is strongly convex quadratic
form on the feasible plane T̄ of the problem and therefore attains its minimum on the plane.
Now let us make the following crucial observation (for the sake of brevity, we write (QP) instead of
(QP(x̄,λ̄) )):
Let the Newton system (11.1.4) be a system with nonsingular matrix (we know that it indeed
is the case when (x̄, λ̄) is close enough to (x∗ , λ∗ )). Then the Newton displacement dx given
by the Newton system is the unique KKT point of the problem (QP), and the vector λ+
given by the Newton system is the vector of Lagrange multipliers associated with the KKT
point dx in the problem (QP).
Indeed, let z be a KKT point of (QP) and µ be the corresponding vector of Lagrange multipliers. The
KKT equations for (QP) are (check it!)
What should be stressed is that the linear term in the objective of (QP) comes from our original
objective f 2) ; in contrast to this, the quadratic term in the objective of (QP) comes from the Lagrange
function, not from f . This is quite reasonable: this is the Lagrange function which attains its nonde-
generate local minimum on T at x∗ , not the actual objective f . For f itself, x∗ is nothing but a critical
point on T , and it may happen to be the point of maximum, as it is seen from the following example:
Our starting observation is that if x∗ is a nondegenerate solution to (GCP) and (λ∗ , µ∗ ) is the corre-
sponding vector of Lagrange multipliers (λ’s are the multipliers for the equality constraints, and µ’s - for
the inequality ones), then, as it is immediately given by straightforward verification of Definition 4.2.1,
x∗ is a nondegenerate solution, and λ∗ , µ∗ are Lagrange multipliers, in the following linearly constrained
quadratic programming program:
∗
(QP )
f (x∗ ) + (x − x∗ )T ∇f (x∗ ) + 21 (x − x∗ )T ∇2x L(x∗ ; λ∗ , µ∗ )(x − x∗ ) → min
s.t.
hi (x∗ ) + (x − x∗ )T ∇hi (x∗ ) = 0, i = 1, ..., m,
gj (x∗ ) + (x − x∗ )T ∇gj (x∗ ) ≤ 0, j = 1, ..., k,
where
m
X k
X
L(x; λ, µ) = f (x) + λi hi (x) + µj gj (x)
i=1 j=1
(QPt )
f (xt ) + (dx)T ∇f (xt ) + 21 (dx)T [∇2x L(xt ; λt , µt )]dx → min
s.t.
[∇h(xt )]dx = −h(xt ),
[∇g(xt )]dx ≤ −g(xt ).
2
note, anyhow, that we may deal with the linear term coming from the Lagrange function as well: recall that
(QP) is equivalent to (*). The only advantage of (QP) is that the vector of Lagrange multipliers in this problem
is exactly the new estimate λ+ of λ∗ , while the vector of Lagrange multipliers in (*) is λ+ − λ̄
11.2. THE CASE OF GENERAL CONSTRAINED PROBLEMS 293
Solving this problem, we get the corresponding displacement dx and, besides this, the vectors λ+ and µ+
of the Lagrange multipliers for (QPt ):
m k
∇f (xt ) + ∇2x L(xt ; λt , µt )dx + λ+ µ+
P P
i ∇hi (xt ) + j ∇gj (xt ) = 0,
i=1 j=1
µ+ T
j [gj (xt ) + (dx) ∇gj (xt )] = 0, j = 1, ..., k,
µ+
j ≥ 0, j = 1, ..., k;
after dx, λ+ , µ+ are identified, we take as the new iterate the triple
and loop.
It can be proved that if x∗ is a nondegenerate solution to (GCP) and the outlined recurrence is
started close enough to (x∗ , λ∗ , µ∗ ), then the recurrence is well-defined and converges quadratically to
(x∗ , λ∗ , µ∗ ).
the x-component of the new iterate is xt+1 = xt + dx, while λt+1 and µt+1 are the vectors of Lagrange
multipliers of (QP0t ).
The indicated modification (which, as we shall see in a while, still is converging) possesses the following
attractive properties:
• problem (QP0t ) does not depend explicitly on the current approximates λt and µt to the Lagrange
multipliers of (GCP) (all the dependence of λt , µt , if any, is via the choice of the matrix Bt );
• it ensures global convergence of a reasonable linesearch version of the SQP scheme (see Section
11.3);
294 LECTURE 11. SEQUENTIAL QUADRATIC PROGRAMMING
• (QP0t ) becomes a problem with strongly convex quadratic objective and linear constraints; the
problem therefore can be efficiently solved in finitely many steps by the active set method (Section
9.1.2).
It should be also mentioned that if, as it is the case under reasonable regularity assumptions, the
Lagrange multipliers µt+1 for inequality constraints in the problem (QP0t ) converge to the actual
Lagrange multipliers µ∗ for the inequality constraints in (GCP), then for large t the vector µt is
close to µt+1 . It means that µt allows to predict which inequalities in (QP0t ) are active at the
solution to this problem: for large t, these active inequalities are exactly those which correspond to
positive entries of µt . It follows that when solving (QP0t ) by the active set method, it makes sense
to use this prediction as the initial guess for the active set of (QP0t ). With this policy, eventually
(for large enough values of t) (QP0t ) will be solved in one step of the active set method (i.e., at the
cost of solving a single system of linear equations)! And even for initial values of t, when it may
take several steps of the active set method to solve (QP0t ), this prediction used as the initial guess
for the actual active set of (QP0t ) typically saves a lot of computations.
• using matrix Bt instead of the Hessian of the Lagrangian, we may apply the Quasi-Newton tactics
to approximate this Hessian (more exactly, its positive definite correction) without explicit usage
of the second-order information on the data
not as the actual step we should perform from ut to get the new iterate ut+1 , but as a search
direction only; the new iterate ut+1 is obtained from ut by certain – not necessarily unit –
stepsize γt+1 from ut in the direction dt :
The general scheme we come to – some policy of choosing sequential search directions plus tactics of
choosing stepsizes in these directions, tactics based on minimization of a reasonable auxiliary objective
over the search rays – is typical for many iterative methods, and the auxiliary objective in these methods
has a special name - it is called the merit function. Good choice of a merit function is as important
for quality of the overall routine as a good policy of choosing search directions, and in fact both these
components of a method should fit each other. The crucial requirement which should be satisfied by
these components is as follows:
(!) if current iterate u is not a solution to the problem in question, then the search direction d associated
with u should be descent for the merit function M (·):
The role of property (!) is clear: if it is violated, then we may arrive at an iterate which is not a solution
to the problem we want to solve and at the same time is such that the merit function increases along
the search ray associated with the iterate. In this case the linesearch will return zero stepsize, and the
next iterate will be the same as the previous one, so that we shall stay at this “bad” iterate forever.
On the other hand, if (!) is satisfied, then the linesearch will ensure nonzero progress in the value of
the merit function at each step where the current iterate is not an exact solution, and the method will
become descent with respect to the merit function. In this case we have good chances to use the Global
Convergence Theorem from Lecture 1 to prove convergence of the method.
Now let us apply the outlined general considerations to the case we are interested in, i.e., to the
case when the problem of interest is (GCE) and the policy of choosing search directions is given by the
Sequential Quadratic Programming scheme. What we need is a merit function which, along with this
policy, fits (!).
Lemma 11.3.1 Let xt be current iterate, let Bt be a positive definite matrix used in (QP0t ), let dx be the
solution to the latter problem and λ ≡ λt+1 , µ ≡ µt+1 be the Lagrange multipliers associated with (QP0t ).
Assume that θ is large enough:
Then either dx = 0, and then xt is a KKT point of (GCP), or dx 6= 0, and then dx is a descent direction
of the merit function M .
Proof. By the origin of dx, λ and µ we have (values and derivatives of f , h, g are taken at xt , B = Bt ):
h + [∇h]dx = 0, (11.3.6)
296 LECTURE 11. SEQUENTIAL QUADRATIC PROGRAMMING
g + [∇g]dx ≤ 0, (11.3.7)
and, finally,
µj ≥ 0, µj [gj + (dx)T ∇gj ] = 0, j = 1, ..., k. (11.3.8)
Looking at these relations, we see that if dx = 0, then xt clearly is a KKT point of (GCP). Now let
dx 6= 0. We have
(dx)T ∇f =
[see (11.3.5)]
m
X l
X
= −(dx)T Bdx − λi (dx)T ∇hi − µj (dx)T ∇gj =
i=1 j=1
[note that µj ≥ 0]
m
X k
X
≤ −(dx)T Bdx + |λi ||hi | + µj gj+ ≤
i=1 j=1
[see (11.3.4)]
Xm k
X
≤ −(dx)T Bdx + θ |hi | + gj+ . (11.3.9)
i=1 j=1
Further,
hi (xt + γdx) = hi + γ(dx)T ∇hi + γOi (γ) =
[see (11.3.6)]
= (1 − γ)hi + γOi (γ),
whence
|hi (xt + γdx)| ≤ (1 − γ)|hi | + γ|Oi (γ)|, i = 1, ..., m. (11.3.12)
Similarly,
gj (xt + γdx) = gj + γ(dx)T ∇gj + γOm+j (γ) ≤
[see (11.3.7)]
≤ (1 − γ)gj + γOm+j (γ),
whence
gj+ (xt + γdx) ≤ (1 − γ)gj+ + γ|Om+j (γ)|, j = 1, ..., k. (11.3.13)
Substituting (11.3.11) - (11.3.13) into (11.3.10), we get
Xm k
X
M (xt ) − M (xt + γdx) ≥ γ(dx)T ∇f + γθ |hi | + gj+ + γθO(γ) ≥
i=1 j=1
11.3. LINESEARCH, MERIT FUNCTIONS, GLOBAL CONVERGENCE 297
[see (11.3.9)]
≥ γ (dx)T Bt dx + θO(γ) .
Since dx 6= 0 and B is positive definite, we have (dx)T Bt dx > 0, and since O(γ) → 0 as γ → +0, we have
also
(dx)T Bt dx + θO(γ) > 0
for all small enough positive γ; from the above computation, for these γ the quantity M (xt )−M (xt +γdx)
is positive, as required.
Algorithm 11.3.1 [Generic SQP Algorithm with l1 Merit Function for (GCP)]
Initialization. Choose θ1 > 0, an arbitrary starting point x1 and set t = 1.
Step t. Given current iterate xt , act as follows:
• Choose somehow positive definite matrix Bt ;
• Form linearly constrained quadratic problem (QP0t ) given by xt and Bt ;
• Solve (QP0t ), thus obtaining the solution dx and the vectors of Lagrange multipliers λ = λt+1 and
µ = µt+1 . If dx = 0, terminate: xt is a KKT point of (GCP).
• Check whether
θt ≥ θ̄t ≡ max{|λ1 |, ..., |λm |, µ1 , ..., µk }.
if this inequality is satisfied, set θt+1 = θt , otherwise set
on the search ray {xt + γdx | γ ≥ 0}. Replace t with t + 1 and loop.
Theorem 11.3.1 [Global Convergence of the SQP Method with l1 Merit Function]
Let problem (GCP) be solved by Algorithm 11.3.1, and assume that
• a) there exists a bounded and closed set Ω ⊂ Rn such that for x ∈ Ω the solution set D(x) of the
system of linear inequality constraints
with unknowns dx is nonempty, and each vector dx ∈ D(x) is a regular solution of system S(x)
(Definition 4.1.1);
• b) the trajectory {xt } of the algorithm belongs to Ω and is infinite (i.e., the method does not
terminate with exact KKT point of (GCE));
298 LECTURE 11. SEQUENTIAL QUADRATIC PROGRAMMING
• c) the matrices Bt used in the method are uniformly bounded and uniformly positive definite:
cI ≤ Bt ≤ CI
Then all accumulation points of the trajectory of the method are KKT points of (GCP).
cI ≤ B ≤ CI.
From a) and c) one can extract (this is easy, although dull) that for x ∈ Ω and B ∈ B the problem
(QP(x, B)) has a unique solution dx(x, B) with uniquely defined Lagrange multipliers λ(x, B), µ(x, B),
and the triple (dx(x, B), λ(x, B), µ(x, B)) is continuous in (x, B) ∈ Ω × B.
2) From 1) and b) it follows that the Lagrange multipliers generated in course of running the method are
uniformly bounded; from the description of the method it therefore follows that the parameter θt of the
merit function is updated only finitely many times. Thus, forgetting about certain finite initial fragment
of the trajectory, we may assume that this parameter simply is constant: θt ≡ θ.
3) To prove the Theorem, we should prove that if x̄ is the limit of certain subsequence {xtp }∞ p=1 of our
(bounded) trajectory, then x̄ is a KKT point of (GCE). From c) it follows that the matrices Btp belong
to the (clearly compact) set B; passing to a subsequence, we therefore may assume that
If the solution d+ = dx(x̄, B̄) to problem (QP(x̄, B̄)) is zero, then x̄ is a KKT point of (GCE) by
Lemma 11.3.1, as required. Thus, all we need is to lead to a contradiction the assumption that d+ 6= 0.
Under this assumption from the convergencies xtp → x̄, Btp → B̄, p → ∞, and from 1) it follows
that dp ≡ dx(xtp , Btp ) → d+ 6= 0. From the indicated convergencies and the relation d+ 6= 0 by the
arguments completely similar to those used in the proof of Lemma 11.3.1 it follows that the progresses
δp ≡ M (xtp ) − M (xtp +1 ) in the values of the merit function at the steps tp , p = 1, 2, ..., are bounded
away from zero:
δp ≥ δ > 0, p = 1, 2, ...
This is clearly a contradiction: our process at each step decreases the merit function, and since this
function clearly is below bounded on Q, the sum of progresses in its values over all steps is finite, and
therefore these progresses go to 0 as t → ∞.
• In the presentation of the Generic SQP algorithm with l1 merit function we did not specify how
the matrices Bt are generated; Theorem 11.3.1 also is not that specific in this respect (all it asks
for is uniform boundedness and uniform positive definiteness of the matrices Bt ). Our preceding
considerations, however, make it clear that in order to get fast algorithm, one should make Bt
“close” to the Hessian of the Lagrange function associated with the current iterate x and the
current (coming from the previous auxiliary quadratic programming problem) estimates of the
Lagrange multipliers. Of course, if this Hessian, being available, is not “well positive definite”,
one should replace it by its positive definite correction. And if the second order information is not
available at all, we can generate Bt via quasi-Newton approximations similar to those from Lecture
7.
11.3. LINESEARCH, MERIT FUNCTIONS, GLOBAL CONVERGENCE 299
• A weak point of the method in its now form is the possibility for auxiliary problems (QP0t ) to be
unfeasible. This indeed may happen, even in the case of feasible problem (GCP). More exactly, if
x∗ is a nondegenerate solution to (GCP), then the quadratic problems associated with the iterates
close enough to x∗ are feasible, but if the current iterate is too far from x∗ , the quadratic problem
can be infeasible. What to do, if an unfeasible auxiliary quadratic problem is met?
A popular way to overcome this difficulty is to pass from the auxiliary problem (QP0t ) to the
“trust region SQP problem” which gives us both the search direction and the stepsize and always
is feasible and solvable. Namely, given current iterate xt and positive definite approximation Bt
to the Hessian of the Lagrange function, we find the new iterate xt+1 = xt + dx by solving the
optimization problem as follows:
1
[f (xt ) + (dx)T ∇f (xt ) + (dx)T Bt dx]
2
Xm
+θ[ |hi (xt ) + (dx)T ∇hi (xt )|
i=1
k
X
+ max[0, gj (xt ) + (dx)T ∇gj (xt )]] → min,
j=1
This problem can be easily rewritten as a linearly constrained problem with convex quadratic
objective. Solving this problem, we take direct care of decreasing the l1 Merit Function, or, better
to say, its analogy obtained when replacing the actual – nonlinear – constraints in the merit function
by their linearizations. The “trust region” bounds −δt ≤ (dx)i ≤ δt should be tight enough to
make the “analogy” sufficiently close to the actual merit function.
It should also be mentioned that if (GCP) is a (feasible) program with convex constraints (i.e., the
equality constraints hi are linear, and the inequality ones gj are convex), then auxiliary quadratic
problems (QP0t ) for sure are feasible. Indeed, if x̄ is an arbitrary point of Rn and x is a feasible
solution to (GCP), then, from linearity of the equality constraints,
so that x − x̄ is a feasible solution to the quadratic program (QP0 ) associated with x̄.
• “Maratos effect”. The Merit Function technique ensures global convergence of the SQP scheme at
the cost of “spoiling”, to some extent, local quadratic convergence of the “pure” SQP scheme from
Sections 11.1.2 and 11.2.1. For the sake of definiteness, let us speak about the case of equality
constrained problem (ECP). Moreover, let the matrix H = ∇2x L(x∗ , λ∗ ) be the unit matrix (so
that H is positive definite on the entire space). Assume that we use the unit matrix also as
Bt at all steps of the SQP method; then, as we know from Section 11.1.2, the unit stepsizes γt
would result in asymptotical quadratic convergence of the routine, provided that it converges at
all. And it is easy to demonstrate that, vise versa, “essentially non-unit” stepsizes will prevent the
superlinear convergence. Now, what about “asymptotical compatibility” of the unit stepsizes with
3
which says that for a differentiable convex function φ(·) one has
for all u, v
300 LECTURE 11. SEQUENTIAL QUADRATIC PROGRAMMING
the linesearch based on the l1 Merit Function? Is it true that the linesearch will eventually result
in nearly unit stepsizes, or is it at least true that the unit stepsizes do not contradict the Merit
Function philosophy, i.e., that close to the solution they decrease the merit function? The answer
on both these question is negative. There are examples of problems (ECP) where, arbitrarily close
to a nondegenerate solution to the problem, the unit stepsizes in the SQP scheme increase the l1
Merit Function (and in fact all natural merit functions). This phenomenon (it was discovered by
N. Maratos in ’78 and is called the Maratos effect) demonstrates that we cannot expect the SQP
scheme with merit function based linesearch to possess superlinear asymptotic rate of convergence.
There are some ways to avoid this drawback, but these details are beyond the scope of the course.
Appendix A
Regarded as mathematical entities, the objective and the constraints in a Mathematical Programming
problem are functions of several real variables; therefore before entering the Optimization Theory and
Methods, we need to recall several basic notions and facts about the spaces Rn where these functions
live, same as about the functions themselves. The reader is supposed to know most of the facts to follow,
so he/she should not be surprised by a “cooking book” style which we intend to use in this Lecture.
A.1.1 A point in Rn
A point in Rn (called also an n-dimensional vector) is an ordered collection x = (x1 , ..., xn ) of n reals,
called the coordinates, or components, or entries of vector x; the space Rn itself is the set of all collections
of this type.
x + y = (x1 + y1 , ..., xn + yn )
301
302 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
always is a linear subspace in Rn . This example is “generic”, that is, every linear subspace in Rn is
the solution set of a (finite) system of homogeneous linear equations, see Proposition A.3.6 below.
5. Linear span of a set of vectors. Given a nonempty set X of vectors, one can form a linear subspace
Lin(X), called the linear span of X; this subspace consists of all vectors x which can be represented
N
P N
P
as linear combinations λi xi of vectors from X (in λi xi , N is an arbitrary positive integer,
i=1 i=1
λi are reals and xi belong to X). Note that
Lin(X) is the smallest linear subspace which contains X: if L is a linear subspace such
that L ⊃ X, then L ⊃ L(X) (why?).
Every linear subspace in Rn is the linear span of an appropriately chosen finite set of
vectors from Rn .
A.1.3.B. Sums and intersections of linear subspaces. Let {Lα }α∈I be a family (finite or
infinite) of linear subspaces of Rn . From this family, one can build two sets:
P
1. The sum Lα of the subspaces Lα which consists of all vectors which can be represented as finite
α
sums of vectors taken each from its own subspace of the family;
T
2. The intersection Lα of the subspaces from the family.
α
1)
Pay attention to the notation: we use the same symbol 0 to denote the real zero and the n-dimensional
vector with all coordinates equal to zero; these two zeros are not the same, and one should understand from the
context (it always is very easy) which zero is meant.
A.1. SPACE RN : ALGEBRAIC STRUCTURE 303
Besides the bases of the entire Rn , one can speak about the bases of linear subspaces:
A collection {f 1 , ..., f m } of vectors is called a basis of a linear subspace L, if
1. The collection is linearly independent,
2. L = Lin{f 1 , ..., f m }, i.e., all vectors f i belong to L, and every vector from L is a linear combination
of the vectors f 1 , ..., f m .
In order to avoid trivial remarks, it makes sense to agree once for ever that
An
P empty set of vectors is linearly independent, and an empty linear combination of vectors
λi xi equals to zero.
i∈∅
With this convention, the trivial linear subspace L = {0} also has a basis, specifically, an empty set of
vectors.
Theorem A.1.2 (i) Let L be a linear subspace of Rn . Then L admits a (finite) basis, and all bases of
L are comprised of the same number of vectors; this number is called the dimension of L and is denoted
by dim (L).
We have seen that Rn admits a basis comprised of n elements (the standard basic orths).
From (i) it follows that every basis of Rn contains exactly n vectors, and the dimension of
Rn is n.
(ii) The larger is a linear subspace of Rn , the larger is its dimension: if L ⊂ L0 are linear subspaces
of Rn , then dim (L) ≤ dim (L0 ), and the equality takes place if and only if L = L0 .
We have seen that the dimension of Rn is n; according to the above convention, the trivial
linear subspace {0} of Rn admits an empty basis, so that its dimension is 0. Since {0} ⊂
L ⊂ Rn for every linear subspace L of Rn , it follows from (ii) that the dimension of a linear
subspace in Rn is an integer between 0 and n.
304 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
is a one-to-one mapping of L onto Rm which is linear, i.e. for every i = 1, ..., m one has
The reals λi (x), i = 1, ..., m, are called the coordinates of x ∈ L in the basis f 1 , ..., f m .
E.g., the coordinates of a vector x ∈ Rn in the standard basis e1 , ..., en of Rn – the one
comprised of the standard basic orths – are exactly the entries of x.
(v) [Dimension formula] Let L1 , L2 be linear subspaces of Rn . Then
Aj = A(ej ).
matrices representing the mappings: adding/multiplying by reals mappings, we add, respectively, multiply
by reals the corresponding matrices.
Given two linear mappings A(x) : Rn → Rm and B(y) : Rm → Rk , we can build their superposition
C(x) ≡ B(A(x)) : Rn → Rk ,
which is again a linear mapping, now from Rn to Rk . In the language of matrices representing the
mappings, the superposition corresponds to matrix multiplication: the k × n matrix C representing the
mapping C is the product of the matrices representing A and B:
Important convention. When speaking about adding n-dimensional vectors and multiplying them
by reals, it is absolutely unimportant whether we treat the vectors as the column ones, or the row ones,
or write down the entries in rectangular tables, or something else. However, when matrix operations
(matrix-vector multiplication, transposition, etc.) become involved, it is important whether we treat
our vectors as columns, as rows, or as something else. For the sake of definiteness, from now on we
treat all vectors as column ones, independently of how we refer to them in the text. For example, when
saying for the first time what a vector is, we wrote x = (x1 , ..., xn ), which might suggest that we were
speaking about row vectors. We stress that it is not the case, and the only reason for using the notation
x1
..
x = (x1 , ..., xn ) instead of the “correct” one x = . is to save space and to avoid ugly formulas like
xn
x1
f ( ... ) when speaking about functions with vector arguments. After we have agreed that there is no
xn
such thing as a row vector in this Lecture course, we can use (and do use) without any harm whatever
notation we want.
Exercise A.1 1. Mark in the list below those subsets of Rn which are linear subspaces, find out their
dimensions and point out their bases:
(a) Rn
(b) {0}
(c) ∅
n
(d) {x ∈ Rn :
P
ixi = 0}
i=1
n
(e) {x ∈ Rn : ix2i = 0}
P
i=1
n
(f ) {x ∈ Rn :
P
ixi = 1}
i=1
n
(g) {x ∈ Rn : ix2i = 1}
P
i=1
(a) Find the dimension of Rm×n and point out a basis in this space
306 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
(b) In the space Rn×n of square n × n matrices, there are two interesting subsets: the set Sn
of symmetric matrices {A = [Aij ] : Aij = Aij } and the set Jn of skew-symmetric matrices
{A = [Aij ] : Aij = −Aji }.
i. Verify that both Sn and Jn are linear subspaces of Rn×n
ii. Find the dimension and point out a basis in Sn
iii. Find the dimension and point out a basis in Jn
iv. What is the sum of Sn and Jn ? What is the intersection of Sn and Jn ?
1. [bi-linearity]: The real-valued function hx, yi of two vector arguments x, y ∈ Rn is linear with
respect to every one of the arguments, the other argument being fixed:
3. [positive definiteness]: The quantity hx, xi always is nonnegative, and it is zero if and only if x is
zero.
Remark A.2.1 The outlined 3 properties – bi-linearity, symmetry and positive definiteness – form a
definition of an Euclidean inner product, and there are infinitely many different from each other ways to
satisfy these properties; in other words, there are infinitely many different Euclidean inner products on
Rn . The standard inner product hx, yi = xT y is just a particular case of this general notion. Although
in the sequel we normally work with the standard inner product, the reader should remember that the
facts we are about to recall are valid for all Euclidean inner products, and not only for the standard one.
The notion of an inner product underlies a number of purely algebraic constructions, in particular, those
of inner product representation of linear forms and of orthogonal complement.
A.2. SPACE RN : EUCLIDEAN STRUCTURE 307
Theorem A.2.2 (i) Twice taken, orthogonal complement recovers the original subspace: whenever L is
a linear subspace of Rn , one has
(L⊥ )⊥ = L;
(ii) The larger is a linear subspace L, the smaller is its orthogonal complement: if L1 ⊂ L2 are linear
subspaces of Rn , then L⊥ 1 ⊃ L2
⊥
(iii) The intersection of a subspace and its orthogonal complement is trivial, and the sum of these
subspaces is the entire Rn :
L ∩ L⊥ = {0}, L + L⊥ = Rn .
Remark A.2.2 From Theorem A.2.2.(iii) and the Dimension formula (Theorem A.1.2.(v)) it follows,
first, that for every subspace L in Rn one has
dim (L) + dim (L⊥ ) = n.
Second, every vector x ∈ Rn admits a unique decomposition
x = xL + xL ⊥
into a sum of two vectors: the first of them, xL , belongs to L, and the second, xL⊥ , belongs to L⊥ .
This decomposition is called the orthogonal decomposition of x taken with respect to L, L⊥ ; xL is called
the orthogonal projection of x onto L, and xL⊥ – the orthogonal projection of x onto the orthogonal
complement of L. Both projections depend on x linearly, for example,
(x + y)L = xL + yL , (λx)L = λxL .
The mapping x 7→ xL is called the orthogonal projector onto L.
308 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
i 6= j ⇒ hf i , f j i = 0
hf i , f i i = 1, i = 1, ..., m.
Theorem A.2.3 (i) An orthonormal collection f 1 , ..., f m always is linearly independent and is therefore
a basis of its linear span L = Lin(f 1 , ..., f m ) (such a basis in a linear subspace is called orthonormal). The
coordinates of a vector x ∈ L w.r.t. an orthonormal basis f 1 , ..., f m of L are given by explicit formulas:
m
X
x= λi (x)f i ⇔ λi (x) = hx, f i i.
i=1
Example of an orthonormal basis in Rn : The standard basis {e1 , ..., en } is orthonormal with
respect to the standard inner product hx, yi = xT y on Rn (but is not orthonormal w.r.t.
other Euclidean inner products on Rn ).
Proof of (i): Taking inner product of both sides in the equality
X
x= λj (x)f j
j
with f i , we get
h λj (x)f j , f i i
P
hx, fi i =
Pj
= λj (x)hf j , f i i [bilinearity of inner product]
j
= λi (x) [orthonormality of {f i }]
Proof:
λi (x)f i , y = λi (y)f i
P P
x =
iP P i
⇒ hx, yi = h λi (x)f i , λi (y)f i i
Pi i
= λi (x)λj (y)hf i , f j i [bilinearity of inner product]
i,j
[orthonormality of {f i }]
P
= λi (x)λi (y)
i
(iii) Every linear subspace L of Rn admits an orthonormal basis; moreover, every orthonormal system
f , ..., f m of vectors from L can be extended to an orthonormal basis in L.
1
A.2. SPACE RN : EUCLIDEAN STRUCTURE 309
Important corollary: All Euclidean spaces of the same dimension are “the same”. Specifi-
cally, if L is an m-dimensional space in a space Rn equipped with an Euclidean inner product
h·, ·i, then there exists a one-to-one mapping x 7→ A(x) of L onto Rm such that
• The mapping preserves linear operations:
• The mapping converts the h·, ·i inner product on L into the standard inner product on
Rm :
hx, yi = (A(x))T A(y) ∀x, y ∈ L.
Indeed, by (iii) L admits an orthonormal basis f 1 , ..., f m ; using (ii), one can immediately
check that the mapping
x 7→ A(x) = (λ1 (x), ..., λm (x))
which maps x ∈ L into the m-dimensional vector comprised of the coordinates of x in the
basis f 1 , ..., f m , meets all the requirements.
Proof of (iii) is given by important by its own right Gram-Schmidt orthogonalization process
as follows. We start with an arbitrary basis h1 , ..., hm in L and step by step convert it into
an orthonormal basis f 1 , ..., f m . At the beginning of a step t of the construction, we already
have an orthonormal collection f 1 , ..., f t−1 such that Lin{f 1 , ..., f t−1 } = Lin{h1 , ..., ht−1 }.
At a step t we
1. Build the vector
t−1
X
g t = ht − hht , f j if j .
j=1
Exercise A.2 1. What is the orthogonal complement (w.r.t. the standard inner product) of the sub-
n
space {x ∈ Rn : xi = 0} in Rn ?
P
i=1
2. Find an orthonormal basis (w.r.t. the standard inner product) in the linear subspace {x ∈ Rn :
x1 = 0} of Rn
3. Let L be a linear subspace of Rn , and f 1 , ..., f m be an orthonormal basis in L. Prove that for every
x ∈ Rn , the orthogonal projection xL of x onto L is given by the formula
m
X
xL = (xT f i )f i .
i=1
n
4. Let L1 , L2 be linear subspaces in R . Verify the formulas
(L1 + L2 )⊥ = L⊥ ⊥
1 ∩ L2 ; (L1 ∩ L2 )⊥ = L⊥ ⊥
1 + L2 .
5. Consider the space of m × n matrices Rm×n , and let us equip it with the “standard inner product”
(called the Frobenius inner product)
X
hA, Bi = Aij Bij
i,j
(as if we were treating m×n matrices as mn-dimensional vectors, writing the entries of the matrices
column by column, and then taking the standard inner product of the resulting long vectors).
(a) Verify that in terms of matrix multiplication the Frobenius inner product can be written as
hA, Bi = Tr(AB T ),
where Tr(C) is the trace (the sum of diagonal elements) of a square matrix C.
(b) Build an orthonormal basis in the linear subspace Sn of symmetric n × n matrices
(c) What is the orthogonal complement of the subspace Sn of symmetric n × n matrices in the
space Rn×n of square n × n matrices?
1 2
(d) Find the orthogonal decomposition, w.r.t. S2 , of the matrix
3 4
E.g., shifting the linear subspace L comprised of vectors with zero first entry by a vector a = (a1 , ..., an ),
we get the set M = a + L of all vectors x with x1 = a1 ; according to our terminology, this is an affine
subspace.
Immediate question about the notion of an affine subspace is: what are the “degrees of freedom” in
decomposition (A.3.1) – how “strict” M determines a and L? The answer is as follows:
Proposition A.3.1 The linear subspace L in decomposition (A.3.1) is uniquely defined by M and is the
set of all differences of the vectors from M :
L = M − M = {x − y : x, y ∈ M }. (A.3.2)
The shifting vector a is not uniquely defined by M and can be chosen as an arbitrary vector from M .
Corollary A.3.1 Let {Mα } be an arbitrary family of affine subspaces in Rn , and assume that the set
M = ∩α Mα is nonempty. Then Mα is an affine subspace.
From Corollary A.3.1 it immediately follows that for every nonempty subset Y of Rn there exists the
smallest affine subspace containing Y – the intersection of all affine subspaces containing Y . This smallest
affine subspace containing Y is called the affine hull of Y (notation: Aff(Y )).
All this resembles a lot the story about linear spans. Can we further extend this analogy and to get
a description of the affine hull Aff(Y ) in terms of elements of Y similar to the one of the linear span
(“linear span of X is the set of all linear combinations of vectors from X”)? Sure we can!
Let us choose somehow a point y0 ∈ Y , and consider the set
X = Y − y0 .
All affine subspaces containing Y should contain also y0 and therefore, by Proposition A.3.1, can be
represented as M = y0 + L, L being a linear subspace. It is absolutely evident that an affine subspace
M = y0 + L contains Y if and only if the subspace L contains X, and that the larger is L, the larger is
M:
L ⊂ L0 ⇒ M = y0 + L ⊂ M 0 = y0 + L0 .
Thus, to find the smallest among affine subspaces containing Y , it suffices to find the smallest among the
linear subspaces containing X and to translate the latter space by y0 :
Now, we know what is Lin(Y − y0 ) – this is a set of all linear combinations of vectors from Y − y0 , so
that a generic element of Lin(Y − y0 ) is
k
X
x= µi (yi − y0 ) [k may depend of x]
i=1
with yi ∈ Y and real coefficients µi . It follows that the generic element of Aff(Y ) is
k
X k
X
y = y0 + µi (yi − y0 ) = λi yi ,
i=1 i=0
where X
λ0 = 1 − µi , λi = µi , i ≥ 1.
i
312 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
We see that a generic element of Aff(Y ) is a linear combination of vectors from Y . Note, however, that
the coefficients λi in this combination are not completely arbitrary: their sum is equal to 1. Linear
combinations of this type – with the unit sum of coefficients – have a special name – they are called
affine combinations.
We have seen that every vector from Aff(Y ) is an affine combination of vectors of Y . Whether the
inverse is true, i.e., whether Aff(Y ) contains all affine combinations of vectors from Y ? The answer is
positive. Indeed, if
X k
y= λi yi
i=1
P
is an affine combination of vectors from Y , then, using the equality λi = 1, we can write it also as
i
k
X
y = y0 + λi (yi − y0 ),
i=1
y0 being the “marked” vector we used in our previous reasoning, and the vector of this form, as we
already know, belongs to Aff(Y ). Thus, we come to the following
When Y itself is an affine subspace, it, of course, coincides with its affine hull, and the above Proposition
leads to the following
Corollary A.3.2 An affine subspace M is closed with respect to taking affine combinations of its mem-
bers – every combination of this type is a vector from M . Vice versa, a nonempty set which is closed with
respect to taking affine combinations of its members is an affine subspace.
Proposition A.3.3 Let M = a + L be an affine subspace and Y be a subset of M , and let y0 ∈ Y . The
set Y affinely spans M – M = Aff(Y ) – if and only if the set
X = Y − y0
Affinely independent sets. A linearly independent set x1 , ..., xk is a set such that no nontrivial
linear combination of x1 , ..., xk equals to zero. An equivalent definition is given by Theorem A.1.2.(iv):
x1 , ..., xk are linearly independent, if the coefficients in a linear combination
k
X
x= λ i xi
i=1
A.3. AFFINE SUBSPACES IN RN 313
are uniquely defined by the value x of the combination. This equivalent form reflects the essence of
the matter – what we indeed need, is the uniqueness of the coefficients in expansions. Accordingly, this
equivalent form is the prototype for the notion of an affinely independent set: we want to introduce this
notion in such a way that the coefficients λi in an affine combination
k
X
y= λi yi
i=0
of “affinely independent” set of vectors y0 , ..., yk would be uniquely defined by y. Non-uniqueness would
mean that
X k k
X
λi yi = λ0i yi
i=0 i=0
for two different collections of coefficients λi and λ0i with unit sums of coefficients; if it is the case, then
m
X
(λi − λ0i )yi = 0,
i=0
with nontrivial zero sum coefficients µi . Thus, we have motivated the following
Definition A.3.2 [Affinely independent set] A collection y0 , ..., yk of n-dimensional vectors is called
affinely independent, if no nontrivial linear combination of the vectors with zero sum of coefficients is
zero:
Xk Xk
λi yi = 0, λi = 0 ⇒ λ0 = λ1 = ... = λk = 0.
i=1 i=0
With this definition, we get the result completely similar to the one of Theorem A.1.2.(iv):
Corollary A.3.3 Let y0 , ..., yk be affinely independent. Then the coefficients λi in an affine combination
k
X X
y= λi yi [ λi = 1]
i=0 i
of the vectors y0 , ..., yk are uniquely defined by the value y of the combination.
Verification of affine independence of a collection can be immediately reduced to verification of linear
independence of closely related collection:
Proposition A.3.4 k + 1 vectors y0 , ..., yk are affinely independent if and only if the k vectors (y1 −
y0 ), (y2 − y0 ), ..., (yk − y0 ) are linearly independent.
From the latter Proposition it follows, e.g., that the collection 0, e1 , ..., en comprised of the origin and
the standard basic orths is affinely independent. Note that this collection is linearly dependent (as
every collection containing zero). You should definitely know the difference between the two notions of
independence we deal with: linear independence means that no nontrivial linear combination of the vectors
can be zero, while affine independence means that no nontrivial linear combination from certain restricted
class of them (with zero sum of coefficients) can be zero. Therefore, there are more affinely independent
sets than the linearly independent ones: a linearly independent set is for sure affinely independent, but
not vice versa.
314 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
Affine bases and affine dimension. Propositions A.3.2 and A.3.3 reduce the notions of affine
spanning/affine independent sets to the notions of spanning/linearly independent ones. Combined with
Theorem A.1.2, they result in the following analogies of the latter two statements:
Proposition A.3.5 [Affine dimension] Let M = a + L be an affine subspace in Rn . Then the following
two quantities are finite integers which are equal to each other:
(i) minimal # of elements in the subsets of M which affinely span M ;
(ii) maximal # of elements in affine independent subsets of M .
The common value of these two integers is by 1 more than the dimension dim L of L.
By definition, the affine dimension of an affine subspace M = a + L is the dimension dim L of L. Thus, if
M is of affine dimension k, then the minimal cardinality of sets affinely spanning M , same as the maximal
cardinality of affine independent subsets of M , is k + 1.
We already know that the standard basic orths e1 , ..., en form a basis of the entire space Rn . And what
about affine bases in Rn ? According to Theorem A.3.1.A, you can choose as such a basis a collection
e0 , e0 + e1 , ..., e0 + en , e0 being an arbitrary vector.
Barycentric coordinates. Let M be an affine subspace, and let y0 , ..., yk be an affine basis of M .
Since the basis, by definition, affinely spans M , every vector y from M is an affine combination of the
vectors of the basis:
Xk Xk
y= λi yi [ λi = 1],
i=0 i=0
and since the vectors of the affine basis are affinely independent, the coefficients of this combination are
uniquely defined by y (Corollary A.3.3). These coefficients are called barycentric coordinates of y with
respect to the affine basis in question. In contrast to the usual coordinates with respect to a (linear)
basis, the barycentric coordinates could not be quite arbitrary: their sum should be equal to 1.
Comment. The “outer” description of a linear subspace/affine subspace – the “artist’s” one – is in
many cases much more useful than the “inner” description via linear/affine combinations (the “worker’s”
one). E.g., with the outer description it is very easy to check whether a given vector belongs or does not
belong to a given linear subspace/affine subspace, which is not that easy with the inner one5) . In fact
both descriptions are “complementary” to each other and perfectly well work in parallel: what is difficult
to see with one of them, is clear with another. The idea of using “inner” and “outer” descriptions of
the entities we meet with – linear subspaces, affine subspaces, convex sets, optimization problems – the
general idea of duality – is, I would say, the main driving force of Convex Analysis and Optimization,
and in the sequel we would all the time meet with different implementations of this fundamental idea.
• Subspaces of dimension 1 (lines). These subspaces are translations of one-dimensional linear sub-
spaces of Rn . A one-dimensional linear subspace has a single-element basis given by a nonzero
vector d and is comprised of all multiples of this vector. Consequently, line is a set of the form
{y = a + td : t ∈ R}
given by a pair of vectors a (the origin of the line) and d (the direction of the line), d 6= 0. The
origin of the line and its direction are not uniquely defined by the line; you can choose as origin
any point on the line and multiply a particular direction by nonzero reals.
In the barycentric coordinates a line is described as follows:
where y0 , y1 is an affine basis of l; you can choose as such a basis any pair of distinct points on the
line.
The “outer” description a line is as follows: it is the set of solutions to a linear system with n
variables and n − 1 linearly independent equations.
• Subspaces of dimension > 2 and < n − 1 have no special names; sometimes they are called affine
planes of such and such dimension.
• Affine subspaces of dimension n − 1, due to important role they play in Convex Analysis, have
a special name – they are called hyperplanes. The outer description of a hyperplane is that a
hyperplane is the solution set of a single linear equation
aT x = b
with nontrivial left hand side (a 6= 0). In other words, a hyperplane is the level set a(x) = const of
a nonconstant linear form a(x) = aT x.
• The “largest possible” affine subspace – the one of dimension n – is unique and is the entire Rn .
This subspace is given by an empty system of linear equations.
5)
in principle it is not difficult to certify that a given point belongs to, say, a linear subspace given as the linear
span of some set – it suffices to point out a representation of the point as a linear combination of vectors from
the set. But how could you certify that the point does not belong to the subspace?
A.4. SPACE RN : METRIC STRUCTURE AND TOPOLOGY 317
is well-defined; this quantity is called the (standard) Euclidean norm of vector x (or simply the norm
of x) and is treated as the distance from the origin to x. The distance between two arbitrary points
x, y ∈ Rn is, by definition, the norm |x − y| of the difference x − y. The notions we have defined satisfy
all basic requirements on the general notions of a norm and distance, specifically:
1. Positivity of norm: The norm of a vector always is nonnegative; it is zero if and only is the vector
is zero:
|x| ≥ 0 ∀x; |x| = 0 ⇔ x = 0.
2. Homogeneity of norm: When a vector is multiplied by a real, its norm is multiplied by the absolute
value of the real:
|λx| = |λ| · |x| ∀(x ∈ Rn , λ ∈ R).
3. Triangle inequality: Norm of the sum of two vectors is ≤ the sum of their norms:
In contrast to the properties of positivity and homogeneity, which are absolutely evident,
the Triangle inequality is not trivial and definitely requires a proof. The proof goes
through a fact which is extremely important by its own right – the Cauchy Inequality,
which perhaps is the most frequently used inequality in Mathematics:
Theorem A.4.1 [Cauchy’s Inequality] The absolute value of the inner product of two
vectors does not exceed the product of their norms:
and is equal to the product of the norms if and only if one of the vectors is proportional
to the other one:
Proof is immediate: we may assume that both x and y are nonzero (otherwise the
Cauchy inequality clearly is equality, and one of the vectors is constant times (specifi-
cally, zero times) the other one, as announced in Theorem). Assuming x, y 6= 0, consider
the function
f (λ) = (x − λy)T (x − λy) = xT x − 2λxT y + λ2 y T y.
By positive definiteness of the inner product, this function – which is a second order
polynomial – is nonnegative on the entire axis, whence the discriminant of the polyno-
mial
(xT y)2 − (xT x)(y T y)
is nonpositive:
(xT y)2 ≤ (xT x)(y T y).
318 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
Taking square roots of both sides, we arrive at the Cauchy Inequality. We also see that
the inequality is equality if and only if the discriminant of the second order polynomial
f (λ) is zero, i.e., if and only if the polynomial has a (multiple) real root; but due to
positive definiteness of inner product, f (·) has a root λ if and only if x = λy, which
proves the second part of Theorem.
The properties of norm (i.e., of the distance to the origin) we have established induce properties of the
distances between pairs of arbitrary points in Rn , specifically:
1. Positivity of distances: The distance |x−y| between two points is positive, except for the case when
the points coincide (x = y), when the distance between x and y is zero;
2. Symmetry of distances: The distance from x to y is the same as the distance from y to x:
|x − y| = |y − x|;
3. Triangle inequality for distances: For every three points x, y, z, the distance from x to z does not
exceed the sum of distances between x and y and between y and z:
|z − x| ≤ |y − x| + |z − y| ∀(x, y, z ∈ Rn )
A.4.2 Convergence
Equipped with distances, we can define the fundamental notion of convergence of a sequence of vectors.
Specifically, we say that a sequence x1 , x2 , ... of vectors from Rn converges to a vector x̄, or, equivalently,
that x̄ is the limit of the sequence {xi } (notation: x̄ = lim xi ), if the distances from x̄ to xi go to 0 as
i→∞
i → ∞:
x̄ = lim xi ⇔ |x̄ − xi | → 0, i → ∞,
i→∞
or, which is the same, for every > 0 there exists i = i() such that the distance between every point xi ,
i ≥ i(), and x̄ does not exceed :
• A set X ⊂ Rn is called open, if whenever x belongs to X, all points close enough to x also belong
to X:
∀(x ∈ X)∃(δ > 0) : |x0 − x| < δ ⇒ x0 ∈ X.
An open set containing a point x is called a neighbourhood of x.
Examples of closed sets: (1) Rn ; (2) ∅; (3) the sequence xi = (i, 0, ..., 0), i = 1, 2, 3, ...; (4)
n
{x ∈ Rn : aij xj = 0, i = 1, ..., m} (in other words: a linear subspace in Rn always is
P
i=1
n
closed, see Proposition A.3.6);(5) {x ∈ Rn :
P
aij xj = bi , i = 1, ..., m} (in other words: an
i=1
affine subset of Rn always is closed, see Proposition A.3.7);; (6) Any finite subset of Rn
Examples of non-closed sets: (1) Rn \{0}; (2) the sequence xi = (1/i, 0, ..., 0), i = 1, 2, 3, ...;
n
(3) {x ∈ Rn : xj > 0, j = 1, ..., n}; (4) {x ∈ Rn :
P
xj > 5}.
i=1
n
Examples of open sets: (1) Rn ; (2) ∅; (3) {x ∈ Rn :
P
aij xj > bj , i = 1, ..., m}; (4)
j=1
complement of a finite set.
Examples of non-open sets: (1) A nonempty finite set; (2) the sequence xi = (1/i, 0, ..., 0),
i = 1, 2, 3, ..., and the sequence xi = (i, 0, 0, ..., 0), i = 1, 2, 3, ...; (3) {x ∈ Rn : xj ≥ 0, j =
n
1, ..., n}; (4) {x ∈ Rn :
P
xj ≥ 5}.
i=1
Exercise A.4 Mark in the list to follows those sets which are closed and those which are open:
1. All vectors with integer coordinates
2. All vectors with rational coordinates
3. All vectors with positive coordinates
4. All vectors with nonnegative coordinates
5. {x : |x| < 1};
6. {x : |x| = 1};
7. {x : |x| ≤ 1};
8. {x : |x| ≥ 1}:
9. {x : |x| > 1};
10. {x : 1 < |x| ≤ 2}.
Verify the following facts
1. A set X ⊂ Rn is closed if and only if its complement X̄ = Rn \X is open;
2. Intersection of every family (finite or infinite) of closed sets is closed. Union of every family (finite
of infinite) of open sets is open.
3. Union of finitely many closed sets is closed. Intersection of finitely many open sets is open.
320 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
3. {x ∈ R2 : x1 = 0};
4. {x ∈ R2 : x2 = 0};
5. {x ∈ R2 : x1 + x2 = 0};
6. {x ∈ R2 : x1 − x2 = 0};
7. {x ∈ R2 : |x1 − x2 | ≤ x41 + x42 };
• Let f : Rn → Rm be a continuous mapping. Mark those of the following statements which always
are true:
1. If U is an open set in Rm , then so is the set f −1 (U ) = {x : f (x) ∈ U };
2. If U is an open set in Rn , then so is the set f (U ) = {f (x) : x ∈ U };
3. If F is a closed set in Rm , then so is the set f −1 (F ) = {x : f (x) ∈ F };
4. If F is an closed set in Rn , then so is the set f (F ) = {f (x) : x ∈ F }.
is continuous on X;
(ii) [stability of continuity w.r.t. superposition] Let
• X ⊂ Rn , Y ⊂ Rm ;
• f : X → Rm be a continuous mapping such that f (x) ∈ Y for every x ∈ X;
• g : Y → Rk be a continuous mapping.
Then the composite mapping
h(x) = g(f (x)) : X → Rk
is continuous on X.
Proof. Assume, on the contrary to what should be proved, that there exists > 0 such
that for every δ > 0 one can find a pair of points x, y in X such that |x − y| < δ and
|f (x) − f (y)| ≥ . In particular, for every i = 1, 2, ... we can find two points xi , y i in X
such that |xi − y i | ≤ 1/i and |f (xi ) − f (y i )| ≥ . By Theorem A.4.2, we can extract from
the sequence {xi } a subsequence {xij }∞ j=1 which converges to certain point x̄ ∈ X. Since
|y ij − xij | ≤ 1/ij → 0 as j → ∞, the sequence {y ij }∞ j=1 converges to the same point x̄ as the
sequence {xij }∞ j=1 (why?) Since f is continuous, we have
whence lim (f (xij ) − f (y ij )) = 0, which contradicts the fact that |f (xij ) − f (y ij )| ≥ > 0
j→∞
for all j.
(iii) Let f be a real-valued continuous function on X. The f attains its minimum on X:
Proof: Let us prove that f attains its maximum on X (the proof for minimum is completely
similar). Since f is bounded on X by (i), the quantity
f ∗ = sup f (x)
x∈X
is finite; of course, we can find a sequence {xi } of points from X such that f ∗ = lim f (xi ).
i→∞
By Theorem A.4.2, we can extract from the sequence {xi } a subsequence {xij }∞
j=1 which
converges to certain point x̄ ∈ X. Since f is continuous on X, we have
Exercise A.6 Prove that in general no one of the three statements in Theorem A.5.2 remains valid when
X is closed, but not bounded, same as when X is bounded, but not closed.
exists, then the linear function f 0 (x)∆x of ∆x approximates the change f (x + ∆x) − f (x) in f up to a
remainder which is of highest order as compared with ∆x as ∆x → 0:
In the above formula, we meet with the notation ō(|∆x|), and here is the explanation of this notation:
ō(|∆x|) is a common name of all functions φ(∆x) of ∆x which are well-defined in a neigh-
bourhood of the point ∆x = 0 on the axis, vanish at the point ∆x = 0 and are such that
φ(∆x)
→ 0 as ∆x → 0.
|∆x|
For example,
1. (∆x)2 = ō(|∆x|), ∆x → 0,
2. |∆x|1.01 = ō(|∆x|), ∆x → 0,
3. sin2 (∆x) = ō(|∆x|), ∆x → 0,
4. ∆x 6= ō(|∆x|), ∆x → 0.
Later on we shall meet with the notation “ō(|∆x|k ) as ∆x → 0”, where k is a positive integer.
The definition is completely similar to the one for the case of k = 1:
ō(|∆x|k ) is a common name of all functions φ(∆x) of ∆x which are well-defined in a neigh-
bourhood of the point ∆x = 0 on the axis, vanish at the point ∆x = 0 and are such that
φ(∆x)
→ 0 as ∆x → 0.
|∆x|k
Note that if f (·) is a function defined in a neighbourhood of a point x on the axis, then there perhaps
are many linear functions a∆x of ∆x which well approximate f (x + ∆x) − f (x), in the sense that the
remainder in the approximation
f (x + ∆x) − f (x) − a∆x
tends to 0 as ∆x → 0; among these approximations, however, there exists at most one which approximates
f (x + ∆x) − f (x) “very well” – so that the remainder is ō(|∆x|), and not merely tends to 0 as ∆x → 0.
Indeed, if
f (x + ∆x) − f (x) − a∆x = ō(|∆x|),
then, dividing both sides by ∆x, we get
f (x + ∆x) − f (x)
a = lim = f 0 (x).
∆x→0 ∆x
Thus, if a linear function a∆x of ∆x approximates the change f (x + ∆x) − f (x) in f up to the remainder
which is ō(|∆x|) as ∆x → 0, then a is the derivative of f at x. You can easily verify that the inverse state-
ment also is true: if the derivative of f at x exists, then the linear function f 0 (x)∆x of ∆x approximates
the change f (x + ∆x) − f (x) in f up to the remainder which is ō(|∆x|) as ∆x → 0.
The advantage of the “ō(|∆x|)”-definition of derivative is that it can be naturally extended onto
vector-valued functions of vector arguments (you should just replace “axis” with Rn in the definition of
ō) and enlightens the essence of the notion of derivative: when it exists, this is exactly the linear function
of ∆x which approximates the change f (x + ∆x) − f (x) in f up to a remainder which is ō(|∆x|). The
precise definition is as follows:
324 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
f (x + t∆x) − f (x)
∀∆x ∈ Rn : Df (x)[∆x] = lim . (A.6.2)
t→+0 t
In particular, the derivative Df (x)[·] is uniquely defined by f and x.
Proof. We have
f (x) = |x|
A.6. DIFFERENTIABLE FUNCTIONS ON RN 325
f (0 + t∆x) − f (0)
lim = |∆x|;
t→+0 t
these derivatives, however, depend non-linearly on ∆x, so that the Euclidean norm is not differ-
entiable at the origin (although is differentiable everywhere outside the origin, but this is another
story).
2. It should be stressed that the derivative, if exists, is what it is: a linear function of ∆x ∈ Rn taking
values in Rm . As we shall see in a while, we can represent this function by something “tractable”,
like a vector or a matrix, and can understand how to compute such a representation; however,
an intelligent reader should bear in mind that a representation is not exactly the same as the
represented entity. Sometimes the difference between derivatives and the entities which represent
them is reflected in the terminology: what we call the derivative, is also called the differential,
while the word “derivative” is reserved for the vector/matrix representing the differential.
Case of m = 1 – the gradient. Let us start with real-valued functions (i.e., with the case of
m = 1); in this case the derivative is a linear real-valued function on Rn . As we remember, the standard
Euclidean structure on Rn allows to represent every linear function on Rn as the inner product of the
argument with certain fixed vector. In particular, the derivative Df (x)[∆x] of a scalar function can be
represented as
Df (x)[∆x] = [vector]T ∆x;
what is denoted ”vector” in this relation, is called the gradient of f at x and is denoted by ∇f (x):
How to compute the gradient? The answer is given by (A.6.2). Indeed, let us look what (A.6.3) and
(A.6.2) say when ∆x is the i-th standard basic orth. According to (A.6.3), Df (x)[ei ] is the i-th coordinate
of the vector ∇f (x); according to (A.6.2),
Df (x)[ei ] = lim f (x+teti )−f (x) , ∂f (x)
t→+0
⇒ Df (x)[ei ] = .
Df (x)[ei ] = −Df (x)[−ei ] = − lim f (x−teti )−f (x) = lim f (x+teti )−f (x) ∂xi
t→+0 t→−0
Thus,
If a real-valued function f is differentiable at x, then the first order partial derivatives of f
at x exist, and the gradient of f at x is just the vector with the coordinates which are the
first order partial derivatives of f taken at x:
∂f (x)
∂x1
∇f (x) = ..
.
.
∂f (x)
∂xn
General case – the Jacobian. Now let f : Rn → Rm with m ≥ 1. In this case, Df (x)[∆x],
regarded as a function of ∆x, is a linear mapping from Rn to Rm ; as we remember, the standard way to
represent a linear mapping from Rn to Rm is to represent it as the multiplication by m × n matrix:
Df (x)[∆x] = [m × n matrix] · ∆x. (A.6.4)
What is denoted by “matrix” in (A.6.4), is called the Jacobian of f at x and is denoted by f 0 (x). How
to compute the entries of the Jacobian? Here again the answer is readily given by (A.6.2). Indeed, on
one hand, we have
Df (x)[∆x] = f 0 (x)∆x, (A.6.5)
whence
[Df (x)[ej ]]i = (f 0 (x))ij , i = 1, ..., m, j = 1, ..., n.
On the other hand, denoting
f1 (x)
f (x) = ..
,
.
fm (x)
the same computation as in the case of gradient demonstrates that
∂fi (x)
[Df (x)[ej ]]i =
∂xj
and we arrive at the following conclusion:
If a vector-valued function f (x) = (f1 (x), ..., fm (x)) is differentiable at x, then the first order
partial derivatives of all fi at x exist, and the Jacobian of f at x is just the m × n matrix
with the entries [ ∂f∂x
i (x)
j
]i,j (so that the rows in the Jacobian are [∇f1 (x)]T ,..., [∇fm (x)]T .
The derivative of f , taken at x, is the linear vector-valued function of ∆x given by
[∇f1 (x)]T ∆x
Df (x)[∆x] = f 0 (x)∆x = ..
.
.
[∇fm (x)]T ∆x
Remark A.6.1 Note that for a real-valued function f we have defined both the gradient ∇f (x) and the
Jacobian f 0 (x). These two entities are “nearly the same”, but not exactly the same: the Jacobian is a
vector-row, and the gradient is a vector-column linked by the relation
f 0 (x) = (∇f (x))T .
Of course, both these representations of the derivative of f yield the same linear approximation of the
change in f :
Df (x)[∆x] = (∇f (x))T ∆x = f 0 (x)∆x.
Theorem A.6.2 (i) [Differentiability and linear operations] Let f1 (x), f2 (x) be mappings defined in a
neighbourhood of x0 ∈ Rn and taking values in Rm , and λ1 (x), λ2 (x) be real-valued functions defined
in a neighbourhood of x0 . Assume that f1 , f2 , λ1 , λ2 are differentiable at x0 . Then so is the function
f (x) = λ1 (x)f1 (x) + λ2 (x)f2 (x), with the derivative at x0 given by
Df (x0 )[∆x] = [Dλ1 (x0 )[∆x]]f1 (x0 ) + λ1 (x0 )Df1 (x0 )[∆x]
+[Dλ2 (x0 )[∆x]]f2 (x0 ) + λ2 (x0 )Df2 (x0 )[∆x]
⇓
f 0 (x0 ) = f1 (x0 )[∇λ1 (x0 )]T + λ1 (x0 )f10 (x0 )
+f2 (x0 )[∇λ2 (x0 )]T + λ2 (x0 )f20 (x0 ).
is differentiable at every point (Theorem A.6.1) and its gradient, of course, equals g:
and we arrive at
∇(a + g T x) = g
Example 2: The gradient of a quadratic form. For the time being, let us define a
homogeneous quadratic form on Rn as a function
X
f (x) = Aij xi xj = xT Ax,
i,j
328 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
where A is an n × n matrix. Note that the matrices A and AT define the same quadratic form, and
therefore the symmetric matrix B = 12 (A + AT ) also produces the same quadratic form as A and AT .
It follows that we always may assume (and do assume from now on) that the matrix A producing the
quadratic form in question is symmetric.
A quadratic form is a simple polynomial and as such is differentiable at every point (Theorem A.6.1).
What is the gradient of f at a point x? Here is the computation:
[(A.6.2)]
= lim xT Ax + t(∆x)T Ax + txT A∆x + t2 (∆x)T A∆x − xT Ax
t→+0
[opening parentheses]
lim t−1 2t(Ax)T ∆x + t2 (∆x)T A∆x
=
t→+0
[arithmetics + symmetry of A]
= 2(Ax)T ∆x
We conclude that
∇(xT Ax) = 2Ax
(recall that A = AT ).
Example 3: The derivative of the log-det barrier. Let us compute the derivative of the
log-det barrier (playing an extremely important role in modern optimization)
here X is an n × n matrix (or, if you prefer, n2 -dimensional vector). Note that F (X) is well-defined
and differentiable in a neighbourhood of every point X̄ with positive determinant (indeed, Det (X) is a
polynomial of the entries of X and thus – is everywhere continuous and differentiable with continuous
partial derivatives, while the function ln(t) is continuous and differentiable on the positive ray; by The-
orems A.5.1.(ii), A.6.2.(ii), F is differentiable at every X such that Det (X) > 0). The reader is kindly
asked to try to find the derivative of F by the standard techniques; if the result will not be obtained in,
say, 30 minutes, please look at the 8-line computation to follow (in this computation, Det (X̄) > 0, and
G(X) = Det (X)):
DF (X̄)[∆X]
= D ln(G(X̄))[DG(X̄)[∆X]] [chain rule]
0
= G−1 (X̄)DG(X̄)[∆X] [ln (t) = t−1 ]
−1
= Det (X̄) lim t−1 Det (X̄ + t∆X) − Det (X̄)
[definition of G and (A.6.2)]
t→+0
= Det −1 (X̄) lim t−1 Det (X̄(I + tX̄ −1 ∆X)) − Det (X̄)
t→+0
= Det −1 (X̄) lim t−1 Det (X̄)(Det (I + tX̄ −1 ∆X) − 1)
[Det (AB) = Det (A) Det (B)]
t→+0
−1 −1
= lim t Det (I + tX̄ ∆X) − 1
t→+0
= Tr(X̄ ∆X) = [X̄ −1 ]ji (∆X)ij
−1
P
i,j
is immediately given by recalling what is the determinant of I + tA: this is a polynomial of t which is the
sum of products, taken along all diagonals of a n × n matrix and assigned certain signs, of the entries of
I +tA. At every one of these diagonals, except for the main one, there are at least two cells with the entries
A.6. DIFFERENTIABLE FUNCTIONS ON RN 329
proportional to t, so that the corresponding products do not contribute to the constant and the linear in t
terms in Det (I + tA) and thus do not affect the limit in (A.6.6). The only product which does contribute
to the linear and the constant terms in Det (I + tA) is the product (1 + tA11 )(1 + tA22 )...(1 + tAnn )
coming from the main diagonal; it is clear that in this product the constant term is 1, and the linear in
t term is t(A11 + ... + Ann ), and (A.6.6) follows.
set U . The Jacobian of this mapping J(x) is a mapping from Rn to the space Rm×n matrices, i.e., is a
mapping taking values in certain RM (M = mn). The derivative of this mapping, if it exists, is called the
second derivative of f ; it again is a mapping from Rn to certain RM and as such can be differentiable,
and so on, so that we can speak about the second, the third, ... derivatives of a vector-valued function of
vector argument. A sufficient condition for the existence of k derivatives of f in U is that f is Ck in U ,
i.e., that all partial derivatives of f of orders ≤ k exist and are continuous everywhere in U (cf. Theorem
A.6.1).
We have explained what does it mean that f has k derivatives in U ; note, however, that according to
the definition, highest order derivatives at a point x are just long vectors; say, the second order derivative
of a scalar function f of 2 variables is the Jacobian of the mapping x 7→ f 0 (x) : R2 → R2 , i.e., a mapping
from R2 to R2×2 = R4 ; the third order derivative of f is therefore the Jacobian of a mapping from R2
to R4 , i.e., a mapping from R2 to R4×2 = R8 , and so on. The question which should be addressed now
is: What is a natural and transparent way to represent the highest order derivatives?
The answer is as follows:
(∗) Let f : Rn → Rm be Ck on an open set U ⊂ Rn . The derivative of order ` ≤ k of f ,
taken at a point x ∈ U , can be naturally identified with a function
of ` vector arguments ∆xi ∈ Rn , i = 1, ..., `, and taking values in Rm . This function is linear
in every one of the arguments ∆xi , the other arguments being fixed, and is symmetric with
respect to permutation of arguments ∆x1 , ..., ∆x` .
In terms of f , the quantity D` f (x)[∆x1 , ∆x2 , ..., ∆x` ] (full name: “the `-th derivative (or
differential) of f taken at a point x along the directions ∆x1 , ..., ∆x` ”) is given by
∂`
D` f (x)[∆x1 , ∆x2 , ..., ∆x` ] = t1 =...=t` =0
f (x + t1 ∆x1 + t2 ∆x2 + ... + t` ∆x` ).
∂t` ∂t`−1 ...∂t1
(A.6.7)
The explanation to our claims is as follows. Let f : Rn → Rm be Ck on an open set U ⊂ Rn .
1. When ` = 1, (∗) says to us that the first order derivative of f , taken at x, is a linear function
Df (x)[∆x1 ] of ∆x1 ∈ Rn , taking values in Rm , and that the value of this function at every ∆x1
is given by the relation
∂
Df (x)[∆x1 ] = f (x + t1 ∆x1 ) (A.6.8)
∂t1 t1 =0
(cf. (A.6.2)), which is in complete accordance with what we already know about the derivative.
2. To understand what is the second derivative, let us take the first derivative Df (x)[∆x1 ], let us
temporarily fix somehow the argument ∆x1 and treat the derivative as a function of x. As a
function of x, ∆x1 being fixed, the quantity Df (x)[∆x1 ] is again a mapping which maps U into
Rm and is differentiable by Theorem A.6.1 (provided, of course, that k ≥ 2). The derivative of
this mapping is certain linear function of ∆x ≡ ∆x2 ∈ Rn , depending on x as on a parameter; and
of course it depends on ∆x1 as on a parameter as well. Thus, the derivative of Df (x)[∆x1 ] in x is
certain function
D2 f (x)[∆x1 , ∆x2 ]
330 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
of x ∈ U and ∆x1 , ∆x2 ∈ Rn and taking values in Rm . What we know about this function is
that it is linear in ∆x2 . In fact, it is also linear in ∆x1 , since it is the derivative in x of certain
function (namely, of Df (x)[∆x1 ]) linearly depending on the parameter ∆x1 , so that the derivative
of the function in x is linear in the parameter ∆x1 as well (differentiation is a linear operation
with respect to a function we are differentiating: summing up functions and multiplying them by
real constants, we sum up, respectively, multiply by the same constants, the derivatives). Thus,
D2 f (x)[∆x1 , ∆x2 ] is linear in ∆x1 when x and ∆x2 are fixed, and is linear in ∆x2 when x and
∆x1 are fixed. Moreover, we have
∂
D2 f (x)[∆x1 , ∆x2 ] = 2
∂t2 t2 =0 Df (x + t2 ∆x )[∆x ]
1
[cf. (A.6.8)]
∂ ∂ 2 1
= ∂t2 t2 =0 ∂t1 t1 =0 f (x + t2 ∆x + t1 ∆x ) [by (A.6.8)]
(A.6.9)
∂2 1 2
= ∂t2 ∂t1 f (x + t1 ∆x + t2 ∆x )
t1 =t2 =0
as claimed in (A.6.7) for ` = 2. The only piece of information about the second derivative which
is contained in (∗) and is not justified yet is that D2 f (x)[∆x1 , ∆x2 ] is symmetric in ∆x1 , ∆x2 ;
but this fact is readily given by the representation (A.6.7), since, as they prove in Calculus, if a
function φ possesses continuous partial derivatives of orders ≤ ` in a neighbourhood of a point,
then these derivatives in this neighbourhood are independent of the order in which they are taken;
it follows that
∂2
D2 f (x)[∆x1 , ∆x2 ] = ∂t2 ∂t1 f (x + t1 ∆x1 + t2 ∆x2 ) [(A.6.9)]
t1 =t2 =0 | {z }
φ(t1 ,t2 )
∂2
= ∂t1 ∂t2 φ(t1 , t2 )
t1 =t2 =0
∂2
= ∂t1 ∂t2 f (x + t2 ∆x2 + t1 ∆x1 )
t1 =t2 =0
= D2 f (x)[∆x , ∆x1 ] 2
[the same (A.6.9)]
3. Now it is clear how to proceed: to define D3 f (x)[∆x1 , ∆x2 , ∆x3 ], we fix in the second order
derivative D2 f (x)[∆x1 , ∆x2 ] the arguments ∆x1 , ∆x2 and treat it as a function of x only, thus
arriving at a mapping which maps U into Rm and depends on ∆x1 , ∆x2 as on parameters (lin-
early in every one of them). Differentiating the resulting mapping in x, we arrive at a function
D3 f (x)[∆x1 , ∆x2 , ∆x3 ] which by construction is linear in every one of the arguments ∆x1 , ∆x2 ,
∆x3 and satisfies (A.6.7); the latter relation, due to the Calculus result on the symmetry of partial
derivatives, implies that D3 f (x)[∆x1 , ∆x2 , ∆x3 ] is symmetric in ∆x1 , ∆x2 , ∆x3 . After we have in
our disposal the third derivative D3 f , we can build from it in the already explained fashion the
fourth derivative, and so on, until k-th derivative is defined.
Remark A.6.2 Since D` f (x)[∆x1 , ..., ∆x` ] is linear in every one of ∆xi , we can expand the derivative
in a multiple sum:
n
∆xi = ∆xij ej
P
j=1
⇓
n n (A.6.10)
D` f (x)[∆x1 , ..., ∆x` ] D` f (x)[ ∆x1j1 ej1 , ..., ∆x`j` ej` ]
P P
=
j 1 =1 j ` =1
D` f (x)[ej1 , ..., ej` ]∆x1j1 ...∆x`j`
P
=
1≤j1 ,...,j` ≤n
What is the origin of the coefficients D` f (x)[ej1 , ..., ej` ]? According to (A.6.7), one has
∂`
D` f (x)[ej1 , ..., ej` ] = ∂t` ∂t`−1 ...∂t1 f (x + t1 ej1 + t2 ej2 + ... + t` ej` )
t1 =...=t` =0
`
∂
= ∂xj` ∂xj`−1 ...∂xj1 f (x).
A.6. DIFFERENTIABLE FUNCTIONS ON RN 331
so that the coefficients in (A.6.10) are nothing but the partial derivatives, of order `, of f .
Remark A.6.3 An important particular case of relation (A.6.7) is the one when ∆x1 = ∆x2 = ... = ∆x` ;
let us call the common value of these ` vectors d. According to (A.6.7), we have
∂`
D` f (x)[d, d, ..., d] = f (x + t1 d + t2 d + ... + t` d).
∂t` ∂t`−1 ...∂t1 t1 =...=t` =0
φ(t) = f (x + td)
∂`
φ(`) (0) = f (x + t1 d + t2 d + ... + t` d) = D` f (x)[d, ..., d].
∂t` ∂t`−1 ...∂t1 t1 =...=t` =0
In other words, D` f (x)[d, ..., d] is what is called `-th directional derivative of f taken at x along the
direction d; to define this quantity, we pass from function f of several variables to the univariate function
φ(t) = f (x + td) – restrict f onto the line passing through x and directed by d – and then take the
“usual” derivative of order ` of the resulting function of single real variable t at the point t = 0 (which
corresponds to the point x of our line).
We may say that the derivative can be represented by k-index collection of m-dimensional vectors
∂ k f (x)
∂xik ∂xik−1 ...∂xi1 . This collection, however, is a difficult-to-handle entity, so that such a representation
does not help. There is, however, a case when the collection becomes an entity we know to handle; this
is the case of the second-order derivative of a scalar
h function
i (k = 2, m = 1). In this case, the collection
∂ 2 f (x)
in question is just a symmetric matrix H(x) = ∂xi ∂xj . This matrix is called the Hessian of f at
1≤i,j≤n
x. Note that
D2 f (x)[∆x1 , ∆x2 ] = ∆xT1 H(x)∆x2 .
Theorem A.6.3 (i) Let U be an open set in Rn , f1 (·), f2 (·) : Rn → Rm be Ck in U , and let real-valued
functions λ1 (·), λ2 (·) be Ck in U . Then the function
is Ck in U .
(ii) Let U be an open set in Rn , V be an open set in Rm , let a mapping f : Rn → Rm be Ck in
U and such that f (x) ∈ V for x ∈ U , and, finally, let a mapping g : Rm → Rp be Ck in V . Then the
superposition
h(x) = g(f (x))
is Ck in U .
332 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
Remark A.6.4 For higher order derivatives, in contrast to the first order ones, there is no simple
“chain rule” for computing the derivative of superposition. For example, the second-order derivative of
the superposition h(x) = g(f (x)) of two C2 -mappings is given by the formula
(check it!). We see that both the first- and the second-order derivatives of f and g contribute to the
second-order derivative of the superposition h.
The only case when there does exist a simple formula for high order derivatives of a superposition is
the case when the inner function is affine: if f (x) = Ax + b and h(x) = g(f (x)) = g(Ax + b) with a C`
mapping g, then
D` h(x)[∆x1 , ..., ∆x` ] = D` g(Ax + b)[A∆x1 , ..., A∆x` ]. (A.6.11)
Df (x)[∆x1 ] = bT ∆x1
is independent of x, and therefore the derivative of Df (x)[∆x1 ] in x, which should give us the second
derivative D2 f (x)[∆x1 , ∆x2 ], is zero. Clearly, the third, the fourth, etc., derivatives of an affine function
are zero as well.
Differentiating in x, we get
D2 f (x)[∆x1 , ∆x2 ] = lim t−1 2(x + t∆x2 )T A∆x1 − 2xT A∆x1 = 2(∆x2 )T A∆x1 ,
t→+0
so that
D2 f (x)[∆x1 , ∆x2 ] = 2(∆x2 )T A∆x1
Note that the second derivative of a quadratic form is independent of x; consequently, the third, the
fourth, etc., derivatives of a quadratic form are identically zero.
To differentiate the right hand side in X, let us first find the derivative of the mapping G(X) = X −1
which is defined on the open set of non-degenerate n × n matrices. We have
Since Tr(AB) = Tr(BA) (check it!) for all matrices A, B such that the product AB makes sense and
is square, the right hand side in the above formula is symmetric in ∆X 1 , ∆X 2 , as it should be for the
second derivative of a C2 function.
– this is the affine function of x which approximates “very well” f (x) in a neighbourhood of x̄, namely,
within approximation error ō(|x − x̄|). Similar fact is true for Taylor expansions of higher order:
334 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
Theorem A.6.4 Let f : Rn → Rm be Ck in a neighbourhood of x̄, and let Fk (x) be the Taylor expansion
of f at x̄ of degree k. Then
(i) Fk (x) is a vector-valued polynomial of full degree ≤ k (i.e., every one of the coordinates of the
vector Fk (x) is a polynomial of x1 , ..., xn , and the sum of powers of xi ’s in every term of this polynomial
does not exceed k);
(ii) Fk (x) approximates f (x) in a neighbourhood of x̄ up to a remainder which is ō(|x− x̄|k ) as x → x̄:
For every > 0, there exists δ > 0 such that
Fk (·) is the unique polynomial with components of full degree ≤ k which approximates f up to a remainder
which is ō(|x − x̄|k ).
(iii) The value and the derivatives of Fk of orders 1, 2, ..., k, taken at x̄, are the same as the value and
the corresponding derivatives of f taken at the same point.
As stated in Theorem, Fk (x) approximates f (x) for x close to x̄ up to a remainder which is ō(|x − x̄|k ).
In many cases, it is not enough to know that the reminder is “ō(|x − x̄|k )” — we need an explicit bound
on this remainder. The standard bound of this type is as follows:
Theorem A.6.5 Let k be a positive integer, and let f : Rn → Rm be Ck+1 in a ball Br = Br (x̄) = {x ∈
Rn : |x − x̄| < r} of a radius r > 0 centered at a point x̄. Assume that the directional derivatives of order
k + 1, taken at every point of Br along every unit direction, do not exceed certain L < ∞:
L|x − x̄|k+1
|f (x) − Fk (x)| ≤ ∀(x ∈ Br ).
(k + 1)!
Thus, in a neighbourhood of x̄ the remainder of the k-th order Taylor expansion, taken at x̄, is of order
of L|x − x̄|k+1 , where L is the maximal (over all unit directions and all points from the neighbourhood)
magnitude of the directional derivatives of order k + 1 of f .
Here Tr stands for the trace – the sum of diagonal elements of a (square) matrix. With this inner product
(called the Frobenius inner product), Mm,n becomes a legitimate Euclidean space, and we may use in
A.7. SYMMETRIC MATRICES 335
connection with this space all notions based upon the Euclidean structure, e.g., the (Frobenius) norm of
a matrix v
um X n
uX q
p
k X k2 = hX, Xi = t Xij2 = Tr(X T X)
i=1 j=1
and likewise the notions of orthogonality, orthogonal complement of a linear subspace, etc. The same
applies to the space Sm equipped with the Frobenius inner product; of course, the Frobenius inner product
of symmetric matrices can be written without the transposition sign:
hX, Y i = Tr(XY ), X, Y ∈ Sm .
for reals λi .
In connection with Theorem A.7.1, it is worthy to recall the following notions and facts:
of A.
Theorem A.7.1 states, in particular, that for a symmetric matrix A, all eigenvalues are real, and the
corresponding eigenvectors can be chosen to be real and to form an orthonormal basis in Rn .
Theorem A.7.2 An n × n matrix A is symmetric if and only if it can be represented in the form
A = U ΛU T , (A.7.2)
where
• U is an orthogonal matrix: U −1 = U T (or, which is the same, U T U = I, or, which is the same,
U U T = I, or, which is the same, the columns of U form an orthonormal basis in Rn , or, which is
the same, the columns of U form an orthonormal basis in Rn ).
• Λ is the diagonal matrix with the diagonal entries λ1 , ..., λn .
Representation (A.7.2) with orthogonal U and diagonal Λ is called the eigenvalue decomposition of A.
In such a representation,
• The columns of U form an orthonormal system of eigenvectors of A;
• The diagonal entries in Λ are the eigenvalues of A corresponding to these eigenvectors.
336 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
A.7.2.D. Freedom in eigenvalue decomposition. Part of the data Λ, U in the eigenvalue de-
composition (A.7.2) is uniquely defined by A, while the other data admit certain “freedom”. Specifically,
the sequence λ1 , ..., λn of eigenvalues of A (i.e., diagonal entries of Λ) is exactly the sequence of roots
of the characteristic polynomial of A (every root is repeated according to its multiplicity) and thus is
uniquely defined by A (provided that we arrange the entries of the sequence in the non-ascending order).
The columns of U are not uniquely defined by A. What is uniquely defined, are the linear spans E(λ) of
the columns of U corresponding to all eigenvalues equal to certain λ; such a linear span is nothing but
the spectral subspace {x : Ax = λx} of A corresponding to the eigenvalue λ. There are as many spec-
tral subspaces as many different eigenvalues; spectral subspaces corresponding to different eigenvalues of
symmetric matrix are orthogonal to each other, and their sum is the entire space. When building an
orthogonal matrix U in the spectral decomposition, one chooses an orthonormal eigenbasis in the spectral
subspace corresponding to the largest eigenvalue and makes the vectors of this basis the first columns
in U , then chooses an orthonormal basis in the spectral subspace corresponding to the second largest
eigenvalue and makes the vector from this basis the next columns of U , and so on.
Ai = U Λi U T , i = 1, ..., k.
You are welcome to prove this statement by yourself; to simplify your task, here are two simple and
important by their own right statements which help to reach your target:
A.7.2.E.1: Let λ be a real and A, B be two commuting n × n matrices. Then the spectral
subspace E = {x : Ax = λx} of A corresponding to λ is invariant for B (i.e., Be ∈ E for
every e ∈ E).
A.7.2.E.2: If A is an n × n matrix and L is an invariant subspace of A (i.e., L is a linear
subspace such that Ae ∈ L whenever e ∈ L), then the orthogonal complement L⊥ of L is
invariant for the matrix AT . In particular, if A is symmetric and L is invariant subspace of
A, then L⊥ is invariant subspace of A as well.
VCE says that to get the largest eigenvalue λ1 (A), you should maximize the quadratic form xT Ax over
the unit sphere S = {x ∈ Rn : xT x = 1}; the maximum is exactly λ1 (A). To get the second largest
eigenvalue λ2 (A), you should act as follows: you choose a linear subspace E of dimension n − 1 and
maximize the quadratic form xT Ax over the cross-section of S by this subspace; the maximum value
of the form depends on E, and you minimize this maximum over linear subspaces E of the dimension
A.7. SYMMETRIC MATRICES 337
n − 1; the result is exactly λ2 (A). To get λ3 (A), you replace in the latter construction subspaces of the
dimension n − 1 by those of the dimension n − 2, and so on. In particular, the smallest eigenvalue λn (A)
is just the minimum, over all linear subspaces E of the dimension n − n + 1 = 1, i.e., over all lines passing
through the origin, of the quantities xT Ax, where x ∈ E is unit (xT x = 1); in other words, λn (A) is just
the minimum of the quadratic form xT Ax over the unit sphere S.
Proof of the VCE is pretty easy. Let e1 , ..., en be an orthonormal eigenbasis of A: Ae` =
λ` (A)e` . For 1 ≤ ` ≤ n, let F` = Lin{e1 , ..., e` }, G` = Lin{e` , e`+1 , ..., en }. Finally, for
x ∈ Rn let ξ(x) be the vector of coordinates of x in the orthonormal basis e1 , ..., en . Note
that
xT x = ξ T (x)ξ(x),
since {e1 , ..., en } is an orthonormal basis, and that
xT Ax = xT A ξi (x)ei ) = xT
P P
λi (A)ξi (x)ei =
i i
λi (A)ξi (x) (xT ei )
P
i | {z } (A.7.4)
ξi (x)
λi (A)ξi2 (x).
P
=
i
Now, given `, 1 ≤ ` ≤ n, let us set E = G` ; note that E is a linear subspace of the dimension
n − ` + 1. In view of (A.7.4), the maximum of the quadratic form xT Ax over the intersection
of our E with the unit sphere is
( n n
)
X X
max λi (A)ξi2 : ξi2 = 1 ,
i=` i=`
and the latter quantity clearly equals to max λi (A) = λ` (A). Thus, for appropriately chosen
`≤i≤n
E ∈ E` , the inner maximum in the right hand side of (A.7.3) equals to λ` (A), whence the
right hand side of (A.7.3) is ≤ λ` (A). It remains to prove the opposite inequality. To this end,
consider a linear subspace E of the dimension n − ` + 1 and observe that it has nontrivial
intersection with the linear subspace F` of the dimension ` (indeed, dim E + dim F` =
(n − ` + 1) + ` > n, so that dim (E ∩ F ) > 0 by the Dimension formula). It follows that there
exists a unit vector y belonging to both E and F` . Since y is a unit vector from F` , we have
P̀ P̀ 2
y= ηi ei with ηi = 1, whence, by (A.7.4),
i=1 i=1
`
X
y T Ay = λi (A)ηi2 ≥ min λi (A) = λ` (A).
1≤i≤`
i=1
max xT Ax ≥ y T Ay ≥ λ` (A).
x∈E:xT x=1
Since E is an arbitrary subspace form E` , we conclude that the right hand side in (A.7.3) is
≥ λ` (A).
A simple and useful byproduct of our reasoning is the relation (A.7.4):
Corollary A.7.1 For a symmetric matrix A, the quadratic form xT Ax is weighted sum of squares of
the coordinates ξi (x) of x taken with respect to an orthonormal eigenbasis of A; the weights in this sum
are exactly the eigenvalues of A: X
xT Ax = λi (A)ξi2 (x).
i
338 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
Proposition A.7.1 : A symmetric matrix A is positive semidefinite if and only if its eigenvalues are
nonnegative; A is positive definite if and only if all eigenvalues of A are positive
Indeed, A is positive definite, if and only if the minimum value of xT Ax over the unit sphere is positive,
and is positive semidefinite, if and only if this minimum value is nonnegative; it remains to note that by
VCE, the minimum value of xT Ax over the unit sphere is exactly the minimum eigenvalue of A.
Proposition A.7.2 If A B, then λ(A) ≥ λ(B), and if A B, then λ(A) > λ(B).
max xT Ax ≥ max xT Bx
x∈E:xT x=1 x∈E:xT x=1
Theorem A.7.4 [Eigenvalue Interlacement Theorem] Let A be a symmetric n × n matrix and Ā be the
angular (n − k) × (n − k) submatrix of A. Then, for every ` ≤ n − k, the `-th eigenvalue of Ā separates
the `-th and the (` + k)-th eigenvalues of A:
Indeed, by VCE, λ` (Ā) = min max xT Ax, where Ē` is the family of all linear subspaces of the
E∈Ē` x∈E:xT x=1
dimension n − k − ` + 1 contained in the linear subspace {x ∈ Rn : xn−k+1 = xn−k+2 = ... = xn = 0}.
Since Ē` ⊂ E`+k , we have
We have proved the left inequality in (A.7.5). Applying this inequality to the matrix −A, we get
or, which is the same, λ` (Ā) ≤ λ` (A), which is the first inequality in (A.7.5).
A.7. SYMMETRIC MATRICES 339
A 0 ⇔ {A = AT and xT Ax ≥ 0 ∀x}.
A is called positive definite (notation: A 0), if it is positive semidefinite and the corresponding quadratic
form is positive outside the origin:
Theorem A.7.5 Let A be a symmetric n × n matrix. Then the following properties of A are equivalent
to each other:
(i) A 0
(ii) λ(A) ≥ 0
(iii) A = DT D for certain rectangular matrix D
(iv) A = ∆T ∆ for certain upper triangular n × n matrix ∆
(v) A = B 2 for certain symmetric matrix B;
(vi) A = B 2 for certain B 0.
The following properties of a symmetric matrix A also are equivalent to each other:
(i0 ) A 0
(ii0 ) λ(A) > 0
(iii0 ) A = DT D for certain rectangular matrix D of rank n
(iv0 ) A = ∆T ∆ for certain nondegenerate upper triangular n × n matrix ∆
(v0 ) A = B 2 for certain nondegenerate symmetric matrix B;
(vi0 ) A = B 2 for certain B 0.
as required in (vi).
(vi)⇒(v): evident.
(v)⇒(iv): Let A = B 2 with certain symmetric B, and let bi be i-th column of B. Applying the
Gram-Schmidt orthogonalization process (see proof of Theorem A.2.3.(iii)), we can find an orthonormal
i
P
system of vectors u1 , ..., un and lower triangular matrix L such that bi = Lij uj , or, which is the
j=1
same, B T = LU , where U is the orthogonal matrix with the rows uT1 , ..., uTn . We now have A = B 2 =
B T (B T )T = LU U T LT = LLT . We see that A = ∆T ∆, where the matrix ∆ = LT is upper triangular.
(iv)⇒(iii): evident.
(iii)⇒(i): If A = DT D, then xT Ax = (Dx)T (Dx) ≥ 0 for all x.
We have proved the equivalence of the properties (i) – (vi). Slightly modifying the reasoning (do it
yourself!), one can prove the equivalence of the properties (i0 ) – (vi0 ).
Remark A.7.1 (i) [Checking positive semidefiniteness] Given an n × n symmetric matrix A, one can
check whether it is positive semidefinite by a purely algebraic finite algorithm (the so called Lagrange diag-
onalization of a quadratic form) which requires at most O(n3 ) arithmetic operations. Positive definiteness
340 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
of a matrix can be checked also by the Choleski factorization algorithm which finds the decomposition in
(iv0 ), if it exists, in approximately 16 n3 arithmetic operations.
There exists another useful algebraic criterion (Sylvester’s criterion) for positive semidefiniteness of
a matrix; according to this criterion, a symmetric matrix A is positive definite if and only if its angular
minors are positive, and A is positive semidefinite
if and
only if all its principal minors are nonnegative.
a b
For example, a symmetric 2 × 2 matrix A = is positive semidefinite if and only if a ≥ 0, c ≥ 0
b c
and det(A) ≡ ac − b2 ≥ 0.
(ii) [Square root of a positive semidefinite matrix] By the first chain of equivalences in Theorem A.7.5,
a symmetric matrix A is 0 if and only if A is the square of a positive semidefinite matrix B. The latter
matrix is uniquely defined by A 0 and is called the square root of A (notation: A1/2 ).
A.7.4.B. The semidefinite cone. When adding symmetric matrices and multiplying them by
reals, we add, respectively multiply by reals, the corresponding quadratic forms. It follows that
A.7.4.B.1: The sum of positive semidefinite matrices and a product of a positive semidefinite
matrix and a nonnegative real is positive semidefinite,
or, which is the same (see Section 1.1.4),
A.7.4.B.2: n × n positive semidefinite matrices form a cone Sn+ in the Euclidean space Sn
of symmetric n × n matrices,
Pthe Euclidean structure being given by the Frobenius inner
product hA, Bi = Tr(AB) = Aij Bij .
i,j
The cone Sn+ is called the semidefinite cone of size n. It is immediately seen that the semidefinite cone
Sn+ is “good” (see Lecture 5), specifically,
• Sn+ is closed: the limit of a converging sequence of positive semidefinite matrices is positive semidef-
inite;
• Sn+ is pointed: the only n × n matrix A such that both A and −A are positive semidefinite is the
zero n × n matrix;
• Sn+ possesses a nonempty interior which is comprised of positive definite matrices.
Note that the relation A B means exactly that A−B ∈ Sn+ , while A B is equivalent to A−B ∈ int Sn+ .
The “matrix inequalities” A B (A B) match the standard properties of the usual scalar inequalities,
e.g.:
AA [reflexivity]
A B, B A ⇒ A = B [antisymmetry]
A B, B C ⇒ A C [transitivity]
A B, C D ⇒ A + C B + D [compatibility with linear operations, I]
A B, λ ≥ 0 ⇒ λA λB [compatibility with linear operations, II]
Ai Bi , Ai → A, Bi → B as i → ∞ ⇒ A B [closedness]
with evident modifications when is replaced with , or
A B, C D ⇒ A + C B + D,
etc. Along with these standard properties of inequalities, the inequality possesses a nice additional
property:
A.7.4.B.3: In a valid -inequality
AB
one can multiply both sides from the left and by the right by a (rectangular) matrix and its
transpose:
A, B ∈ Sn , A B, V ∈ Mn,m
⇓
V T AV V T BV
A.7. SYMMETRIC MATRICES 341
where the concluding equality is given by the following well-known property of the trace:
A.7.4.B.4: Whenever matrices A, B are such that the product AB makes sense and is a
square matrix, one has
Tr(AB) = Tr(BA).
Indeed, we should verify that if A ∈ Mp,q and B ∈ Mq,p , then Tr(AB) = Tr(BA). The
p P
P q
left hand side quantity in our hypothetic equality is Aij Bji , and the right hand side
i=1 j=1
q P
P p
quantity is Bji Aij ; they indeed are equal.
j=1 i=1
Looking at the concluding quantity in (A.7.6), we see that it indeed is nonnegative whenever X 0
(since Y 0 and thus λi (Y ) ≥ 0 by P.7.5).
”only if” part: We are given Y such that Tr(Y X) ≥ 0 for all matrices X 0, and we should prove
that Y 0. This is immediate: for every vector x, the matrix X = xxT is positive semidefinite (Theorem
A.7.5.(iii)), so that 0 ≤ Tr(Y xxT ) = Tr(xT Y x) = xT Y x. Since the resulting inequality xT Y x ≥ 0 is
valid for every x, we have Y 0.
A.7.4.B.5: Schur Complement Lemma is the following simple and extremely useful fact:
Proposition A.7.3 [Schur Complement Lemma] A symmetric block 2 × 2 matrix
P Q
A=
QT R
with R 0 is positive definite (semidefinite) if and only if the matrix
P − QR−1 QT
is positive definite (resp. semidefinite).
Proof is immediate. When P ∈ Sp , R ∈ Sq , splitting vector x from Rp+q into blocks u ∈ Rp , v ∈ Rq ,
P 0 ⇔uT P u + 2uT Qv + v T Rv ≥ 0 ∀(u, v)
⇔ minv v T Rv + 2v T Qu + uT P u ≥ 0 ∀u
minc (v − R Qu)T R(v − R−1 Q)u + uT P u − uT QR−1 QT u ≥ 0 ∀u
−1
⇔
⇔ P − QR−1 QT 0,
as claimed by the “positive semidefinite” verion of Lemma. The same reasoning, with evident modifica-
tions, justifies the “positive definite” claim.