(MPS-SIAM series on optimization) James Renegar - A mathematical view of interior-point methods in convex optimization-Society for Industrial and Applied Mathematics _, Mathematical Programming Societ.pdf
(MPS-SIAM series on optimization) James Renegar - A mathematical view of interior-point methods in convex optimization-Society for Industrial and Applied Mathematics _, Mathematical Programming Societ.pdf
(MPS-SIAM series on optimization) James Renegar - A mathematical view of interior-point methods in convex optimization-Society for Industrial and Applied Mathematics _, Mathematical Programming Societ.pdf
OF INTERIOR-POINT METHODS
IN CONVEX OPTIMIZATION
MPS/SIAM Series on Optimization
This series is published jointly by the Mathematical Programming Society and the Society
for Industrial and Applied Mathematics. It includes research monographs, textbooks at all
levels, books on applications, and tutorials. Besides being of high scientific quality, books
in the series must advance the understanding and practice of optimization and be written
clearly, in a manner appropriate to their level.
Editor-in-Chief
John E. Dennis, Jr., Rice University
Editorial Board
Daniel Bienstock, Columbia University
John R. Birge, Northwestern University
Andrew V. Goldberg, InterTrust Technologies Corporation
Matthias Heinkenschloss, Rice University
David S. Johnson, AT&T Labs - Research
Gil Kalai, Hebrew University
Ravi Kannan, Yale University
C. T. Kelley, North Carolina State University
Jan Karel Lenstra, Technische Universiteit Eindhoven
Adrian S. Lewis, University of Waterloo
Daniel Ralph, The Judge Institute of Management Studies
James Renegar, Cornell University
Alexander Schrijver, CWI, The Netherlands
David P. Williamson, IBM T.J. Watson Research Center
Jochem Zowe, University of Erlangen-Nuremberg, Germany
Series Volumes
Renegar, James, A Mathematical View of Interior-Point Methods in Convex Optimization
Ben-Tal, Aharon and Nemirovski, Arkadi, Lectures on Modern Convex Optimization:
Analysis, Algorithms, and Engineering Applications
Conn, Andrew R., Gould, Nicholas I. M., and Toint, Phillippe L., Trust-Region Methods
A MATHEMATICAL VIEW
OF INTERIOR-POINT METHODS
IN CONVEX OPTIMIZATION
James Renegar
Cornell University
Ithaca, New York
siam. MpS
Society for Industrial and Applied Mathematics Mathematical Programming Society
Philadelphia Philadelphia
Copyright ©2001 by the Society for Industrial and Applied Mathematics.
10987654321
All rights reserved. Printed in the United States of America. No part of this book
may be reproduced, stored, or transmitted in any manner without the written
permission of the publisher. For information, write to the Society for Industrial
and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA
19104-2688.
Preface vii
1 Preliminaries 1
1.1 Linear Algebra 2
1.2 Gradients 5
1.3 Hessians 9
1.4 Convexity 11
1.5 Fundamental Theorems of Calculus 14
1.6 Newton's Method 18
v
vi Contents
Bibliography 115
Index 117
Preface
This book aims at developing a thorough understanding of the most general theory for
interior-point methods, a class of algorithms for convex optimization problems. The study
of these algorithms has dominated the continuous optimization literature for nearly 15
years, beginning with the paper by Karmarkar [10]. In that time, the theory has matured
tremendously, perhaps most notably due to the path-breaking, broadly encompassing, and
enormously influential work of Nesterov and Nemirovskii [15].
Much of the literature on the general theory of interior-point methods is difficult to
understand, even for specialists. My hope is that this book will make the most general theory
accessible to a wide audience—especially Ph.D. students, the next generation of optimizers.
The book might be used for graduate courses in optimization, both in mathematics and
engineering departments. Moreover, it is self-contained and thus appropriate for individual
study by students as well as by seasoned researchers who wish to better assimilate the most
general interior-point method theory.
This book grew out of my lecture notes for the Ph.D. course on interior-point methods
at Cornell University. The impetus for writing the book came from an invitation to give the
annual lecture series at the Center for Operations Research and Econometrics (CORE) at
Louvain-la-Neuve, Belgium, in fall, 1996. The book is being published by CORE as well
as in the SIAM-MPS series.*
The writing of this book has been a particularly satisfying experience. It has brought
into sharp focus the beauty and coherence of the interior-point method theory as developed
and influenced through the efforts of many researchers. I hope those researchers will not
be offended by my choice to cite few references (and I hope they largely agree that the
ones I have cited are the ones that should be cited, given the material covered herein).
The reader can look to recent articles and books for extensive bibliographies (cf. [21],
[22], [24], [25]); a great place to stay up to date is Interior-Point Methods Online
(http://www-unix.mcs.anl.gov/otc/InteriorPoint/).
In citing few references, my intent is not to give the impression of research originality.
Indeed, most of the results in this monograph are undoubtedly embedded somewhere in [15],
[16], and [17] (which in turn were influenced by other works). My only claim to originality
is in presenting a simplifying perspective.
*This SIAM-MPS version is slightly more polished than the CORE version, having benefited from reviewers as
well as from students who combed the manuscript while taking the Ph.D. course on interior-point methods. I wish
to thank Osman Giiler, who carefully read the manuscript in his role as a reviewer for the SIAM-MPS series. His
suggestions improved the presentation. I am very grateful to Francis Su and Trevor Park, both of whom suggested
many changes that benefit the reader.
vii
This page intentionally left blank
Chapter 1
Preliminaries
where || AJC|| := (Ax, Ajc) 1/2 . In general, the function whose yth coordinate is df/dxj is
the gradient only if ( , } is the Euclidean inner product.
The natural geometry varies from point to point in the domains of optimization prob-
lems that can be solved by ipm's. As the algorithms progress from one point to the next, one
changes the inner product—and hence the geometry—to visualize the headway achieved
by the algorithms. The relevant inner products may bear no relation to an initially imposed
coordinate system. Consequently, in aiming for the most transparent and least cumbersome
proofs, one should dispense with coordinate systems.
We begin with a review of linear algebra by recalling, for example, the notion of a
self-adjoint linear operator. We then define gradients and Hessians, emphasizing how they
change when the underlying inner product is changed. Next is a brief review of basic results
for convex functionals, followed by results akin to the fundamental theorem of calculus.
Although these "calculus results" are elementary and dry, they are essential in achieving
lift-off for the ipm theory. Finally, we recall Newton's method for continuous optimization,
proving a standard theorem which later plays a central motivational role.
Seasoned researchers will likely find parts of this chapter to be overly detailed, espe-
cially the proofs for results which are essentially only multivariate calculus. The researchers
might wonder why standard calculus results are not employed for the sake of brevity. This
chapter is aimed at new Ph.D. students rather than seasoned researchers. The development is
meant to teach students to think coordinate-free, quite contrary to how multivariate calculus
1
2 Chanter 1. Preliminaries
Perhaps the most useful relation between the inner product and the norm is the Cauchy-
Schwarz inequality,
We remark that the Frobenius norm is submultiplicative, meaning ||XS\\ < \\X\\ \\S\\,
as is seen from the relations
Recall that vectors jci, x^ e R" are said to be orthogonal if (x\,X2) = 0- Recall that
a basis v\,,.., vn for R" is said to be an orthonormal if
where <5jy is the Kronecker delta. A linear operator (i.e., a linear transformation) Q : R" ->
Rn is said to be orthogonal if
The operator A* is the adjoint of A. The range space of A* is orthogonal to the nullspace
of A.
If A is surjective, then A* is injective and the linear operator A*(AA*) -1 A projects
R" orthogonally onto the range space of A*; that is, the image of x is the point in the range
space closest to x. Likewise, / — A*(AA*)-1A projects R" orthogonally onto the nullspace
of A.
If both R" and Rm are endowed with the dot product and if A and A* are written as
matrices, then A* = A r , the transpose of A.
It is a simple but important exercise for SDP to show that if S\,..., Sm e S wx " and
nx/1
A :S -> Rm is the linear operator defined by
then
assuming S"*" is endowed with the trace product and Rm is endowed with the dot product.
4 Chapter 1. Preliminaries
Continuing to assume Rn and Rm are endowed with inner products, and hence norms,
one obtains an induced operator norm on the vector space consisting of linear operators
IK. —>v TO"*-
A .. mm M .
(Equivalently, Aw, = y/u;,- for i = 1 , . . . , r and Aut• = 0 for i > r.) The numbers y, are
the singular values of A; if r < n, then the number 0 is also considered to be a singular
value of A. It is easily seen that || A|| = yr. Moreover,
so that the values y,- (and possibly 0) are also the singular values of A*. It follows that
||A*|| = ||A||.
If Rn and Em are endowed with the dot product, the singular-value decomposition
corresponds to the fact that if A is an m x n matrix, there exist orthogonal matrices Qm
and Qn such that QmAQn = F where F is an m x n matrix with zeros everywhere except
possibly for positive numbers on the main diagonal.
It is not difficult to prove that a linear operator Q : R" ->• Rn is orthogonal iff
Q* = Q~l. For orthogonal operators, ||<2H = 1.
A linear operator S : R" —> R" is said to be self-adjoint if S = S*.
If { , ) is the dot product and 5 is written as a matrix, then S being self-adjoint is
equivalent to S being symmetric.
It is instructive to show that for S e § nxn , the linear operator A : §"x" -> S nxn
defined by
is self-adjoint. Such operators are important in the ipm theory for SDR
A linear operator S : R" -> R" is said to be positive semidefinite (psd) if S is self-
adjoint and
then S is said to be positive definite (pd). Keep in mind that self-adjointness is a part of
our definitions of psd and pd operators, contrary to the definitions found in much of the
optimization literature.
Each self-adjoint linear operator S has a spectral decomposition. Precisely, for each
self-adjoint linear operator S there exists an orthonormal basis v\,..., vn and real numbers
AI < • • • < A.n such that for all jc,
1.2. Gradients 5
The spectral decomposition for a psd operator S allows one to easily prove the ex-
istence of a psd operator S1/2 satisfying 5 = (51/2)2; simply replace A, by v^7 in the
decomposition. In turn, the uniqueness of S1/2 can readily be proven by relying on the fact
that if T is a psd operator satisfying T2 = S, then the eigenvectors for T are eigenvectors
for S. The operator 51/2 is the square root of S.
Here is a crucial observation: If S is pd, then S defines a new inner product, namely,
Every inner product on E" arises in this way; that is, regardless of the initial inner product
( , }, for every other inner product there exists S which is pd with respect to (w.r.t.) { , )
and for which ( , )$ is precisely the other inner product. (We do not rely on this fact.)
Let || \\s denote the norm induced by { , ) $ .
Assume A* is the adjoint of A : R" —>• R m . Assuming S and T are pd w.r.t. the
respective inner products, if the inner product on E" is replaced by ( , )s and that on Rm is
replaced by { , }T, then the adjoint of A becomes S"1 A*7, as is easily shown. Moreover,
letting || A|| c 7- denote the resulting operator norm, it is easily proven that
Similarly, since for c e E", the linear function given by x h-» (c, jc) is identical to the
function x (->• (S~lc, x)$, if c is the objective vector of an optimization problem written
in terms of {, )—that is, if the objective is "min{c, x)"—then when written in terms of
{ , )s, the objective vector is S~lc. The notation used in this paragraph illustrates our
earlier assertion that the reference inner product will be useful in fixing notation as the inner
products change.
It is instructive to consider the shape of the unit ball w.r.t. || || 5 viewed in terms of the
geometry of the reference inner product. The spectral decomposition of S easily implies
the unit ball to be an ellipsoid with axes in the directions of the orthonormal basis vectors
v\,..., vn, the length of the axis in the direction of v, being 2-,/1/A,-.
1.2 Gradients
Recall that a functional is a function whose range lies in E. We use Df to denote the
domain of a functional /. It will always be assumed that D/ is an open subset of R" in the
norm topology (recalling that all norms on Rn induce the same topology).
6 Chapter 1. Preliminaries
Let { , ) denote an arbitrary inner product on Rn and let || || denote the norm induced
by {, ).
The functional / is said to be (Frechet) differentiable at x e Df if there exists a vector
g(x) satisfying
with domain §+*", the set of all pd matrices in §"xn. This functional plays an especially
important role in SDR We claim that w.r.t. the trace product,
Indeed, let AX e Snx" and denote the eigenvalues of X l ( A X ) by y\,..., yn. Since
X i , . . . , Yn are also the eigenvalues of the symmetric matrix X ~ l / 2 ( A X ) X ~ 1 / 2 , the eigen-
values are real numbers. Note that X"1 o AX = £). y, and
Since \\X l(AX)\\ = \\y\\ where \\y\\ is the Euclidean norm of the vector y := (y\,..., /„),
the inequalities in (1.1) show the statement "|| AX|| ->• 0" is equivalent to "||y || —>• 0."
Finally, to establish the claim g(X) = —X~l, observe that
Proof. Letting AI denote the least eigenvalue of S and A.n the greatest, the proof relies on
the relations
we have
8 Chapter 1. Preliminaries
Finally, we make a few observations that will be important for applying ipm theory
to optimization problems having linear equations among the constraints.
Recall that we use Rn to denote a generic finite-dimensional vector space with an inner
product, and hence our definition of the gradient applies to subspaces, too. In particular, if
/ is defined on a vector space and the domain of / intersects a subspace L, one can speak
of the gradient of the function f\i obtained by restricting / to L. Substituting L for R" in
the definition of the gradient, we see that the gradient of f\i at x e D/ H L is the vector
S\L(X) e L satisfying
1.3 Hessians
The functional / is said to be twice differentiable at x e D/ if / e Cl and there exists a
linear operator H (jc) : Rn ->• Rn satisfying
(If the Hessian matrix does not vary continuously in x, the order in which the partials are
taken can matter, resulting in a nonsymmetric matrix.)
To illustrate the definition of the Hessian, we again consider the functional
with domain §++", the set of all pd matrices in S"x". We saw that g(X) = -X~l. We claim
that H(X) is the linear operator given by
This can be proven by relying on the fact that if || AX|| is sufficiently small, then
10 Chapter 1. Preliminaries
and hence
Theorem 1.3.1. IfS ispdand f is twice differentiable atx, then the Hessian of f atx w.r.t.
( , )sisS-lH(x).
The lack of symmetry in the expression S~lH(x) might puzzle readers who are
thinking of Hessians as being symmetric matrices. We remark that if one chose a basis for
E" which is orthonormal w.r.t. { , )s and expressed the operator S~lH(x) as a matrix
in terms of the resulting coordinate system, then the matrix would be symmetric. (Again
we emphasize that in developing ipm theory, it is best not to think in terms of coordinate
systems.)
Theorems 1.2.1 and 1.3.1 have the unsurprising consequence that the second-ord
approximation of / at x is independent of the inner product:
1.4. Convexity 11
1.4 Convexity
Recall that a set S C E" is said to be convex if whenever x, y e S and 0 < t < 1 we have
x + t(y -x) e S.
Recall that a functional / is said to be convex if Df is convex and if whenever
x, y € Df and 0 < t < 1, we have
If the inequality is strict whenever 0 < t < 1 and x ^ v, then / is said to be strictly convex.
The minimizers of a convex functional form a convex set. A strictly convex functional
has at most one minimizer.
Henceforth, we assume /EC2 and we assume Df is an open, convex set.
If / is a univariate functional, we know from calculus that / is convex iff f"(x) > 0
for all x € Df. Similarly, if f"(x) > 0 for all x € D/, then / is strictly convex. The
following standard theorem generalizes these facts.
Theorem 1.4.1. The functional f is convex iff H(x) is psdfor all x € Df. If H(x) is pd
for all x 6 Df, then f is strictly convex.
The following elementary proposition, which is relied on in the proof of Theorem 1.4.1,
is fundamental in this book. It does not assume convexity of f .
Then
and
we have
an inequality that is certainly valid if 0 is convex on the interval [0,1]. Hence to prove (1.2)
it suffices to prove 0"(0 > 0 for all 0 < t < 1. However, Proposition 1.4.2 implies
So the initial rate of change cannot exceed \\g(x) \\ in magnitude, regardless of which direc-
tion v of unit length is chosen. However, assuming g(x) / 0, if one chooses the direction
iff Pig(i) = 0, that is, iff g(z) is orthogonal to L. In particular, if L is the nullspace of a
linear operator A, then z solves the optimization problem iff g(z) = A*y for some y. The
same holds when L is replaced by a translate of L, that is, when L is replaced by an affine
space.
Theorem 1.4.3. If f is convex and A is a linear operator, then z € Df solves the linearly
constrained optimization problem
and
1.5. Fundamental Theorems of Calculus 15
Proof. Again considering the univariate functional 0(0 '.= f(x+t(y— Jt)), the fundamental
theorem of calculus implies
and
Using Proposition 1.4.2 to make the obvious substitutions, (1.6) yields (1.4), whereas (1.7)
yields (1.5).
Proposition 1.5.2 provides the means to bound the error in the first- and second-order
approximations of /.
and
Relying on continuity of g and H, observe that the error in the first-order approx-
imation is o(\\y — x\\) (i.e., tends to zero faster than ||j — x\\), whereas the error in the
second-order approximation is0(||y — jt|| 2 ).
Theorem 1.5.1 gives a fundamental theorem of calculus for a functional /. It will be
necessary to have an analogous theorem for g, a theorem which expresses the difference
g(y) — g(x) as an integral involving the Hessian. To keep our development coordinate-free,
we introduce the following definition:
The univariate function t h-> v(t} e R", with domain [ a , b ] , is said to be
integrable if there exists a vector u such that
If it exists, the vector u is uniquely determined (as is not difficult to prove) and
is called the integral of the function v(t). One uses the notation fa v(t) dt to
represent this vector.
Although the definition of the integral is phrased in terms of the inner product { , },
it is independent of the inner product. Indeed, if u is the integral as defined by { , } and if
S is pd, then for all vectors w,
16 Chapter 1. Preliminaries
Proposition 1.5.4. If the univariate function t \-> v(t) e R", with domain [a,b], is
inteerable, then
Proof. Let u := fa v(t) dt. By definition of the integral, for all if we have
However,
second equality
where the second equality isisby
bydefinition
definitionofoff fa v(t) dt.
Next is the fundamental theorem of calculus for the gradient.
Comparing (1.11) with (1.10), we see that to prove (1.10) it suffices to show for arbitrary
0 < t < 1 that
where
Toward proving (1.12), recall that H(u) is the unique operator satisfying
Since, by Cauchy-Schwarz,
Since
we thus have
from which it is immediate that (H(u)(y — x ) , w ) = <J>'(t). Thus, (1.12) is established and
the proof is complete.
Proposition 1.6.1. The gradient ofqx at y is g(x) + H(x)(y — x) and the Hessian is H(x)
(regardless ofy).
Invoking the assumed continuity of the Hessian, the theorem is seen to imply that if
H(z) is invertible and x is sufficiently close to z, then x+ will be closer to z than is x.
Now we present a brief discussion of Newton's method and subspaces, as will be
important when we consider applications of ipm theory to optimization problems having
linear equations among the constraints. Assume L is a subspace of R" and x € L n Df.
Let n\L(x) denote the Newton step for f\L at x. Since the gradient of f\i at x is Pig(x)
and the Hessian is PLH(x), the Newton step n\L(x) is the vector in L solving
that is, n\L(x) is the vector in L for which H(x)n\i(x) + g(x) is orthogonal to L. In
particular, if L is the nullspace of a linear operator A : R" —> R m , then n\L(x) is the vector
in R" for which there exists y e Rm satisfying
Computing n\L(x} (and y) can thus be accomplished by solving a system of m+n equations
in m + n variables.
If H(x)~l is readily computed (as it is for functionals / used in ipm's), the size of the
system of linear equations to be solved can easily be reduced to m variables. One solves
the linear system in the variables y,
In closing this section we remark that the error bound given by Theorem 1.6.2 can of
course be applied to f\L. Assuming z' minimizes f\L (trivially, the minimizer z of / over
all of Df satisfies z = z' iff z e L), we obtain
Hence,
The Hessian on the right being H rather than H\L makes for less cumbersome applications
of the inequality.
Chapter 2
Throughout this chapter, unless otherwise stated, / refers to a functional having at least the
following properties: D/ is open and convex, / e C2, and H(x) is pd for all x e Df. In
particular, / is strictly convex.
These inner products vary continuously with x. In particular, given € > 0, there exists a
neighborhood of x consisting of points y with the property that for all vectors v ^ 0,
We often refer to the inner product { , )H(X) as the local inner product (at x).
In the inner product ( , )H(X), the gradient at y is H(x)~lg(y) and the Hessian is
H(x)~lH(y). In particular, the gradient at x is — n(x), the negative of the Newton step,
and the Hessian is /, the identity. Thus, in the local inner product, Newton's method
coincides with the "method of steepest descent," i.e., Newton's method coincides with the
algorithm which attempts to minimize / by moving in the direction given by the negative
of the gradient. (Whereas Newton's method is independent of inner products, the method
of steepest descent is not independent because gradients are not independent.)
It appears from our definition that the local inner product potentially depends on the
reference inner product { , ). In fact, the local inner product is independent of the reference
inner product; indeed, if the reference inner product is changed to ( , }$, and hence the
Hessian is changed to S~lH(x), the resulting local inner product is
21
22 Chapter 2. Basic Interior-Point Method Theory
The independence of the local inner products from the reference inner product shows
the local inner products to be intrinsic to the functional /. For that reason, we often refer
to the local inner products as the intrinsic inner products. To highlight the independence of
the local inner products from any reference inner product, we adopt notation which avoids
the Hessians of a reference. We denote the intrinsic inner product at x by ( , )x. Let || \\x
denote the induced norm. For 3; G D f , let g x ( y ) denote the gradient at y and let Hx(y)
denote the Hessian. Thus, gx(x) = -n(x) and Hx(x) = I. If A : R" -> Rm is a linear
operator, let A* denote its adjoint w.r.t. { , )x. (Of course the adjoint also depends on
the inner product on Rm. That inner product will always be fixed but arbitrary, unlike the
intrinsic inner products which vary with x and are not arbitrary, depending on /.)
The reader should be especially aware that we use gx(x) and —n(x) interchangeably,
depending on context.
A miniscule amount of the ipm literature is written in terms of the local inner products.
Rather, in much of the literature, only a reference inner product is explicit, say, the dot
product. There, the proofs are done by manipulating operators built from Hessians, operators
like H(x)~lH(y) and AH(x)~lAT, operators we recognize as being Hx(y) and AA*. An
advantage to working in the local inner products is that the underlying geometry becomes
evident and, consequently, the operator manipulations in the proofs become less mysterious.
Observe that the quadratic approximation of / at jc is
where the latter norm is the operator norm induced by the local norm. Similarly, the progress
made by Newton's method toward approximating a minimizer z (Theorem 1.6.2) is captured
by the inequality
Consequently, the local inner product on L induced by f\L is precisely the restriction of
{ , )x to L. Thus,
2.2. Self-Concordant Functionals 23
That is, in the local inner product, the Newton step for f\L is the orthogonal projection of
the Newton step for /.
If L is the nullspace of a surjective linear operator A, the relation
provides the means to compute n\L(x} from n(x). One solves the linear system
As the parentheses in our definition indicate, for brevity we typically refer to strongly
nondegenerate self-concordant functionals simply as "self-concordant functionals."
If a linear functional is added to a self-concordant functional (x i-> (c, x) + /(*)), the
resulting functional is self-concordant because the Hessians are unaffected. Similarly, if one
restricts a self-concordant functional / to a subspace L (or to a translation of the subspace),
one obtains a self-concordant functional, a simple consequence of the local norms for f\i
being the restrictions of the local norms for /.
The primordial self-concordant barrier functional is the "logarithmic barrier function
for the nonnegative orthant," having domain Df := R^_+ (i.e., the strictly positive orthant).
It is defined by f ( x ) := — ^ In Xj. Since the coordinates of vectors play a prominent role
in the definition of this functional (as they do in the definition of the nonnegative orthant),
to prove self-concordance it is natural to use the dot product as a reference inner product.
Expressing the Hessian H(x) as a matrix, one sees it is diagonal with yth diagonal entry
l/*y. Consequently, y € Bx(x, 1) is equivalent to
Since
where r] > 0 is a fixed constant, / is the logarithmic barrier function for the nonnegative
orthant, and L :— {x : Ax = b}. (Of course L is an affine space, a translation of a
subspace; L is a subspace iff b = 0. For a brief discussion of how our results for functionals
2.2. Self-Concordant Functional 25
restricted to subspaces readily translate to functionals restricted to affine spaces, recall the
last paragraph of § 1.2.)
Another important self-concordant functional is the "logarithmic barrier function for
the cone of psd matrices" in S"xn. This is the functional defined by f ( X ) := — In det(X),
having domain §++" (i.e., the pd matrices in S"xn). To prove self-concordance, it is natural
to rely on the trace product, for which we know H(X)AX = X~l(AX)X~1. For arbitrary
7 e S"x", keeping in mind that the trace of a matrix depends only on the eigenvalues, we
have
where A-i < • • • < A,n are the eigenvalues of X~l/2YX~l/2. Assuming \\Y - X\\x < 1, all
of the values A,y are thus positive, and hence X~l/2YX~1/2 is pd, which is easily seen to
be equivalent to Y being pd. Consequently, if ||y - X\\x < 1, then Y <= Df (= §++"), as
required by the definition of self-concordance.
Assuming Y e BX(X, 1) and V e S"xn, let
Note that ||5i|| = ||V||jr and the eigenvalues of S2 are 0 < I/A.,, < ••• < 1/A.i. In
establishing the bounds for || V||y/|| V\\x in the definition of self-concordance, we rely on
the inequality
However, as the Frobenius norm of a matrix M satisfies ||M|| — (^ w,27)1/2, it is easily seen
that
where the last equality once again relies on the Frobenius norm of a symmetric matrix being
determined solely by the eigenvalues.
We have
26 Chapter 2. Basic Interior-Point Method Theory
Since
the rightmost inequality for || V||y/|| V||x in the definition of self-concordance is proven.
The leftmost inequality is proven similarly. Thus, the logarithmic barrier function for the
cone of pd matrices is indeed self-concordant.
For an SDP
where A : §'!X" -> Rm is a linear operator, the most important self-concordant functionals
are those of the form
where 77 > 0 is a fixed constant, / is the logarithmic barrier function for the cone of pd
matrices, and L := {X : A(X) = b}.
LP can be viewed as a special case of SDP by identifying, in the obvious manner,
Rn with the subspace in § nxn consisting of diagonal matrices. Then the logarithmic barrier
function for the pd cone restricts to be the logarithmic barrier function for the nonnegative
orthant. Thus, we were redundant in giving a proof that the logarithmic barrier function
for the nonnegative orthant is indeed a self-concordant functional since the restriction of
self-concordant functionals to affine spaces yield self-concordant functionals. The insight
gained from the simplicity of the nonnegative orthant justifies the redundancy.
In §2.5 we show that the self-concordance of each of these two logarithmic barrier
functions is a simple consequence of the original definition of self-concordance due to
Nesterov and Nemirovskii [15]. (The original definition is not particularly well suited for a
transparent development of the theory, but it is well suited for establishing self-concordance.)
To apply the definition of self-concordance in developing the theory, it is useful to
rephrase it in terms of Hessians. (The operator norm appearing in the following theorem is
the one induced by the local norm || ||^.)
Theorem 2.2.1. Assume the functional f has the property that Bx(x, 1) C Df for all
x e Df. Then f is self-concordant iff for all x e Df and y e Bx(x, 1),
and, similarly,
the pair of inequalities in the definition of self-concordance is equivalent to the pair (2.1).
To complete the proof, we show that the pair of inequalities (2.1) is equivalent to the pair
(2.2).
Since the eigenvalues of / — Hx (y) are 1 — A,,-, we have
where the inequality relies on the relations 0 < A-i < A.w. Hence, the first inequality in (2.2)
is implied by the pair of inequalities (2.1), and similarly for the second inequality in (2.2).
Finally, it is immediate that the first inequality (resp., second inequality) of (2.2) implies
the first inequality (resp., second inequality) of (2.1).
Recalling that Hx(x} = I, it is evident from (2.2) that to assume a functional to
be self-concordant is essentially to assume Lipschitz continuity of the Hessians w.r.t. the
operator norms induced by the local norms.
An aside for those familar with third differentials: dividing the quantities on the left
and right of (2.2) by ||v — x\\x and taking the limits supremum as y tends to x suggests,
when / is thrice differentiable, that self-concordance implies the local norm of the third
differential to be bounded by "2." In fact, the converse is also true, that is, a bound of "2"
on the local norm of the third differential for all x e D f , together with the requirement that
the local unit balls be contained in the functional domain, implies self-concordance, as we
shall see in §2.5. Indeed, the original definition of self-concordance in [15] is phrased as a
bound on the third differential.
l
where n(x) := —H(x) g(x) is the Newton step for / at x.
28 Chapter 2. Basic Interior-Point Method Theory
Theorem 2.2.3. Assume f e SC andx e Df. Ifz minimizes f and z € Bx(x, I), then
satisfies
The use of the local norm || ||^ in Theorem 2.2.3 to measure the difference x+ — z
makes for a particularly simple proof but does not result in a theorem immediately ready
for induction. At x+, the containment z e Bx+(x+, 1) is needed to apply the theorem, i.e.,
a bound on ||jc+ — z\\x+ rather than a bound on ||*+ — z\\x. Given that the definition of
self-concordance restricts the norms to vary nicely, it is no surprise that the theorem can
easily be transformed into a statement ready for induction. For example, substituting into
the theorem the inequalities
2.2. Self-Concordant Functionals 29
and
as are immediate from the definition of self-concordance when x e Bz(z, 1), we find as a
corollary to the theorem that if ||jc — z\\z < \, then
and, inductively,
where x\ — x+, X 2 , . . • is the sequence generated by Newton's method. The bound (2.4)
makes apparent the rapid convergence of Newton's method.
Recall n(jc) := — H(x)~lg(x), the Newton step. The most elegant proofs of some
key results in the ipm theory are obtained by phrasing the analysis in terms of \\n(x) \\x rather
than in terms of ||z — x\\x or \\z — x\\z. In this regard, the following theorem is especially
useful.
we thus have
A rather unsatisfying, but unavoidable, aspect of general convergence results for New-
ton's method is an assumption of jc being sufficiently close to a minimizer z, where "suf-
ficiently close" depends explicitly on z. For general functionals, it is impossible to verify
that x is indeed sufficiently close to z without knowing z. For self-concordant functionals,
we know that the explicit dependence on z of what constitutes "sufficiently close" can take
a particularly simple form (e.g., we know x is sufficiently close to z if \\x — z \\z < |), albeit
a form which appears still to require knowing z. The next theorem provides means to verify
proximity to a minimizer without knowing the minimizer.
Theorem 2.2.5. Assume f e SC. If\\n(x}\\x < ^forsomex € Df, then f has a minimizer
z and
Thus,
Proof. We first prove a weaker result, namely, if line*)!!* < |, then / has a minimizer z
and||*-zL<3||n(jc)|U.
Theorem 2.2.2 implies that for all y € Bx(x, |),
It follows that if ||n(jc)|U < \ and \\y - x\\x = 3\\n(x)\\x, then /(y) > /(*). However, it
is easily proven that whenever a continuous, convex functional / satisfies /(y) > f ( x ) for
all y on the boundary of a compact, convex set S and some x in the interior of S, then / has
a minimizer in S. Thus, if ||n(jc)||jc < |, / has a minimizer z and ||A; — z\\x < 3\\n(x)\\x.
Now assume ||n(x)||jc < |. Theorem 2.2.4 implies
2.2. Self-Concordant Functionals 31
Applying the conclusion of the preceding paragraph to x+ rather than to x, we find that /
has aminimizerz and ||z — x+\\x+ < 3\\n(x+)\\x+. Thus,
Theorem 2.2.6. The set SC is closed under addition; that is, if f\ and fa are self-concordant
functionals satisfying Z)/, fl D/2 ^ 0, then /i + /2 : £>/, H Z>/2 —> IR is a self-concordant
functional.
that is,
Hence,
Consequently, if v e £>/,
Proof. Denote the functional x i-> /(Ax — b) by /'. Let || || 'x denote the local norm derived
from /'. Assuming x e D f , one easily verifies from the identity H'(x} = A*H(Ax — b)A
that H'(x) is pd and that \\v\\'x = \\Av\\Ax-b for all v. In particular,
and thus
establishing the upper bound on ||u|| y /||u|U in the definition of self-concordance. One
establishes the lower bound similarly.
Applying Theorem 2.2.7 with the logarithmic barrier function for the nonnegative
orthant in W", one verifies self-concordance for the functional
whose domain consists of the points x satisfying the strict linear inequality constraints
ai-x > hi. This self-concordant functional is important for LPs with constraints written in
the form Ax > b. It, too, is referred to as a "logarithmic barrier function."
To provide the reader with another (logarithmic barrier) functional with which to apply
the above theorems, we mention that x H> — ln(l — ||jt||2) is a self-concordant functional
with domain the open unit ball. (Verification of self-concordance is made in §2.5.) Given
an ellipsoid {x : \\ Ax \\ < r}, it then follows from Theorem 2.2.7 that
is a self-concordant functional whose domain is the ellipsoid, yet another logarithmic barrier
function. For an intersection of ellipsoids, one simply adds the functionals for the individual
ellipsoids, as justified by Theorem 2.2.6.
Although the values of finite-valued convex functionals with bounded domains (i.e.,
bounded w.r.t. any reference norm) are always bounded from below, such a functional need
not have a minimizer even if it is continuous. This is not the case for self-concordant
functionals.
2.2. Self-Concordant Functionals 33
Theorem 2.2.8. If f e SC and the values of f are bounded from below, then f has a
minimizer. (In particular, ifDf is bounded, then f has a minimizer.)
that is,
Theorem 2.2.9. Assume f e SC and x e dDf, the boundary of Df. If the sequence
{x{} C Df converges to x, then lim, /(*,-) = oo and ||g(jc,-)|| —> oo.
Proof. Assume*, —> x € dDf. We claim that if /(*,) —> oo, then ||g(jc,-)|| —> oo. Indeed,
fix v e D f . By convexitv,
The claim easily follows. Thus it only remains to prove /(*,•) —> oo.
Adding / to the functional x \-+ -\n(R - \\x\\2) where R > \\x\\2, \\Xi\\2 (for all
0, one obtains a self-concordant functional / for which D? is bounded and for which
lim, f ( x i ) = oo iff lim, /(*,•) = oo. Consequently, we may assume Df is bounded.
Assuming Df is bounded, we will construct from {jc,} a sequence {v,} C Df which
has a unique limit point, the limit point lying in dDf, and which satisfies
34 Chapter 2. Basic Interior-Point Method Theory
Applying the same construction to the sequence {y,-}, and so on, we will thus conclude
that if lim, /(*,•) 7^ oo, then / assumes arbitrarily small values, contradicting the lower
boundedness of finite-valued convex functional having bounded domains.
Shortly, we prove liminf, ||n(jc,)||^ > ~. In particular, for sufficiently large i, it
follows that is well-defined and, from Theorem 2.2.2,
Moreover, all limit points of {y,} lie in 9D/; indeed, otherwise, passing to a subsequence of
{y,-} if necessary, there exists e > 0 such that #(y,, €) c Df for all i, where the ball is w.r.t. a
reference norm. However, since y,; and*,- —(y,- — jc,) lie in £>/(because ||y,—x,L, = ^ < 1),
it then follows from convexity of Df that /?(*,-, e/2) c £)/, contradicting jc, -> Jc e 9D/.
Consequently, all limit points of {y/} do indeed lie in 9D/. Restricting to a subsequence if
necessary, we may assume {y,-} has a unique limit point, the limit point lying in dDf.
Finally, we show liminf, ||n(jc/)|U, > 5. Since Df is bounded, Theorem 2.2.8
shows that / has a minimizer z. Since Bz(z, 1) c Df and x, ->• x e dDf, we have
liminf, \\xi- z\\z > 1- Hence, from the definition of self-concordance, liminf,
|. Because / has a most one minimizer, Theorem 2.2.5 then implies lim inf
concluding the proof.
We close this section with a technical proposition to be called upon later.
If g(x) + v is a vector sufficiently near g ( x ) , there exists x + u close to jc such that
g(x + «) = g(x) + v, a consequence of H(x) being pd and hence invertible. It is useful to
quantify "near" and "close" when the inner product is the local inner product, that is, when
Hx(x) = I and hence u & v.
Proposition 2.2.10. Assume f e SC and x e Df. If \\v\\x < r where r < |, there exists
such tha
a functional whose local inner products agree with those of /. Note that a point z' minimizes
the functional iff gx(z') = gx(x) + v. Under the assumption of the proposition, we thus
wish to show z' exists and u := z' — x satisfies
Since at jc the Newton step for the functional (2.5) is u, the assumption
allows us to apply Theorem 2.2.5, concluding that a minimizer z' does indeed exist and
ik'-(* + u)IU<7&-
2.3. Barrier Functionals 35
Let SCB denote the family of functionals thus defined. We typically refer to elements of
SCB as "barrier functionals."
The definition of barrier functionals is phrased in terms of ||&t(-*)IU rather than in
terms of the identical quantity ||w(x)|U because the importance of barrier functionals for
ipm's lies not in applying Newton's method to them directly but rather in applying Newton's
method to self-concordant functionals built from them. As mentioned before, for an LP
where r\ > 0 is a fixed constant, / is the logarithmic barrier function for the nonnegative
orthant, and L := [x : Ax = b}.
When Nesterov and Nemirovskii [15] defined barrier functionals, they referred to #/
as "the parameter of the barrier /." Unfortunately, this can be confused with the phrase
"barrier parameter" which predates [15] and refers to the constant rj in (2.6). Consequently,
we prefer to call #/ the complexity value off, especially because it is the quantity that most
often represents / in the complexity analysis of ipm's relying on /.
If one restricts a barrier functional / to a subspace L (or a translation of a subspace),
one obtains a barrier functional simply because the local norms for f\L are the restrictions
of the local norms for / and
Thus, &f=n.
36 Chapter 2. Basic Interior-Point Method Theory
Now let / denote the logarithmic barrier function for the cone of pd matrices in §/lx",
that is, f ( X ) := - In det(X). Relying on the trace product, we have, for all X e §++",
and hence
Consequently,
It readily follows that #/ = 1, showing the complexity value need not depend on the
dimension n.
Nesterov and Nemirovskii [15] proved a most impressive and theoretically important
result, namely, that each open, convex set containing no lines is the domain of a (strongly
nondegenerate self-concordant) barrier functional. In fact, they proved the existence of a
universal constant C with the property that for each n, if the convex set lies in R", there
exists a barrier functional whose domain is the set and whose complexity value is bounded
by Cn. (I have yet to find a relatively transparent proof of this result, and hence a proof
is not contained in this book.) Unfortunately, the result is only of theoretical interest. To
rely on self-concordant functionals in devising ipm's, one must be able to readily compute
their gradients and Hessians. For the self-concordant functionals proven to exist, from a
computational viewpoint one cannot say much more than that the gradients and Hessians
exist. By contrast, the importance of the various logarithmic barrier functions we have
described lies largely in the ease with which their gradients and Hessians can be computed.
In §2.2 we noted that if a linear functional is added to a self-concordant functional, the
resulting functional is self-concordant because the Hessians are unchanged; the definition
of self-concordance depends on the Hessians alone. By contrast, adding a linear functional
to a barrier functional need not result in a barrier functional. For example, consider the
univariate barrier functional x t-> — In x and the functional x i-» x — In x.
The set SCB, like SC, is closed under addition.
Proof. Assume x e Df. Let the reference inner product { , } be the local inner product
at* defined by /. Thus, / = H(x) = HI(X) + H2(x). In particular, H{(x) and H2(x)
commute, i.e., H\(x}H2(x} = H2(x)Hi(x). Consequently, so do Hi(x)l/2 and
For brevity, let Hf := #,•(*) and gi := g/(x) for i = 1,2.
To prove the inequality in the statement of the theorem, it suffices to show
since, by definition, the quantity on the right is bounded from above by #/, + #/2.
— 1/2
Defining y, := Hi g, for i = 1, 2, we have
(The last inequality is due to the operator being an orthogonal projection operator; the
operator has norm equal to one.)
With regards to theory, the following theorem is perhaps the most useful tool in
establishing properties possessed by all barrier functionals. The inner product is arbitrary.
Proof. We wish to prove 0'(0) < #/, where 0 is the univariate functional defined by
0(/) := f ( x + t(y — x ) ) . In doing so, we may assume 0'(0) > 0 and hence, by convexity
of $, <j>'(t) > 0 for all t > 0 in the domain of 0.
38 Chapter 2. Basic Interior-Point Method Theory
Thus,
that is.
Consequently, the domain of the convex functional (j> is contained in the open interval
(—00, #//0'(0)). In particular, #//0'(0) is not in the domain. Since s = I is in the domain
of 0 we thus have 1 < #//0'(0).
Proof. Restricting / to the line through x and y, we may assume / is univariate. Viewing t
line as R with values increasing as one travels from x to y, the assumption (g(x),y — x) > 0
is then equivalent to g(x) > 0, i.e., g(x) is a nonnegative number.
Let v denote the smallest nonnegative number for which ||#*(.*) + v\\x > |. Since
8x(x) > 0» we have || v \\x < |. Applying Proposition 2.2.10, we find there exists u satisfying
where the last inequality makes use of gx(x) + v and y — x both being nonnegative.
However, since H&tOt) + v\\x > | only if v = 0 (and hence only if u = 0), we
Thus,
Corollary 2.3.5. Assume f e SCB. Ifz is the analytic center for f, then
Proof. Since / e SC, the leftmost containment is by assumption. The rightmost contain-
ment is immediate from Theorem 2.3.4 since g(z) = 0.
Corollary 2.3.5 suggests that if one was to choose a single inner product as being
especially natural for a barrier functional with bounded domain, the local inner product at
the analytic center would be an appropriate choice because the resulting balls conform to
the shape of the domain.
When do analytic centers exist? The answer is given by the following corollary.
Proof. The proof is immediate from Theorem 2.2.8 and Corollary 2.3.5.
an analytic center, it is readily proven (without making use of a particular inner product)
that (gj,e(e), ej)e < 0- Since gj,e and e-} are colinear (because Dfj is one-dimensional), it
follows that
Hence,
Proof. For brevity, let s := sym(x, Df). Assuming jc, y e Df, note that
s)(x — y) € Df (the closure of £>/) since w, x, y are colinear and
Since By(y, 1) c Df and D/ is convex, we thus have
Theorem 2.3.8. Assume f e SCB andx e Df. I f y e Df, then for all 0 < t < 1,
Proof. For s > 0 let jc(^) := y + e~s(x — y) and consider the univariate functional
0(5) := f ( x ( s ) ) . Relying on the chain rule, observe
jti, X2 € K and t\,ti > 0, then t\x\ + hxi e £). A barrier functional / : K° —> R is said
to be logarithmically homogeneous if for all # e K° and ? > 0,
It is easily established that the logarithmic barrier functions for the nonnegative orthant
and the cone of psd matrices are logarithmically homogeneous, as are barrier functionals
of the form x i—> f(Ax) where / is logarithmically homogeneous. Another important
example of a logarithmically homogeneous barrier functional is
the domain of this functional being the interior of the second-order cone
It has complexity value #/ = 2. As with the standard barrier functionals for the nonnegative
orthant and the cone of psd matrices, it is referred to as a logarithmic barrier function.
The following theorem provides a characterization of logarithmic homogeneity as
well as other properties useful in analysis.
Hence, the three quantities in (2.9) are independent of x. In particular, the middle quantity
is bounded above as a functional of x, and thus / is not only self-concordant, it is also a
barrier functional. Moreover, the independence from x of the three quantities implies each
of them is equal to #/.
Finally,
where Df denotes the closure of Df. Among many other problems, linear programs are
of this form. Specifically, restricting the logarithmic barrier function for the nonnegative
orthant to the affine space [x : Ax = b}, we obtain a barrier functional / for which
for r) > 0. It is readily proven when Df is bounded that the central path begins at the analytic
center z of f and consists of the minimizers of the barrier functionals f\i(V) obtained by
restricting / to the affine spaces
for v satisfying val < v < ( c , z ) . Similarly, when Df is unbounded, the central path
consists of the minimizers of the barrier functionals /|LOO f°r v satisfying val < v.
In the literature, it is standard to define the central path, and to do ipm analysis, with
the functionals
where ^ > 0, rather than with the functionals /,,. The difference is only cosmetic. The
minimizer 2(77) of /,, is the minimizer of the functional (2.11) for /x = \fr\. Similarly,
Newton's method applied to minimizing /,, produces exactly the same sequence of points
as Newton's method applied to minimizing the functional (2.11) for yu, = \jr\. Whether one
relies on the functionals /,, or the functionals (2.11), one arrives at exactly the same ipm
results, the only difference being insignificant changes in minor algebraic steps along the
way. The reason we use the functionals /,, rather than the customary functionals (2.11) is
that the local inner products for the functionals /^ are identical to those for /, whereas the
local inner products for the functionals (2.11) depend on /z. Since a goal of this book is to
elucidate the geometry underlying ipm's, it is more natural for us to rely on the functionals
fn-
We observe that for each y e Df, the optimization problem (2.10) is equivalent to
where cy := H(y) 1c. (In other words, the objective vector is cy w.r.t. { , )y.)
The desirability of following the central path is made evident by considering the
objective values (c, z(rj)}. Since g(z(ri) = —rjc, Theorem 2.3.3 implies for all y e Df,
and hence
Moreover, the point z(rj) is well centered in the sense that all feasible points y with objective
value at most (c, z(n)} satisfy y e 5z(,,)(z(ty), 4#y + 1), a consequence of Theorem 2.3.4
andg(z(r])) = -rjc.
Path-following ipm's follow the central path approximately, generating points near
the central path where "near" is measured by local norms. If a point y is computed for which
IIy — z(rj) \\Z(TI) is small, then, relatively, the objective value at y will not be much worse than
at z(n), and hence (2.12) implies a bound on (c, y). In fact, if x is an arbitrary point in Df
and y is a point for which \\y — x\\x is small, then, relatively, the objective value at y will
not be much worse than at x. To make this precise, first observe Bx(x, 1) c Df implies
2.4. Primal Algorithms 45
x — tcx e Df if 0 < t < l/\\cx \\x. Since the objective value at x — tcx is
we thus have
Before discussing algorithms, we record a piece of notation: Let nrl(x) denote the
Newton step for /^ at x, that is,
Continuing this procedure indefinitely (i.e., increasing 77, applying Newton's method, in-
creasing 77, etc.), we have the barrier method.
One would like 772 to be much larger than 771. However, if 772 is "too" large relative
to 771, Newton's method can fail to approximate 2(772); in fact, it can happen that x^ $. Z)/,
bringing the algorithm to a halt. The main goal in analyzing the barrier method is to prove
that 772 can be larger than 771 by a reasonable amount without the algorithm losing sight of
the central path.
In analyzing the barrier method, it is most natural to rely on the length of Newton
steps to measure proximity to the central path. We will assume jci is near 2(771) in the sense
that Hn^OtOHjc, is small. Keep in mind that the Newton step taken by the algorithm is
n,,2(xi), not n^,(xi). The relevance of n^(x\) for nm(x\) is due to the following easily
proven relation:
46 Chapter 2. Basic Interior-Point Method Theory
In particular,
By requiring \\nm(xi)\\X] < a and 1 < — </?, we then find from (2.15) that
Consequently, KI will be close to the central path like jci. Continuing, by requiring 1 <
—
^2 _< t-8, XT,J will be close to the central rpath, too, and so on. Hence, we will have determined
a value ft such that if one has an initial point appropriately close to the central path, and if
one never increases the barrier parameter from rj to more than ftrj, the barrier method will
follow the central path, always generating points close to it.
The reader can verify, for example, that
satisfy the relations. Now we have a "safe" value for ft. Relying on it, the algorithm is
guaranteed to stay on track. It is a remarkable aspect of ipm's that safe values for quantities
like ft depend only on the complexity value #/ of the underlying barrier functional /.
Concerning LPs,
if one relies on the logarithmic barrier function for the strictly nonnegative orthant M+ + ,
then y = (1 + g^) is safe regardless of A, b, and c.
Assuming at each iteration of the barrier method the parameter r] is never increased
by more than 1 + l/S^/Wf, we now know that for each x generated by the algorithm, there
corresponds z(n} which x approximates in that ||n^(jc)||x < |; hence, by Theorem 2.2.5,
II*—zOOL < |;mus, by the definition of self-concordance, II*—2(77)11^) < |. All points
generated by the algorithm lie within distance | of the central
2.4. Primal Algorithms 47
Assuming that at each iteration of the barrier method, the parameter rj is increased by
exactly the factor 1 + l/Sy/^/, the number of iterations required to increase the parameter
from an initial value rji to some value 77 > rji is
where in the inequality we rely on #/ > 1 (as discussed in §2.3.3). Hence, from (2.14),
given € > 0,
The point x' is on the central path for this optimization problem. In fact, x' = z'(v) for
v = 1.
Let n'v(;c) denote the Newton step for /„' at x.
Rather than increasing the parameter v, we decrease it toward zero, following the
central path to the analytic center z of /. From there, we switch to following the central
path (2(77) : 77 > 0} as before.
We showed rj can safely be increased by a factor of 1 + l/Sy/tf/. Brief consideration
of the analysis shows it is also safe to decrease rj by a factor 1 — l/S^/^/ and, hence, safe to
decrease v by that factor. Thus, to complete our understanding of the difficulty of following
the path (z'(v) : v > 0}, and then the path {z(rj) : rj > 0}, it remains only to understand the
process of switching paths.
One way to know when it is safe to switch paths is to compute the length of the
gradients for / at the points x generated in following the path (z'(v) : v > 0}. Once one
encounters a point jc for which, say, || #*(•*) IU 5- g> one can safely switch paths. For then,
by choosing rj\ = 1/12 \\cx \\x, we find the Newton step for fm at x satisfies
and hence, by Theorem 2.2.4, the Newton step takes us from x to a point x\ for which
ll«^i C*i) IU < |»putting us precisely in the setting of the earlier analysis (where a = \ w
determined safe).
How much will v have to be decreased from the initial value v = I before we compute
a point x for which !!#*(*) H* < | so that paths can be switched? An answer is foun
48 Chapter 2. Basic Interior-Point Method Theory
the relations
the last inequality by Proposition 2.3.7. In particular, with ||n'vOOIU < |, v need only
satisfy
in order for
The requirement on v specified by (2.19) gives geometric interpretation to the effi-
ciency of the algorithm in following the path (z'(v) : v > 0}, beginning with the initial
value v = 1. If the domain Df is nearly symmetric about the initial point x', not much time
will be required to follow the path to a point where we can switch to following the path
{z(/7) : r, > 0}.
We stipulated that the algorithm switch paths when it encounters x satisfying \\gx (x) \\x
< |, and we stipulated that one choose the initial value n\ := l/12||c^ H*. Letting
and hence
Theorem 2.4.1. Assume f e SCB and Df is bounded. Assume x' e Df, a point at which
to initiate the barrier method. IfO<€ < 1, then within
Consider the following modification to the algorithm. Choose V > (c, x'}. Rather
than relying on /, rely on the barrier functional
and whose complexity value does not exceed #/ + 1. In the theorem, the quantity V — val
is then replaced by the potentially much smaller quantity V — val. Of course the quantity
symC*', Df) must then be replaced by the symmetry of the set (2.20) about x'.
Finally, we highlight an implicit assumption underlying our analysis, namely, the
complexity value ftf is known. The value is used to safely increase the parameter r\. What
is actually required is an upper bound & > #/. If one relies on an upper bound $ rather
than the precise complexity value #/, then #/ in the theorem must be replaced by #.
Except for #/, none of the quantities appearing in the theorem are assumed to be
known or approximated. The quantities appear naturally in the analysis of the algorithm,
but the algorithm itself does not rely on the quantities.
No ipm's have proven complexity bounds which are better than (2.18) even in the
restricted setting of LP. In the case of LP where &f = n, a bound like (2.18) was first
established in [18] for an algorithm other than the barrier method; it was established by
Gonzaga [6] for the barrier method. By contrast, in the complexity analysis that started the
waves of ipm papers, Karmarkar [10] proved for LP a bound like (2.18) in which the factor
^fn (= -/#/) is replaced by n.
Although no ipm's have proven complexity bounds which are better than (2.18), the
barrier method is not considered to be practically efficient relative to some other ipm's,
especially relative to primal-dual methods (discussed in sections 3.7 and 3.8). The barrier
method is an excellent algorithm with which to begin one's understanding of ipm's, and it is
often the perfect choice for concise complexity theory proofs, but it is not one of the ipm's
that appear in widely used software.
and then let KI := yK. At each point y^, the algorithm will determine if the point is close to
z(r]2) by, say, checking whether ||n,, 2 (yjt)IU < \- (We choose the specific value \ becau
50 Chapter 2. Basic Interior-Point Method Theory
it is the largest value for which Theorem 2.2.5 applies.) The point yK will be the first point
that is determined to satisfy this inequality.
To compute yk+i from y*, the algorithm minimizes the univariate functional
This is the place in the algorithm to which the phrase "exact line search" alludes. "Line"
refers to the functional being univariate. "Exact" refers to an assumption that the exact
minimizer is computed, certainly an exaggeration, but an assumption useful for keeping the
analysis succinct. Letting t^+i denote the exact minimizer, define
Then we show that fm (y/t) — fn2 (y&+i) is bounded below by a positive amount T independent
of k; that is, each exact line search decreases the value of fm by at least a certain amount.
Consequently K < p/r. Proofs like this—showing a certain functional decreases by at
least a fixed amount with each iteration—are common in the ipm literature.
In proving an upper bound on the difference p, we make use of the fact that for any
convex functional / and x,y e D/, one has
Assuming x^ is close to 2(771) in the sense that ||w^,(JCi)|U, < |, we see that Theo-
rem 2.2.5 implies \\xi - z(rji)\\Xl < f . Thus, applying (2.22) to the functional fm,
2.4. Primal Algorithms 51
the final equality because z(r)i) minimizes fm and hence nm (z(??i)) = 0. Thus,
Finally,
It follows that if one fixes a constant K > 1 and always chooses successive values
?;,, rjii+i to satisfy 77,+1 = Kfy, the number of points generated by the long-step barrier
method (i.e., the number of exact line searches) in increasing the parameter from an initial
value r}\ to some value 77 > 771 is
In the case of linear programming where #/ = n, such a bound was first established by
Gonzaga [7] (see also den Hertog, Roos, and Vial [3]).
Fixing K (say, K = 100), we obtain the bound
This bound is worse than the analogous bound (2.17) for the short-step method by a factor
y#/. It is one of the ironies of the ipm literature that algorithms which are more efficient
in practice often have somewhat worse complexity bounds.
52 Chapter 2. Basic Interior-Point Method Theory
Thus, the corrector step is n\L(V)(x), this being the orthogonal projection of the Newton step
n(x) for / onto the subspace L(0) = {y : (c, y) = 0}, orthogonal w.r.t. { , }x. In the
literature, the corrector step is often referred to as the "centering direction." It aims to move
from x toward the point on the central path having the same objective value as x.
Since the multiples of cx (= H(x)~lc) form the orthogonal complement of L(0), the
difference n(x} — n\L(v)(x) is a multiple of cx, and hence so is the predictor step
The vector — cx predicts the tangential direction of the central path near jc. If x is on
the central path, the vector — cx is exactly tangential to the path, pointing in the direction
of decreasing objective values. (Indeed, observe that by differentiating both sides of the
identity rjc + g(z(r))) = 0 w.r.t. 77, we have c + H (z(n})zf (n) = 0, i.e., z'tn) := -cz(r,).)
In the literature, — cx is often referred to as the "affine-scaling direction." With regards to
( , )x, it is the direction in which one would move to decrease the objective value most
quickly.
Whereas the barrier method combines a predictor step and a corrector step in one
step, predictor-corrector methods separate the two types of steps. After a predictor step,
several corrector steps might be applied. In practice, predictor-corrector methods tend to be
substantially more efficient than the barrier method, but the (worst-case) complexity bounds
that have been proven for them are worse.
Perhaps the most natural predictor-corrector method is based on moving in the pre-
dictor direction a fixed fraction of the distance toward the boundary and then recentering
via exact line searches. We now formalize and analyze such an algorithm.
Fix a satisfying 0 < a < 1. Assume Jti is near the central path. Let v\ := (c, x\).
The algorithm first computes
Beginning with yi, the algorithm takes corrector steps, moving toward the point Z2 on
the central path with objective value vi by using the Newton steps for the functional f\L(v2)
as directions in performing exact line searches. Precisely, given y*, the algorithm computes
the minimizer tk+\ for the univariate functional
2.4. Primal Algorithms 53
Let
When the first point JK is encountered for which \\n \L(v2) Cv/r) \\yK is appropriately small, the
algorithm lets xi := y% and takes a predictor step from ;t2, relying on the same value a as
in the predictor step from x\. The predictor step is followed by corrector steps, and so on.
Typically, a is chosen near 1, say, a = .99. In the following analysis, we assume
a>J.
In analyzing the predictor-corrector method, we determine an upper bound on the
number K of exact line searches made in moving from x i to JC2, and we determine a lower
bound on progress made in decreasing the objective value by moving from x\ to xi.
Unlike the previous algorithms, the predictor-corrector method does not rely on the
parameter 77. However, it is useful to rely on r\ in analyzing the method so as to make use of
the previous analysis. In this regard, let r\\ be the (positive) value satisfying (c, z(rj\)} = v\,
i.e., z(r]\) is the point on the central path whose objective value is the same as the objective
value for x\.
For the analysis of the predictor-corrector method, we consider ||n|L(u)(*)IU < ^ to
be the criterion for claiming x to be close to the point z on the central path satisfying (c, z) =
v (= (c, jc}). The specific value -^ is chosen so that we are in position to rely on the barrier
method analysis. Specifically, we claim that IN^oC*!)!!*! < ^ implies ||n^(jci)|| Xl < |,
precisely the criterion we assumed xi to satisfy for the barrier method. To justify the claim,
let zi denote the minimizer of f \ L ( V t ) - , thus, zi = zO?i). If Ni<i»,)(*i)L, < 33, then
Theorem 2.2.5 applied to f \ i ( v i ) implies
In taking the step, the barrier method decreases the objective function value from {c, jci) to
(c, x\ + «^ 2 (jci)); hence, the decrease is at most
method in decreasing the objective value is at least as great as the progress made by the
barrier method in taking one step from jci. (Clearly, the point X2 computed by the predictor-
corrector method generally differs from the point KI computed by the barrier method.) Of
course in moving from x\ to *2, the predictor-corrector method might require several exact
line searches. We now bound the number K of exact line searches.
Analogous to our analysis for the long-step barrier method, we obtain an upper bound
on K by dividing an upper bound on /(yi) — f(z2) by a lower bound on the differences
/Ofc) - /Ofc+i).
The lower bound on the differences /(y*) — /(y/t+i) is proven exactly, as was the lower
bound for the differences fm (y*) — fm (y/t+i) in our analysis of the long-step barrier method.
Assuming HnLo^O7*)!!?* > ^ (as is the case if the algorithm proceeds to compute y*+i)>
one relies on Theorem 2.2.2, now applied to the functional f\L(V2)>to show /(y/0 — /(y^+i)
is bounded below independently of k.
To obtain an upper bound on /(yi) — f ( z i ) , one can use the identity
In all,
Combined with the constant lower bound on the differences /(y*) — /(y*+i), we mus find
that the number K of exact line searches performed in moving from x\ to *2 satisfies
Having shown the progress made by the predictor-corrector method in decreasing the
objective value is at least as great as the progress made by the barrier method in taking
one step from x\, we obtain complexity bounds for the predictor-corrector method which
are greater than the bounds for the barrier method by a factor K, that is, by a factor #/
(assuming a fixed, say, a = .99). The bounds are greater than the bounds for the long-step
barrier method by a factor J&f.
Nesterov and Nemirovskii [15]. In this section we consider various equivalent definitions
of self-concordance, including the original definition. We close the section with a brief
discussion of the term "strongly nondegenerate," a term we have suppressed.
The proofs in this section are more technical than conceptual, yielding useful results,
but ones that are not central to understanding the core theory of ipm's. It is suggested that
on a first pass through the book, the reader absorb only the exposition and statements of
results, bypassing the formal proofs.
Unless otherwise stated, we assume only that f e C2, Df is open and convex, and
H(x) is pdforall x e Df.
For ease of reference, we recall our definition of self-concordance.
Specifically, if x, y, and z are colinear with y between x and z, if x and y satisfy (2.24), if
y and z satisfy the analogous condition
the equality relying on the colinearity of x, y, and z. Note (2.26) is immediate from (2.24)
and (2.27), hence the transitivity.
56 Chapter 2. Basic Interior-Point Method Theory
We now show that in the definition of self-concordance, the lower bound on || v \\y( \\ v \\x
is redundant.
Theorem 2.5.1. Assume f is such that for all x e Df we have Bx (x, 1) C Df and is such
that whenever y e Bx(x, I) we have
Then
Proof. Since
it suffices to show for all positive € in some open neighborhood of 0, and for all x, y e Df,
that
Toward establishing the implication (2.29), note from (2.28) that whenever points x
and y satisfy ||y — Jc|U < -rr (with e small), we have
Letting Amin denote the minimum eigenvalue of Hx(y), from the identities Hx(y) =
H ( x ) ~ l H ( y ) = Hy(x)~l we of course have
Since
Hence, by (2.30),
x, y e Df except in that y is required to lie in a ball with smaller radius r than the claimed
radius (i.e., \\y — x\\x < r rather than \\y — x\\x < -^, where r is independent of x, y),
then the smaller radius r can be increased to r' := r + €(-^ — r). Likewise, r' can be
increased. In the limit, we arrive at the validity of desired implication (2.29) for the claimed
radius -^.
To implement this proof strategy, assume 0 < € < 1, -^ < r < -^, and assume
that for all x,y e D/,
Since x and y are arbitrary, to prove that r can be increased to r' it suffices to show that
(2.31) gives
the equality relying on the colinearity of x, y, and z. That is, for all v / 0,
58 Chapter 2. Basic Interior-Point Method Theory
If we assume
we thus have
and hence the relation (2.32). Since (2.35) follows from the bound
we have thus shown the implication (2.31) to indeed give the relation (2.32), thereby com-
pleting the proof.
The following theorem provides various equivalent definitions of self-concordant
functionals.
(Ib) For each x e Df, and for all y in some open neighborhood ofx,
Moreover, if f satisfies any (and hence all) of the above conditions, as well as any condition
from the following list, then f satisfies all conditions from the following list:
(2a) For all x e Df we have
(2b) There exists 0 < r < 1 such that for all x e Df we have
Hence, since SC consists precisely of those functionals satisfying conditions la and 2a, by
choosing one condition from the first set and one from the second, we can define the set SC
as the set of functionals satisfying the two chosen conditions.
Proof. To prove the theorem, we first establish the equivalence of conditions la, Ib, and Ic.
We then prove conditions la and 2b together imply 2a, as do conditions la and 2c. Trivially,
2a implies 2b. To conclude the proof, it then suffices to recall that by Theorem 2.2.9,
conditions la and 2a together imply 2c.
2.5. Matters of Definition 59
Now we establish the equivalence of la, Ib, and Ic. Trivially, la implies Ib. Next
note
condition la not holding implies there exist x, /, and € > 0 such that ||/ — x \\x < -^ and
Considering points on the line segment between x and /, and relying on continuity of the
Hessian, it then readily follows that there exists y (possibly y = x) satisfying
and
In particular,
that is,
contradicting (2.37). We have thus proven the equivalence of conditions la, Ib, and Ic.
Now we prove that conditions la and 2b together imply 2a. Let 0 < r < 1 be as in
2b, i.e., Bx(x, r) c Df for all x e Df. Let r' := (2 - r)r. We show la and 2b together
imply Bx(x, r') c Df for all x e Df, i.e., 2b holds with r' in place of the smaller value r.
Likewise, r' can be replaced with a larger value. In the limit we find Bx(x, 1) c Df for all
jc e Df, precisely condition 2a.
To verify that 1 a and 2b together allow r to be replaced with the larger value r', assume
x, y e Df satisfy \\y — x\\x < r. Let
The colinear points x,y,z satisfy ||z — x\\x = (2— ||v —jf|U)||y — Jt:||^. Consequently, since
y is an arbitrary point in Bx ( x , r ) , it suffices to verify that la and 2b together imply z e Df.
By condition 2b, y € Df. Condition la applied with v := z — y thus gives
Hence, since x we h
Consequently, 2b can be applied to y and z (in place of x and y), yielding the desired
inclusion z € Df.
To conclude the proof of the theorem, it remains only to prove that conditions la and
2c together imply 2a.
Assuming y e Df satisfies \\y — x\\x < 1, condition la implies
2.5. Matters of Definition 61
In particular, f ( y ) is bounded away from oo on each set Bx(x, r) n Df, where r < 1. By
condition 2c we conclude Bx(x, r) c. D f i f r < 1. Hence, Bx(x, 1) c Df, completing the
proof.
We turn to the original definition of self-concordance due to Nesterov and Nemirovskii.
First, we provide some motivation.
We know that if one restricts a self-concordant functional / to subspaces—or translates
thereof—one obtains self-concordant functionals. In particular, if / is restricted to a line
t M- x + td (where x, d e E"), then
the property
is identical to
Squaring both sides, then subtracting 1 from both sides, and finally multiplying both sides
by0"(f)/k-f|,wefind
This result has a converse. The converse, given by the following theorem, coincides with the
original definition of self-concordance due to Nesterov and Nemirovskii. (Keep in mind our
standing assumptions that / e C2, Df is open and convex, and H(x) is pd for all x e Df.)
Theorem 2.5.3. Assume f e C3 and assume each of the univariate functionals (f> obtained
by restricting f to lines intersecting Df satisfy (2.41) for all t in their domains. Furthermore,
assume that if a sequence {xk} converges to a point in the boundary dDf, then f(xk) -> oo.
Then f e SC.
Proof. The proof assumes the reader to be familar with certain properties of differentials.
To prove the theorem, it suffices to prove / satisfies conditions Ic and 2c of Theo-
rem 2.5.2. Of course 2c is satisfied by assumption.
The family of inequalities (2.41) (an inequality for each x, d, and t) is equivalent to
the family of inequalities
62 Chapter 2. Basic Interior-Point Method Theory
(an inequality for each jc and u). On the other hand, the family of inequalities given by
condition Ic is equivalent to
However, for any C3-functional / and for any inner product norm
and
are self-concordant according to our definition but not according to the original definition.
Neither functional is thrice-differentiable at the origin.
Our definition was not chosen for the slightly broader set of functionals it defines. It
was chosen because it provides the reader with a sense of the geometry underlying self-
concordance and because it is handy in developing the theory. Nonetheless, the original
definition has distinct advantages, especially in proving a functional to be self-concordant.
For example, assume D c R" is open and convex, and assume F e C3 is a functional which
takes on only positive values in D and only the value 0 on the boundary 3D. Furthermore,
assume that for each line intersecting D, the univariate functional 11->- F(x + td) obtained
by restricting F to the line happens to be a polynomial—moreover, a polynomial with only
real roots. Then, relying on the original definition of self-concordance, it is easy to prove
that the functional
the inequality due to the relation || ||3 < || ||2 between the 2-norm and the 3-norm on
2.5. Matters of Definition 63
are not inner products if / is not nondegenerate. However, there is indeed an analogue, ob-
tained as a simple generalization of the definition for strongly nondegenerate self-concordance.
Roughly, strongly self-concordant functionals are those obtained by extending strongly
nondegenerate self-concordant functionals to larger vector spaces by having the functional
be constant on parallel slices. Specifically, one can prove (as is done in [15]) that / is strongly
self-concordant iff Rw is a direct sum L\ ® L^ of subspaces for which there exists a strongly
nondegenerate self-concordant functional h, with D/, c L\, satisfying f(x\, ^2) = M*i)-
For example, f ( x ) := — ln(x\) is a strongly self-concordant functional with domain the
half-space R+ © R""1 in R", but it is not nondegenerate.
If self-concordant (resp., strongly self-concordant) functionals are added, the resulting
functional is self-concordant (resp., strongly self-concordant). If one of the summands is
strongly nondegenerate, so is the sum. This indicates how the theory of self-concordant
functionals, and strongly self-concordant functionals, parallels the theory developed in this
book. To get to the heart of the ipm theory quickly and cleanly, we focus on strongly
nondegenerate self-concordant functionals.
Henceforth, we return to our practice of referring to functionals as self-concordant
when, strictly speaking, we mean the functionals to be strongly nondegenerate self-
concordant.
This page intentionally left blank
Chapter 3
Assuming R m , like R", is endowed with an inner product, the dual instance is
65
66 Chapter 3. Conic Programming and Duality
where A* denotes the adjoint of A. The constraints of the dual instance can be written in
the same form as the primal if we introduce slack variables s e R":
Let val* denote the optimal value of the dual instance: —oo if the instance is infeasible.
An especially important relation between feasible points for the primal and dual in-
stances is given by the identities
Since x e K and s e K* it follows that val* < val, an inequality known as weak duality.
To understand how the geometry of the primal and dual instances are related, fix x
satisfying Ax = b and (y, s) satisfying A*y + s = c. Letting L denote the nullspace of
A, note by (3.2) that up to an additive constant in the objective functional (specifically, the
constant —(b, y}), the primal instance is precisely
On the other hand, making the standard (simplifying) assumption that A is surjective, and
hence A* is injective, if (y, s) is dual feasible, then y is uniquely determined by s. Conse-
quently, the dual instance is equivalent to
(In fact, they are equivalent even if A is not surjective.) Geometrical relations between the
primal and dual instances are apparent from (3.3) and (3.4). Moreover, since (K*)* = K (as
follows from Corollary 3.2.2 and the closedness of K), (3.3) and (3.4) make it geometrically
clear that the dual of the dual instance is the primal instance, a fact that is also readily
established algebraically.
In the literature, CP is often introduced without reference to an inner product. The
objective functional (c, x) is expressed c*(x), where c* is an element of the dual space of
the primal space M", i.e., an element of the space of all continuous linear functionals on
the primal space. For us, the inner product naturally identifies the primal space with the
dual space; for each element c* in the dual space there exists c in the primal space such that
c*(x) = (c,x) for all x. The main advantage of relying on an inner product is that it makes
apparent geometrical relations between a primal instance and its dual instance. The main
disadvantage is that the identification of the primal space with the dual space changes when
the inner product is changed, forcing one to use notation that depends on the inner product.
3.1. Conic Programming 67
However, by this point the reader is accustomed to changing local inner products, so the
advantages far outweigh the disadvantages.
Let us be precise as to how the primal and dual instances change when the inner
product is changed. Assume the inner product on Rn is changed from { , ) to { , }#,
where H is pd w.r.t. { , }. (Recall (U,V)H '•= (u, Hv).) The primal instance can then be
expressed as
where CH '.= H lc. Assuming the inner product on Rm, the range space of A, is left
unchanged, the dual instance is
where A*H — H 1 A* is the new adjoint of A, K^ = H lK* is the new dual cone of K,
and SH is the vector of slack variables. A point (y, s) is dual feasible before the change of
inner product iff the point (y, H~ls) is dual feasible after the change.
We close this section with an overview of the remainder of the chapter.
In LP, one has strong duality; if either val or val* is finite, then val = val*. Strong
duality can fail for CP instances, even for simple SDP instances over the 2 x 2 symmetric
matrices. However, under conditions which typically must be satisfied for the application
of ipm's, strong duality is present. For example, it is present if the primal and dual instances
are feasible and at least one of them remains feasible whenever its right-hand-side vector (b
or c, respectively) is perturbed by a small amount, as is the case, say, if A is surjective and
the primal instance has a feasible point in the interior of K. We prove this result, as well as
other basic results in the classical duality theory, in §3.2.
We study conjugate functionals in §3.3. Given a barrier functional / whose domain
is the interior of K, we show (among other things) that the conjugate functional is a barrier
functional whose domain is the interior of the dual cone K*. The connections between
ipm's and duality theory hinge critically on this fact.
In §3.4 we establish the fundamental relations between the central path for the primal
instance and the central path for the dual instance. We show that in following one path, the
other is virtually generated as a by-product.
In §3.5 we present the theory of self-scaled (or symmetric) cones, a topic first de-
veloped in the ipm literature by Nesterov and Todd [16], [17]. The pronounced structure
of these cones allows for the development of symmetric primal-dual ipm's (algorithms in
which the roles of the primal and dual instances are mirror images). The most important
cones—both practically and theoretically—are self-scaled cones. The nonnegative orthant,
the cone of psd matrices, and the second-order cone are all self-scaled.
In designing an iterative algorithm (an algorithm that generates a sequence of points
as do ipm's), the choice of direction for moving from one point to the next is crucial. In §3.6
we discuss the Nesterov-Todd directions, perhaps the most prevalent directions appearing
in the design of primal-dual ipm's.
68 Chapter 3. Conic Programming and Duality
We present and analyze two types of primal-dual ipm's. In §3.7 we consider path-
following methods, algorithms which stay near the central path, much like the barrier
method in §2.4. In §3.8 we consider a potential-reduction method. Whereas the theory
of path-following methods requires the algorithms to stay near the central path, the theory
of potential-reduction methods does not, thus suggesting that potential-reduction methods
are more robust. The analysis of the progress for a potential-reduction method depends on
showing that a certain function—an appropriately chosen potential function—decreases by
a constant amount with each iteration. The decrease can be established regardless of where
the iterates lie; they need not lie near the central path.
For an example where strong duality fails, let K be the second-order cone in R3:
With regards to the dot product, it is an instructive exercise to show K = K*; that is, K is
self-dual. Consequently, for the primal instance
Rewriting x\ + x\ < x% as x\ < (x2 + ^3) (^3 — *i)> it is readily seen that val = 0. On the
other hand, to be feasible for the dual, y must satisfy 1 + y 2 < y 2 , obviously impossible.
Thus val* = — oo and hence val ^ val*, even though one of the optimal objective values is
finite.
3.2. Classical Duality Theory 69
A slightly more elaborate example shows strong duality can fail even when both of
the optimal objective values are finite. Consider the primal instance
where
As before, val = 0. On the other hand, y = (— 1, 0) is dual optimal as all dual feasible
points satisfy y\ = — 1. Hence val* = — 1.
A standard approach to proving strong duality for LP is via the Farkas lemma. The
Farkas lemma is a "theorem of exclusive alternatives," stating that exactly one of the fol-
lowing two systems is feasible:
The analogue for CP would be that exactly one of the following two systems is feasible:
However, just as strong duality can fail for CP instances, so can this analogue of the Farkas
lemma, as the reader can readily verify from the following example in which K is the
second-order cone in E3:
for an extremely general development). However, by restricting to R n , the proofs are eased
considerably. Hence, just as we are developing ipm theory in R n , we develop duality theory
in finite dimensions.
Proof. Using the fact that in R n , bounded sequences have convergent subsequences, it is
not difficult to prove that S has a closest point to x, i.e., a point 5' € S satisfying
Indeed, assume otherwise, letting s e S violate the inequality. For 0 < t < 1, define
s(t) := s' + t(s - s'). Since s(t) e S and
we conclude that for sufficiently small t > 0, s(t) is a point in S strictly closer to x than is
s' a contradiction.
However,
Corollary 3.2.2. Assume K is a nonempty, closed, convex cone in R". Ifx e R" but x £ K,
there exists d e R" such that
The Farkas lemma for LP extends to CP if one relaxes the notion of feasibility. In LP,
an instance is either feasible, or is infeasible and remains infeasible if the right-hand-side
vector b is perturbed to any vector b + Ab for which || A& || is sufficiently small. This is
not true of CP instances in general. For example, the system on the left of (3.5) becomes
feasible if the right-hand-side coordinate 0 is replaced by € > 0.
Assuming R m , like R", is endowed with an inner product (and the induced norm), one
says the system
3.2. Classical Duality Theory 71
Proof. Fixing A, consider the cone consisting of right-hand-side vectors for which the first
system is feasible:
Clearly, b e K(A) (the closure of K(A)) iff the system with right-hand-side vector b is
asymptotically feasible. Consequently, invoking Corollary 3.2.2 for the cone K(A), one
has the mutually exclusive alternatives that either the system with right-hand-side vector b
is asymptotically feasible or there exists y e Rm satisfying
Theorem 3.2.4 (asymptotic strong duality). If the primal instance is asymptotically fea-
sible, then a-val = val*.
Proof. Fix v G R and consider the following constraints in the variables (x, a) e R" x R:
The dual cone for K x IR+ is then K* x R+. Likewise, extend the inner product on W to
Rm x R. The adjoint of the linear operator
is then
It is readily seen that the system (3.9) is asymptotically feasible iff a-val < v. Con-
sequently, by Theorem 3.2.3 we have the exclusive alternatives that either a-val < v or the
following system in the variables (y, 8) G Rm x R is consistent:
We now prove a-val > val*. Assume otherwise and choose v to satisfy a-val < v <
val*. Since a-val < v, we know the system (3.10) is inconsistent. Since v < val*, we know
there exists dual feasible y satisfying (b, y) > v. Then (y, — 1) satisfies the system (3.10),
a contradiction. Hence a-val > val*.
Now we prove a-val < val*. Assume otherwise and choose v to satisfy val* < v <
a-val. Since v < a-val, the system (3.10) is satisfied by some point (y, ft}. If ft = 0, then
y satisfies
and hence, by Theorem 3.2.3, the primal instance is not asymptotically feasible, contradict-
ing the assumption of the theorem. Hence, ft < 0. Consequently, \/f$ is defined and thus
the point y := ~y satisfies
contradicting val* < v. Hence, a-val < val*, completing the proof.
3.2. Classical Duality Theory 73
Not surprisingly, the theorem has a dual analogue which can be proven by showing
that the dual of the dual instance is (equivalent to) the primal instance, then invoking the
theorem. To that end, define the asymptotic optimal value for the dual instance to be
Corollary 3.2.5. If the dual instance is asymptotically feasible, then a-val* = val.
Theorem 3.2.6. If either the primal instance or the dual instance is strongly feasible, then
val = val*.
Proof. We first observe that we may assume both instances are asymptotically feasible. For
if, say, the primal instance is not asymptotically feasible—hence val = oo—there exists
y satisfying the second system of Theorem 3.2.3. With the dual instance being strongly
feasible—in particular, being feasible—by adding arbitrarily large positive multiples of y
to any feasible point for the dual instance, we obtain dual feasible points with arbitrarily
large objective values. Thus val* = 00 = val, giving the equality in the theorem. Hence,
assume both instances are asymptotically feasible.
We know that both instances being asymptotically feasible implies
74 Chapter 3. Conic Programming and Duality
For definiteness, assume the dual instance to be strongly feasible. (A similar proof
applies if the primal instance is strongly feasible.) In light of the relations (3.13), to prove
the theorem it suffices to assume a-val < val and show it follows that the dual instance is
not strongly feasible, a contradiction.
So assume a-val < val. By the definition of a-val and the assumption that the primal
instance is asymptotically feasible, there exist sequences {*,} c K, {Ab,} such that
We claim ||;c,|| ->• oo. For otherwise, {*,} has a convergent subsequence; the limit
point x is easily proven to satisfy
from which it is immediate that a-val = val, contrary to our assumption a-val < val.
Let Ac be a limit point of the sequence {JT^JTJC,-}. Of course || Ac|| = 1; in particular,
Ac / 0. Forfixede > 0 consider the following CP instance:
Hence, by Theorem 3.2.4, the dual instance of (3.14) is infeasible. The dual instance is
Since € can be an arbitrarily small positive number, this infeasibility contradicts the assumed
strong feasibility of the original dual instance.
In applying ipm's to solve a primal CP instance, one relies on a barrier functional
whose domain is the interior of the cone K. Thus, for the central path to exist, there must
be primal feasible points in the interior of K. In that regard, the following corollary shows
the potential failure of strong duality to be somewhat irrelevant to the study of ipm's. The
corollary implies that if the central path exists for the primal instance, then strong duality
holds. The same holds for the dual instance (as is further elucidated in §3.4).
Corollary 3.2.7. If the linear operator A is surjective and there is a primal feasible point
in the interior of K, then val = val*. Similarly, if there exists a dual feasible point (y,s)
with s in the interior of K*, then val — val*.
Lastly, we briefly discuss the fact that in CP there may not exist optimal solutions even
when the optimal objective values are finite—and even when, in addition, strong duality
3.3. The Conjugate Functional 75
is present. Strictly speaking, our use of "min" for the primal instance and "max" for the
dual instance should be replaced by "inf" and "sup," respectively. However, under an
assumption of strong feasibility, an optimal solution does indeed exist. Insofar as we are
interested in the connections between duality theory and ipm theory, "min" and "max" are
entirely appropriate.
For an example of an optimal solution not existing, let K denote the second-order
cone in E3. We leave it to the reader to verify that the primal instance
satisfies val = 0, the dual instance satisfies val* = 0, but the primal instance does not have
an optimal solution.
Theorem 3.2.8. If the dual instance is strongly feasible and the primal instance is feasible,
then the primal instance has an optimal solution. Similarly, if the primal instance is strongly
feasible and the dual instance is feasible, then the dual instance has an optimal solution.
For these relations, one assumes K°—the interior of K—is the domain of a barrier func-
tional.
To get a sense of how a barrier functional / : K° -> E joins with duality theory,
consider the (negative) gradient map x i-» — g(x). For example, in LP with the dot product,
this is the map (x\,..., xn) i-> (^-,..., ^-). We claim the map x t-> — g(x) takes K° into
K*. Indeed, by Theorem 2.3.3, whenever x, x' € K°,
Applying this with tx' in place of x' and letting t —> oc, one deduces
The conjugate functional played a prominent role in optimization long before ipm research
blossomed, long before the notion of a self-concordant functional took form (cf. [19]).
Later in this chapter when local inner products become useful, it will be important to
keep in mind that the definition of the conjugate functional, like the definition of the dual
cone, depends on the underlying inner product. When the inner product changes, so does
the conjugate functional.
The definition of the conjugate functional /* applies to each functional /, not just bar-
rier functionals. Regardless of /, the conjugate functional is convex, for it is the supremum
of the convex functionals—in fact, linear functionals—
We define the domain D/* to be the set consisting of all s for which f * ( s ) is finite.
Throughout this section we assume / to satisfy only the following properties unless
otherwise stated: / e C2, Df is open and convex, H(x) is pd for all x e Df.
Recall SC denotes the set of (strongly nondegenerate) self-concordant functionals and
SCB denotes the subset of SC composed of barrier functionals.
It is suggested that on a first pass, the reader skip proofs in this section.
Theorem 3.3.1. If f e SC, then f* e SC. If f e SCB and Df = K°, where K is a cone,
then f* e SCB and
'This differs slightly from the standard definition of the conjugate functional: s \-+ sup,,.6D f ( x , s ) — f ( x ) .
The only real effects of the difference are to reduce the number of minus signs appearing in our exposition and to
make the domain of the conjugate functional be the dual cone rather than the polar cone.
3.3. The Conjugate Functional 77
This proposition will be proven using the following proposition, which figures promi-
nently in proofs throughout this section.
Proof. If — g(jti) = s = —g(x2), then jci and *i minimize the strictly convex functional
Hence, x\ = x^ and thus g is injective. Moreover, because the functional (3.17) then has a
minimizer, 5 e Df*. Hence,
Assume s e £>/* and / e SC. Since Df* consists of s for which f * ( s ) is finite, the
self-concordant functional (3.17) is bounded below. By Theorem 2.2.8, it has a minimizer
x. Clearly, — g(x) = s. Hence,
has a minimizer.
Since 5 e (K*)°, there exists a > 0 such that
the inequality by Theorem 2.3.3. Observe that if \\x\\ > #//«, then (3.19) implies the
functional / to be strictly increasing as one moves outward along the ray [x + tx : t > 0}.
Hence,
For the self-concordant functional /, Theorem 2.2.9 shows /(*,-) —> oo if a sequence
{*,} converges to a point in the boundary of K. Consequently, using (3.20), it is easily
argued by compactness that / has a minimizer.
The identities stated in the next theorem are absolutely fundamental in developing
primal-dual ipm's.
Theorem 3.3.4. Assume f e SC. Then /* € C2. Moreover, ifx and s satisfy s = —g(x),
then
Proof. The proof goes somewhat outside this book, relying on differentials and the inverse
function theorem.
By Proposition 3.3.3, the map — g : £)/ -> D/* is invertible. For each s e D/*,
let x(s) denote the point in D/ satisfying -g(x(s)) = s. Since g e C1 and the first-
differentials of g are invertible (i.e., the Hessians H(x) are invertible), the inverse function
theorem implies s M» x(s) is a C1 map.
Differentiating both sides of the identity — g(x(s)) = s w.r.t. s by making use of the
chain rule, we have
that is,
Let [Dsx(s)]T denote the adjoint of Dsx(s) : Rn -> R". (We use "T" because "*" is
in use.)
Clearly,
Since the map s H> x(s) is in C1, so is the functional /*. Differentiating both sides w.r.t.
5, making use of the product rule and the chain rule, we have
the last equality because s = — g(x(s)). Thus, whenever x and s satisfy — g(x) = s we
have —g*(s) = x. Moreover, since s i-> x(s) is a C1 mapping, (3.22) trivially shows the
same to be true of g*. Hence, /* e C2.
Finally, note (3.22) and (3.21) imply
Proof of Theorem 3.3.1. We use a superscript "*" to distinguish between objects associated
with /* and those associated with /. For example, || ||* denotes the local norm associated
with /*.
Assume / € SC. Assume x and 5 satisfy — g(x) = s. By Theorem 3.3.4, for all
W\, U)2,
an identity we use freely in the remainder of the proof. Likewise for the resulting relation
We now prove /* e SC. By Theorem 2.5.2, it suffices to show that if s e D/», then
and
Assume As is a vector satisfying r := ||A5||* < |, that is, \\v\\x < |, where v :=
-H(x)~l&s. Noting ^p < 9r2, Proposition 2.2.10 shows there exists u e Bx(v,9\\v\\2x)
such that
Note that
However,
implying
Evenif / is not logarithmically homogeneous, we have (g(x), 0 — x ) > 0 since— g(x) e K*.
Consequently, because 0 e K, Theorem 2.3.4 implies
The central path {(y^, s^} : r\ > 0} for the dual instance
We shall see that under the assumption of logarithmic homogeneity, these central paths are
strongly related, each arising from the optimality conditions for the other. For this, it is
convenient to adopt the geometrical viewpoint described in §3.1.
Fixing x satisfying Ax = b and ( y , s) satisfying A*y + s = c, recall that the primal
instance is identical (up to an additive constant in the objective functional) to
whereas the dual central path consists of the points sn solving the problems
Hence, defining
Moreover, since Theorem 3.3.1 shows — g(xn) e (K*)°, and since K is a cone, we have
sn e (K*)°. Thus, s,, is dual feasible.
Assuming logarithmic homogeneity, we claim Sr, = s^; that is, the vector Sr, arising
from the optimality conditions for xn being on the primal central path is itself on the dual cen-
tral path (moreover, with the same parameter value 77). For under logarithmic homogeneity
we have
the last equality because — g* is the inverse of — g by Theorem 3.3.4 (regardless of whether
logarithmic homogeneity holds). Hence, by the first of the necessary conditions (3.31),
Together with (3.32) we now have that sn satisfies the sufficient (and necessary) conditions
for ST, to be the unique optimal solution of the strictly convex problem (3.30):
that xn := — ^g*(Sr,) satisfies the sufficient conditions (3.31). The roles of xn and s^ are
entirely symmetric.
An important consequence of the relations Sj, = — -g(x,,) and jc^ = — -g*(^) is that
by following one of the central paths, one can generate the other as a by-product. The dual
instance can be solved by solving the primal instance and vice versa.
We claim that regardless of whether logarithmic homogeneity holds, the dual points
Sr, (primal points *,,) tend to optimality as 77 -* oo. For we have already proven that sn is
dual feasible; that did not rely on logarithmic homogeneity. Moreover, letting y^ denote the
vector satisfying A*yr, + s,, = c, we have
the inequality by Theorem 2.3.3. Consequently, s^ does indeed tend to optimality as rj —> oo,
and similarly for *„.
If K is intrinsically self-dual, then / and f* have K° as their domain (by Theorem 3.3.1).
We define / to be intrinsically self-conjugate if / is logarithmically homogeneous,
if K is intrinsically self-dual, and if for each z e K° there exists a constant Cz such that
84 Chapter 3. Conic Programming and Duality
The theory of ipm's focuses on the gradients and Hessians of barrier functionals rather
than on values of the functionals. The relevance of intrinsic self-conjugateness is that for
each local inner product { , ) z , the gradients gz and Hessians Hz of / for the primal setting
in CP are exactly the same as the gradients g* and Hessians H* of f* for the dual setting.
A cone K is self-scaled (or symmetric) if K° is the domain of an intrinsically self-
conjugate barrier functional.
In the definition of intrinsic self-conjugateness, the constant Cz is unrestricted. How-
ever, it happens that the various conditions of the definition force the constant to equal a
specific value, as is shown by the following proposition.
Proof. Theorem 3.3.4 shows that regardless of the inner product, the conjugate functional
satisfies
In particular,
Consequently,
The same applies for the logarithmic barrier function for the cone of psd matrices.
Letting E denote the identity matrix (hence { , }E is the trace product), we have
Hence,
Similarly, with somewhat more involved calculations one can use the theorem to
show that the logarithmic barrier functional for the second-order cone is intrinsically self-
conjugate.
The family of all self-scaled cones is limited. Giiler [8] made this known to optimizers
(see also Faybusovich [5]). Giiler realized that self-scaled cones are the same as symmetric
cones, objects that have been studied by analysts for decades (cf. [4]). There are five basic
symmetric cones, all others being Cartesian products of these:
• the cone of psd symmetric matrices,
• the second-order cone,
• the cone of psd Hermitian matrices,
• the cone of psd Hermitian quaternion matrices,
• a 27-dimensional exceptional cone.
Nesterov and Todd did not know of the equivalence between self-scaled cones and
symmetric cones when they wrote their foundational papers [16], [17]. Their principal
motivation was to generalize primal-dual methods beyond LP. Although the generalization
they achieved was perhaps not as great as we imagined (because we now know there are only
five basic self-scaled cones), they succeeded in uncovering the essence of the conic structure
needed by primal-dual methods. One can develop the theory of primal-dual methods for
each of the five cones individually, but developing it generally has the advantage of making
apparent the essential structure.
K°. When we refer specifically to the nonnegative orthant, the cone of psd matrices, or the
second-order cone, the underlying intrinsically self-conjugate barrier functional is assumed
to be the logarithmic barrier function.
Consequently, H(z) is a linear operator carrying K bijectively onto itself, i.e., H(z) is a
linear automorphism of K, a "scaling" of K.
The next theorem shows the set of linear automorphisms {H(z) : z e K°} forms a
complete set of scalings in the sense that for each x e K°,
that is, for each point in K° there exists some—in fact, a unique—automorphism H(z)
carrying x to that point.
When H(w)x = s, the point w is referred to as the scaling point (w.r.t the local inner
product at e} for the ordered pair x, s. That it is unique proves to be very important in the
theory.
If w is the scaling point for the ordered pair x, s, then — g(w) is the scaling point for
the pair s, x. In fact,
the first equality because H = H*, the second equality by Theorem 3.3.4.
The primal-dual methods we present in sections 3.7 and 3.8 require that when given
jc, s e K°, one can compute the scaling point u; as well as g(w) and H(w). In this regard,
we note that if K is the nonnegative orthant and e = ( l , . . . , l ) , then w is the vector whose
7'th component is ^/Xj/Sj. Similarly, if X and S are pd matrices, and if E is the identity
matrix, then
3.5. Self-Scaled (or Symmetric) Cones 87
To get a sense of the role the scaling point plays in the theory of primal-dual methods,
recall our discussion in §3.1 regarding the dependence of the primal and dual instances on
the inner product. If in the initial inner product { , ) = ( , )e the primal instance is
where
Consequently, assuming w is the scaling point for a pair x,s, by rewriting the instances
in terms of ( , )„,, the pair is transformed to the pair x,sw := H(w)~ls which satisfies
x = sw, i.e., each point in the pair is transformed to the point of intersection for the rewritten
primal and dual feasible regions. The primal-dual methods we consider are invariant under
the transformation in the sense that the iterates they generate for the rewritten instances are
the transformations of the iterates generated for the original instances. For the rewritten
instances, the primal and dual search directions—the directions in which one moves from
the first and second points in the current pair x, sw to the first and second points in the
next pair—are respectively obtained by projecting the gradient (w.r.t. ( , } w ) of a certain
functional onto the nullspace of A and onto the range space of A*w. As these spaces are
orthogonal complements (w.r.t. { , )w), the sum of the search directions is the gradient
itself. As we shall see, this makes for a particularly clean and symmetric analysis of the
methods. However, before we can develop primal-dual methods, we need to know more
about self-scaled cones.
Toward proving Theorem 3.5.3 we have the following proposition. We suggest that
the reader skip arguments marked "Proof" on the first pass through this subsection.
Proposition 3.5.4. Assume f is a barrier functional with Df = K°, the interior of a (not
necessarily self-scaled) cone. Let K* denote the dual cone of K w.r.t. an arbitrary inner
product ( , }. Ifx e K° ands e (K°)*, then there exists w e K° such that H(w)x = s.
with domain K°. To prove the proposition, it suffices to show ty has a minimizer, for at a
minimizer the gradient is 0, that is, —H(w)x +s = 0. To prove ty has a minimizer, it suffices
to show that if {u>/} c K° is a sequence satisfying either
then
Assume ||iy,-|| -> oo. Since s e (K*)° and {tu,-} c K it follows that (s, w,) ->• oo.
Observing that (x, —g(it>,-)) > 0 because —g(u>,-) e K*, we conclude ^(u),-) -> oo.
Now assume if/ —>• w e dK. Theorem 2.2.9 shows ||g(w,)|| -> oo. Since— g(io,-) e
^T* and x € jST° we thus have (jc, —g(w,)} -> oo. Observing that (s, u;,-} > 0 since 5 e ^*
and u;, e K, we conclude
Applying Proposition 3.5.4 to a self-scaled cone £ with a local inner product { , } =
( , }e, we have for each ordered pair x, s e K° the existence of a scaling point w for the pair.
In light of the discussion just prior to the statement of Theorem 3.5.3, to complete the proof
of that theorem it remains only to prove the uniqueness of the scaling point. The uniqueness
is crucial for the subsequent theory. To prove uniqueness, we rely on the following theorem
which is important in itself. We freely invoke its identities throughout the remainder of the
chapter; thus, the reader should keep them in mind.
and
Using the values for C and Cw given by Proposition 3.5.1 establishes (3.34).
To prove (3.35), one can simply differentiate both sides of (3.34) w.r.t. x, making use of
the chain rule. Alternatively, we can observe that since — g is an involution (Theorem 3.5.2),
Thus,
that is,
3.5. Self-Scaled (or Symmetric) Cones 89
establishing (3.35).
To establish (3.36), one can differentiate both sides of (3.35) w.r.t. x using the chain
rule. Alternatively, relying on (3.35) and — g being an involution, observe
and so
that is,
Thus,
that is,
Since H(e)v = Iv = v, the uniqueness given by Theorem 3.5.3 thus implies x' = e.
By definition, the local norms induced by a self-concordant functional / change
gradually; if \\x — e\\ < 1, then for all v,
(the Minkowski gauge function for B}. For example, if K = R^_ and e = (!,...,!), then
| | is the ^oo-norm, whereas || || is the ^2-norm.
Since B(e, 1) := {x : \\x — e\\ < 1} is a set which is centrally symmetric about e and
satisfies B(e, 1) c K°, we have B(e, 1) c e + B. Stated differently,
Theorem 3.5.7. Assume K is self-scaled. Ifx satisfies \x — e\ < 1, then for all v,
and
The bounds (3.38) imply the Hessian of / to vary gradually. For example, just as
we established the bounds in Theorem 2.2.1 using the bounds in the definition of self-
concordance, from the bounds (3.38) we obtain that for all x satisfying \x — e\ < 1,
This has implications for Newton's method when applied to functionals obtained by adding
a linear term to /, like those for the barrier method (§2.4.1) and many other ipm's. To see
why, assume z minimizes such a functional. Repeating the proof of Theorem 2.2.3, now
making use of (3.40), we find that whenever \z — e\ < 1, then one step of Newton's method
beginning at e gives a point e+ satisfying
Unfortunately, although the stronger bound (3.41) indicates that ipm's like the barrier method
might perform better when the underlying barrier functional is intrinsically self-conjugate,
the stronger bound does not appear to be sufficient to prove complexity bounds which are
better than what we have already proven, i.e., (2.18). Still, the bound is considered to be
important conceptually and so we have discussed it.
The proof of Theorem 3.5.7 depends on the following theorem, which is of interest in
itself and which plays a major role in the proof of an important theorem in §3.5.5 (a theorem
which is central in our analysis of primal-dual methods).
For observe that x—e e KandB(e, 1) c K (by self-concordance) together imply B(x, 1) c
K (keeping in mind that B(x, 1) := Be(x, 1)). Hence, by symmetry and the fact that for
each vector v, either (g(x), v) > 0 or (g(x), —v) > 0, Theorem 2.3.4 implies
giving (3.42).
Since
Proof. Assume x € K° satisfies (g(e), x — e) > 0. Since g(e) = —e, we are thus assuming
(x,e) < \\e\\2, an upper bound on (x, e). A strict lower bound of 0 is immediate from
e,x e K° = (K*)°. In all,
is a scalar multiple of* where the scalar is no less than 1. Stated differently, x is a convex
combination of 0 and x. Hence,
Since ||i>|| = 1 we have e — v e K. Since K* = K, we thus see (jc, e — v) > 0, that is,
94 Chapter 3. Conic Programming and Duality
Hence, using \\e\\2 = #/, (v, e] = 0, and ||u|| = 1, we have ||jc — e\\ < #/. Consequently,
by (3.43) we find \\x — e\\ < #/. By openness, the inequality is seen to be strict, that is,
x e B(e, #/).
Theorem 3.5.10. Assume K is self-scaled and x e K°. For each t e R there exists unique
xt e K° such that
Proof. That there is at most one point xt with the desired property is immediate from the
uniqueness in Theorem 3.5.3.
Define
that is,
3.5. Self-Scaled (or Symmetric) Cones 95
By the fact mentioned above, the operators H(x)t{~t2 and H(x)i(t]~t2) are pd w.r.t. { , }Xt .
Since the square root of a pd operator is unique, (3.44) thus implies
Consequently,
so that
The following theorem and proposition are used to prove Theorem 3.5.11.
then
Since
the largest eigenvalue of HZf(x) is (A/(A - e))2; in particular, if 0 < £ < A, the largest
eigenvalue is greater than 1. Hence, if 0 < 6 < A, Theorem 3.5.8 (applied with ze in place
of e) shows
96 Chapter 3. Conic Programming and Duality
that is.
Consequently, since € can be arbitrarily near 0 and since Bx (x, 1) c K° (by self-concordance),
In the case that A. = \\H(x) l\\ (= \\H(—g(x))\\), one simply interchanges the roles of x
and — g(x).
where || || is the local norm at e given by /. Note the following relation between the local
inner product at e given by /' and the one given by /: { , }' = 2{ , }. Letting g' denote
the gradient of /' w.r.t. { , )', it follows that
where
Note that 8 H-> (8 — l)/<5 is an increasing function of 8 when 8 > 0, whereas for fixed
Y > 0, the function 8 H» y/V8 is decreasing. Consequently, the function
Thus, by (3.45),
More geometrically, we know from §3.1 that fixing x satisfying Ax = b and (y, s) satisfying
A*y + s = c, and letting L denote the nullspace of A, the primal and dual instances can be
expressed as
and
where xn solves
Thus, £„ solves
To highlight symmetry between the primal and dual instances, we often hide y^, referring to
{s^ : r\ > 0} as the dual central path, keeping in mind that yn is uniquely determined by sn
through the equations A*y^ + Sr, = c (assuming A is surjective and hence A* is injective).
Assume x approximates xn and (y, s) approximates (y,,, 5^). One strategy for obtain-
ing a better approximation to x^ is to move from x in the steepest descent direction for the
linearly constrained optimization problem (3.46), i.e., in the direction opposite the projected
gradient. This direction w.r.t. the inner product { , }x is the direction of the Newton step,
the step taken by the barrier method. If, instead, one relies on { , )w where w is the scaling
point for the ordered pair jc, s, one arrives at the primal Nesterov-Todd direction:
3.6. The Nesterov-Todd Directions 99
where cw := H(w)~lc and PL<W denotes orthogonal projection onto L, orthogonal w.r.t.
{ , >„,. Since cw = A*wy + H(w)-ls and PL,WAI = 0 (where A*w = H(w)~lA* is the
adjoint of A w.r.t. { , } UJ ), the direction can also be expressed as
where w* = —g(w) is the scaling point for the ordered pair s, x (and where LL is the
orthogonal complement of L w.r.t. the original inner product { , )).
One gets a primal Nesterov-Todd direction and a dual Nesterov-Todd direction for
each triple 77, x, s, where x is primal feasible and s is dual feasible.
The Nesterov-Todd directions appeared in primal-dual ipm's for LP long before Nes-
terov andTodd developed them generally (cf. Megiddo [13], Kojima, Mizuno, andYoshise
[11], and Monteiro and Adler [14]). Indeed, the goal of Nesterov andTodd was to generalize
earlier work for LP.
Theoretical and practical insight into the Nesterov-Todd directions is gained by con-
sidering dx together with H(w)~lds (or, symmetrically, H(w*)~ldx together with ds).
Note that H(w)~lds e LLw, where L±w = H(w)~lL± is the orthogonal complement
of L w.r.t. ( , )w. We claim that
and hence
for then
and
1
However, (3.49) is immediate from the identity LLw = H(w) LJ-, whereas (3.50) is
immediate from
In the literature, these equations are often taken as the starting point for discussions of the
Nesterov-Todd directions, i.e., they are taken as defining the directions. The equations can
be used to compute the directions in practice.
It should be mentioned that dx and H(w}~lds are the Nesterov-Todd directions if one
replaces { , } with ( , )w, rewriting the instances in terms of this inner product. Moreover,
one can show quite generally that the Nesterov-Todd directions are essentially invariant
under changes in the inner product. For example, if one replaces { , } with { , }z where
z e K°, then dx and H(z)~lds are the Nesterov-Todd directions for the rewritten instances
The simple geometry given by viewing the summation dx + H(w) lds as an ( , )„,-
orthogonal decomposition of — (qx + g w (jc)) provides analyses of primal-dual ipm's which
are more transparent than arguments phrased only in terms of the original inner product
{ , } and the original instances. However, other inner products and rewritings of the
instances reveal the same simple geometry, for example, the approach which is symmetric
to the one we employ. The symmetric approach relies on { , }„,*. In this approach, LL
is fixed, L is replaced by (L^)1** = H(u>*)~lL, and the sum H(w*)~ldx + ds gives an
{ , )w*-orthogonal decomposition of — (rjs + gw*(s)).
One can even use the original inner product ( , } to reveal the same simple geometry
if one transforms the instances appropriately. Specifically, transforming the primal feasible
3.6. The Nesterov-Todd Directions 101
rather than on w and w*. (Take note that we slightly abuse notation in that
unless 77 = 1.) To give a sense as to why it is convenient, we show that a necessary an
sufficient condition to have the simultaneous equalities x = x^ and 5 = s,, is that w = x
(likewise, w* = s), for observe that w is the scaling point for the ordered pair x, rjs (whereas
w* is the scaling point for the pair s, rjx). Hence,
As we know from §3.4, a necessary and sufficient condition for the simultaneous equalities
x = xn and s = s,, is that x = —-g(s). Consequently, since logarithmic homogeneity
and the uniqueness of scaling points (Theorem 3.5.3) imply gu,(x) = —x iff w = x, (3.51)
shows we do indeed have the simultaneous equalities x = x^ and s = s^ iff w — x.
In the remaining sections of this chapter it is useful to scale the Nesterov-Todd direc-
tions, letting
and
Moreover,
where w is the scaling point for the ordered pair x, s and w* (= —g(w)) is the scaling
point for the pair s, x. We know from §3.6 that a necessary and sufficient condition for the
simultaneous equalities x = x^ and 5 = sn is that w = x (likewise, w* = s). Thus, one
might consider measuring proximity of the pair x, s to the central path pair jc^, sn by the
quantity II x — w II $• It is not difficult to show
and
Moreover,
3.7. Primal-Dual Path-Following Methods 103
Proof. To prove (3.52), substitute H(w)~ls for x and H(w)x for s in the expression
\\rjs + g(x)\\-g(X), then use the identities
and
l 2
Next, we claim that H(x) / e = —g(x), for by Theorem 3.5.10 there exists x\/2 e K°
such that H(xi/2) = H(x)l/2 and hence by (3.36),
Toward proving the final inequality of the theorem, we remark that from the identities
H(w)~lr]s = x, gt,(x) = H(w)~lg(x), and
3.7.2 An Algorithm
A feature of our first primal-dual path-following ipm is the simplicity of its analysis. Indeed,
the algorithm is designed expressly for this feature. In the next subsection we present an
algorithm which is more typical of the ipm's found in the literature, one that relies on the
Nesterov-Todd directions. Its analysis extends the simpler analysis.
Assume x and 5 are the current feasible points, meant to crudely approximate x^ and
Sr,. Our first algorithm obtains better approximations by computing
and
where PL^ denotes orthogonal projection onto L, orthogonal w.r.t. { , }„,. The algorithm
then increases rj to some value rj+. The points x+, s+—computed with the aim of obtaining
better approximations to jc,,, s^ than x, s—can be considered as crude approximations to
XTI+ » Si>+, just as x, s were considered to be crude approximations to jc^, s^. The algorithm
repeats the steps with rj+ in place of rj, and so on.
Insight into the algorithm is gained by understanding that the primal and dual direction
vectors
can essentially be viewed as orthogonal projections of the same vector, similarly to the
primal and dual Nesterov-Todd directions. In particular, we claim
and hence
Thus,
3.7. Primal-Dual Path-Following Methods 105
where A*w = H(w} 1A* = r]H(w) 1A* = rjA*^. These equations can be used to compute
the directions in practice.
In light of Theorem 3.7.1, the following theorem justifies the algorithm.
Theorem 3.7.2. Assume |1 — — | < l/(12y#/). Ifx ands are feasible points satisfying
then
we have
and
Consequently,
Hence, by gw(w) = —w, H^w} = I, Corollary 1.5.8, and Theorem 2.2.1, we have
106 Chapter 3. Conic Programming and Duality
Since (3.39) (applied with w in place of e and x+ in place of x) implies that for all v,
it follows that
and
The algorithm then increases rj to /?+. The points jc+, s+ crudely approximate xn+, Sr,+ . The
algorithm repeats the steps with ??+ in place of 77, and so on.
3.7. Primal-Dual Path-Following Methods 107
The algorithm is closely related to the one of the previous subsection. Indeed, if x, s
are near *,,, s^, then w & x and hence
likewise, s 4- gw*(s) &2(s — w*). Consequently, it is not surprising that the analysis of the
algorithm is an extension of the arguments in the proof of Theorem 3.7.2.
Theorem 3.7.3. Assume |1 — y| < l/(2Q^/&f). Ifx ands are feasible points satisfying
then
Hence, defining
Hence,
and
Consequently,
Thus,
where
The appropriateness of this potential function is shown by the following theorem, recalling
that
Proof. Let t := & f / ( x , s } . Recall that /*(z) = f ( z ) + Ce for all z e K°, where
Ce := -&f - 2 f ( e ) (Theorem 3.5.1). Thus, by definition of /*,
(with equality iff ts = —g(x), i.e., iff s is a multiple of —#(*)). Consequently, since
f(ts) = f ( s ) — &f Int (logarithmic homogeneity),
is concave when restricted to the feasible regions, observe that its Hessian acts on vectors
(M, u) eR" x R " as follows:
Thus, if
Hence, the Hessian of the functional obtained by restricting (3.56) to the primal and dual
feasible regions is negative semidefimte, i.e., the restricted functional is concave.
110 Chapter 3. Conic Programming and Duality
and
where
For motivation, we note that the gradient w.r.t. { , ) „, of the functional obtained by restricting
x h-> <E> (x, 5) to the primal feasible region is precisely —dx,i.e.,dxis the direction of steepest
descent. Likewise, ds is the direction of steepest descent w.r.t. { , ) « , * for the functional
obtained by restricting s h-> $>(x, s) to the dual feasible region.
The algorithm searches the line {(x + tdx, s + tds) : t e R} to obtain an approximate
minimizer rmjn of the univariate functional
It then lets
We show
For theory purposes we can formalize the condition that t^n be an "approximate minimizer"
by requiring t^n satisfy, say,
The specific constant -^ is irrelevant for the theory. What matters is that the constant is
positive and universal (i.e., entirely independent of the particular instances) and that potential
reduction of the amount can always be achieved. In practice, drastically greater reduction
of the potential function than ^ always happens.
The algorithm generates a sequence of points (XQ, SQ), (x\, s\),..., each obtained
from the previous one by line search. The relation (3.57) in conjunction with Theorem 3.8.1
immediately imply for all i that
Hence, for all e > 0, the algorithm computes feasible points (*,-, sv) satisfying
within
3.8. A Primal-Dual Potential-Reduction Method 111
iterations.
From a theoretical perspective, by far the most important term in the bound (3.58) is
J&~f In ^. The bound can be roughly interpreted as stating that the objective value difference
(c, xt} — (b, v,} is halved in O(^JWf) iterations, a result which is essentially the same as
what we obtained for the barrier method in §2.4.2 and for the primal-dual path-following
methods in §3.7. However, the potential-reduction method has the potential for much
greater efficiency (especially if each line search yields tmin as the exact minimizer) as it is
not required to stay near the central path like the other algorithms.
The potential-reduction method described above was first introduced in the context
of linear complementarity problems by Kojima, Mizuno, and Yoshise [12].
Theorem 3.8.2.
and
Let
It suffices to prove
112 Chapter 3. Conic Programming and Duality
Let
we have
and hence,
Let
so that
The function -fy\ is concave since the map ( x , s ) -> p\n(x,s) is concave when
restricted to the primal and dual feasible regions (as was discussed at the end of §3.8.1).
Hence,
that is,
However, since ||;c||| = p = #/ -I- y#/, ||u>||«i = ./#/, and i?/ > 1, it is not difficult to
prove that
Thus,
Substitution yields
115
116 Bibliography
[14] R.D.C. Monteiro and I. Adler, "Interior path following primal-dual algorithms. Part I:
Linear programming," Mathematical Programming, 44 (1989), pp. 27-41.
[15] Yu.E. Nesterov and A.S. Nemirovskii, Interior-Point Polynomial Algorithms in Convex
Programming, SIAM, Philadelphia, 1994.
[16] Yu.E. Nesterov and MJ. Todd, "Self-scaled barriers and interior-point methods for
convex programming," Mathematics of Operations Research, 22 (1997), pp. 1-42.
[17] Yu.E. Nesterov and M.J. Todd, "Primal-dual interior-point methods for self-scaled
cones," SIAM Journal on Optimization, 8 (1998), pp. 324-364.
[18] J. Renegar, "A polynomial-time algorithm based on Newton's method for linear pro-
gramming," Mathematical Programming, 40 (1988), pp. 59-94.
[19] R.T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970.
[20] K. Tanabe, "Centered Newton method for mathematical programming," System Mod-
elling and Optimization: Proceedings of the 13th IFIP Conference, Tokyo, Japan,
Aug./Sept. 1987, Lecture Notes in Control and Information Sciences 113, M. Iri and
K. Yajima, eds., Springer-Verlag, Berlin, 1988, pp. 197-206.
[23] M.J. Todd and Y. Ye, "A centered projective algorithm for linear programming," Math-
ematics of Operations Research, 15 (1990), pp. 508-529.
[25] Y. Ye, Interior Point Algorithms: Theory and Analysis, John Wiley and Sons, New
York, 1997.
Index
A*, 3 { , )*,22
A:, 22 intrinsically self-conjugate, 83
analytic center, 39 intrinsically self-dual, 83
asymptotic feasibility, 71
asymptotic optimal value, 71, 73 logarithmic homogeneity, 42
a-val, 71
Nesterov-Todd directions, 98, 99
a-vaT, 73
n(x), 19
£c(y,r),23 II 11,2
barrier functional, 35 II II s, 5
II IU.22
central path, 43, 81 II A||, 4
complexity value, 35
tf/,35 positive definite (pd), 4
positive semidefinite (psd), 4
conjugate functional, 76
/*,76 PL, 8
PL,,, 22
Cl,6
C\9
frOO. 18
xi • x2, 2 Rn++' o/i
dual cone, 65
K*,65 scaling point, 86
K*, 83 self-adjoint, 4
self-concordant functional, 23
Frobenius norm, 2 self-scaled (or symmetric) cone, 84
SC, 23
gradient, 6 SCB, 35
*(*), 6 strong duality, 67
g, (y),22 strong feasibility, 73
*lL,*00,22 § wx ",2
Snxn
++,6 <r
Hessian, 9
symO, D), 40
H(x), 9
H\L(x\\\ Xi o X2, 2
#,00,22
H| L ,,(v),22 val, 65
val*, 66
(, ),2
( , )s,5 weak duality, 66
117