(MPS-SIAM series on optimization) James Renegar - A mathematical view of interior-point methods in convex optimization-Society for Industrial and Applied Mathematics _, Mathematical Programming Societ.pdf

Download as pdf or txt
Download as pdf or txt
You are on page 1of 126

A MATHEMATICAL VIEW

OF INTERIOR-POINT METHODS
IN CONVEX OPTIMIZATION
MPS/SIAM Series on Optimization

This series is published jointly by the Mathematical Programming Society and the Society
for Industrial and Applied Mathematics. It includes research monographs, textbooks at all
levels, books on applications, and tutorials. Besides being of high scientific quality, books
in the series must advance the understanding and practice of optimization and be written
clearly, in a manner appropriate to their level.

Editor-in-Chief
John E. Dennis, Jr., Rice University

Continuous Optimization Editor


Stephen J. Wright, Argonne National Laboratory

Discrete Optimization Editor


David B. Shmoys, Cornell University

Editorial Board
Daniel Bienstock, Columbia University
John R. Birge, Northwestern University
Andrew V. Goldberg, InterTrust Technologies Corporation
Matthias Heinkenschloss, Rice University
David S. Johnson, AT&T Labs - Research
Gil Kalai, Hebrew University
Ravi Kannan, Yale University
C. T. Kelley, North Carolina State University
Jan Karel Lenstra, Technische Universiteit Eindhoven
Adrian S. Lewis, University of Waterloo
Daniel Ralph, The Judge Institute of Management Studies
James Renegar, Cornell University
Alexander Schrijver, CWI, The Netherlands
David P. Williamson, IBM T.J. Watson Research Center
Jochem Zowe, University of Erlangen-Nuremberg, Germany

Series Volumes
Renegar, James, A Mathematical View of Interior-Point Methods in Convex Optimization
Ben-Tal, Aharon and Nemirovski, Arkadi, Lectures on Modern Convex Optimization:
Analysis, Algorithms, and Engineering Applications
Conn, Andrew R., Gould, Nicholas I. M., and Toint, Phillippe L., Trust-Region Methods
A MATHEMATICAL VIEW
OF INTERIOR-POINT METHODS
IN CONVEX OPTIMIZATION

James Renegar
Cornell University
Ithaca, New York

siam. MpS
Society for Industrial and Applied Mathematics Mathematical Programming Society
Philadelphia Philadelphia
Copyright ©2001 by the Society for Industrial and Applied Mathematics.

10987654321

All rights reserved. Printed in the United States of America. No part of this book
may be reproduced, stored, or transmitted in any manner without the written
permission of the publisher. For information, write to the Society for Industrial
and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA
19104-2688.

Library of Congress Cataloging-in-Publication Data


Renegar, James, 1955-
A mathematical view of interior-point methods in convex optimization / James
Renegar.
p. cm. - (MPS-SIAM series on optimization)
Includes bibliographical references and index.
ISBN 0-89871-502-4 (pbk.)
1. Interior-point methods. 2. Mathematical optimization. 3. Convex
programming. I. Title. II. Series.

QA402.5 .R46 2001


519.3-dc21
2001042999

This research was supported by NSF grant CCR-9403580.

siam is a registered trademark.


Contents

Preface vii

1 Preliminaries 1
1.1 Linear Algebra 2
1.2 Gradients 5
1.3 Hessians 9
1.4 Convexity 11
1.5 Fundamental Theorems of Calculus 14
1.6 Newton's Method 18

2 Basic Interior-Point Method Theory 21


2.1 Intrinsic Inner Products 21
2.2 Self-Concordant Functionals 23
2.2.1 Introduction 23
2.2.2 Self-Concordancy and Newton's Method 27
2.2.3 Other Properties 31
2.3 Barrier Functionals 35
2.3.1 Introduction 35
2.3.2 Analytic Centers 38
2.3.3 Optimal Barrier Functionals 39
2.3.4 Other Properties 40
2.3.5 Logarithmic Homogeneity 41
2.4 Primal Algorithms 43
2.4.1 Introduction 43
2.4.2 The Barrier Method 45
2.4.3 The Long-Step Barrier Method 49
2.4.4 A Predictor-Corrector Method 52
2.5 Matters of Definition 54

3 Conic Programming and Duality 65


3.1 Conic Programming 65
3.2 Classical Duality Theory 68
3.3 The Conjugate Functional 75
3.4 Duality of the Central Paths 81

v
vi Contents

3.5 Self-Scaled (or Symmetric) Cones 83


3.5.1 Introduction 83
3.5.2 An Important Remark on Notation 85
3.5.3 Scaling Points 86
3.5.4 Gradients and Norms 90
3.5.5 A Useful Theorem 95
3.6 The Nesterov-Todd Directions 97
3.7 Primal-Dual Path-Following Methods 102
3.7.1 Measures of Proximity 102
3.7.2 An Algorithm 104
3.7.3 Another Algorithm 106
3.8 A Primal-Dual Potential-Reduction Method 108
3.8.1 The Potential Function 108
3.8.2 TheAlgorithm 110
3.8.3 The Analysis 111

Bibliography 115

Index 117
Preface
This book aims at developing a thorough understanding of the most general theory for
interior-point methods, a class of algorithms for convex optimization problems. The study
of these algorithms has dominated the continuous optimization literature for nearly 15
years, beginning with the paper by Karmarkar [10]. In that time, the theory has matured
tremendously, perhaps most notably due to the path-breaking, broadly encompassing, and
enormously influential work of Nesterov and Nemirovskii [15].
Much of the literature on the general theory of interior-point methods is difficult to
understand, even for specialists. My hope is that this book will make the most general theory
accessible to a wide audience—especially Ph.D. students, the next generation of optimizers.
The book might be used for graduate courses in optimization, both in mathematics and
engineering departments. Moreover, it is self-contained and thus appropriate for individual
study by students as well as by seasoned researchers who wish to better assimilate the most
general interior-point method theory.
This book grew out of my lecture notes for the Ph.D. course on interior-point methods
at Cornell University. The impetus for writing the book came from an invitation to give the
annual lecture series at the Center for Operations Research and Econometrics (CORE) at
Louvain-la-Neuve, Belgium, in fall, 1996. The book is being published by CORE as well
as in the SIAM-MPS series.*
The writing of this book has been a particularly satisfying experience. It has brought
into sharp focus the beauty and coherence of the interior-point method theory as developed
and influenced through the efforts of many researchers. I hope those researchers will not
be offended by my choice to cite few references (and I hope they largely agree that the
ones I have cited are the ones that should be cited, given the material covered herein).
The reader can look to recent articles and books for extensive bibliographies (cf. [21],
[22], [24], [25]); a great place to stay up to date is Interior-Point Methods Online
(http://www-unix.mcs.anl.gov/otc/InteriorPoint/).
In citing few references, my intent is not to give the impression of research originality.
Indeed, most of the results in this monograph are undoubtedly embedded somewhere in [15],
[16], and [17] (which in turn were influenced by other works). My only claim to originality
is in presenting a simplifying perspective.

Ithaca, January, 2001

*This SIAM-MPS version is slightly more polished than the CORE version, having benefited from reviewers as
well as from students who combed the manuscript while taking the Ph.D. course on interior-point methods. I wish
to thank Osman Giiler, who carefully read the manuscript in his role as a reviewer for the SIAM-MPS series. His
suggestions improved the presentation. I am very grateful to Francis Su and Trevor Park, both of whom suggested
many changes that benefit the reader.

vii
This page intentionally left blank
Chapter 1

Preliminaries

This chapter provides a review of material generally pertinent to continuous optimization


theory, although it is written so as to be readily applicable in developing interior-point
method (ipm) theory. The primary difference between our exposition and more customary
approaches is that we do not rely on coordinate systems. For example, it is customary to
define the gradient of a functional / : R" -> R as the vector-valued function g : R" —> W
whose y'th coordinate is df/dxj. Instead, we consider the gradient as determined by an
underlying inner product ( , ). For us, the gradient is the function g satisfying

where || AJC|| := (Ax, Ajc) 1/2 . In general, the function whose yth coordinate is df/dxj is
the gradient only if ( , } is the Euclidean inner product.
The natural geometry varies from point to point in the domains of optimization prob-
lems that can be solved by ipm's. As the algorithms progress from one point to the next, one
changes the inner product—and hence the geometry—to visualize the headway achieved
by the algorithms. The relevant inner products may bear no relation to an initially imposed
coordinate system. Consequently, in aiming for the most transparent and least cumbersome
proofs, one should dispense with coordinate systems.
We begin with a review of linear algebra by recalling, for example, the notion of a
self-adjoint linear operator. We then define gradients and Hessians, emphasizing how they
change when the underlying inner product is changed. Next is a brief review of basic results
for convex functionals, followed by results akin to the fundamental theorem of calculus.
Although these "calculus results" are elementary and dry, they are essential in achieving
lift-off for the ipm theory. Finally, we recall Newton's method for continuous optimization,
proving a standard theorem which later plays a central motivational role.
Seasoned researchers will likely find parts of this chapter to be overly detailed, espe-
cially the proofs for results which are essentially only multivariate calculus. The researchers
might wonder why standard calculus results are not employed for the sake of brevity. This
chapter is aimed at new Ph.D. students rather than seasoned researchers. The development is
meant to teach students to think coordinate-free, quite contrary to how multivariate calculus

1
2 Chanter 1. Preliminaries

is customarily taught; the development is in the spirit of functional analysis. By patiently


learning to think coordinate-free in Chapter 1, students are in a strong position to quickly
grasp the essentials of the ipm theory in Chapter 2. Beginning with Chapter 2, proofs are
more concise and written at a level appropriate to research.

1.1 Linear Algebra


We let { , } denote an arbitrary inner product on R n . In later sections, { , } will act as a
reference inner product, an inner product from which other inner products are constructed.
For the ipm theory, it happens that the reference inner product { , } is irrelevant; the inner
products essential to the theory are independent of the reference inner product. To a large
extent, the reference inner product will serve only to fix notation.
Although the particular reference inner product will prove to be irrelevant for ipm
theory, for many optimization problems to be solved by ipm's there are natural reference
inner products. For example, in linear programming (LP) where vectors x are expressed
coordinatewise and "jc > 0" means each coordinate is nonnegative, the natural inner product
is the Euclidean inner product, which we refer to as the dot product, writing x\ • KI. Similarly,
in semidefinite programming (SDP) where the relevant vector space is §"xn—the space of
symmetric n x n real matrices X—and "X > 0" means X is positive semidefinite (i.e., has
no negative eigenvalues), the natural inner product is the trace product,

Thus, X\ o X2 equals the sum of the eigenvalues of the matrix X\X^.1


Throughout the general development, we use W1 to denote an arbitrary finite-
dimensional real vector space with an inner product, be it Snx" or whatever. (The gen-
eral development can be done in the setting of arbitrary real Hilbert spaces, even infinite-
dimensional ones. We restrict ourselves to W1 to reach a wide audience.)
The inner product { , ) induces a norm on R n ,

Perhaps the most useful relation between the inner product and the norm is the Cauchy-
Schwarz inequality,

with equality iff jti and KI are colinear.


Whereas the dot product gives rise to the Euclidean norm, the norm arising from the
trace product is known as the Frobenius norm (or the Hilbert-Schmidt norm). The trace
product, and hence the Frobenius norm, can be extended to the vector space of all real nxn
matrices by defining MI o MI := trace(AffAf2) and ||M|| := (£ m?.)1/2 where mfj are the
coefficients of M.
1
If one catenates the rows of n x n symmetric matrices to form vectors in W , the trace product of two matrices
is identical to the dot product of their resulting vectors. This is not to say that S"x" endowed with the trace product
is no different than R"" endowed with the dot product, since the vectors obtained by catenating rows of symmetric
matrices form a proper and highly structured subspace of W . Rather than thinking of the dot product when
considering the trace product, it is best to think of eigenvalues: X\ o X2 is the sum of the eigenvalues of the matrix
XiX 2 .
1.1. Linear Algebra 3

We remark that the Frobenius norm is submultiplicative, meaning ||XS\\ < \\X\\ \\S\\,
as is seen from the relations

Recall that vectors jci, x^ e R" are said to be orthogonal if (x\,X2) = 0- Recall that
a basis v\,,.., vn for R" is said to be an orthonormal if

where <5jy is the Kronecker delta. A linear operator (i.e., a linear transformation) Q : R" ->
Rn is said to be orthogonal if

If given an inner product { , ) on R", one adopts a coordinate system obtained by


expressing vectors as linear combinations of an orthonormal basis; the inner product { , }
is the dot product for that coordinate system. Consequently, one can consider the results we
review below as following from the special case of the dot product, but one should keep in
mind that thinking in terms of coordinates is best avoided for understanding the ipm theory.
If both R" and Rm are endowed with inner products and A : R" -+ Rm is a linear
operator, there exists a unique linear operator A* : Rm —> Rn satisfying

The operator A* is the adjoint of A. The range space of A* is orthogonal to the nullspace
of A.
If A is surjective, then A* is injective and the linear operator A*(AA*) -1 A projects
R" orthogonally onto the range space of A*; that is, the image of x is the point in the range
space closest to x. Likewise, / — A*(AA*)-1A projects R" orthogonally onto the nullspace
of A.
If both R" and Rm are endowed with the dot product and if A and A* are written as
matrices, then A* = A r , the transpose of A.
It is a simple but important exercise for SDP to show that if S\,..., Sm e S wx " and
nx/1
A :S -> Rm is the linear operator defined by

then

assuming S"*" is endowed with the trace product and Rm is endowed with the dot product.
4 Chapter 1. Preliminaries

Continuing to assume Rn and Rm are endowed with inner products, and hence norms,
one obtains an induced operator norm on the vector space consisting of linear operators
IK. —>v TO"*-
A .. mm M .

Each linear operator A : R" —*• Rm has a singular-value decomposition. Precisely,


there exist orthonormal bases u\,..., un and w\,..., wm, as well as real numbers 0 < y\ <
• • • < Yr where r is the rank of A, such that for all x,

(Equivalently, Aw, = y/u;,- for i = 1 , . . . , r and Aut• = 0 for i > r.) The numbers y, are
the singular values of A; if r < n, then the number 0 is also considered to be a singular
value of A. It is easily seen that || A|| = yr. Moreover,

so that the values y,- (and possibly 0) are also the singular values of A*. It follows that
||A*|| = ||A||.
If Rn and Em are endowed with the dot product, the singular-value decomposition
corresponds to the fact that if A is an m x n matrix, there exist orthogonal matrices Qm
and Qn such that QmAQn = F where F is an m x n matrix with zeros everywhere except
possibly for positive numbers on the main diagonal.
It is not difficult to prove that a linear operator Q : R" ->• Rn is orthogonal iff
Q* = Q~l. For orthogonal operators, ||<2H = 1.
A linear operator S : R" —> R" is said to be self-adjoint if S = S*.
If { , ) is the dot product and 5 is written as a matrix, then S being self-adjoint is
equivalent to S being symmetric.
It is instructive to show that for S e § nxn , the linear operator A : §"x" -> S nxn
defined by

is self-adjoint. Such operators are important in the ipm theory for SDR
A linear operator S : R" -> R" is said to be positive semidefinite (psd) if S is self-
adjoint and

If, further, S satisfies

then S is said to be positive definite (pd). Keep in mind that self-adjointness is a part of
our definitions of psd and pd operators, contrary to the definitions found in much of the
optimization literature.
Each self-adjoint linear operator S has a spectral decomposition. Precisely, for each
self-adjoint linear operator S there exists an orthonormal basis v\,..., vn and real numbers
AI < • • • < A.n such that for all jc,
1.2. Gradients 5

It is easily seen that u, is an eigenvector for S with eigenvalue A,-.


If ( , ) is the dot product and S is a symmetric matrix, then the spectral decomposition
corresponds to the fact that S can be diagonalized using an orthogonal matrix, i.e.

The following relations are easily established:

• Sis psd iff A, > 0 for all/;


• 5 is pd iff A, > 0 for alii;
• If S~l exists, then it, too, is self-adjoint and has eigenvalues I/A,. (In particular,

The spectral decomposition for a psd operator S allows one to easily prove the ex-
istence of a psd operator S1/2 satisfying 5 = (51/2)2; simply replace A, by v^7 in the
decomposition. In turn, the uniqueness of S1/2 can readily be proven by relying on the fact
that if T is a psd operator satisfying T2 = S, then the eigenvectors for T are eigenvectors
for S. The operator 51/2 is the square root of S.
Here is a crucial observation: If S is pd, then S defines a new inner product, namely,

Every inner product on E" arises in this way; that is, regardless of the initial inner product
( , }, for every other inner product there exists S which is pd with respect to (w.r.t.) { , )
and for which ( , )$ is precisely the other inner product. (We do not rely on this fact.)
Let || \\s denote the norm induced by { , ) $ .
Assume A* is the adjoint of A : R" —>• R m . Assuming S and T are pd w.r.t. the
respective inner products, if the inner product on E" is replaced by ( , )s and that on Rm is
replaced by { , }T, then the adjoint of A becomes S"1 A*7, as is easily shown. Moreover,
letting || A|| c 7- denote the resulting operator norm, it is easily proven that

Similarly, since for c e E", the linear function given by x h-» (c, jc) is identical to the
function x (->• (S~lc, x)$, if c is the objective vector of an optimization problem written
in terms of {, )—that is, if the objective is "min{c, x)"—then when written in terms of
{ , )s, the objective vector is S~lc. The notation used in this paragraph illustrates our
earlier assertion that the reference inner product will be useful in fixing notation as the inner
products change.
It is instructive to consider the shape of the unit ball w.r.t. || || 5 viewed in terms of the
geometry of the reference inner product. The spectral decomposition of S easily implies
the unit ball to be an ellipsoid with axes in the directions of the orthonormal basis vectors
v\,..., vn, the length of the axis in the direction of v, being 2-,/1/A,-.

1.2 Gradients
Recall that a functional is a function whose range lies in E. We use Df to denote the
domain of a functional /. It will always be assumed that D/ is an open subset of R" in the
norm topology (recalling that all norms on Rn induce the same topology).
6 Chapter 1. Preliminaries

Let { , ) denote an arbitrary inner product on Rn and let || || denote the norm induced
by {, ).
The functional / is said to be (Frechet) differentiable at x e Df if there exists a vector
g(x) satisfying

The vector g(x) is the gradient of / at x w.r.t. { , ).


If one chooses the inner product { , ) on E" to be the dot product and expresses g(x)
coordinatewise, the jth coordinate is df/dxj. This is seen by restricting A* to be scalar
multiples of ej, the jth unit vector.
For an arbitrary inner product { , ), the gradient has the same geometrical interpreta-
tion that is taught in Calculus for the dot product gradient. Roughly speaking, the gradient
g(x) points in the direction for which the functional output increases the fastest per unit
distance traveled, and the magnitude HgOOII equals the amount the functional will change
per unit distance traveled in that direction. (Since what constitutes "unit distance" changes
as the inner product changes, it is not mysterious that the gradient changes as the inner
product changes.) We give rigor to this geometrical interpretation in §1.4.
The first-order approximation of f at x is the linear functional

If / is differentiable at each x e Df, then / is said to be differentiable. Henceforth,


assume / is differentiable.
If the function x \-> g(x) is continuous at each x e Df, then / is said to be continu-
ously differentiable; one writes / e Cl.
To illustrate the definition of gradient, consider the functional

with domain §+*", the set of all pd matrices in §"xn. This functional plays an especially
important role in SDR We claim that w.r.t. the trace product,

Indeed, let AX e Snx" and denote the eigenvalues of X l ( A X ) by y\,..., yn. Since
X i , . . . , Yn are also the eigenvalues of the symmetric matrix X ~ l / 2 ( A X ) X ~ 1 / 2 , the eigen-
values are real numbers. Note that X"1 o AX = £). y, and

The submultiplicativity of the Frobenius norm gives


1.2. Gradients 7

Since \\X l(AX)\\ = \\y\\ where \\y\\ is the Euclidean norm of the vector y := (y\,..., /„),
the inequalities in (1.1) show the statement "|| AX|| ->• 0" is equivalent to "||y || —>• 0."
Finally, to establish the claim g(X) = —X~l, observe that

Our definition of what it means for a functional / to be differentiable is phrased in


terms of the inner product ( , ). However, relying on the equivalence of all norm topologies
on R", it is readily proven that the property of being differentiable—and being continuously
differentiable—is independent of the inner product. The gradient depends on the inner
product, but differentiability does not.
The following theorem shows how the gradient changes as the inner product changes.

Theorem 1.2.1. If S is pd and f is differentiable at x, then the gradient of f at x w.r.t.


( , )s is S~lg(x).

Proof. Letting AI denote the least eigenvalue of S and A.n the greatest, the proof relies on
the relations

as follow easily from the spectral decomposition of 5.


To prove that S~lg(x) is the gradient of / at x w.r.t. { , )$, we wish to show

However, noting that for all v,

we have
8 Chapter 1. Preliminaries

completing the proof.


Theorem 1.2.1 has the unsurprising consequence that the first-order approximation of
/ at x is independent of the inner product:

Finally, we make a few observations that will be important for applying ipm theory
to optimization problems having linear equations among the constraints.
Recall that we use Rn to denote a generic finite-dimensional vector space with an inner
product, and hence our definition of the gradient applies to subspaces, too. In particular, if
/ is defined on a vector space and the domain of / intersects a subspace L, one can speak
of the gradient of the function f\i obtained by restricting / to L. Substituting L for R" in
the definition of the gradient, we see that the gradient of f\i at x e D/ H L is the vector
S\L(X) e L satisfying

If Df is an open subset of R" and Df n L ^ 0 where L is a subspace, then for


x e Df n L we can refer both to the gradient g(x) e R" and to the gradient #|L(*) € L.
We claim that g!/,(*) = PL§(X) where PI is the operator projecting R" orthogonally onto
L. Indeed, it is readily proven that PI is self-adjoint (e.g., use the facts that PLU = u if
u e L, PLV = Qif v E L^- and each vector w in R" can be expressed w = u + v where
u e L,v e L1-}. Consequently,

and hence g\i(x) = ?Lg(x} as claimed.


Similarly, if L' is a translate of a subspace L and Df n L' ^ 0, we can consider
the restricted functional /|//. As the gradient is naturally defined for subspaces rather than
for affine spaces like L', it is sometimes expedient to instead consider the functional with
domain L defined by v i-> f\i'(v + x), where x is a fixed point in L'. In this way, facts
about gradients can be applied to analyzing f\L>. We sometimes state results for functionals
restricted to affine spaces while giving only proofs for subspaces, leaving the remaining
details to the reader.
1.3. Hessians 9

1.3 Hessians
The functional / is said to be twice differentiable at x e D/ if / e Cl and there exists a
linear operator H (jc) : Rn ->• Rn satisfying

If it exists, H(x) is said to be the Hessian of / at x w.r.t. { , }.


If { , ) is the dot product and H (x) is written as a matrix, the (i, j } entry of H(x) is

The second-order approximation of f at x is the quadratic functional

If / is twice differentiable at each x e Df, then / is said to be twice differentiable.


Henceforth, assume / is twice differentiable.
If the function jc i-> H(x) is continuous at x (w.r.t. the operator-norm topology or,
equivalently, any norm topology on the vector space of linear operators from R" to R"),
then H(x) is self-adjoint. If the function x M- H (x) is continuous at each x e Df, then /
is said to be twice continuously differentiable and one writes / € C2.
The assumption of twice continuous differentiability, as opposed to mere twice dif-
ferentiability, is often made in optimization to guarantee self-adjointness of the Hessian. In
this book, functional are always assumed to be twice continuously differentiable.
If the inner product is the dot product and the Hessian is expressed as a matrix,
self-adjointness is equivalent to the matrix being symmetric, that is,

(If the Hessian matrix does not vary continuously in x, the order in which the partials are
taken can matter, resulting in a nonsymmetric matrix.)
To illustrate the definition of the Hessian, we again consider the functional

with domain §++", the set of all pd matrices in S"x". We saw that g(X) = -X~l. We claim
that H(X) is the linear operator given by

This can be proven by relying on the fact that if || AX|| is sufficiently small, then
10 Chapter 1. Preliminaries

and hence

For then, from the submultiplicativity of the Frobenius norm,

establishing that the Hessian is as claimed.


Since f ( X ) := — Indet(X) is twice continuously differentiable (in fact, analytic),
the Hessian H(X) is self-adjoint and hence has a spectral decomposition, i.e., has an or-
thonormal basis of eigenvectors, where "orthonormal" is w.r.t. the trace product. Indeed,
letting v\,..., vn be an orthonormal basis of eigenvectors for the symmetric matrix X (here,
"orthonormal" refers to the dot product on Rn), and letting AI , . . . , Xn be the eigenvalues, it
is readily verified that an orthonormal basis of eigenvectors for the operator H(X} is given
by the set of matrices {Mfj : 1 < i < j < n] where

Moreover, it is readily verified that the eigenvalue for M,; is A.,-A,;-.


The property of being twice continuously differentiable does not depend on the inner
product, whereas the Hessian most certainly does depend on the inner product. The depen-
dence of the Hessian on the inner product is made explicit in the following theorem. The
proof of the theorem is similar to the proof of Theorem 1.2.1 and hence is left to the reader.

Theorem 1.3.1. IfS ispdand f is twice differentiable atx, then the Hessian of f atx w.r.t.
( , )sisS-lH(x).

The lack of symmetry in the expression S~lH(x) might puzzle readers who are
thinking of Hessians as being symmetric matrices. We remark that if one chose a basis for
E" which is orthonormal w.r.t. { , )s and expressed the operator S~lH(x) as a matrix
in terms of the resulting coordinate system, then the matrix would be symmetric. (Again
we emphasize that in developing ipm theory, it is best not to think in terms of coordinate
systems.)
Theorems 1.2.1 and 1.3.1 have the unsurprising consequence that the second-ord
approximation of / at x is independent of the inner product:
1.4. Convexity 11

Finally, we make an observation regarding Hessians and subspaces. It is straightfor-


ward to prove that if L is a subspace of E" and / satisfies D/ D L / 0, then the Hessian of
f\L at x e L—an operator from L to L—is given by H\L(x) = PLH(x) (= PiH(x)Pi)
where P \L is the operator which orthogonally projects R" onto L. That is, when one applies
H\L(X) to a vector v e L, one obtains the vector PLH(x)v.

1.4 Convexity
Recall that a set S C E" is said to be convex if whenever x, y e S and 0 < t < 1 we have
x + t(y -x) e S.
Recall that a functional / is said to be convex if Df is convex and if whenever
x, y € Df and 0 < t < 1, we have

If the inequality is strict whenever 0 < t < 1 and x ^ v, then / is said to be strictly convex.
The minimizers of a convex functional form a convex set. A strictly convex functional
has at most one minimizer.
Henceforth, we assume /EC2 and we assume Df is an open, convex set.
If / is a univariate functional, we know from calculus that / is convex iff f"(x) > 0
for all x € Df. Similarly, if f"(x) > 0 for all x € D/, then / is strictly convex. The
following standard theorem generalizes these facts.

Theorem 1.4.1. The functional f is convex iff H(x) is psdfor all x € Df. If H(x) is pd
for all x 6 Df, then f is strictly convex.

The following elementary proposition, which is relied on in the proof of Theorem 1.4.1,
is fundamental in this book. It does not assume convexity of f .

Proposition 1.4.2. Assume x, y e Df and define a univariate functional 0 : [0, 1] —> R by

Then

and

Proof. Fix t and let u = x + t (y — x). We wish to prove

To prove </>'(/) = (g(u), y — x) it suffices to show


12 Chapter 1. Preliminaries

However, noting (f)(t) = f ( u ) and 0(? + s) = f ( u + s(y — x ) ) , we have

the final equality by definition of g(u).


Similarly, to prove 0"0) = (y — x, H(u)(y — x ) ) , it suffices to show

However, since we now know

we have

where the inequality is by Cauchy-Schwarz and the final equality is by definition of


H(u).
Proof of Theorem 1.4.1. We first show that if the Hessian is psd everywhere on Df, then
/ is convex. So assume the Hessian is psd everywhere on Df.
Assume x and y are arbitrary points in Df. We wish to show that if t satisfies
0 < t < l,then

Consider the univariate functional 0 defined by

Observe that (1.2) is equivalent to


1.4. Convexity 13

an inequality that is certainly valid if 0 is convex on the interval [0,1]. Hence to prove (1.2)
it suffices to prove 0"(0 > 0 for all 0 < t < 1. However, Proposition 1.4.2 implies

the inequality because H(x + t(y — x)) is psd.


The proof that / is strictly convex if the Hessian is everywhere pd on Df is similar
and hence is left to the reader.
To conclude the proof, it suffices to show that if H(x) is not psd for some x, then /
is not convex. If H(x} is not psd, then H(x) has an eigenvector v with negative eigenvalue
A,. To show / is not convex, it suffices to show that the functional 0(0 := f ( x + tv) is not
convex. To show 0 is not convex, it suffices to show 0"(0) < 0. This is straightforward,
again relying on Proposition 1.4.2.
It was asserted earlier that, roughly speaking, the gradient g(x) points in the direction
for which the functional output increases the fastest per unit distance traveled, and the
magnitude \\g(x) \\ equals the amount the functional will change per unit distance traveled in
that direction. Proposition 1.4.2 provides the means to make this rigorous. Indeed, choose
an arbitrary direction v of unit length. The initial rate of change in the output of / as one
moves from x to x + v in unit time is given by 0'(0) where

Note that Proposition 1.4.2 implies

and hence, by Cauchy-Schwarz and ||u|| = 1, we have

So the initial rate of change cannot exceed \\g(x) \\ in magnitude, regardless of which direc-
tion v of unit length is chosen. However, assuming g(x) / 0, if one chooses the direction

then (1.3) implies 0'(0) = ||g(jc)||.


We mention that a point z minimizes a convex functional / iff g(z) = 0. (For the "if,"
assume g(z) = 0. For y e Df, consider the univariate functional 0(0 := f ( z + t(y — z)).
By Proposition 1.4.2, 0'(0) = 0. Since 0 is convex, we know from univariate calculus that
0 minimizes 0. In particular, 0(0) < 0(1), that is, f ( z ) < /(y). For the "only if," assume
g(z) ^ 0 and consider y := x — tg(z) for small t > 0.)
As with the two preceding sections, we close this one with a discussion of subspaces.
If L is a subspace of Rn and Df n L ^ 0, we know the gradient of f\L to be PLg,
the orthogonal projection of g onto L. Thus, z e L solves the linearly constrained convex
optimization problem
14 Chapter 1. Preliminaries

iff Pig(i) = 0, that is, iff g(z) is orthogonal to L. In particular, if L is the nullspace of a
linear operator A, then z solves the optimization problem iff g(z) = A*y for some y. The
same holds when L is replaced by a translate of L, that is, when L is replaced by an affine
space.

Theorem 1.4.3. If f is convex and A is a linear operator, then z € Df solves the linearly
constrained optimization problem

iff Az = b and g(z) — A*yfor some y.

1.5 Fundamental Theorems of Calculus


We continue to assume f e C2 and Df is an open, convex set.
The following theorem generalizes the fundamental theorem of calculus.

Theorem 1.5.1. Ifx, y e Df, then

Proof. Consider the univariate functional 0(?) := f ( x + t ( y — jc)). The fundamental


theorem of calculus asserts

Since 0(1) = f ( y ) , 0(0) = /(*) and, by Proposition 1.4.2,

the proof is complete.


In a similar vein, we have the following proposition.

Proposition 1.5.2. If x, y e Df, then

and
1.5. Fundamental Theorems of Calculus 15

Proof. Again considering the univariate functional 0(0 '.= f(x+t(y— Jt)), the fundamental
theorem of calculus implies

and

Using Proposition 1.4.2 to make the obvious substitutions, (1.6) yields (1.4), whereas (1.7)
yields (1.5).
Proposition 1.5.2 provides the means to bound the error in the first- and second-order
approximations of /.

Corollary 1.5.3. Ifx, y e £>/, then

and

Relying on continuity of g and H, observe that the error in the first-order approx-
imation is o(\\y — x\\) (i.e., tends to zero faster than ||j — x\\), whereas the error in the
second-order approximation is0(||y — jt|| 2 ).
Theorem 1.5.1 gives a fundamental theorem of calculus for a functional /. It will be
necessary to have an analogous theorem for g, a theorem which expresses the difference
g(y) — g(x) as an integral involving the Hessian. To keep our development coordinate-free,
we introduce the following definition:
The univariate function t h-> v(t} e R", with domain [ a , b ] , is said to be
integrable if there exists a vector u such that

If it exists, the vector u is uniquely determined (as is not difficult to prove) and
is called the integral of the function v(t). One uses the notation fa v(t) dt to
represent this vector.
Although the definition of the integral is phrased in terms of the inner product { , },
it is independent of the inner product. Indeed, if u is the integral as defined by { , } and if
S is pd, then for all vectors w,
16 Chapter 1. Preliminaries

The following are two useful, elementary propositions.

Proposition 1.5.4. If the univariate function t \-> v(t) e R", with domain [a,b], is
inteerable, then

Proof. Let u := fa v(t) dt. By definition of the integral, for all if we have

In particular, choosing w = u gives

However,

Combining (1.8) and (1.9) gives

Since \\u\\ — \\ fa v(t) dt\\, the proof is complete.


Proposition 1.5.5. If the univariate function t i-> u(0 € R", with domain [ a , b ] , is
integrable and if A : R" —> Rm is a linear operator, then the Junction t i-» Av(t) is
integrable and

Proof. Observe that for all w e Rm we have


1.5. Fundamental Theorems of Calculus 17

second equality
where the second equality isisby
bydefinition
definitionofoff fa v(t) dt.
Next is the fundamental theorem of calculus for the gradient.

Theorem 1.5.6. Ifx, y e Df, then

Proof. By definition of the integral, we wish to prove that for all w,

Fix arbitrary w and consider the functional

The fundamental theorem of calculus asserts

which, by definition of 0, is equivalent to

Comparing (1.11) with (1.10), we see that to prove (1.10) it suffices to show for arbitrary
0 < t < 1 that

where

Toward proving (1.12), recall that H(u) is the unique operator satisfying

Thinking of AM as being s(y — x) where s ^ 0, it follows from (1.13) that

Since, by Cauchy-Schwarz,

equation (1.14) implies


18 Chapter 1. Preliminaries

Since

we thus have

from which it is immediate that (H(u)(y — x ) , w ) = <J>'(t). Thus, (1.12) is established and
the proof is complete.

Proposition 1.5.7. Ifx, y e Df, then

Proof. The proof is a simple consequence of Theorem 1.5.6 and

an identity which is trivially verified.

Corollary 1.5.8. If x, y e Df, then

1.6 Newton's Method


We continue to assume / e C2 and Df is an open, convex set.
In optimization, Newton's method is an algorithm for minimizing functionals. The
idea behind the algorithm is simple. Given a point x in the domain of a functional /, where
/ is to be minimized, one replaces / by the second-order approximation at x and minimizes
the approximation to obtain a new point, x+. One repeats this procedure with x+ in place
of x, and so on, generating a sequence of points which, under certain conditions, converges
rapidly to a minimizer of /.
For x e Df, we denote the second-order—or "quadratic"—approximation of / at x
by

The domain of qx is all of R".

Proposition 1.6.1. The gradient ofqx at y is g(x) + H(x)(y — x) and the Hessian is H(x)
(regardless ofy).

Proof. Using the self-adjointness of H(x), it is easily established that


1.6. Newton's Method 19

Proving the gradient is as asserted is thus equivalent to proving

an easily established identity.


Having proven the gradient is as asserted, it is simple to prove the Hessian is as
asserted.
Henceforth, assume H(x) is pd. Thus, qx is strictly convex and is minimized by the
point x+ satisfying g(x) + H(x)(x+ — x) = 0, that is, qx is minimized by the point

The "Newton step at x" is defined to be the difference

Newton's method steps from x to x + n(x).


We emphasize that x+ is the root of the map y i-> g(x) + H(x}(y — x), the linear
approximation at x of the gradient map y i->- g(y). Although it is customary in optimization
to present Newton's method as the algorithm which attempts to approximate a minimizer of /
by minimizing the quadratic approximation y t-> ^(y) of /, one can equivalently present
it as the algorithm which attempts to approximate a solution to the nonlinear equations
g(y) = 0 by computing the root of the linear approximation y i->- g(x) + H(x)(y — x) of
g. The latter viewpoint dovetails with the use of Newton's method in applied mathematics
beyond optimization; it is the algorithm which attempts to approximate a root to a general
nonlinear map g : E" -> E" by computing the root of the linear approximation to the map.
We know the second-order approximation is independent of the inner product. Con-
sequently, so is Newton's method. More explicitly, in the inner product { , )$, the gradient
of / at x is S~1g(x), the Hessian is S~lH(x), and so the Newton step is

The Newton step is independent from the inner product.


The following theorem is the main tool for analyzing the progress of Newton's method.
Theorem 1.6.2. Ifz minimizes f and H(x} is invertible, then

Proof. Noting g(z) = 0, we have


20 Chapter 1. Preliminaries

Invoking the assumed continuity of the Hessian, the theorem is seen to imply that if
H(z) is invertible and x is sufficiently close to z, then x+ will be closer to z than is x.
Now we present a brief discussion of Newton's method and subspaces, as will be
important when we consider applications of ipm theory to optimization problems having
linear equations among the constraints. Assume L is a subspace of R" and x € L n Df.
Let n\L(x) denote the Newton step for f\L at x. Since the gradient of f\i at x is Pig(x)
and the Hessian is PLH(x), the Newton step n\L(x) is the vector in L solving

that is, n\L(x) is the vector in L for which H(x)n\i(x) + g(x) is orthogonal to L. In
particular, if L is the nullspace of a linear operator A : R" —> R m , then n\L(x) is the vector
in R" for which there exists y e Rm satisfying

Computing n\L(x} (and y) can thus be accomplished by solving a system of m+n equations
in m + n variables.
If H(x)~l is readily computed (as it is for functionals / used in ipm's), the size of the
system of linear equations to be solved can easily be reduced to m variables. One solves
the linear system in the variables y,

and then computes

In closing this section we remark that the error bound given by Theorem 1.6.2 can of
course be applied to f\L. Assuming z' minimizes f\L (trivially, the minimizer z of / over
all of Df satisfies z = z' iff z e L), we obtain

However, since H\L(y} = PLH(y)PLfor all y e Df, it is easily proven that

Hence,

The Hessian on the right being H rather than H\L makes for less cumbersome applications
of the inequality.
Chapter 2

Basic Interior-Point Method


Theory

Throughout this chapter, unless otherwise stated, / refers to a functional having at least the
following properties: D/ is open and convex, / e C2, and H(x) is pd for all x e Df. In
particular, / is strictly convex.

2.1 Intrinsic Inner Products


The functional / gives rise to a family of inner products, an inner product for each point

These inner products vary continuously with x. In particular, given € > 0, there exists a
neighborhood of x consisting of points y with the property that for all vectors v ^ 0,

We often refer to the inner product { , )H(X) as the local inner product (at x).
In the inner product ( , )H(X), the gradient at y is H(x)~lg(y) and the Hessian is
H(x)~lH(y). In particular, the gradient at x is — n(x), the negative of the Newton step,
and the Hessian is /, the identity. Thus, in the local inner product, Newton's method
coincides with the "method of steepest descent," i.e., Newton's method coincides with the
algorithm which attempts to minimize / by moving in the direction given by the negative
of the gradient. (Whereas Newton's method is independent of inner products, the method
of steepest descent is not independent because gradients are not independent.)
It appears from our definition that the local inner product potentially depends on the
reference inner product { , ). In fact, the local inner product is independent of the reference
inner product; indeed, if the reference inner product is changed to ( , }$, and hence the
Hessian is changed to S~lH(x), the resulting local inner product is

that is, the local inner product is unchanged.

21
22 Chapter 2. Basic Interior-Point Method Theory

The independence of the local inner products from the reference inner product shows
the local inner products to be intrinsic to the functional /. For that reason, we often refer
to the local inner products as the intrinsic inner products. To highlight the independence of
the local inner products from any reference inner product, we adopt notation which avoids
the Hessians of a reference. We denote the intrinsic inner product at x by ( , )x. Let || \\x
denote the induced norm. For 3; G D f , let g x ( y ) denote the gradient at y and let Hx(y)
denote the Hessian. Thus, gx(x) = -n(x) and Hx(x) = I. If A : R" -> Rm is a linear
operator, let A* denote its adjoint w.r.t. { , )x. (Of course the adjoint also depends on
the inner product on Rm. That inner product will always be fixed but arbitrary, unlike the
intrinsic inner products which vary with x and are not arbitrary, depending on /.)
The reader should be especially aware that we use gx(x) and —n(x) interchangeably,
depending on context.
A miniscule amount of the ipm literature is written in terms of the local inner products.
Rather, in much of the literature, only a reference inner product is explicit, say, the dot
product. There, the proofs are done by manipulating operators built from Hessians, operators
like H(x)~lH(y) and AH(x)~lAT, operators we recognize as being Hx(y) and AA*. An
advantage to working in the local inner products is that the underlying geometry becomes
evident and, consequently, the operator manipulations in the proofs become less mysterious.
Observe that the quadratic approximation of / at jc is

and its error in approximating /(y) (Corollary 1.5.3) is no worse than

where the latter norm is the operator norm induced by the local norm. Similarly, the progress
made by Newton's method toward approximating a minimizer z (Theorem 1.6.2) is captured
by the inequality

Assume L is a subspace of IRn and x e L n D f . Let g\L,x denote the gradient of


f\L w.r.t. the inner product obtained by restricting { , )x to L. Likewise, let H\L,X denote
the Hessian. Let PLtX denote the linear operator which projects R" orthogonally onto L,
orthogonally w.r.t. { , }x. We know from sections 1.2 and 1.3 that g\L,x(y) = PL,xgx(y}
and tflz^OOu = P\L,xHx(y)v for all y, v e L.
Observe that

H\LiX(x) = PL,xHx(x) = PL,X = I (the last identity is valid on L, not R").

Consequently, the local inner product on L induced by f\L is precisely the restriction of
{ , )x to L. Thus,
2.2. Self-Concordant Functionals 23

That is, in the local inner product, the Newton step for f\L is the orthogonal projection of
the Newton step for /.
If L is the nullspace of a surjective linear operator A, the relation

provides the means to compute n\L(x} from n(x). One solves the linear system

and then computes

Expressed in terms of an arbitrary inner product { , ), the equations become

precisely the equations we arrived at m §1.6 by slightly different reasoning.

2.2 Self-Concordant Functionals


2.2.1 Introduction
Let Bx(y, r) denote the open ball of radius r centered at y, where the radius is measured
w.r.t. || H*. Let Bx(y, r) denote the closed ball.

A functional / is said to be (strongly nondegenerate) self-concordant if for all


x e Df we have Bx(x, 1) c Df, and if whenever y e Bx(x, 1) we have

Let SC denote the family of functionals thus defined.


Self-concordant functionals play a central role in the general theory of ipm's, as was
developed in the pioneering work of Nesterov and Nemirovskii [15]. Although our definition
of strongly nondegenerate self-concordant functionals is on the surface quite different from
the original definition given in [15], it is in fact equivalent except in assuming / € C2 as
opposed to the ever-so-slightly stronger assumption in [15] that / is thrice continuously
differentiate. The equivalence is shown in §2.5, where it is also shown that our definition
can be "relaxed" in a few ways without altering the family of functionals so defined; for
example, the leftmost inequality involving H f l l y / I M I * is redundant.
The term "strongly" refers to the requirement Bx (x, 1) c D f . The term "nondegener-
ate" refers to the Hessians being pd, thereby giving the local inner products. The definition
of self-concordant functionals—not necessarily strongly nondegenerate—is a natural re-
laxation of the above definition, only requiring the Hessians to be psd. However, it is the
strongly nondegenerate self-concordant functionals that play the central role in ipm theory,
and so the relaxation of the definition is best postponed until the reader has in mind a general
outline of the theory.
24 Chapter 2. Basic Interior-Point Method Theory

As the parentheses in our definition indicate, for brevity we typically refer to strongly
nondegenerate self-concordant functionals simply as "self-concordant functionals."
If a linear functional is added to a self-concordant functional (x i-> (c, x) + /(*)), the
resulting functional is self-concordant because the Hessians are unaffected. Similarly, if one
restricts a self-concordant functional / to a subspace L (or to a translation of the subspace),
one obtains a self-concordant functional, a simple consequence of the local norms for f\i
being the restrictions of the local norms for /.
The primordial self-concordant barrier functional is the "logarithmic barrier function
for the nonnegative orthant," having domain Df := R^_+ (i.e., the strictly positive orthant).
It is defined by f ( x ) := — ^ In Xj. Since the coordinates of vectors play a prominent role
in the definition of this functional (as they do in the definition of the nonnegative orthant),
to prove self-concordance it is natural to use the dot product as a reference inner product.
Expressing the Hessian H(x) as a matrix, one sees it is diagonal with yth diagonal entry
l/*y. Consequently, y € Bx(x, 1) is equivalent to

an inequality which is easily seen to imply y e Df (= M+ + ) as required by the definition of


self-concordance. Moreover, assuming 3; e Bx(x, 1) and v is an arbitrary vector, we have

Since

the rightmost inequality on H f l l ^ / H f L in the definition of self-concordance is proven. The


leftmost inequality is proven similarly. Thus, the barrier function for the nonnegative orthant
is indeed self-concordant.
For an LP

the most important self-concordant functionals are those of the form

where r] > 0 is a fixed constant, / is the logarithmic barrier function for the nonnegative
orthant, and L :— {x : Ax = b}. (Of course L is an affine space, a translation of a
subspace; L is a subspace iff b = 0. For a brief discussion of how our results for functionals
2.2. Self-Concordant Functional 25

restricted to subspaces readily translate to functionals restricted to affine spaces, recall the
last paragraph of § 1.2.)
Another important self-concordant functional is the "logarithmic barrier function for
the cone of psd matrices" in S"xn. This is the functional defined by f ( X ) := — In det(X),
having domain §++" (i.e., the pd matrices in S"xn). To prove self-concordance, it is natural
to rely on the trace product, for which we know H(X)AX = X~l(AX)X~1. For arbitrary
7 e S"x", keeping in mind that the trace of a matrix depends only on the eigenvalues, we
have

where A-i < • • • < A,n are the eigenvalues of X~l/2YX~l/2. Assuming \\Y - X\\x < 1, all
of the values A,y are thus positive, and hence X~l/2YX~1/2 is pd, which is easily seen to
be equivalent to Y being pd. Consequently, if ||y - X\\x < 1, then Y <= Df (= §++"), as
required by the definition of self-concordance.
Assuming Y e BX(X, 1) and V e S"xn, let

Note that ||5i|| = ||V||jr and the eigenvalues of S2 are 0 < I/A.,, < ••• < 1/A.i. In
establishing the bounds for || V||y/|| V\\x in the definition of self-concordance, we rely on
the inequality

To verify this inequality, let Q be an orthogonal matrix diagonalizing Si so that QS\! QT is


a diagonal matrix A~ 1/2 with diagonal entries l/>An" < • • • < 1/V^T- Since the Frobenius
norm of a symmetric matrix is determined solely by the eigenvalues, it is immediate that

However, as the Frobenius norm of a matrix M satisfies ||M|| — (^ w,27)1/2, it is easily seen
that

where the last equality once again relies on the Frobenius norm of a symmetric matrix being
determined solely by the eigenvalues.
We have
26 Chapter 2. Basic Interior-Point Method Theory

Since

the rightmost inequality for || V||y/|| V||x in the definition of self-concordance is proven.
The leftmost inequality is proven similarly. Thus, the logarithmic barrier function for the
cone of pd matrices is indeed self-concordant.
For an SDP

where A : §'!X" -> Rm is a linear operator, the most important self-concordant functionals
are those of the form

where 77 > 0 is a fixed constant, / is the logarithmic barrier function for the cone of pd
matrices, and L := {X : A(X) = b}.
LP can be viewed as a special case of SDP by identifying, in the obvious manner,
Rn with the subspace in § nxn consisting of diagonal matrices. Then the logarithmic barrier
function for the pd cone restricts to be the logarithmic barrier function for the nonnegative
orthant. Thus, we were redundant in giving a proof that the logarithmic barrier function
for the nonnegative orthant is indeed a self-concordant functional since the restriction of
self-concordant functionals to affine spaces yield self-concordant functionals. The insight
gained from the simplicity of the nonnegative orthant justifies the redundancy.
In §2.5 we show that the self-concordance of each of these two logarithmic barrier
functions is a simple consequence of the original definition of self-concordance due to
Nesterov and Nemirovskii [15]. (The original definition is not particularly well suited for a
transparent development of the theory, but it is well suited for establishing self-concordance.)
To apply the definition of self-concordance in developing the theory, it is useful to
rephrase it in terms of Hessians. (The operator norm appearing in the following theorem is
the one induced by the local norm || ||^.)

Theorem 2.2.1. Assume the functional f has the property that Bx(x, 1) C Df for all
x e Df. Then f is self-concordant iff for all x e Df and y e Bx(x, 1),

likewise, f is self-concordant iff


2.2. Self-Concordant Functionals 27

Proof. Let AI < • • • < Xn denote the eigenvalues of Hx(y). Since

and, similarly,

the pair of inequalities in the definition of self-concordance is equivalent to the pair (2.1).
To complete the proof, we show that the pair of inequalities (2.1) is equivalent to the pair
(2.2).
Since the eigenvalues of / — Hx (y) are 1 — A,,-, we have

where the inequality relies on the relations 0 < A-i < A.w. Hence, the first inequality in (2.2)
is implied by the pair of inequalities (2.1), and similarly for the second inequality in (2.2).
Finally, it is immediate that the first inequality (resp., second inequality) of (2.2) implies
the first inequality (resp., second inequality) of (2.1).
Recalling that Hx(x} = I, it is evident from (2.2) that to assume a functional to
be self-concordant is essentially to assume Lipschitz continuity of the Hessians w.r.t. the
operator norms induced by the local norms.
An aside for those familar with third differentials: dividing the quantities on the left
and right of (2.2) by ||v — x\\x and taking the limits supremum as y tends to x suggests,
when / is thrice differentiable, that self-concordance implies the local norm of the third
differential to be bounded by "2." In fact, the converse is also true, that is, a bound of "2"
on the local norm of the third differential for all x e D f , together with the requirement that
the local unit balls be contained in the functional domain, implies self-concordance, as we
shall see in §2.5. Indeed, the original definition of self-concordance in [15] is phrased as a
bound on the third differential.

2.2.2 Self-Concordancy and Newton's Method


The following theorems display the simplifying role the conditions of self-concordance
play in analysis. The first theorem bounds the error of the quadratic approximation and the
second guarantees progress made by Newton's method. Note the elegance of the bounds
when compared to the more general Corollary 1.5.3 and Theorem 1.6.2.
Recall that qx is the quadratic approximation of / at x, that is,

l
where n(x) := —H(x) g(x) is the Newton step for / at x.
28 Chapter 2. Basic Interior-Point Method Theory

Theorem 2.2.2. Iff e SC, x e Df and y e Bx(x, 1), tfzen

Proof. Using Corollary 1.5.3 and Theorem 2.2.1, we have

Theorem 2.2.3. Assume f e SC andx e Df. Ifz minimizes f and z € Bx(x, I), then

satisfies

Proof. Using Theorem 1.6.2 and Theorem 2.2.1, simply observe

The use of the local norm || ||^ in Theorem 2.2.3 to measure the difference x+ — z
makes for a particularly simple proof but does not result in a theorem immediately ready
for induction. At x+, the containment z e Bx+(x+, 1) is needed to apply the theorem, i.e.,
a bound on ||jc+ — z\\x+ rather than a bound on ||*+ — z\\x. Given that the definition of
self-concordance restricts the norms to vary nicely, it is no surprise that the theorem can
easily be transformed into a statement ready for induction. For example, substituting into
the theorem the inequalities
2.2. Self-Concordant Functionals 29

and

as are immediate from the definition of self-concordance when x e Bz(z, 1), we find as a
corollary to the theorem that if ||jc — z\\z < \, then

Consequently, if one assumes ||jc — z\\z < \, then

and, inductively,

where x\ — x+, X 2 , . . • is the sequence generated by Newton's method. The bound (2.4)
makes apparent the rapid convergence of Newton's method.
Recall n(jc) := — H(x)~lg(x), the Newton step. The most elegant proofs of some
key results in the ipm theory are obtained by phrasing the analysis in terms of \\n(x) \\x rather
than in terms of ||z — x\\x or \\z — x\\z. In this regard, the following theorem is especially
useful.

Theorem 2.2.4. Assume f e SC. If\\n(x)\\x < 1, then

Proof. Assuming ||n(jc)|U < 1, we have

Since by (2.1) we have

we thus have

The proof is completed by observing that since gx(x} = —n(x),


30 Chapter 2. Basic Interior-Point Method Theory

A rather unsatisfying, but unavoidable, aspect of general convergence results for New-
ton's method is an assumption of jc being sufficiently close to a minimizer z, where "suf-
ficiently close" depends explicitly on z. For general functionals, it is impossible to verify
that x is indeed sufficiently close to z without knowing z. For self-concordant functionals,
we know that the explicit dependence on z of what constitutes "sufficiently close" can take
a particularly simple form (e.g., we know x is sufficiently close to z if \\x — z \\z < |), albeit
a form which appears still to require knowing z. The next theorem provides means to verify
proximity to a minimizer without knowing the minimizer.

Theorem 2.2.5. Assume f e SC. If\\n(x}\\x < ^forsomex € Df, then f has a minimizer
z and

Thus,

Proof. We first prove a weaker result, namely, if line*)!!* < |, then / has a minimizer z
and||*-zL<3||n(jc)|U.
Theorem 2.2.2 implies that for all y € Bx(x, |),

and hence, by definition of qx,

It follows that if ||n(jc)|U < \ and \\y - x\\x = 3\\n(x)\\x, then /(y) > /(*). However, it
is easily proven that whenever a continuous, convex functional / satisfies /(y) > f ( x ) for
all y on the boundary of a compact, convex set S and some x in the interior of S, then / has
a minimizer in S. Thus, if ||n(jc)||jc < |, / has a minimizer z and ||A; — z\\x < 3\\n(x)\\x.
Now assume ||n(x)||jc < |. Theorem 2.2.4 implies
2.2. Self-Concordant Functionals 31

Applying the conclusion of the preceding paragraph to x+ rather than to x, we find that /
has aminimizerz and ||z — x+\\x+ < 3\\n(x+)\\x+. Thus,

2.2.3 Other Properties


We noted that adding a linear functional to a self-concordant functional yields a self-
concordant functional. The next two theorems demonstrate other relevant ways of con-
structing self-concordant functionals. The first theorem shows the set SC to be closed under
addition.

Theorem 2.2.6. The set SC is closed under addition; that is, if f\ and fa are self-concordant
functionals satisfying Z)/, fl D/2 ^ 0, then /i + /2 : £>/, H Z>/2 —> IR is a self-concordant
functional.

Proof. Let / := f\ -\- fa. Assume x e D/. For all v,

that is,

Hence,

as required by the definition of self-concordance.


Note that whenever a,b,c,d are positive numbers,

Consequently, if v e £>/,

establishing the upper bound on IMIy/IMU in the definition of self-concordance. One


establishes the lower bound similarly.
32 Chapter 2. Basic Interior-Point Method Theory

Theorem 2.2.7. If f e SC, Df c E.m, b € Rm, and A : Rn -+ Rm is an injective


linear operator, then x H> f(Ax — b) is a self-concordant functional, assuming the domain
{x : Ax — b e Df} is nonempty.

Proof. Denote the functional x i-> /(Ax — b) by /'. Let || || 'x denote the local norm derived
from /'. Assuming x e D f , one easily verifies from the identity H'(x} = A*H(Ax — b)A
that H'(x) is pd and that \\v\\'x = \\Av\\Ax-b for all v. In particular,

and thus

as required by the definition of self-concordance.

establishing the upper bound on ||u|| y /||u|U in the definition of self-concordance. One
establishes the lower bound similarly.
Applying Theorem 2.2.7 with the logarithmic barrier function for the nonnegative
orthant in W", one verifies self-concordance for the functional

whose domain consists of the points x satisfying the strict linear inequality constraints
ai-x > hi. This self-concordant functional is important for LPs with constraints written in
the form Ax > b. It, too, is referred to as a "logarithmic barrier function."
To provide the reader with another (logarithmic barrier) functional with which to apply
the above theorems, we mention that x H> — ln(l — ||jt||2) is a self-concordant functional
with domain the open unit ball. (Verification of self-concordance is made in §2.5.) Given
an ellipsoid {x : \\ Ax \\ < r}, it then follows from Theorem 2.2.7 that

is a self-concordant functional whose domain is the ellipsoid, yet another logarithmic barrier
function. For an intersection of ellipsoids, one simply adds the functionals for the individual
ellipsoids, as justified by Theorem 2.2.6.
Although the values of finite-valued convex functionals with bounded domains (i.e.,
bounded w.r.t. any reference norm) are always bounded from below, such a functional need
not have a minimizer even if it is continuous. This is not the case for self-concordant
functionals.
2.2. Self-Concordant Functionals 33

Theorem 2.2.8. If f e SC and the values of f are bounded from below, then f has a
minimizer. (In particular, ifDf is bounded, then f has a minimizer.)

Proof. Choose x that satisfies

Letting y := x + 3|, * _M(*), Theorem 2.2.2 shows

and thus, by choice of x,

that is,

Theorem 2.2.5 now implies / to have a minimizer.


The conclusion of the next theorem is trivially verified for important self-concordant
functionals like those obtained by adding linear functionals to logarithmic barrier functions.
Whereas our definition of self-concordance plays a useful role in simplifying and unifying the
analysis of Newton's method for many functionals important to ipm's, it certainly does not
simplify the proof of the property established in the next theorem for those same functionals.
Nonetheless, for the theory it is important that the property is possessed by all self-concordant
functionals.

Theorem 2.2.9. Assume f e SC and x e dDf, the boundary of Df. If the sequence
{x{} C Df converges to x, then lim, /(*,-) = oo and ||g(jc,-)|| —> oo.

Proof. Assume*, —> x € dDf. We claim that if /(*,) —> oo, then ||g(jc,-)|| —> oo. Indeed,
fix v e D f . By convexitv,

The claim easily follows. Thus it only remains to prove /(*,•) —> oo.
Adding / to the functional x \-+ -\n(R - \\x\\2) where R > \\x\\2, \\Xi\\2 (for all
0, one obtains a self-concordant functional / for which D? is bounded and for which
lim, f ( x i ) = oo iff lim, /(*,•) = oo. Consequently, we may assume Df is bounded.
Assuming Df is bounded, we will construct from {jc,} a sequence {v,} C Df which
has a unique limit point, the limit point lying in dDf, and which satisfies
34 Chapter 2. Basic Interior-Point Method Theory

Applying the same construction to the sequence {y,-}, and so on, we will thus conclude
that if lim, /(*,•) 7^ oo, then / assumes arbitrarily small values, contradicting the lower
boundedness of finite-valued convex functional having bounded domains.
Shortly, we prove liminf, ||n(jc,)||^ > ~. In particular, for sufficiently large i, it
follows that is well-defined and, from Theorem 2.2.2,

Moreover, all limit points of {y,} lie in 9D/; indeed, otherwise, passing to a subsequence of
{y,-} if necessary, there exists e > 0 such that #(y,, €) c Df for all i, where the ball is w.r.t. a
reference norm. However, since y,; and*,- —(y,- — jc,) lie in £>/(because ||y,—x,L, = ^ < 1),
it then follows from convexity of Df that /?(*,-, e/2) c £)/, contradicting jc, -> Jc e 9D/.
Consequently, all limit points of {y/} do indeed lie in 9D/. Restricting to a subsequence if
necessary, we may assume {y,-} has a unique limit point, the limit point lying in dDf.
Finally, we show liminf, ||n(jc/)|U, > 5. Since Df is bounded, Theorem 2.2.8
shows that / has a minimizer z. Since Bz(z, 1) c Df and x, ->• x e dDf, we have
liminf, \\xi- z\\z > 1- Hence, from the definition of self-concordance, liminf,
|. Because / has a most one minimizer, Theorem 2.2.5 then implies lim inf
concluding the proof.
We close this section with a technical proposition to be called upon later.
If g(x) + v is a vector sufficiently near g ( x ) , there exists x + u close to jc such that
g(x + «) = g(x) + v, a consequence of H(x) being pd and hence invertible. It is useful to
quantify "near" and "close" when the inner product is the local inner product, that is, when
Hx(x) = I and hence u & v.

Proposition 2.2.10. Assume f e SC and x e Df. If \\v\\x < r where r < |, there exists
such tha

Proof. Consider the self-concordant functional

a functional whose local inner products agree with those of /. Note that a point z' minimizes
the functional iff gx(z') = gx(x) + v. Under the assumption of the proposition, we thus
wish to show z' exists and u := z' — x satisfies
Since at jc the Newton step for the functional (2.5) is u, the assumption
allows us to apply Theorem 2.2.5, concluding that a minimizer z' does indeed exist and
ik'-(* + u)IU<7&-
2.3. Barrier Functionals 35

2.3 Barrier Functionals


2.3.1 Introduction
A functional / is said to be a (strongly nondegenerate self-concordant) barrier
functional if f E SC and

Let SCB denote the family of functionals thus defined. We typically refer to elements of
SCB as "barrier functionals."
The definition of barrier functionals is phrased in terms of ||&t(-*)IU rather than in
terms of the identical quantity ||w(x)|U because the importance of barrier functionals for
ipm's lies not in applying Newton's method to them directly but rather in applying Newton's
method to self-concordant functionals built from them. As mentioned before, for an LP

the most important self-concordant functionals are those of the form

where r\ > 0 is a fixed constant, / is the logarithmic barrier function for the nonnegative
orthant, and L := [x : Ax = b}.
When Nesterov and Nemirovskii [15] defined barrier functionals, they referred to #/
as "the parameter of the barrier /." Unfortunately, this can be confused with the phrase
"barrier parameter" which predates [15] and refers to the constant rj in (2.6). Consequently,
we prefer to call #/ the complexity value off, especially because it is the quantity that most
often represents / in the complexity analysis of ipm's relying on /.
If one restricts a barrier functional / to a subspace L (or a translation of a subspace),
one obtains a barrier functional simply because the local norms for f\L are the restrictions
of the local norms for / and

Clearly, # /k < #/.


The primordial barrier functional is the primordial self-concordant functional, that is,
the logarithmic barrier function for the nonnegative orthant, f ( x ) := — ]T . hue,. Relying
on the dot product, so that g(x) is the vector with y'th entry —l/Xj and H(x) is the diagonal
matrix with y'th diagonal entry 1/jc?, we have

Thus, &f=n.
36 Chapter 2. Basic Interior-Point Method Theory

Now let / denote the logarithmic barrier function for the cone of pd matrices in §/lx",
that is, f ( X ) := - In det(X). Relying on the trace product, we have, for all X e §++",

Thus, &f =n.


Finally, let / denote the logarithmic barrier function for the unit ball in R n , that is,
f ( x ) := — ln(l — ||x ||2) (where || || := ( , )1/2 for some inner product). It is not difficult
to verify that for jc in the unit ball,

and hence

Consequently,

It readily follows that #/ = 1, showing the complexity value need not depend on the
dimension n.
Nesterov and Nemirovskii [15] proved a most impressive and theoretically important
result, namely, that each open, convex set containing no lines is the domain of a (strongly
nondegenerate self-concordant) barrier functional. In fact, they proved the existence of a
universal constant C with the property that for each n, if the convex set lies in R", there
exists a barrier functional whose domain is the set and whose complexity value is bounded
by Cn. (I have yet to find a relatively transparent proof of this result, and hence a proof
is not contained in this book.) Unfortunately, the result is only of theoretical interest. To
rely on self-concordant functionals in devising ipm's, one must be able to readily compute
their gradients and Hessians. For the self-concordant functionals proven to exist, from a
computational viewpoint one cannot say much more than that the gradients and Hessians
exist. By contrast, the importance of the various logarithmic barrier functions we have
described lies largely in the ease with which their gradients and Hessians can be computed.
In §2.2 we noted that if a linear functional is added to a self-concordant functional, the
resulting functional is self-concordant because the Hessians are unchanged; the definition
of self-concordance depends on the Hessians alone. By contrast, adding a linear functional
to a barrier functional need not result in a barrier functional. For example, consider the
univariate barrier functional x t-> — In x and the functional x i-» x — In x.
The set SCB, like SC, is closed under addition.

Theorem 2.3.1. If and then (where


and
2.3. Barrier Functionals 37

Proof. Assume x e Df. Let the reference inner product { , } be the local inner product
at* defined by /. Thus, / = H(x) = HI(X) + H2(x). In particular, H{(x) and H2(x)
commute, i.e., H\(x}H2(x} = H2(x)Hi(x). Consequently, so do Hi(x)l/2 and
For brevity, let Hf := #,•(*) and gi := g/(x) for i = 1,2.
To prove the inequality in the statement of the theorem, it suffices to show

since, by definition, the quantity on the right is bounded from above by #/, + #/2.
— 1/2
Defining y, := Hi g, for i = 1, 2, we have

where the fourth equality relies on H1/2{ and H2' commuting.


The set SCB, like SC, is closed under composition with injective linear maps.

Theorem 2.3.2. /// e SCB, Df c Rm, b e Rm, and A : Rn -+ Rm is an injective linear


operator, then x (->• f(Ax — b) is a barrierfunctional—assuming the domain {x : Ax — & e
Df} is nonempty—and its complexity value does not exceed #/•.

Proof. Assume Ax — b e Df. Endow W1 with an arbitrary reference inner product


and let the reference inner product on Em be the local inner product for / at
Denoting the functional x t-> f(Ax — b) by /', we then have
Thus,

(The last inequality is due to the operator being an orthogonal projection operator; the
operator has norm equal to one.)
With regards to theory, the following theorem is perhaps the most useful tool in
establishing properties possessed by all barrier functionals. The inner product is arbitrary.

Theorem 2.3.3. Assume f e SCB. Ifx, y e Df, then

Proof. We wish to prove 0'(0) < #/, where 0 is the univariate functional defined by
0(/) := f ( x + t(y — x ) ) . In doing so, we may assume 0'(0) > 0 and hence, by convexity
of $, <j>'(t) > 0 for all t > 0 in the domain of 0.
38 Chapter 2. Basic Interior-Point Method Theory

Let v(t) := x + t(y — x). Assuming t > 0 is in the domain of 0,

and hence, for all s > 0 in the domain of 0,

Thus,

that is.

Consequently, the domain of the convex functional (j> is contained in the open interval
(—00, #//0'(0)). In particular, #//0'(0) is not in the domain. Since s = I is in the domain
of 0 we thus have 1 < #//0'(0).

2.3.2 Analytic Centers


The next theorem implies that for each x in the domain of a barrier functional, the ball
Bx(x, 1) is, to within a factor of 4#/ + 1, the largest among all ellipsoids centered at x
which are contained in the domain. A consequence is that no (strongly nondegenerate
self-concordant) barrier functional has a domain containing a line.

Theorem 2.3.4. Assume f e SCB. I f x , y € Df satisfy (g(x),y — x} > 0, then y e

Proof. Restricting / to the line through x and y, we may assume / is univariate. Viewing t
line as R with values increasing as one travels from x to y, the assumption (g(x),y — x) > 0
is then equivalent to g(x) > 0, i.e., g(x) is a nonnegative number.
Let v denote the smallest nonnegative number for which ||#*(.*) + v\\x > |. Since
8x(x) > 0» we have || v \\x < |. Applying Proposition 2.2.10, we find there exists u satisfying

Note that \\u\\x < 1.


Theorem 2.3.3 implies
2.3. Barrier Functionals 39

where the last inequality makes use of gx(x) + v and y — x both being nonnegative.
However, since H&tOt) + v\\x > | only if v = 0 (and hence only if u = 0), we

Thus,

from which the theorem is immediate.


Minimizers of barrier functionals are called analytic centers. The following corollary
gives meaning to the term "center."

Corollary 2.3.5. Assume f e SCB. Ifz is the analytic center for f, then

Proof. Since / e SC, the leftmost containment is by assumption. The rightmost contain-
ment is immediate from Theorem 2.3.4 since g(z) = 0.
Corollary 2.3.5 suggests that if one was to choose a single inner product as being
especially natural for a barrier functional with bounded domain, the local inner product at
the analytic center would be an appropriate choice because the resulting balls conform to
the shape of the domain.
When do analytic centers exist? The answer is given by the following corollary.

Corollary 2.3.6. If f e SCB, then f has an analytic center iff Df is bounded.

Proof. The proof is immediate from Theorem 2.2.8 and Corollary 2.3.5.

2.3.3 Optimal Barrier Functionals


In the complexity analysis of ipm's, it is desirable to have barrier functionals with small
complexity values. However, there is a positive threshold below which the complexity
values of no barrier functionals fall. Nesterov and Nemirovskii [15] prove that #/ > 1 for
all / e SCB. To understand why there is indeed a lower bound, assume #/ < -^ for some
/ e SCB. Since gx(x) = —n(x), Theorem 2.2.5 then implies / has a (unique) minimizer
z and all x e Df satisfy \\z — x\\x < ||. However, by choosing x so that in the line L
through x and z, the distance from x to the boundary of Df D L is smaller than the distance
from x to z, the containment Bx(x, 1) c Df implies \\z — x\\x > 1, a contradiction. Hence
&f > ± for all / e SCB.
Likewise, by Theorem 2.2.5, if / e SCB and Df is unbounded—hence / has no
minimizer—then \\gx(x)\\x > I := ^ for all x e Df.
It is worth noting that any universal lower bound t as in the preceding paragraph im-
plies a lower bound nt < #/ for each barrier functional / whose domain is the nonnegative
orthant K+ + . Indeed, let e denote the vector of all ones and let ej denote the j'th unit vector.
Consider the univariate barrier functional // obtained by restricting / to the line through e in
the direction ej. Let ge denote the gradient of / w.r.t. { , )e and let gj,e denote the gradient
of fj w.r.t. the restricted inner product. Since Df. is unbounded, and hence // does not have
40 Chapter 2. Basic Interior-Point Method Theory

an analytic center, it is readily proven (without making use of a particular inner product)
that (gj,e(e), ej)e < 0- Since gj,e and e-} are colinear (because Dfj is one-dimensional), it
follows that

Noting that \\Ci\\e > 1 because e — e; & Df, we thus have

Hence,

the last inequality by Theorem 2.3.3.


In light of the preceding paragraphs, we see that with regards to the complexity value,
the logarithmic barrier function for the nonnegative orthant R++ is the optimal barrier
functional having domain M++. Likewise, viewing R" as a subspace of S nxn , we can
conclude that the logarithmic barrier function for the cone of pd matrices is the optimal
barrier functional having that cone as its domain.

2.3.4 Other Properties


For arbitrary inner products, the bounds ||gxOOL < y^/imply nothing about the quantities
|| #00II- However, the bounds do imply bounds on the quantities HgyOOII? for all y e D/.
This is the subject of the next proposition. First, a definition.
For x in an arbitrary bounded convex set D, a natural way of measuring the relative
nearness of x to the boundary of D, in a manner that is independent of a particular norm,
is the quantity known as the symmetry ofD about x, denoted symOt, D). This quantity is
defined in terms of the set C(x, D) consisting of all lines through x which intersect D in
an interval of positive length. (If D is lower-dimensional, most lines through x will not be
in £(x, D).) If x is an endpoint of L fl D for some L € £(x, D), define sym(x, D) := 0.
Otherwise, for each L e £(x, D), letting r(L) denote the ratio of the length of the smaller
to the larger of the two intervals in L H (D \ {x}), define

Clearly, if D is an ellipsoid centered at x, then sym(jc, D) = 1, "perfect symmetry."


Corollary 2.3.5 implies that if z is the analytic center for a barrier functional /, then
sym(z, Df) > l/(4#/ + 1).

Proposition 2.3.7. Assume f e SCB. For allx,y e Df,


2.3. Barrier Functionals 41

Proof. For brevity, let s := sym(x, Df). Assuming jc, y e Df, note that
s)(x — y) € Df (the closure of £>/) since w, x, y are colinear and
Since By(y, 1) c Df and D/ is convex, we thus have

that is, By(x, y^) c Df. Consequently,

the last inequality by Theorem 2.3.3.


Theorem 2.2.9 shows that /(*,) -> oo if / is a self-concordant functional and {*,}
converges to a point in the boundary of Df. To close this subsection, we present a theorem
that implies the rate at which /(*,•) goes to oo is "slow" if / is a barrier functional.

Theorem 2.3.8. Assume f e SCB andx e Df. I f y e Df, then for all 0 < t < 1,

Proof. For s > 0 let jc(^) := y + e~s(x — y) and consider the univariate functional
0(5) := f ( x ( s ) ) . Relying on the chain rule, observe

the inequality due to Theorem 2.3.3. Hence, for 0 < / < 1,

2.3.5 Logarithmic Homogeneity


In Chapter 3, when we tie ipm's to duality theory, attention will often focus on a particular
type of barrier functional / whose domain is the interior K° of a closed, convex cone K (if
42 Chapter 2. Basic Interior-Point Method Theory

jti, X2 € K and t\,ti > 0, then t\x\ + hxi e £). A barrier functional / : K° —> R is said
to be logarithmically homogeneous if for all # e K° and ? > 0,

It is easily established that the logarithmic barrier functions for the nonnegative orthant
and the cone of psd matrices are logarithmically homogeneous, as are barrier functionals
of the form x i—> f(Ax) where / is logarithmically homogeneous. Another important
example of a logarithmically homogeneous barrier functional is

the domain of this functional being the interior of the second-order cone

It has complexity value #/ = 2. As with the standard barrier functionals for the nonnegative
orthant and the cone of psd matrices, it is referred to as a logarithmic barrier function.
The following theorem provides a characterization of logarithmic homogeneity as
well as other properties useful in analysis.

Theorem 2.3.9. A self-concordant functional f : K° —>• IR is a logarithmically homoge-


neous barrier functional iff for all x € K° and t > 0

Moreover, a logarithmically homogeneous barrier functional satisfies the identities

Proof. Assume / is a logarithmically homogeneous barrier functional. To prove (2.8),


simply differentiate both sides of (2.7) w.r.t. x. Similarly, differentiating both sides of (2.8)
w.r.t. x gives H(tx) = ^H(x}.
Now assume / is a self-concordant functional satisfying (2.8) for all x e K° and
t > 0. Differentiating both sides of (2.8) w.r.t. t gives

For t = 1 this is the same as gx(x} = —x.


Since gx(x) = —x, we have

The gradient of the rightmost quantity in (2.9) as a functional of x is


2.4. Primal Algorithms 43

Hence, the three quantities in (2.9) are independent of x. In particular, the middle quantity
is bounded above as a functional of x, and thus / is not only self-concordant, it is also a
barrier functional. Moreover, the independence from x of the three quantities implies each
of them is equal to #/.
Finally,

showing / is logarithmically homogeneous.


Using the "g(tx) = jg(x)" characterization of logarithmic homogeneity provided
by the theorem, it is simple to prove that a sum of logarithmically homogeneous barrier
functionals is itself a logarithmically homogeneous barrier functional.
In closing our discussion of logarithmic homogeneity, we recall that in §2.3.3 it was
established that for all barrier functionals with unbounded domains, #/ > |. We noted tha
with greater effort, Nesterov and Nemirovskii proved #/ > 1. The stronger inequality is
easily established for logarithmically homogeneous barrier functionals. Indeed, since / is
self-concordant and 0 lies on the boundary of K = Df, we must have ||0 — x \\x > 1 for all
jc e K°, that is, UNCOIL > 1 for all x e Df.

2.4 Primal Algorithms


2.4.1 Introduction
The importance of a barrier functional / lies not in itself but in that it can be used to
efficiently solve optimization problems of the form

where Df denotes the closure of Df. Among many other problems, linear programs are
of this form. Specifically, restricting the logarithmic barrier function for the nonnegative
orthant to the affine space [x : Ax = b}, we obtain a barrier functional / for which

Similarly for SDR


Let val denote the optimal value of the optimization problem (2.10).
Path-following ipm's solve (2.10) by following the central path (a.k.a. "the path of
analytic centers" and "the central trajectory"), the path consisting of the minimizers z(rf) of
the self-concordant functionals
44 Chapter 2. Basic Interior-Point Method Theory

for r) > 0. It is readily proven when Df is bounded that the central path begins at the analytic
center z of f and consists of the minimizers of the barrier functionals f\i(V) obtained by
restricting / to the affine spaces

for v satisfying val < v < ( c , z ) . Similarly, when Df is unbounded, the central path
consists of the minimizers of the barrier functionals /|LOO f°r v satisfying val < v.
In the literature, it is standard to define the central path, and to do ipm analysis, with
the functionals

where ^ > 0, rather than with the functionals /,,. The difference is only cosmetic. The
minimizer 2(77) of /,, is the minimizer of the functional (2.11) for /x = \fr\. Similarly,
Newton's method applied to minimizing /,, produces exactly the same sequence of points
as Newton's method applied to minimizing the functional (2.11) for yu, = \jr\. Whether one
relies on the functionals /,, or the functionals (2.11), one arrives at exactly the same ipm
results, the only difference being insignificant changes in minor algebraic steps along the
way. The reason we use the functionals /,, rather than the customary functionals (2.11) is
that the local inner products for the functionals /^ are identical to those for /, whereas the
local inner products for the functionals (2.11) depend on /z. Since a goal of this book is to
elucidate the geometry underlying ipm's, it is more natural for us to rely on the functionals
fn-
We observe that for each y e Df, the optimization problem (2.10) is equivalent to

where cy := H(y) 1c. (In other words, the objective vector is cy w.r.t. { , )y.)
The desirability of following the central path is made evident by considering the
objective values (c, z(rj)}. Since g(z(ri) = —rjc, Theorem 2.3.3 implies for all y e Df,

and hence

Moreover, the point z(rj) is well centered in the sense that all feasible points y with objective
value at most (c, z(n)} satisfy y e 5z(,,)(z(ty), 4#y + 1), a consequence of Theorem 2.3.4
andg(z(r])) = -rjc.
Path-following ipm's follow the central path approximately, generating points near
the central path where "near" is measured by local norms. If a point y is computed for which
IIy — z(rj) \\Z(TI) is small, then, relatively, the objective value at y will not be much worse than
at z(n), and hence (2.12) implies a bound on (c, y). In fact, if x is an arbitrary point in Df
and y is a point for which \\y — x\\x is small, then, relatively, the objective value at y will
not be much worse than at x. To make this precise, first observe Bx(x, 1) c Df implies
2.4. Primal Algorithms 45

x — tcx e Df if 0 < t < l/\\cx \\x. Since the objective value at x — tcx is
we thus have

Hence for ally e R",

In particular, using (2.12),

Before discussing algorithms, we record a piece of notation: Let nrl(x) denote the
Newton step for /^ at x, that is,

2.4.2 The Barrier Method


"Short-step" ipm's follow the central path most closely, generating sequences of points
which are all near the path. We now present and analyze an elementary short-step ipm, the
"barrier method."
Assume, initially, we know 771 > 0 and Jti such that x\ is "near" 2(771), that is, x\ is
near the minimizer for the functional fm. In the algorithm, we increase r}\ to a value 772 and
then apply Newton's method to approximate 2(772), thus obtaining a point KI. Assuming
only one iteration of Newton's method is applied,

Continuing this procedure indefinitely (i.e., increasing 77, applying Newton's method, in-
creasing 77, etc.), we have the barrier method.
One would like 772 to be much larger than 771. However, if 772 is "too" large relative
to 771, Newton's method can fail to approximate 2(772); in fact, it can happen that x^ $. Z)/,
bringing the algorithm to a halt. The main goal in analyzing the barrier method is to prove
that 772 can be larger than 771 by a reasonable amount without the algorithm losing sight of
the central path.
In analyzing the barrier method, it is most natural to rely on the length of Newton
steps to measure proximity to the central path. We will assume jci is near 2(771) in the sense
that Hn^OtOHjc, is small. Keep in mind that the Newton step taken by the algorithm is
n,,2(xi), not n^,(xi). The relevance of n^(x\) for nm(x\) is due to the following easily
proven relation:
46 Chapter 2. Basic Interior-Point Method Theory

In particular,

Besides the bound (2.15), a crucial ingredient in the analysis is a bound on


in terms of ||n m (xi)\\ X l . Theorem 2.2.4 provides an appropriate bound: if
then

Suppose we determine values a > 0 and ft > 1 such that if we define

then y < 1 and

By requiring \\nm(xi)\\X] < a and 1 < — </?, we then find from (2.15) that

and thus, from (2.16),

Consequently, KI will be close to the central path like jci. Continuing, by requiring 1 <

^2 _< t-8, XT,J will be close to the central rpath, too, and so on. Hence, we will have determined
a value ft such that if one has an initial point appropriately close to the central path, and if
one never increases the barrier parameter from rj to more than ftrj, the barrier method will
follow the central path, always generating points close to it.
The reader can verify, for example, that

satisfy the relations. Now we have a "safe" value for ft. Relying on it, the algorithm is
guaranteed to stay on track. It is a remarkable aspect of ipm's that safe values for quantities
like ft depend only on the complexity value #/ of the underlying barrier functional /.
Concerning LPs,

if one relies on the logarithmic barrier function for the strictly nonnegative orthant M+ + ,
then y = (1 + g^) is safe regardless of A, b, and c.
Assuming at each iteration of the barrier method the parameter r] is never increased
by more than 1 + l/S^/Wf, we now know that for each x generated by the algorithm, there
corresponds z(n} which x approximates in that ||n^(jc)||x < |; hence, by Theorem 2.2.5,
II*—zOOL < |;mus, by the definition of self-concordance, II*—2(77)11^) < |. All points
generated by the algorithm lie within distance | of the central
2.4. Primal Algorithms 47

Assuming that at each iteration of the barrier method, the parameter rj is increased by
exactly the factor 1 + l/Sy/^/, the number of iterations required to increase the parameter
from an initial value rji to some value 77 > rji is

where in the inequality we rely on #/ > 1 (as discussed in §2.3.3). Hence, from (2.14),
given € > 0,

iterations suffice to produce x satisfying (c, x) < val + € .


We have been assuming that an initial point x\ near the central path is available. What
if, instead, we know only some arbitrary point x' e D/? How might we use the barrier
method to solve the optimization problem efficiently? We now describe a simple approach,
assuming Df is bounded and hence / has an analytic center.
Consider the optimization problem obtained by replacing the objective vector c with
—#(*'). The central path then consists of the minimizers z'(v) of the self-concordant
functional

The point x' is on the central path for this optimization problem. In fact, x' = z'(v) for
v = 1.
Let n'v(;c) denote the Newton step for /„' at x.
Rather than increasing the parameter v, we decrease it toward zero, following the
central path to the analytic center z of /. From there, we switch to following the central
path (2(77) : 77 > 0} as before.
We showed rj can safely be increased by a factor of 1 + l/Sy/tf/. Brief consideration
of the analysis shows it is also safe to decrease rj by a factor 1 — l/S^/^/ and, hence, safe to
decrease v by that factor. Thus, to complete our understanding of the difficulty of following
the path (z'(v) : v > 0}, and then the path {z(rj) : rj > 0}, it remains only to understand the
process of switching paths.
One way to know when it is safe to switch paths is to compute the length of the
gradients for / at the points x generated in following the path (z'(v) : v > 0}. Once one
encounters a point jc for which, say, || #*(•*) IU 5- g> one can safely switch paths. For then,
by choosing rj\ = 1/12 \\cx \\x, we find the Newton step for fm at x satisfies

and hence, by Theorem 2.2.4, the Newton step takes us from x to a point x\ for which
ll«^i C*i) IU < |»putting us precisely in the setting of the earlier analysis (where a = \ w
determined safe).
How much will v have to be decreased from the initial value v = I before we compute
a point x for which !!#*(*) H* < | so that paths can be switched? An answer is foun
48 Chapter 2. Basic Interior-Point Method Theory

the relations

the last inequality by Proposition 2.3.7. In particular, with ||n'vOOIU < |, v need only
satisfy

in order for
The requirement on v specified by (2.19) gives geometric interpretation to the effi-
ciency of the algorithm in following the path (z'(v) : v > 0}, beginning with the initial
value v = 1. If the domain Df is nearly symmetric about the initial point x', not much time
will be required to follow the path to a point where we can switch to following the path
{z(/7) : r, > 0}.
We stipulated that the algorithm switch paths when it encounters x satisfying \\gx (x) \\x
< |, and we stipulated that one choose the initial value n\ := l/12||c^ H*. Letting

note that (2.13) implies

and hence

We have now essentially proven the following theorem.

Theorem 2.4.1. Assume f e SCB and Df is bounded. Assume x' e Df, a point at which
to initiate the barrier method. IfO<€ < 1, then within

iterations of the algorithm, all points x computed thereafter satisfy

Consider the following modification to the algorithm. Choose V > (c, x'}. Rather
than relying on /, rely on the barrier functional

a functional whose domain is


2.4. Primal Algorithms 49

and whose complexity value does not exceed #/ + 1. In the theorem, the quantity V — val
is then replaced by the potentially much smaller quantity V — val. Of course the quantity
symC*', Df) must then be replaced by the symmetry of the set (2.20) about x'.
Finally, we highlight an implicit assumption underlying our analysis, namely, the
complexity value ftf is known. The value is used to safely increase the parameter r\. What
is actually required is an upper bound & > #/. If one relies on an upper bound $ rather
than the precise complexity value #/, then #/ in the theorem must be replaced by #.
Except for #/, none of the quantities appearing in the theorem are assumed to be
known or approximated. The quantities appear naturally in the analysis of the algorithm,
but the algorithm itself does not rely on the quantities.
No ipm's have proven complexity bounds which are better than (2.18) even in the
restricted setting of LP. In the case of LP where &f = n, a bound like (2.18) was first
established in [18] for an algorithm other than the barrier method; it was established by
Gonzaga [6] for the barrier method. By contrast, in the complexity analysis that started the
waves of ipm papers, Karmarkar [10] proved for LP a bound like (2.18) in which the factor
^fn (= -/#/) is replaced by n.
Although no ipm's have proven complexity bounds which are better than (2.18), the
barrier method is not considered to be practically efficient relative to some other ipm's,
especially relative to primal-dual methods (discussed in sections 3.7 and 3.8). The barrier
method is an excellent algorithm with which to begin one's understanding of ipm's, and it is
often the perfect choice for concise complexity theory proofs, but it is not one of the ipm's
that appear in widely used software.

2.4.3 The Long-Step Barrier Method


One of the barrier method's shortcomings is obvious, being implicit in the terminology
"short-step algorithm." Although it is always safe to increase rj by a factor 1 + l/S^tf/
with each iteration, that increase is small if #/ is large. No doubt, for many instances, a
much larger increase is safe.
There is a trivial manner in which to modify the barrier method in hopes of having
a more practical algorithm. Rather than increase rj by the safe amount, increase it by
much more, apply (perhaps several iterations of) Newton's method, and check (say, using
Theorem 2.2.5) if the computed point is near the desired minimizer. If not, increase r] by a
smaller amount and try again.
A more interesting and more practical modification of the barrier method is known
as the "long-step barrier method." In this version, one increases r] by an arbitrarily large
amount but does not take Newton steps. Instead, the Newton steps are used as directions
for "exact line searches," as we now describe.
Assume as before that we have an initial value r\\ > 0 and a point x\ approximating
z(n\). Choose rj2 larger than 771, perhaps significantly larger. In search of a point x.i which
approximates 2(772), the algorithm will generate a finite sequence of points

and then let KI := yK. At each point y^, the algorithm will determine if the point is close to
z(r]2) by, say, checking whether ||n,, 2 (yjt)IU < \- (We choose the specific value \ becau
50 Chapter 2. Basic Interior-Point Method Theory

it is the largest value for which Theorem 2.2.5 applies.) The point yK will be the first point
that is determined to satisfy this inequality.
To compute yk+i from y*, the algorithm minimizes the univariate functional

This is the place in the algorithm to which the phrase "exact line search" alludes. "Line"
refers to the functional being univariate. "Exact" refers to an assumption that the exact
minimizer is computed, certainly an exaggeration, but an assumption useful for keeping the
analysis succinct. Letting t^+i denote the exact minimizer, define

thus ending our description of the long-step barrier method.


The short-step barrier method is confined to making slow but sure progress. The
long-step method is more adventurous, having the potential for much quicker progress.
Clearly, the complexity analysis of the long-step barrier method revolves around
determining an upper bound on K in terms of the ratio 772/771- We now undertake the task
of determining such a bound.
We begin by determining an upper bound on the difference

Then we show that fm (y/t) — fn2 (y&+i) is bounded below by a positive amount T independent
of k; that is, each exact line search decreases the value of fm by at least a certain amount.
Consequently K < p/r. Proofs like this—showing a certain functional decreases by at
least a fixed amount with each iteration—are common in the ipm literature.
In proving an upper bound on the difference p, we make use of the fact that for any
convex functional / and x,y e D/, one has

The upper bound on p is obtained by adding upper bounds for

Assuming x^ is close to 2(771) in the sense that ||w^,(JCi)|U, < |, we see that Theo-
rem 2.2.5 implies \\xi - z(rji)\\Xl < f . Thus, applying (2.22) to the functional fm,
2.4. Primal Algorithms 51

Similarly, for all y e D/,

the final equality because z(r)i) minimizes fm and hence nm (z(??i)) = 0. Thus,

the last inequality by Theorem 2.3.3.


Now we show /,2 (y^) — fm (yk+i) is bounded below by a positive amount r indepen-
dent of k.
If the algorithm proceeds from y^ to yk+i, it is because yk happens not to be appro-
priately close to z(?fc), i-e., it happens that Letting
and y := yk + inm(yk), Theorem 2.2.2 then implies

Since fy+i minimizes the functional (2.21), we thus have

Finally,

It follows that if one fixes a constant K > 1 and always chooses successive values
?;,, rjii+i to satisfy 77,+1 = Kfy, the number of points generated by the long-step barrier
method (i.e., the number of exact line searches) in increasing the parameter from an initial
value r}\ to some value 77 > 771 is

In the case of linear programming where #/ = n, such a bound was first established by
Gonzaga [7] (see also den Hertog, Roos, and Vial [3]).
Fixing K (say, K = 100), we obtain the bound

This bound is worse than the analogous bound (2.17) for the short-step method by a factor
y#/. It is one of the ironies of the ipm literature that algorithms which are more efficient
in practice often have somewhat worse complexity bounds.
52 Chapter 2. Basic Interior-Point Method Theory

2.4.4 A Predicto^Corrector Method


The Newton step nn (x) := — r]cx — gx (x) for the barrier method can be viewed as the sum of
two steps, one of which predicts the tangential direction of the central path and the other of
which corrects for the discrepancy between the tangential direction and the actual position
of the (curving) path.
The corrector step is the Newton step at x for the barrier functional f\L(V) where
v = ( c , x ) and

Thus, the corrector step is n\L(V)(x), this being the orthogonal projection of the Newton step
n(x) for / onto the subspace L(0) = {y : (c, y) = 0}, orthogonal w.r.t. { , }x. In the
literature, the corrector step is often referred to as the "centering direction." It aims to move
from x toward the point on the central path having the same objective value as x.
Since the multiples of cx (= H(x)~lc) form the orthogonal complement of L(0), the
difference n(x} — n\L(v)(x) is a multiple of cx, and hence so is the predictor step

The vector — cx predicts the tangential direction of the central path near jc. If x is on
the central path, the vector — cx is exactly tangential to the path, pointing in the direction
of decreasing objective values. (Indeed, observe that by differentiating both sides of the
identity rjc + g(z(r))) = 0 w.r.t. 77, we have c + H (z(n})zf (n) = 0, i.e., z'tn) := -cz(r,).)
In the literature, — cx is often referred to as the "affine-scaling direction." With regards to
( , )x, it is the direction in which one would move to decrease the objective value most
quickly.
Whereas the barrier method combines a predictor step and a corrector step in one
step, predictor-corrector methods separate the two types of steps. After a predictor step,
several corrector steps might be applied. In practice, predictor-corrector methods tend to be
substantially more efficient than the barrier method, but the (worst-case) complexity bounds
that have been proven for them are worse.
Perhaps the most natural predictor-corrector method is based on moving in the pre-
dictor direction a fixed fraction of the distance toward the boundary and then recentering
via exact line searches. We now formalize and analyze such an algorithm.
Fix a satisfying 0 < a < 1. Assume Jti is near the central path. Let v\ := (c, x\).
The algorithm first computes

Let yi := x\ — os\cxr Thus, — as\cxi is the predictor step. Let

Beginning with yi, the algorithm takes corrector steps, moving toward the point Z2 on
the central path with objective value vi by using the Newton steps for the functional f\L(v2)
as directions in performing exact line searches. Precisely, given y*, the algorithm computes
the minimizer tk+\ for the univariate functional
2.4. Primal Algorithms 53

Let

When the first point JK is encountered for which \\n \L(v2) Cv/r) \\yK is appropriately small, the
algorithm lets xi := y% and takes a predictor step from ;t2, relying on the same value a as
in the predictor step from x\. The predictor step is followed by corrector steps, and so on.
Typically, a is chosen near 1, say, a = .99. In the following analysis, we assume
a>J.
In analyzing the predictor-corrector method, we determine an upper bound on the
number K of exact line searches made in moving from x i to JC2, and we determine a lower
bound on progress made in decreasing the objective value by moving from x\ to xi.
Unlike the previous algorithms, the predictor-corrector method does not rely on the
parameter 77. However, it is useful to rely on r\ in analyzing the method so as to make use of
the previous analysis. In this regard, let r\\ be the (positive) value satisfying (c, z(rj\)} = v\,
i.e., z(r]\) is the point on the central path whose objective value is the same as the objective
value for x\.
For the analysis of the predictor-corrector method, we consider ||n|L(u)(*)IU < ^ to
be the criterion for claiming x to be close to the point z on the central path satisfying (c, z) =
v (= (c, jc}). The specific value -^ is chosen so that we are in position to rely on the barrier
method analysis. Specifically, we claim that IN^oC*!)!!*! < ^ implies ||n^(jci)|| Xl < |,
precisely the criterion we assumed xi to satisfy for the barrier method. To justify the claim,
let zi denote the minimizer of f \ L ( V t ) - , thus, zi = zO?i). If Ni<i»,)(*i)L, < 33, then
Theorem 2.2.5 applied to f \ i ( v i ) implies

Consequently, since z\ = z(n\), applying Theorem 2.2.3 to /,,, yields

The barrier method moves from x\ to x\ + nm(x\), where 772 = (1 + l/S^/^/)^!.


The length of the barrier method step is thus

In taking the step, the barrier method decreases the objective function value from {c, jci) to
(c, x\ + «^ 2 (jci)); hence, the decrease is at most

Assuming a > | for the predictor-corrector method, the predictor step is


direction — cxi and has length at least ^, a consequence of Bx^ (x\, 1) c D/. Thus, v\ — V2>
\\\cX} \\X}. Hence, in moving from jci to X2, the progress made by the predictor-corrector
54 Chapter 2. Basic Interior-Point Method Theory

method in decreasing the objective value is at least as great as the progress made by the
barrier method in taking one step from jci. (Clearly, the point X2 computed by the predictor-
corrector method generally differs from the point KI computed by the barrier method.) Of
course in moving from x\ to *2, the predictor-corrector method might require several exact
line searches. We now bound the number K of exact line searches.
Analogous to our analysis for the long-step barrier method, we obtain an upper bound
on K by dividing an upper bound on /(yi) — f(z2) by a lower bound on the differences
/Ofc) - /Ofc+i).
The lower bound on the differences /(y*) — /(y/t+i) is proven exactly, as was the lower
bound for the differences fm (y*) — fm (y/t+i) in our analysis of the long-step barrier method.
Assuming HnLo^O7*)!!?* > ^ (as is the case if the algorithm proceeds to compute y*+i)>
one relies on Theorem 2.2.2, now applied to the functional f\L(V2)>to show /(y/0 — /(y^+i)
is bounded below independently of k.
To obtain an upper bound on /(yi) — f ( z i ) , one can use the identity

Theorem 2.3.8 and the definition of yi imply

Relation (2.22) applied to f\L(V]), together with (2.23), gives

Finally, (2.22) applied to / gives

In all,

Combined with the constant lower bound on the differences /(y*) — /(y*+i), we mus find
that the number K of exact line searches performed in moving from x\ to *2 satisfies

Having shown the progress made by the predictor-corrector method in decreasing the
objective value is at least as great as the progress made by the barrier method in taking
one step from x\, we obtain complexity bounds for the predictor-corrector method which
are greater than the bounds for the barrier method by a factor K, that is, by a factor #/
(assuming a fixed, say, a = .99). The bounds are greater than the bounds for the long-step
barrier method by a factor J&f.

2.5 Matters of Definition


There are various equivalent ways of defining self-concordant functionals. Our definition
is geometric and simple to employ in theory, but it is not the original definition due to
2.5. Matters of Definition 55

Nesterov and Nemirovskii [15]. In this section we consider various equivalent definitions
of self-concordance, including the original definition. We close the section with a brief
discussion of the term "strongly nondegenerate," a term we have suppressed.
The proofs in this section are more technical than conceptual, yielding useful results,
but ones that are not central to understanding the core theory of ipm's. It is suggested that
on a first pass through the book, the reader absorb only the exposition and statements of
results, bypassing the formal proofs.
Unless otherwise stated, we assume only that f e C2, Df is open and convex, and
H(x) is pdforall x e Df.
For ease of reference, we recall our definition of self-concordance.

A functional / is said to be (strongly nondegenerate) self-concordant if for all


x £ D f , we have Bx(x, 1) c D f , and if whenever y e Bx(x, 1), we have

Recall SC denotes the family of functionals thus defined.


An important property in establishing the equivalence of various definitions of self-
concordance is the "transitivity" of the condition

Specifically, if x, y, and z are colinear with y between x and z, if x and y satisfy (2.24), if
y and z satisfy the analogous condition

and if \\z — x\\x < 1, then

To establish this transitivity, note (2.24) implies

Substituting into (2.25) gives for all v ^ 0,

the equality relying on the colinearity of x, y, and z. Note (2.26) is immediate from (2.24)
and (2.27), hence the transitivity.
56 Chapter 2. Basic Interior-Point Method Theory

We now show that in the definition of self-concordance, the lower bound on || v \\y( \\ v \\x
is redundant.

Theorem 2.5.1. Assume f is such that for all x e Df we have Bx (x, 1) C Df and is such
that whenever y e Bx(x, I) we have

Then

Proof. Since

it suffices to show for all positive € in some open neighborhood of 0, and for all x, y e Df,
that

Toward establishing the implication (2.29), note from (2.28) that whenever points x
and y satisfy ||y — Jc|U < -rr (with e small), we have

Letting Amin denote the minimum eigenvalue of Hx(y), from the identities Hx(y) =
H ( x ) ~ l H ( y ) = Hy(x)~l we of course have

Since

applying (2.28) with x and y interchanged thus gives

Hence, by (2.30),

Summarizing, for € in an open neighborhood of 0, and for all x, y e Df,

This gives the desired implication (2.29) except in that replaces


-^. The proof strategy now is to show that when one has the desired implication for all
2.5 Maytter of definiton 57

x, y e Df except in that y is required to lie in a ball with smaller radius r than the claimed
radius (i.e., \\y — x\\x < r rather than \\y — x\\x < -^, where r is independent of x, y),
then the smaller radius r can be increased to r' := r + €(-^ — r). Likewise, r' can be
increased. In the limit, we arrive at the validity of desired implication (2.29) for the claimed
radius -^.
To implement this proof strategy, assume 0 < € < 1, -^ < r < -^, and assume
that for all x,y e D/,

Define z '.= y + 1 ( y — x), where t is chosen so that

Since x and y are arbitrary, to prove that r can be increased to r' it suffices to show that
(2.31) gives

Using (2.28) with v = z — y and the definition of z, we see

Hence, by (2.31) applied to y and z (rather than to x and y),

Trivially, by (2.33) we have

Substituting this in (2.34) gives

the equality relying on the colinearity of x, y, and z. That is, for all v / 0,
58 Chapter 2. Basic Interior-Point Method Theory

If we assume

we thus have

and hence the relation (2.32). Since (2.35) follows from the bound

we have thus shown the implication (2.31) to indeed give the relation (2.32), thereby com-
pleting the proof.
The following theorem provides various equivalent definitions of self-concordant
functionals.

Theorem 2.5.2. The following conditions on a functional f are equivalent:

(la) For all jc, y e Df, if\\y - x\\x < 1, then

(Ib) For each x e Df, and for all y in some open neighborhood ofx,

(Ic) For all

Moreover, if f satisfies any (and hence all) of the above conditions, as well as any condition
from the following list, then f satisfies all conditions from the following list:
(2a) For all x e Df we have

(2b) There exists 0 < r < 1 such that for all x e Df we have

(2c) If a sequence {xk} converges to a point in the boundary dDf, then

Hence, since SC consists precisely of those functionals satisfying conditions la and 2a, by
choosing one condition from the first set and one from the second, we can define the set SC
as the set of functionals satisfying the two chosen conditions.

Proof. To prove the theorem, we first establish the equivalence of conditions la, Ib, and Ic.
We then prove conditions la and 2b together imply 2a, as do conditions la and 2c. Trivially,
2a implies 2b. To conclude the proof, it then suffices to recall that by Theorem 2.2.9,
conditions la and 2a together imply 2c.
2.5. Matters of Definition 59

Now we establish the equivalence of la, Ib, and Ic. Trivially, la implies Ib. Next
note

Consequently, condition Ib implies for y near x,

It easily follows that Ib implies Ic.


To conclude the proof of the equivalence of la, Ib, and Ic, we assume Ic holds but
la does not, then obtain a contradiction.
Since

condition la not holding implies there exist x, /, and € > 0 such that ||/ — x \\x < -^ and

Considering points on the line segment between x and /, and relying on continuity of the
Hessian, it then readily follows that there exists y (possibly y = x) satisfying

and

for all z := y + t(y — x), where t > 0 is sufficiently small.


Condition Ic implies for z near v,

That is, for all v / 0,

Likewise, from (2.36),


60 Chapter 2. Basic Interior-Point Method Theory

In particular,

Substituting into (2.38) and relying on the colinearity of x, y, and z, we have

From (2.39) and (2.40), for all v ^ 0,

that is,

contradicting (2.37). We have thus proven the equivalence of conditions la, Ib, and Ic.
Now we prove that conditions la and 2b together imply 2a. Let 0 < r < 1 be as in
2b, i.e., Bx(x, r) c Df for all x e Df. Let r' := (2 - r)r. We show la and 2b together
imply Bx(x, r') c Df for all x e Df, i.e., 2b holds with r' in place of the smaller value r.
Likewise, r' can be replaced with a larger value. In the limit we find Bx(x, 1) c Df for all
jc e Df, precisely condition 2a.
To verify that 1 a and 2b together allow r to be replaced with the larger value r', assume
x, y e Df satisfy \\y — x\\x < r. Let

The colinear points x,y,z satisfy ||z — x\\x = (2— ||v —jf|U)||y — Jt:||^. Consequently, since
y is an arbitrary point in Bx ( x , r ) , it suffices to verify that la and 2b together imply z e Df.
By condition 2b, y € Df. Condition la applied with v := z — y thus gives

Hence, since x we h
Consequently, 2b can be applied to y and z (in place of x and y), yielding the desired
inclusion z € Df.
To conclude the proof of the theorem, it remains only to prove that conditions la and
2c together imply 2a.
Assuming y e Df satisfies \\y — x\\x < 1, condition la implies
2.5. Matters of Definition 61

In particular, f ( y ) is bounded away from oo on each set Bx(x, r) n Df, where r < 1. By
condition 2c we conclude Bx(x, r) c. D f i f r < 1. Hence, Bx(x, 1) c Df, completing the
proof.
We turn to the original definition of self-concordance due to Nesterov and Nemirovskii.
First, we provide some motivation.
We know that if one restricts a self-concordant functional / to subspaces—or translates
thereof—one obtains self-concordant functionals. In particular, if / is restricted to a line
t M- x + td (where x, d e E"), then

is a univariate self-concordant functional. Since for 0 we have

the property

is identical to

Squaring both sides, then subtracting 1 from both sides, and finally multiplying both sides
by0"(f)/k-f|,wefind

If /—and hence 0—is thrice-differentiable, we thus have

This result has a converse. The converse, given by the following theorem, coincides with the
original definition of self-concordance due to Nesterov and Nemirovskii. (Keep in mind our
standing assumptions that / e C2, Df is open and convex, and H(x) is pd for all x e Df.)

Theorem 2.5.3. Assume f e C3 and assume each of the univariate functionals (f> obtained
by restricting f to lines intersecting Df satisfy (2.41) for all t in their domains. Furthermore,
assume that if a sequence {xk} converges to a point in the boundary dDf, then f(xk) -> oo.
Then f e SC.

Proof. The proof assumes the reader to be familar with certain properties of differentials.
To prove the theorem, it suffices to prove / satisfies conditions Ic and 2c of Theo-
rem 2.5.2. Of course 2c is satisfied by assumption.
The family of inequalities (2.41) (an inequality for each x, d, and t) is equivalent to
the family of inequalities
62 Chapter 2. Basic Interior-Point Method Theory

(an inequality for each jc and u). On the other hand, the family of inequalities given by
condition Ic is equivalent to

However, for any C3-functional / and for any inner product norm

Hence, (2.43) follows from (2.42).


The original definition of (strongly nondegenerate) self-concordance was that / sat-
isfy the assumptions of Theorem 2.5.3. The theorem shows such / to be self-concordant
according to our definition. Our definition is ever-so-slightly less restrictive by requiring
only / e C2, not / e C3. For example, letting || || denote the Euclidean norm, the
functionals

and

are self-concordant according to our definition but not according to the original definition.
Neither functional is thrice-differentiable at the origin.
Our definition was not chosen for the slightly broader set of functionals it defines. It
was chosen because it provides the reader with a sense of the geometry underlying self-
concordance and because it is handy in developing the theory. Nonetheless, the original
definition has distinct advantages, especially in proving a functional to be self-concordant.
For example, assume D c R" is open and convex, and assume F e C3 is a functional which
takes on only positive values in D and only the value 0 on the boundary 3D. Furthermore,
assume that for each line intersecting D, the univariate functional 11->- F(x + td) obtained
by restricting F to the line happens to be a polynomial—moreover, a polynomial with only
real roots. Then, relying on the original definition of self-concordance, it is easy to prove
that the functional

is self-concordant. Indeed, letting 0(0 := f ( x + td) be the restriction of / to a line,


and letting r\,..., rj denote the roots of the univariate polynomial t M» F(x + td) (listed
according to multiplicity), we have

the inequality due to the relation || ||3 < || ||2 between the 2-norm and the 3-norm on
2.5. Matters of Definition 63

It is an insightful exercise to show that the self-concordance of the various logarith-


mic barrier functions are special cases of the result described in the preceding paragraph.
Incidentally, functionals F as above are known as "hyperbolic polynomials" (cf. [2], [9]).
We close this section with a discussion of the qualifying phrase "strongly non-
degenerate," which we suppress throughout the book.
Nesterov and Nemirovskii define self-concordant functionals as functionals f e C3,
with open and convex domains, satisfying (2.41) for the univariate functionals (j> obtained
by restricting / to lines. It is readily proven that self-concordant functionals (thus defined)
have psd Hessians. They define strongly self-concordant functionals as having the additional
property that f(*k) —>• oo if the sequence {Xk} converges to a point in the boundary dDf.
Finally, strongly nondegenerate self-concordant functionals are those which satisfy the yet
further property that H(x) is pd for all x e Df.
One might ask if the Nesterov-Nemirovskii definition of, say, strong self-concordance
has a geometric analogue similar to our definition of (strongly nondegenerate) self-
concordance. One should not expect the analogue to be as intuitively simple as our
definition—for the bilinear forms

are not inner products if / is not nondegenerate. However, there is indeed an analogue, ob-
tained as a simple generalization of the definition for strongly nondegenerate self-concordance.
Roughly, strongly self-concordant functionals are those obtained by extending strongly
nondegenerate self-concordant functionals to larger vector spaces by having the functional
be constant on parallel slices. Specifically, one can prove (as is done in [15]) that / is strongly
self-concordant iff Rw is a direct sum L\ ® L^ of subspaces for which there exists a strongly
nondegenerate self-concordant functional h, with D/, c L\, satisfying f(x\, ^2) = M*i)-
For example, f ( x ) := — ln(x\) is a strongly self-concordant functional with domain the
half-space R+ © R""1 in R", but it is not nondegenerate.
If self-concordant (resp., strongly self-concordant) functionals are added, the resulting
functional is self-concordant (resp., strongly self-concordant). If one of the summands is
strongly nondegenerate, so is the sum. This indicates how the theory of self-concordant
functionals, and strongly self-concordant functionals, parallels the theory developed in this
book. To get to the heart of the ipm theory quickly and cleanly, we focus on strongly
nondegenerate self-concordant functionals.
Henceforth, we return to our practice of referring to functionals as self-concordant
when, strictly speaking, we mean the functionals to be strongly nondegenerate self-
concordant.
This page intentionally left blank
Chapter 3

Conic Programming and


Duality

3.1 Conic Programming


Linear programming (LP) and semidefinite programming (SDP) are special cases of conic
programming (CP). The ingredients for a CP instance are a closed, convex cone K c R"
(if x i , X 2 e K and t\,ti > 0, then t\x\ + 12X2 e K), a linear operator A : Rn -> R m ,
vectors c € E" and b e R m , and for our development, an inner product ( , ) on R n . These
ingredients give the following instance:

LP is obtained with K = R+ (the nonnegative orthant), whereas SDP is obtained by letting


K = S+xw (the cone of psd matrices). If one lets K be the second-order cone, that is,

then one has second-order programming.


Strictly speaking, "min" should be replaced with "inf" in (3.1) even though K is
closed. (We discuss this later.) Let val denote the optimal objective value,

If the constraints are infeasible (i.e., inconsistent), define val := oo.


We refer to (3.1) as the primal instance. The dual of the primal instance is itself a CP
instance, its cone being the dual cone of K,

Assuming R m , like R", is endowed with an inner product, the dual instance is

65
66 Chapter 3. Conic Programming and Duality

where A* denotes the adjoint of A. The constraints of the dual instance can be written in
the same form as the primal if we introduce slack variables s e R":

Let val* denote the optimal value of the dual instance: —oo if the instance is infeasible.
An especially important relation between feasible points for the primal and dual in-
stances is given by the identities

Since x e K and s e K* it follows that val* < val, an inequality known as weak duality.
To understand how the geometry of the primal and dual instances are related, fix x
satisfying Ax = b and (y, s) satisfying A*y + s = c. Letting L denote the nullspace of
A, note by (3.2) that up to an additive constant in the objective functional (specifically, the
constant —(b, y}), the primal instance is precisely

On the other hand, making the standard (simplifying) assumption that A is surjective, and
hence A* is injective, if (y, s) is dual feasible, then y is uniquely determined by s. Conse-
quently, the dual instance is equivalent to

(In fact, they are equivalent even if A is not surjective.) Geometrical relations between the
primal and dual instances are apparent from (3.3) and (3.4). Moreover, since (K*)* = K (as
follows from Corollary 3.2.2 and the closedness of K), (3.3) and (3.4) make it geometrically
clear that the dual of the dual instance is the primal instance, a fact that is also readily
established algebraically.
In the literature, CP is often introduced without reference to an inner product. The
objective functional (c, x) is expressed c*(x), where c* is an element of the dual space of
the primal space M", i.e., an element of the space of all continuous linear functionals on
the primal space. For us, the inner product naturally identifies the primal space with the
dual space; for each element c* in the dual space there exists c in the primal space such that
c*(x) = (c,x) for all x. The main advantage of relying on an inner product is that it makes
apparent geometrical relations between a primal instance and its dual instance. The main
disadvantage is that the identification of the primal space with the dual space changes when
the inner product is changed, forcing one to use notation that depends on the inner product.
3.1. Conic Programming 67

However, by this point the reader is accustomed to changing local inner products, so the
advantages far outweigh the disadvantages.
Let us be precise as to how the primal and dual instances change when the inner
product is changed. Assume the inner product on Rn is changed from { , ) to { , }#,
where H is pd w.r.t. { , }. (Recall (U,V)H '•= (u, Hv).) The primal instance can then be
expressed as

where CH '.= H lc. Assuming the inner product on Rm, the range space of A, is left
unchanged, the dual instance is

where A*H — H 1 A* is the new adjoint of A, K^ = H lK* is the new dual cone of K,
and SH is the vector of slack variables. A point (y, s) is dual feasible before the change of
inner product iff the point (y, H~ls) is dual feasible after the change.
We close this section with an overview of the remainder of the chapter.
In LP, one has strong duality; if either val or val* is finite, then val = val*. Strong
duality can fail for CP instances, even for simple SDP instances over the 2 x 2 symmetric
matrices. However, under conditions which typically must be satisfied for the application
of ipm's, strong duality is present. For example, it is present if the primal and dual instances
are feasible and at least one of them remains feasible whenever its right-hand-side vector (b
or c, respectively) is perturbed by a small amount, as is the case, say, if A is surjective and
the primal instance has a feasible point in the interior of K. We prove this result, as well as
other basic results in the classical duality theory, in §3.2.
We study conjugate functionals in §3.3. Given a barrier functional / whose domain
is the interior of K, we show (among other things) that the conjugate functional is a barrier
functional whose domain is the interior of the dual cone K*. The connections between
ipm's and duality theory hinge critically on this fact.
In §3.4 we establish the fundamental relations between the central path for the primal
instance and the central path for the dual instance. We show that in following one path, the
other is virtually generated as a by-product.
In §3.5 we present the theory of self-scaled (or symmetric) cones, a topic first de-
veloped in the ipm literature by Nesterov and Todd [16], [17]. The pronounced structure
of these cones allows for the development of symmetric primal-dual ipm's (algorithms in
which the roles of the primal and dual instances are mirror images). The most important
cones—both practically and theoretically—are self-scaled cones. The nonnegative orthant,
the cone of psd matrices, and the second-order cone are all self-scaled.
In designing an iterative algorithm (an algorithm that generates a sequence of points
as do ipm's), the choice of direction for moving from one point to the next is crucial. In §3.6
we discuss the Nesterov-Todd directions, perhaps the most prevalent directions appearing
in the design of primal-dual ipm's.
68 Chapter 3. Conic Programming and Duality

We present and analyze two types of primal-dual ipm's. In §3.7 we consider path-
following methods, algorithms which stay near the central path, much like the barrier
method in §2.4. In §3.8 we consider a potential-reduction method. Whereas the theory
of path-following methods requires the algorithms to stay near the central path, the theory
of potential-reduction methods does not, thus suggesting that potential-reduction methods
are more robust. The analysis of the progress for a potential-reduction method depends on
showing that a certain function—an appropriately chosen potential function—decreases by
a constant amount with each iteration. The decrease can be established regardless of where
the iterates lie; they need not lie near the central path.

3.2 Classical Duality Theory


In this section we develop the basic duality theory of CP. Although the duality theory predates
ipm's by decades, we include its development because the ipm literature often assumes the
reader to be familar with the central results. Understanding the proofs in this section is by
no means essential for understanding the rest of the chapter. Perhaps the main purpose of
this section is to serve as motivation for subsequent sections.
We remarked that for CP, the strong duality relation can fail to hold between a primal
instance

and its dual instance

For an example where strong duality fails, let K be the second-order cone in R3:

With regards to the dot product, it is an instructive exercise to show K = K*; that is, K is
self-dual. Consequently, for the primal instance

the dual instance, involving a single variable y, is

Rewriting x\ + x\ < x% as x\ < (x2 + ^3) (^3 — *i)> it is readily seen that val = 0. On the
other hand, to be feasible for the dual, y must satisfy 1 + y 2 < y 2 , obviously impossible.
Thus val* = — oo and hence val ^ val*, even though one of the optimal objective values is
finite.
3.2. Classical Duality Theory 69

A slightly more elaborate example shows strong duality can fail even when both of
the optimal objective values are finite. Consider the primal instance

where

Once again, K is self-dual. Consequently, the dual instance is

As before, val = 0. On the other hand, y = (— 1, 0) is dual optimal as all dual feasible
points satisfy y\ = — 1. Hence val* = — 1.
A standard approach to proving strong duality for LP is via the Farkas lemma. The
Farkas lemma is a "theorem of exclusive alternatives," stating that exactly one of the fol-
lowing two systems is feasible:

The analogue for CP would be that exactly one of the following two systems is feasible:

However, just as strong duality can fail for CP instances, so can this analogue of the Farkas
lemma, as the reader can readily verify from the following example in which K is the
second-order cone in E3:

Neither system is feasible.


Although strong duality can fail in CP, something only slightly weaker always holds,
something that is called "asymptotic strong duality." Similarly, although the strong version
of the Farkas lemma can fail in CP, something only slightly weaker never fails. Moreover,
in virtually the same manner one deduces strong duality from the Farkas lemma in LP, one
can prove asymptotic strong duality from an asymptotic version of the Farkas lemma for
general CP. We do precisely this.
The heart of CP duality theory is the Hahn-Banach theorem. From this single theorem
the theory is built. The Hahn-Banach theorem has many versions, most pertaining to
very general vector spaces, e.g., arbitrary topological vector spaces, including infinite-
dimensional ones. Likewise, CP duality theory can be developed quite generally (cf. [1]
70 Chapter 3. Conic Programming and Duality

for an extremely general development). However, by restricting to R n , the proofs are eased
considerably. Hence, just as we are developing ipm theory in R n , we develop duality theory
in finite dimensions.

Theorem 3.2.1 (a geometric "Hahn-Banach theorem"). Assume S is a nonempty, closed,


convex subset ofW1. Ifx € R" but x <j£ S, then there exists d e R" such that

Proof. Using the fact that in R n , bounded sequences have convergent subsequences, it is
not difficult to prove that S has a closest point to x, i.e., a point 5' € S satisfying

where || || is the norm induced by { , }.


We claim

Indeed, assume otherwise, letting s e S violate the inequality. For 0 < t < 1, define
s(t) := s' + t(s - s'). Since s(t) e S and

we conclude that for sufficiently small t > 0, s(t) is a point in S strictly closer to x than is
s' a contradiction.

However,

completing the proof.

Corollary 3.2.2. Assume K is a nonempty, closed, convex cone in R". Ifx e R" but x £ K,
there exists d e R" such that

The Farkas lemma for LP extends to CP if one relaxes the notion of feasibility. In LP,
an instance is either feasible, or is infeasible and remains infeasible if the right-hand-side
vector b is perturbed to any vector b + Ab for which || A& || is sufficiently small. This is
not true of CP instances in general. For example, the system on the left of (3.5) becomes
feasible if the right-hand-side coordinate 0 is replaced by € > 0.
Assuming R m , like R", is endowed with an inner product (and the induced norm), one
says the system
3.2. Classical Duality Theory 71

is asymptotically feasible if it is feasible or can be made feasible by perturbing b by an


arbitrarily small amount, that is, if for each € > 0 there exists Afr satisfying || A&|| < €,
where the system obtained by replacing b with b + A£ is feasible.

Theorem 3.2.3 (asymptotic Farkas lemma). Either the system

is asymptotically feasible or the system

is feasible, but not both.

Proof. Fixing A, consider the cone consisting of right-hand-side vectors for which the first
system is feasible:

Clearly, b e K(A) (the closure of K(A)) iff the system with right-hand-side vector b is
asymptotically feasible. Consequently, invoking Corollary 3.2.2 for the cone K(A), one
has the mutually exclusive alternatives that either the system with right-hand-side vector b
is asymptotically feasible or there exists y e Rm satisfying

Observe that y satisfies (3.7) iff y satisfies

that is, iff y satisfies

Since (3.8) is the same as

the proof of the theorem is complete.


Now we turn to discussing asymptotic strong duality. The notion of asymptotic
feasibility wears a new guise, one pertaining to the optimal objective value. For the primal
instance, the relevant quantity is not the optimal objective value for the instance itself but,
rather, the infimum of the optimal objective values from instances obtained by perturbing
the right-hand-side vector b by an arbitrarily small amount. This quantity is called the
asymptotic optimal value for the instance. We denote it a-val. Precisely,
72 Chapter 3. Conic Programming and Duality

Note that a-val = oo if the primal instance is not asymptotically feasible.


Here is the main theorem relating the optimal objective values of the primal and dual
instances.

Theorem 3.2.4 (asymptotic strong duality). If the primal instance is asymptotically fea-
sible, then a-val = val*.

Proof. Fix v G R and consider the following constraints in the variables (x, a) e R" x R:

Endow Rn x R with the inner product

The dual cone for K x IR+ is then K* x R+. Likewise, extend the inner product on W to
Rm x R. The adjoint of the linear operator

is then

It is readily seen that the system (3.9) is asymptotically feasible iff a-val < v. Con-
sequently, by Theorem 3.2.3 we have the exclusive alternatives that either a-val < v or the
following system in the variables (y, 8) G Rm x R is consistent:

We now prove a-val > val*. Assume otherwise and choose v to satisfy a-val < v <
val*. Since a-val < v, we know the system (3.10) is inconsistent. Since v < val*, we know
there exists dual feasible y satisfying (b, y) > v. Then (y, — 1) satisfies the system (3.10),
a contradiction. Hence a-val > val*.
Now we prove a-val < val*. Assume otherwise and choose v to satisfy val* < v <
a-val. Since v < a-val, the system (3.10) is satisfied by some point (y, ft}. If ft = 0, then
y satisfies

and hence, by Theorem 3.2.3, the primal instance is not asymptotically feasible, contradict-
ing the assumption of the theorem. Hence, ft < 0. Consequently, \/f$ is defined and thus
the point y := ~y satisfies

contradicting val* < v. Hence, a-val < val*, completing the proof.
3.2. Classical Duality Theory 73

Not surprisingly, the theorem has a dual analogue which can be proven by showing
that the dual of the dual instance is (equivalent to) the primal instance, then invoking the
theorem. To that end, define the asymptotic optimal value for the dual instance to be

Corollary 3.2.5. If the dual instance is asymptotically feasible, then a-val* = val.

Proof. The dual instance is equivalent to

Relying on Corollary 3.2.2 and the closedness of K, it is straightforward to prove K** = K.


Consequently, the dual cone for the cone Rm x K* is {0} x K. Since the adjoint of
l[A
~4* TOm X
7T~\J .• M v M
TO"—ri. TCP" ic
IK IS

it follows that the dual instance for (3.11) is

Clearly, this instance is equivalent to the original primal instance.


Applying Theorem 3.2.4 to the instance (3.11) and its dual instance (3.12) by the
equivalences noted above, the corollary is proven.
Despite CP duality theory having to make special amends for the possible failure of
strong duality, the failure is rare in a generic sense as the next theorem indicates.
Let us say that the primal instance is strongly feasible if it is feasible and remains
feasible for all sufficiently small perturbations of b (i.e., if there exists € > 0 such that
when b is replaced by b + A6 for any Ab satisfying \\Ab\\ < €, the resulting constraints
are consistent). Similarly, we can speak of strong feasibility of the dual instance.

Theorem 3.2.6. If either the primal instance or the dual instance is strongly feasible, then
val = val*.

Proof. We first observe that we may assume both instances are asymptotically feasible. For
if, say, the primal instance is not asymptotically feasible—hence val = oo—there exists
y satisfying the second system of Theorem 3.2.3. With the dual instance being strongly
feasible—in particular, being feasible—by adding arbitrarily large positive multiples of y
to any feasible point for the dual instance, we obtain dual feasible points with arbitrarily
large objective values. Thus val* = 00 = val, giving the equality in the theorem. Hence,
assume both instances are asymptotically feasible.
We know that both instances being asymptotically feasible implies
74 Chapter 3. Conic Programming and Duality

For definiteness, assume the dual instance to be strongly feasible. (A similar proof
applies if the primal instance is strongly feasible.) In light of the relations (3.13), to prove
the theorem it suffices to assume a-val < val and show it follows that the dual instance is
not strongly feasible, a contradiction.
So assume a-val < val. By the definition of a-val and the assumption that the primal
instance is asymptotically feasible, there exist sequences {*,} c K, {Ab,} such that

We claim ||;c,|| ->• oo. For otherwise, {*,} has a convergent subsequence; the limit
point x is easily proven to satisfy

from which it is immediate that a-val = val, contrary to our assumption a-val < val.
Let Ac be a limit point of the sequence {JT^JTJC,-}. Of course || Ac|| = 1; in particular,
Ac / 0. Forfixede > 0 consider the following CP instance:

Note the asymptotic optimal value of this instance is no greater than

Hence, by Theorem 3.2.4, the dual instance of (3.14) is infeasible. The dual instance is

Since € can be an arbitrarily small positive number, this infeasibility contradicts the assumed
strong feasibility of the original dual instance.
In applying ipm's to solve a primal CP instance, one relies on a barrier functional
whose domain is the interior of the cone K. Thus, for the central path to exist, there must
be primal feasible points in the interior of K. In that regard, the following corollary shows
the potential failure of strong duality to be somewhat irrelevant to the study of ipm's. The
corollary implies that if the central path exists for the primal instance, then strong duality
holds. The same holds for the dual instance (as is further elucidated in §3.4).

Corollary 3.2.7. If the linear operator A is surjective and there is a primal feasible point
in the interior of K, then val = val*. Similarly, if there exists a dual feasible point (y,s)
with s in the interior of K*, then val — val*.

Lastly, we briefly discuss the fact that in CP there may not exist optimal solutions even
when the optimal objective values are finite—and even when, in addition, strong duality
3.3. The Conjugate Functional 75

is present. Strictly speaking, our use of "min" for the primal instance and "max" for the
dual instance should be replaced by "inf" and "sup," respectively. However, under an
assumption of strong feasibility, an optimal solution does indeed exist. Insofar as we are
interested in the connections between duality theory and ipm theory, "min" and "max" are
entirely appropriate.
For an example of an optimal solution not existing, let K denote the second-order
cone in E3. We leave it to the reader to verify that the primal instance

satisfies val = 0, the dual instance satisfies val* = 0, but the primal instance does not have
an optimal solution.

Theorem 3.2.8. If the dual instance is strongly feasible and the primal instance is feasible,
then the primal instance has an optimal solution. Similarly, if the primal instance is strongly
feasible and the dual instance is feasible, then the dual instance has an optimal solution.

Proof. The proof is similar in spirit to that of Theorem 3.2.6.


We consider the case that the dual instance is strongly feasible and the primal instance is
feasible. Since the primal instance is feasible, there exists a feasible sequence {*,} satisfying
(c, xi) -> val. If the sequence is bounded, it has limit points and those points are easily
argued to be optimal for the primal instance. Hence, we may assume the sequence to be
unbounded. Then, exactly as in the proof of Theorem 3.2.6, we can perturb the vector c by
an arbitrarily small amount to obtain a dual instance (3.15) which is infeasible, contradicting
the strong feasibility of the dual instance.
The other case is handled similarly.
It should be mentioned that strong duality and the Farkas lemma for LP can be seen as
special cases of the asymptotic analogues for CP. In the CP theory, the need for asymptotic
results has its roots in the cone K(A) appearing in the proof of Theorem 3.2.3. The culprit
is the potential nonclosedness of that cone. However, if the cone K is polyhedral (e.g.,
K = R+, as in LP), then so is K(A), and hence K (A) is closed. With K (A) closed, a strong
version of the Farkas lemma is obtained, and from it, strong duality.

3.3 The Conjugate Functional


The ipm literature is full of interesting relations between primal instances

and their dual instances


76 Chapter 3. Conic Programming and Duality

For these relations, one assumes K°—the interior of K—is the domain of a barrier func-
tional.
To get a sense of how a barrier functional / : K° -> E joins with duality theory,
consider the (negative) gradient map x i-» — g(x). For example, in LP with the dot product,
this is the map (x\,..., xn) i-> (^-,..., ^-). We claim the map x t-> — g(x) takes K° into
K*. Indeed, by Theorem 2.3.3, whenever x, x' € K°,

Applying this with tx' in place of x' and letting t —> oc, one deduces

that is, —g(x) e K* as claimed.


The gradient map is injective, i.e., one-to-one. For if x and x' satisfy g(x) = g(x'),
then both x and x' minimize the strictly convex functional x h-> — (g(x), x) + /(Jt), forcing
x = x'.
We shall see (Theorem 3.3.1) that not only does the gradient map x i-> — g(x) carry
K° injectively into K*, it carries K° onto (K*)°, the interior of K*. The map is a bijection
between K° and (K*)°. Furthermore, we shall see that the resulting inverse map is itself
the (negative) gradient map for a barrier functional, namely, for the conjugate functional1
off:

The conjugate functional played a prominent role in optimization long before ipm research
blossomed, long before the notion of a self-concordant functional took form (cf. [19]).
Later in this chapter when local inner products become useful, it will be important to
keep in mind that the definition of the conjugate functional, like the definition of the dual
cone, depends on the underlying inner product. When the inner product changes, so does
the conjugate functional.
The definition of the conjugate functional /* applies to each functional /, not just bar-
rier functionals. Regardless of /, the conjugate functional is convex, for it is the supremum
of the convex functionals—in fact, linear functionals—

We define the domain D/* to be the set consisting of all s for which f * ( s ) is finite.
Throughout this section we assume / to satisfy only the following properties unless
otherwise stated: / e C2, Df is open and convex, H(x) is pd for all x e Df.
Recall SC denotes the set of (strongly nondegenerate) self-concordant functionals and
SCB denotes the subset of SC composed of barrier functionals.
It is suggested that on a first pass, the reader skip proofs in this section.

Theorem 3.3.1. If f e SC, then f* e SC. If f e SCB and Df = K°, where K is a cone,
then f* e SCB and

'This differs slightly from the standard definition of the conjugate functional: s \-+ sup,,.6D f ( x , s ) — f ( x ) .
The only real effects of the difference are to reduce the number of minus signs appearing in our exposition and to
make the domain of the conjugate functional be the dual cone rather than the polar cone.
3.3. The Conjugate Functional 77

Moreover, #/* < (4#/ + I)2. Furthermore, if f is logarithmically homogeneous, then so


is f* and

Toward proving the theorem, we present two propositions and a theorem.

Proposition 3.3.2. If f e SCB and Df = K°, where K is a cone, then

This proposition will be proven using the following proposition, which figures promi-
nently in proofs throughout this section.

Proposition 3.3.3. The gradient map g : Df ->Rn is injective. If f e SC, then

and Df* is open.

Proof. If — g(jti) = s = —g(x2), then jci and *i minimize the strictly convex functional

Hence, x\ = x^ and thus g is injective. Moreover, because the functional (3.17) then has a
minimizer, 5 e Df*. Hence,

Assume s e £>/* and / e SC. Since Df* consists of s for which f * ( s ) is finite, the
self-concordant functional (3.17) is bounded below. By Theorem 2.2.8, it has a minimizer
x. Clearly, — g(x) = s. Hence,

and so with (3.18), we have (3.16).


The openness of D/« now follows from the differential of g—i.e., the Hessian—being
surjective. Alternatively, openness follows from Proposition 2.2.10.
Proof of Proposition 3.3.2. In the opening paragraphs of this section we saw {—g(x) : x e
K°} c K*. Hence, by Proposition 3.3.3, Df* c (K*)°. Thus, again by Proposition 3.3.3,
to complete the proof it suffices to show that if s € (K*)°, then there exists jc € K° such
that —g(x) = s; that is, it suffices to show the self-concordant functional

has a minimizer.
Since 5 e (K*)°, there exists a > 0 such that

In particular, if x € K°, then


78 Chapter 3. Conic Programming and Duality

the inequality by Theorem 2.3.3. Observe that if \\x\\ > #//«, then (3.19) implies the
functional / to be strictly increasing as one moves outward along the ray [x + tx : t > 0}.
Hence,

For the self-concordant functional /, Theorem 2.2.9 shows /(*,-) —> oo if a sequence
{*,} converges to a point in the boundary of K. Consequently, using (3.20), it is easily
argued by compactness that / has a minimizer.
The identities stated in the next theorem are absolutely fundamental in developing
primal-dual ipm's.

Theorem 3.3.4. Assume f e SC. Then /* € C2. Moreover, ifx and s satisfy s = —g(x),
then

where g* and H* denote the gradient and Hessian of f*.

Proof. The proof goes somewhat outside this book, relying on differentials and the inverse
function theorem.
By Proposition 3.3.3, the map — g : £)/ -> D/* is invertible. For each s e D/*,
let x(s) denote the point in D/ satisfying -g(x(s)) = s. Since g e C1 and the first-
differentials of g are invertible (i.e., the Hessians H(x) are invertible), the inverse function
theorem implies s M» x(s) is a C1 map.
Differentiating both sides of the identity — g(x(s)) = s w.r.t. s by making use of the
chain rule, we have

that is,

Let [Dsx(s)]T denote the adjoint of Dsx(s) : Rn -> R". (We use "T" because "*" is
in use.)
Clearly,

Since the map s H> x(s) is in C1, so is the functional /*. Differentiating both sides w.r.t.
5, making use of the product rule and the chain rule, we have

the last equality because s = — g(x(s)). Thus, whenever x and s satisfy — g(x) = s we
have —g*(s) = x. Moreover, since s i-> x(s) is a C1 mapping, (3.22) trivially shows the
same to be true of g*. Hence, /* e C2.
Finally, note (3.22) and (3.21) imply

that is, if -g(x) = s, then H*(s) = H(x)~l-


3.3. The Conjugate Functional 79

Proof of Theorem 3.3.1. We use a superscript "*" to distinguish between objects associated
with /* and those associated with /. For example, || ||* denotes the local norm associated
with /*.
Assume / € SC. Assume x and 5 satisfy — g(x) = s. By Theorem 3.3.4, for all
W\, U)2,

an identity we use freely in the remainder of the proof. Likewise for the resulting relation

We now prove /* e SC. By Theorem 2.5.2, it suffices to show that if s e D/», then

and

Assume As is a vector satisfying r := ||A5||* < |, that is, \\v\\x < |, where v :=
-H(x)~l&s. Noting ^p < 9r2, Proposition 2.2.10 shows there exists u e Bx(v,9\\v\\2x)
such that

Note that

Consequently, by Proposition 3.3.3, we have s + As e £>/*, establishing (3.23). Moreover,

Toward establishing (3.24), observe


80 Chapter 3. Conic Programming and Duality

However,

implying

By (3.26), (3.27), (3.25), and self-concordance of / we have

from which (3.24) is immediate. Hence, /* e SC.


Assume / e SCB and Df = K°, where K is a cone. Proposition 3.3.2 shows
Df* = (K*)°. To see that /* is a barrier functional, first observe

If / is logarithmically homogeneous, Theorem 2.3.9 shows x = —gx(x), and hence (3.28)


gives

Evenif / is not logarithmically homogeneous, we have (g(x), 0 — x ) > 0 since— g(x) e K*.
Consequently, because 0 e K, Theorem 2.3.4 implies

Hence, by (3.28), #r < (4tf/ + I) 2 .


Finally, if / is logarithmically homogeneous,

and hence, by Theorem 3.3.4,

that is, g* is homogeneous of degree —1. Thus, by Theorem 2.3.9, /* is logarithmically


homogeneous.
We remark that using the first statement in Theorem 3.3.1 together with the identity
for Z)/* in Proposition 3.3.3 and the relation — g*(s) = x in Theorem 3.3.4, it is readily
proven for / e SC that / = (/*)*, just as one wants for a perfectly symmetric duality
theory.
3.4. Duality of the Central Paths 81_

3.4 Duality of the Central Paths


Let / denote a barrier functional whose domain is the interior of a cone K, i.e., Df = K°.
We know from Theorem 3.3.1 that /*, the conjugate functional, is a barrier functional whose
domain is (K*)°, the interior of the dual cone.
Recall that the central path {jc^ : 77 > 0} for the primal instance

consists of the points x^ solving the linearly constrained optimization problems

The central path {(y^, s^} : r\ > 0} for the dual instance

consists of the points (y^, Sj,) solving the problems

We shall see that under the assumption of logarithmic homogeneity, these central paths are
strongly related, each arising from the optimality conditions for the other. For this, it is
convenient to adopt the geometrical viewpoint described in §3.1.
Fixing x satisfying Ax = b and ( y , s) satisfying A*y + s = c, recall that the primal
instance is identical (up to an additive constant in the objective functional) to

where L is the nullspace of A. The dual instance is equivalent to

Maintaining our simplifying assumption that A is surjective, and hence A* is injective, y is


uniquely determined by s from A*y + s = c.
The primal central path consists of the points x^ solving the optimization problems
82 Chapter 3. Conic Programming and Duality

whereas the dual central path consists of the points sn solving the problems

Necessary (and sufficient) conditions for x^ to solve (3.29) are that

Hence, defining

by the latter condition we have

Moreover, since Theorem 3.3.1 shows — g(xn) e (K*)°, and since K is a cone, we have
sn e (K*)°. Thus, s,, is dual feasible.
Assuming logarithmic homogeneity, we claim Sr, = s^; that is, the vector Sr, arising
from the optimality conditions for xn being on the primal central path is itself on the dual cen-
tral path (moreover, with the same parameter value 77). For under logarithmic homogeneity
we have

the last equality because — g* is the inverse of — g by Theorem 3.3.4 (regardless of whether
logarithmic homogeneity holds). Hence, by the first of the necessary conditions (3.31),

Together with (3.32) we now have that sn satisfies the sufficient (and necessary) conditions
for ST, to be the unique optimal solution of the strictly convex problem (3.30):

Consequently, Sf, = s^ as claimed.


Having proved s,j = — -g(x11), it is easy to establish xn = —-g(srl), assuming loga-
rithmic homogeneity; indeed, from the fact that —g* and — g are inverses,

Alternatively, just as we relied on the necessity of the conditions (3.31) to prove s^ =


— -g(xn) by establishing that sn := — -g(xrl) satisfies the sufficient conditions (3.33), we
could rely on the necessity of the conditions (3.33) to prove xn = — -g*(sr)) by establishing
3.5. Self-Scaled (or Symmetric) Cones 83

that xn := — ^g*(Sr,) satisfies the sufficient conditions (3.31). The roles of xn and s^ are
entirely symmetric.
An important consequence of the relations Sj, = — -g(x,,) and jc^ = — -g*(^) is that
by following one of the central paths, one can generate the other as a by-product. The dual
instance can be solved by solving the primal instance and vice versa.
We claim that regardless of whether logarithmic homogeneity holds, the dual points
Sr, (primal points *,,) tend to optimality as 77 -* oo. For we have already proven that sn is
dual feasible; that did not rely on logarithmic homogeneity. Moreover, letting y^ denote the
vector satisfying A*yr, + s,, = c, we have

the inequality by Theorem 2.3.3. Consequently, s^ does indeed tend to optimality as rj —> oo,
and similarly for *„.

3.5 Self-Scaled (or Symmetric) Cones


3.5.1 Introduction
A self-scaled (or symmetric) cone is a cone K whose interior K° is the domain of a barrier
functional / with particularly strong properties. The properties allow for the development
of symmetric primal-dual ipm's, i.e., algorithms in which the roles of the primal and dual
instances are mirror images. To state the properties, we introduce some notation and a few
definitions.
For z € K°, let K* denote the dual cone of K w.r.t. the intrinsic inner product ( , } z ,
that is,

We define K to be intrinsically self-dual if K* = K for all z e K°.


Strictly speaking, the property of intrinsic self-duality is a property of /, not of K,
for K° may be the domain of a second barrier functional whose intrinsic inner products do
not satisfy K* = K. However, in the relevant literature it is standard to accord the property
to K rather than to /, meaning that the structure of K is sufficiently rich to yield a barrier
functional with the property. Henceforth, when a cone K is assumed to be intrinsically self-
dual, the reader should take / to denote a barrier functional with the property that K* = K
for all z e Df = K°.
Let f* denote the conjugate of / w.r.t. { , }z:

If K is intrinsically self-dual, then / and f* have K° as their domain (by Theorem 3.3.1).
We define / to be intrinsically self-conjugate if / is logarithmically homogeneous,
if K is intrinsically self-dual, and if for each z e K° there exists a constant Cz such that
84 Chapter 3. Conic Programming and Duality

f* = f + Czt i.e., for all s e K°,

The theory of ipm's focuses on the gradients and Hessians of barrier functionals rather
than on values of the functionals. The relevance of intrinsic self-conjugateness is that for
each local inner product { , ) z , the gradients gz and Hessians Hz of / for the primal setting
in CP are exactly the same as the gradients g* and Hessians H* of f* for the dual setting.
A cone K is self-scaled (or symmetric) if K° is the domain of an intrinsically self-
conjugate barrier functional.
In the definition of intrinsic self-conjugateness, the constant Cz is unrestricted. How-
ever, it happens that the various conditions of the definition force the constant to equal a
specific value, as is shown by the following proposition.

Proposition 3.5.1. Iff : K° ->• R is an intrinsically self-conjugate barrier functional,


then for all z e K°,

Proof. Theorem 3.3.4 shows that regardless of the inner product, the conjugate functional
satisfies

Thus, intrinsic self-conjugateness gives

In particular,

However, since / is logarithmically homogeneous, we know from Theorem 2.3.9 that


(g(z), z) = —#/ and —gz(z) = z. Substitution completes the proof.
The following theorem provides a useful characterization of intrinsic self-conjugacy.

Theorem 3.5.2. Assume K is a cone and f is a logarithmically homogeneous barrier


functional with domain K°. Then f is intrinsically self-conjugate iff for each z e K°, the
map x t-» — g z (x) is an involution, i.e., the range of the map (like the domain) is K° and
-g
-z(-gz(x))=xforallxeK°.

Proof. If / is intrinsically self-conjugate, then g* = gz and H* = Hz. Hence, Theo-


rem 3.3.4 implies the map x \-> —gz(x) is an involution.
Conversely, assume for each z € K° that — gz(—gz(x)) = x for all x e K°. Since
Theorem 3.3.4 asserts -g*(-gz(x)) = x for all x e K°, it follows that gz = g*. But then
/ differs from f* by at most a constant, i.e., / is intrinsically self-conjugate.
The theorem provides an easy means to verify that the logarithmic barrier function
for the nonnegative orthant is intrinsically self-conjugate. For letting e denote the vector of
3.5. Self-Scaled (or Symmetric) Cones 85

all ones (hence { , }e is the dot product), we have

Consequently,
The same applies for the logarithmic barrier function for the cone of psd matrices.
Letting E denote the identity matrix (hence { , }E is the trace product), we have

Hence,
Similarly, with somewhat more involved calculations one can use the theorem to
show that the logarithmic barrier functional for the second-order cone is intrinsically self-
conjugate.
The family of all self-scaled cones is limited. Giiler [8] made this known to optimizers
(see also Faybusovich [5]). Giiler realized that self-scaled cones are the same as symmetric
cones, objects that have been studied by analysts for decades (cf. [4]). There are five basic
symmetric cones, all others being Cartesian products of these:
• the cone of psd symmetric matrices,
• the second-order cone,
• the cone of psd Hermitian matrices,
• the cone of psd Hermitian quaternion matrices,
• a 27-dimensional exceptional cone.
Nesterov and Todd did not know of the equivalence between self-scaled cones and
symmetric cones when they wrote their foundational papers [16], [17]. Their principal
motivation was to generalize primal-dual methods beyond LP. Although the generalization
they achieved was perhaps not as great as we imagined (because we now know there are only
five basic self-scaled cones), they succeeded in uncovering the essence of the conic structure
needed by primal-dual methods. One can develop the theory of primal-dual methods for
each of the five cones individually, but developing it generally has the advantage of making
apparent the essential structure.

3.5.2 An Important Remark on Notation


Throughout the remainder of the book we fix e e K° and let { , ) = { , )e', that is, we
suppress the subscript e. Likewise, we write g in place of ge, H in place of He, K* in place
of K*, etc. We do this mainly to avoid excessive notation. The choice of e € K° is arbitrary,
i.e., any point in K° will do. Indeed, as the following development will make apparent, in
terms of each point's local inner product, the point's geometric view of the cone coincides
exactly with the geometric views of the other points in terms of their local inner products;
thus the name "symmetric cone."
Throughout the remainder of the book, whenever K is assumed to be self-scaled, the
reader should take / to denote an intrinsically self-conjugate barrier functional with domain
86 Chapter 3. Conic Programming and Duality

K°. When we refer specifically to the nonnegative orthant, the cone of psd matrices, or the
second-order cone, the underlying intrinsically self-conjugate barrier functional is assumed
to be the logarithmic barrier function.

3.5.3 Scaling Points


Assume K is self-scaled, fix an arbitrary point e e K°, and let { , } = ( , }e. For z € K°
we have

the second identity being a simple consequence of

Consequently, H(z) is a linear operator carrying K bijectively onto itself, i.e., H(z) is a
linear automorphism of K, a "scaling" of K.
The next theorem shows the set of linear automorphisms {H(z) : z e K°} forms a
complete set of scalings in the sense that for each x e K°,

that is, for each point in K° there exists some—in fact, a unique—automorphism H(z)
carrying x to that point.

Theorem 3.5.3. If K is self-scaled, then H(z) is a linear automorphism of K for each


z e K°. Moreover, i f x , s e K°, then there exists a unique w e K° satisfying

(Consequently, if H(w\) = H(w2), then w\ = wi.)

When H(w)x = s, the point w is referred to as the scaling point (w.r.t the local inner
product at e} for the ordered pair x, s. That it is unique proves to be very important in the
theory.
If w is the scaling point for the ordered pair x, s, then — g(w) is the scaling point for
the pair s, x. In fact,

the first equality because H = H*, the second equality by Theorem 3.3.4.
The primal-dual methods we present in sections 3.7 and 3.8 require that when given
jc, s e K°, one can compute the scaling point u; as well as g(w) and H(w). In this regard,
we note that if K is the nonnegative orthant and e = ( l , . . . , l ) , then w is the vector whose
7'th component is ^/Xj/Sj. Similarly, if X and S are pd matrices, and if E is the identity
matrix, then
3.5. Self-Scaled (or Symmetric) Cones 87

To get a sense of the role the scaling point plays in the theory of primal-dual methods,
recall our discussion in §3.1 regarding the dependence of the primal and dual instances on
the inner product. If in the initial inner product { , ) = ( , )e the primal instance is

and the dual instance is

then in terms of the local inner product { , )w the primal instance is

and the dual instance is

where

Consequently, assuming w is the scaling point for a pair x,s, by rewriting the instances
in terms of ( , )„,, the pair is transformed to the pair x,sw := H(w)~ls which satisfies
x = sw, i.e., each point in the pair is transformed to the point of intersection for the rewritten
primal and dual feasible regions. The primal-dual methods we consider are invariant under
the transformation in the sense that the iterates they generate for the rewritten instances are
the transformations of the iterates generated for the original instances. For the rewritten
instances, the primal and dual search directions—the directions in which one moves from
the first and second points in the current pair x, sw to the first and second points in the
next pair—are respectively obtained by projecting the gradient (w.r.t. ( , } w ) of a certain
functional onto the nullspace of A and onto the range space of A*w. As these spaces are
orthogonal complements (w.r.t. { , )w), the sum of the search directions is the gradient
itself. As we shall see, this makes for a particularly clean and symmetric analysis of the
methods. However, before we can develop primal-dual methods, we need to know more
about self-scaled cones.
Toward proving Theorem 3.5.3 we have the following proposition. We suggest that
the reader skip arguments marked "Proof" on the first pass through this subsection.

Proposition 3.5.4. Assume f is a barrier functional with Df = K°, the interior of a (not
necessarily self-scaled) cone. Let K* denote the dual cone of K w.r.t. an arbitrary inner
product ( , }. Ifx e K° ands e (K°)*, then there exists w e K° such that H(w)x = s.

Proof. Consider the functional


88 Chapter 3. Conic Programming and Duality

with domain K°. To prove the proposition, it suffices to show ty has a minimizer, for at a
minimizer the gradient is 0, that is, —H(w)x +s = 0. To prove ty has a minimizer, it suffices
to show that if {u>/} c K° is a sequence satisfying either
then
Assume ||iy,-|| -> oo. Since s e (K*)° and {tu,-} c K it follows that (s, w,) ->• oo.
Observing that (x, —g(it>,-)) > 0 because —g(u>,-) e K*, we conclude ^(u),-) -> oo.
Now assume if/ —>• w e dK. Theorem 2.2.9 shows ||g(w,)|| -> oo. Since— g(io,-) e
^T* and x € jST° we thus have (jc, —g(w,)} -> oo. Observing that (s, u;,-} > 0 since 5 e ^*
and u;, e K, we conclude
Applying Proposition 3.5.4 to a self-scaled cone £ with a local inner product { , } =
( , }e, we have for each ordered pair x, s e K° the existence of a scaling point w for the pair.
In light of the discussion just prior to the statement of Theorem 3.5.3, to complete the proof
of that theorem it remains only to prove the uniqueness of the scaling point. The uniqueness
is crucial for the subsequent theory. To prove uniqueness, we rely on the following theorem
which is important in itself. We freely invoke its identities throughout the remainder of the
chapter; thus, the reader should keep them in mind.

Theorem 3.5.5. Assume K is self-scaled. Ifx ,weK, then

and

Proof. Letting C := Ce, where Ce is as in the definition of an intrinsically self-conjugate


functional, we have

Using the values for C and Cw given by Proposition 3.5.1 establishes (3.34).
To prove (3.35), one can simply differentiate both sides of (3.34) w.r.t. x, making use of
the chain rule. Alternatively, we can observe that since — g is an involution (Theorem 3.5.2),

Thus,

that is,
3.5. Self-Scaled (or Symmetric) Cones 89

Consequently, since — gw is an involution,

establishing (3.35).
To establish (3.36), one can differentiate both sides of (3.35) w.r.t. x using the chain
rule. Alternatively, relying on (3.35) and — g being an involution, observe

The proof is complete.


Proof of Theorem 3.5.3. As observed just prior to the statement of Theorem 3.5.5, it remains
only to prove uniqueness of the scaling point for the ordered pair x, s.
Assume H(w\)x = s and H(w2)x = s. By (3.36),

and so

that is,

Since the square root of a pd linear operator is unique, it follows that


and hence

Thus,

In particular, relying on the logarithmic homogeneity of /,

Of course logarithmic homogeneity also gives

Consider the functional

With respect to ( , }„,,, the gradient at w is w + gW}(w)- Since w\ = —gWl(wi) and


u>2 = —gwt (^2), the gradient is 0 at both wi and u>2- As the functional is strictly convex
and hence has at most one minimizer, w\ = W2.
90 Chapter 3. Conic Programming and Duality

3.5.4 Gradients and Norms


We continue to fix an arbitrary point e e K° and let { , ) = { , )e.
The next theorem is of paramount importance. It concerns the error of the linear map
x i->- ;c — e in approximating the gradient map x M> g(x) — g(e). In terms of the norm
|| || = || ||e, we know from Proposition 2.2.10 that for self-concordant functionals, the
error is small when x & e. The next theorem gives insight into the error for all x, not just
for x near e. Its proof depends heavily on the uniqueness given by Theorem 3.5.3.
Recall that logarithmic homogeneity implies g(e) = — e.

Theorem 3.5.6. Assume K is self-scaled. Ifx e K°, then

that is,

Proof. Since K* = K, it suffices to show for each v e K° that

which is the same as

Hence, it suffices to show e minimizes the functional ty.


Toward proving e is indeed the minimizer, we first note ty has a minimizer, as is
immediate from two facts; if {*,•} c K° satisfies ||jtj|| -> oo, then i^C*;) ->• oo, and if
the sequence satisfies *, -> x e 3^, then t/r(jc,) —>• oo. The first fact follows from the
relations v e K° = (K*)°, —#(*,•) e K* = K, whereas the second fact follows from the
same relations along with ||g(jc,-)|| -> ooifjc, ->• x e dK, a consequence of Theorem 2.2.9.
Hence, fy has a minimizer x'.
At the mimimizer x', the gradient is 0, that is,

Since H(e)v = Iv = v, the uniqueness given by Theorem 3.5.3 thus implies x' = e.
By definition, the local norms induced by a self-concordant functional / change
gradually; if \\x — e\\ < 1, then for all v,

Stronger bounds hold if / is intrinsically self-conjugate, as we now describe.


Let B denote the largest set which is centrally symmetric about 0 and which satisfies
e + B c K°, that is,

Define a norm | | on R" by


3.5. Self-Scaled (or Symmetric) Cones 91

(the Minkowski gauge function for B}. For example, if K = R^_ and e = (!,...,!), then
| | is the ^oo-norm, whereas || || is the ^2-norm.
Since B(e, 1) := {x : \\x — e\\ < 1} is a set which is centrally symmetric about e and
satisfies B(e, 1) c K°, we have B(e, 1) c e + B. Stated differently,

Theorem 3.5.7. Assume K is self-scaled. Ifx satisfies \x — e\ < 1, then for all v,

and

The bounds (3.38) imply the Hessian of / to vary gradually. For example, just as
we established the bounds in Theorem 2.2.1 using the bounds in the definition of self-
concordance, from the bounds (3.38) we obtain that for all x satisfying \x — e\ < 1,

This has implications for Newton's method when applied to functionals obtained by adding
a linear term to /, like those for the barrier method (§2.4.1) and many other ipm's. To see
why, assume z minimizes such a functional. Repeating the proof of Theorem 2.2.3, now
making use of (3.40), we find that whenever \z — e\ < 1, then one step of Newton's method
beginning at e gives a point e+ satisfying

Compare this with the bound of Theorem 2.2.3:

Unfortunately, although the stronger bound (3.41) indicates that ipm's like the barrier method
might perform better when the underlying barrier functional is intrinsically self-conjugate,
the stronger bound does not appear to be sufficient to prove complexity bounds which are
better than what we have already proven, i.e., (2.18). Still, the bound is considered to be
important conceptually and so we have discussed it.
The proof of Theorem 3.5.7 depends on the following theorem, which is of interest in
itself and which plays a major role in the proof of an important theorem in §3.5.5 (a theorem
which is central in our analysis of primal-dual methods).

Theorem 3.5.8. Assume K is self-scaled. Ifx satisfies x — e e K, then


92 Chapter 3. Conic Programming and Duality

Proof. We first establish the less restrictive inequality

For observe that x—e e KandB(e, 1) c K (by self-concordance) together imply B(x, 1) c
K (keeping in mind that B(x, 1) := Be(x, 1)). Hence, by symmetry and the fact that for
each vector v, either (g(x), v) > 0 or (g(x), —v) > 0, Theorem 2.3.4 implies

giving (3.42).
Since

the inequality of the theorem is equivalent to asserting \\H(x) || < 1.


Let*o := xandxi := H(xo)~le. Since logarithmic homogeneity implies g(e) = —e,
we have ;ci = — gXQ(e). Hence, Theorem 3.5.6 applied with XQ in place of e and e in place
of x implies

Since XQ — e e K it follows that x\ — XQ e K and hence x\ — e e K. Moreover, by (3.36)


andtf^or 1 =H(-g(xQ»,

Proceeding inductively we find Xk := H(xk-\) le satisfies Xk — e e K and

Thus, ||H(jCfe)|| = ||H(j:)||2\ However, applying (3.42) with Xk in place of x, we have


II#(**)!! < (4i?/ + I) 2 - Consequently, \\H(x)\\ < 1.

Proof of Theorem 3.5.7. Let

Since \X2 — e\ = 1 we have X2 e K. Observe that

Hence, e — x\ e K. Consequently, applying Theorem 3.5.8 with x\ in place of e and e in


place of x,

Since logarithmic homogeneity of / makes the Hessian homogeneous of degree —2, we


thus conclude by definition of x\ that
3.5. Self-Scaled (or Symmetric) Cones 93

giving the leftmost inequality in (3.38).


Similarly, to establish the rightmost inequality in (3.38), define

Note *2 e K and x = x( + \x — e\x'2. Hence, x — x{ e K. Theorem 3.5.8 thus implies for


allu,

Logarithmic homogeneity gives

establishing the desired inequality.


To prove (3.39) one uses (3.38) and the relation H(—g(x)) = H(x)~l (so the eigen-
values of H(—g(x)) are the reciprocals of the eigenvalues of H(x)).
Recall Theorem 2.3.4 which asserts that if / is a barrier functional and if x, y e Df
satisfy (g(x), y—x) > 0, theny e Bx(x, 4 #/• + !). Among other things, this implies that the
set Bx (x, 1) is, to within a factor 4#/ +1, the largest among all ellipsoids centered at x which
are contained in Df. A stronger statement holds when / is intrinsically self-conjugate.

Theorem 3.5.9. Assume K is self-scaled. If x e K° satisfies (g(e),x — e) > 0, then


x e B(e, #/).

Proof. Assume x € K° satisfies (g(e), x — e) > 0. Since g(e) = —e, we are thus assuming
(x,e) < \\e\\2, an upper bound on (x, e). A strict lower bound of 0 is immediate from
e,x e K° = (K*)°. In all,

Thus, the point

is a scalar multiple of* where the scalar is no less than 1. Stated differently, x is a convex
combination of 0 and x. Hence,

the equality by logarithmic homogeneity (Theorem 2.3.9).


Let

Since ||i>|| = 1 we have e — v e K. Since K* = K, we thus see (jc, e — v) > 0, that is,
94 Chapter 3. Conic Programming and Duality

Hence, using \\e\\2 = #/, (v, e] = 0, and ||u|| = 1, we have ||jc — e\\ < #/. Consequently,
by (3.43) we find \\x — e\\ < #/. By openness, the inequality is seen to be strict, that is,
x e B(e, #/).

Theorem 3.5.10. Assume K is self-scaled and x e K°. For each t e R there exists unique
xt e K° such that

Proof. That there is at most one point xt with the desired property is immediate from the
uniqueness in Theorem 3.5.3.
Define

It suffices to prove that T is a closed and dense subset of R.


For proving that T is closed, assume t to be in the closure of T and let [tt} c T be
a sequence satisfying tt —* t. We claim the sequence {xti} has a subsequence converging
to a point x' e K°. Thus, by continuity of the Hessian, H(x') = H(x)'. Hence t e T and
xt := x', establishing closedness.
To show {xti} has a subsequence converging to a point x' e K°, it suffices to show
the sequence is bounded and does not have any subsequence converging to a point in the
boundary of K. If the sequence (similarly, a subsequence) satisfied ||jcfl. || —>• oo, then using
H(xt.) -> H(xY we could conclude

contradicting \\xti \\2t = #/ (Theorem 2.3.9). If the sequence (similarly, a subsequence)


converged to a point in the boundary of K, then from the containments BXt (xt., 1) c K° we
could conclude ||H(ocfi)|| -> oo, contradicting H(xti} -> H(xY. Hence, {xti} does indeed
have a subsequence converging to a point x' e K°, completing the proof that T is closed.
To prove that T is dense in R, we show T contains all quantities that can be obtained
by adding finitely many numbers each of which is an integral power of 2 and contains the
negatives of those quantities. Noting 0, 1 e T since H(e) = I = H(x)° and H(x) =
H(x)1, for this it suffices to establish the following: (i) if t e T, then —t,2t € T and (ii) if
ti,t2e r . t h e n i f a + f c ) e T.
Assume t € T. Since H(-g(xt)) = H(x,)~l = H(x)~t we have -t e T. Defining
l 2
X2t := H(-g(x,))e = H(x,r e, (3.36) implies H(x2t) - H(x,) = H(x)2t. Hence,
2t € T. We have proved (i).
In proving (ii), we rely on the fact that for each inner product { , ) and for each pd linear
operator H, if t\,ti 6 R, then Htl is pd w.r.t. the inner product defined by
We leave the proof of this fact to the reader.
Assume t\,ti e T. We know —12 € T. Let w be the scaling point for the pair
that is, H(w)xt] = x-h. Then (3.36) implies

that is,
3.5. Self-Scaled (or Symmetric) Cones 95

By the fact mentioned above, the operators H(x)t{~t2 and H(x)i(t]~t2) are pd w.r.t. { , }Xt .
Since the square root of a pd operator is unique, (3.44) thus implies

Consequently,

so that

3.5.5 A Useful Theorem


We continue to fix an arbitrary point e e K° and let { , ) = { , }e.
We close this section by proving a theorem which plays a central role in the analysis
of the primal-dual methods we consider.

Theorem 3.5.11. Assume K is a self-scaled cone. If x e K°, then

The following theorem and proposition are used to prove Theorem 3.5.11.

Theorem 3.5.12. Assume K is a self-scaled cone. Ifx e K° and

then

Proof. We may assume A > 1.


Consider the case that A = ||//(jc)||, i.e., A is the largest eigenvalue of H(x). For
0 < € < A let

Since

the largest eigenvalue of HZf(x) is (A/(A - e))2; in particular, if 0 < £ < A, the largest
eigenvalue is greater than 1. Hence, if 0 < 6 < A, Theorem 3.5.8 (applied with ze in place
of e) shows
96 Chapter 3. Conic Programming and Duality

that is.

Equivalently (assuming € •£ X — 1),

Since K is a cone, if 0 < e < X — I w e thus have

Consequently, since € can be arbitrarily near 0 and since Bx (x, 1) c K° (by self-concordance),

implying the desired inequality

In the case that A. = \\H(x) l\\ (= \\H(—g(x))\\), one simply interchanges the roles of x
and — g(x).

Proposition 3.5.13. Assume f is a logarithmically homogeneous barrier functional with


domain K°. Ifx e K°, then

Proof. Consider the self-concordant functional

where || || is the local norm at e given by /. Note the following relation between the local
inner product at e given by /' and the one given by /: { , }' = 2{ , }. Letting g' denote
the gradient of /' w.r.t. { , )', it follows that

Since / is logarithmically homogeneous, g(e) = —e and hence g'(e) = 0. Applying


Proposition 2.2.10 to /', we thus find that for each value r < £, the image of the closed ball
B'(e, r + (1^r)3) under the map ;c h-> gf(x) contains the closed ball B'(0, r). In particular,
if r < -£j= (then -^—-^ < 1), the image of B'(e, 2r) contains B'(Q, r). Since the map
x h-> g'(x) is injective (as are the gradient maps for all strictly convex functionals), it
follows that for all x e K°,

i.e., it follows that


3.6. The Nesterov-Todd Directions 97

proving the proposition.


Proof of Theorem 3.5.11. Theorem 3.5.12 and Proposition 3.5.13 give

where

Note that 8 H-> (8 — l)/<5 is an increasing function of 8 when 8 > 0, whereas for fixed
Y > 0, the function 8 H» y/V8 is decreasing. Consequently, the function

is minimized when 8 satisfies

that is, when

Thus, by (3.45),

completing the proof.

3.6 The Nesterov-Todd Directions


In moving from a primal and dual feasible pair x, (y, s) to the next feasible pair x+, (y+ ,s+),
primal-dual ipm's rely on both x and (y, s) in determining the primal direction x+ — x and
in determining the dual direction (y+ — y,s+ — s). Here we study what are perhaps the
most prevalent directions appearing in the design of primal-dual ipm's.
We continue to assume A" is a self-scaled cone. We fix an arbitrary point e e K° and
let { , } = { , }e; that is, we continue to suppress the subscript e.
The primal instance is
98 Chapter 3. Conic Programming and Duality

and the dual instance is

More geometrically, we know from §3.1 that fixing x satisfying Ax = b and (y, s) satisfying
A*y + s = c, and letting L denote the nullspace of A, the primal and dual instances can be
expressed as

and

The primal central path is the set

where xn solves

that is, solves

The dual central path is the set

where (y^, s^) solves

Thus, £„ solves

To highlight symmetry between the primal and dual instances, we often hide y^, referring to
{s^ : r\ > 0} as the dual central path, keeping in mind that yn is uniquely determined by sn
through the equations A*y^ + Sr, = c (assuming A is surjective and hence A* is injective).
Assume x approximates xn and (y, s) approximates (y,,, 5^). One strategy for obtain-
ing a better approximation to x^ is to move from x in the steepest descent direction for the
linearly constrained optimization problem (3.46), i.e., in the direction opposite the projected
gradient. This direction w.r.t. the inner product { , }x is the direction of the Newton step,
the step taken by the barrier method. If, instead, one relies on { , )w where w is the scaling
point for the ordered pair jc, s, one arrives at the primal Nesterov-Todd direction:
3.6. The Nesterov-Todd Directions 99

where cw := H(w)~lc and PL<W denotes orthogonal projection onto L, orthogonal w.r.t.
{ , >„,. Since cw = A*wy + H(w)-ls and PL,WAI = 0 (where A*w = H(w)~lA* is the
adjoint of A w.r.t. { , } UJ ), the direction can also be expressed as

The dual Nesterov-Todd direction is

where w* = —g(w) is the scaling point for the ordered pair s, x (and where LL is the
orthogonal complement of L w.r.t. the original inner product { , )).
One gets a primal Nesterov-Todd direction and a dual Nesterov-Todd direction for
each triple 77, x, s, where x is primal feasible and s is dual feasible.
The Nesterov-Todd directions appeared in primal-dual ipm's for LP long before Nes-
terov andTodd developed them generally (cf. Megiddo [13], Kojima, Mizuno, andYoshise
[11], and Monteiro and Adler [14]). Indeed, the goal of Nesterov andTodd was to generalize
earlier work for LP.
Theoretical and practical insight into the Nesterov-Todd directions is gained by con-
sidering dx together with H(w)~lds (or, symmetrically, H(w*)~ldx together with ds).
Note that H(w)~lds e LLw, where L±w = H(w)~lL± is the orthogonal complement
of L w.r.t. ( , )w. We claim that

and hence

gives an ( , }„,-orthogonal decomposition of -(rjx + £«,(*)).


To establish (3.47) it suffices to prove the operator identity

for then

To prove the operator identity (3.48) it suffices to show


100 Chapter 3. Conic Programming and Duality

and

1
However, (3.49) is immediate from the identity LLw = H(w) LJ-, whereas (3.50) is
immediate from

Hence we have established (3.47), with the consequence that

does indeed give an { , }^-orthogonal decomposition of


We reintroduce v, keeping in mind that y is uniquely determined by s and, likewise,
the direction dy is uniquely determined by ds. We now see dx,ds, and dy satisfy

that is, they satisfy

In the literature, these equations are often taken as the starting point for discussions of the
Nesterov-Todd directions, i.e., they are taken as defining the directions. The equations can
be used to compute the directions in practice.
It should be mentioned that dx and H(w}~lds are the Nesterov-Todd directions if one
replaces { , } with ( , )w, rewriting the instances in terms of this inner product. Moreover,
one can show quite generally that the Nesterov-Todd directions are essentially invariant
under changes in the inner product. For example, if one replaces { , } with { , }z where
z e K°, then dx and H(z)~lds are the Nesterov-Todd directions for the rewritten instances

The simple geometry given by viewing the summation dx + H(w) lds as an ( , )„,-
orthogonal decomposition of — (qx + g w (jc)) provides analyses of primal-dual ipm's which
are more transparent than arguments phrased only in terms of the original inner product
{ , } and the original instances. However, other inner products and rewritings of the
instances reveal the same simple geometry, for example, the approach which is symmetric
to the one we employ. The symmetric approach relies on { , }„,*. In this approach, LL
is fixed, L is replaced by (L^)1** = H(u>*)~lL, and the sum H(w*)~ldx + ds gives an
{ , )w*-orthogonal decomposition of — (rjs + gw*(s)).
One can even use the original inner product ( , } to reveal the same simple geometry
if one transforms the instances appropriately. Specifically, transforming the primal feasible
3.6. The Nesterov-Todd Directions 101

region by H(w)l/2 = H(w*rl/2 and the dual feasible region by


we see that the original primal and dual instances are, respectively, equivalent to the primal
instance

and its dual instance

Both of the original points jc, 5 are transformed to the point


and H(w)l/2dx + H(w*)l/2ds gives an { , )-orthogonal decomposition of — (rjz + g(z)).
This last approach—relying on the original inner product { , } but transforming the primal
and dual instances—is perhaps more consistent with the literature than the approach we
employ which uses the local inner product { , )w, but it has the drawback of messier
notation and makes the appearance of w, w* through H(w)1/2, H(w*)l/2 unnecessarily
mysterious. Consequently, we have chosen to rely on { , }„,, but we emphasize that the
various approaches are essentially equivalent.
In closing this section, we remark that in the analysis of algorithms using the Nesterov-
Todd directions, it is convenient to focus on

rather than on w and w*. (Take note that we slightly abuse notation in that
unless 77 = 1.) To give a sense as to why it is convenient, we show that a necessary an
sufficient condition to have the simultaneous equalities x = x^ and 5 = s,, is that w = x
(likewise, w* = s), for observe that w is the scaling point for the ordered pair x, rjs (whereas
w* is the scaling point for the pair s, rjx). Hence,

As we know from §3.4, a necessary and sufficient condition for the simultaneous equalities
x = xn and s = s,, is that x = —-g(s). Consequently, since logarithmic homogeneity
and the uniqueness of scaling points (Theorem 3.5.3) imply gu,(x) = —x iff w = x, (3.51)
shows we do indeed have the simultaneous equalities x = x^ and s = s^ iff w — x.
In the remaining sections of this chapter it is useful to scale the Nesterov-Todd direc-
tions, letting

Since H(w) = r^H(w) and H(w*} = rjH(w*), we have


102 Chapter 3. Conic Programming and Duality

and

Moreover,

is an { , _)$-orthogonal decomposition of — (x + gw(x)). (In the last equality, we wrote


H(w)~lds rather than r]H(w)~lds to ease notation.)

3.7 Primal-Dual Path-Following Methods


We assume K is a self-scaled cone and rely on the notation of §3.6. We continue to fix an
arbitrary point e e K° and let { , } : = { , }e.

3.7.1 Measures of Proximity


The analysis of primal-dual path-following ipm's relies on measuring proximity of current
points x and (y, s) to the primal and dual central paths. In the literature, there are various
expressions for measuring the proximity. Perhaps the most basic approach is simply to
measure the distance from x to x^ by \\x — xn \\Xr] and, likewise, measure the distance from
(y, s) to (y,,, s,)) by ||s — -sJU,. However, other expressions for measuring the proximity
yield more elegant and insightful analyses of primal-dual methods.
Let

where w is the scaling point for the ordered pair x, s and w* (= —g(w)) is the scaling
point for the pair s, x. We know from §3.6 that a necessary and sufficient condition for the
simultaneous equalities x = x^ and 5 = sn is that w = x (likewise, w* = s). Thus, one
might consider measuring proximity of the pair x, s to the central path pair jc^, sn by the
quantity II x — w II $• It is not difficult to show

so we have symmetry in this measure of proximity.


Although the quantity \\x — w\\a, plays an important role in our analysis, it does not
have the starring role. That role is played by the value \\r]s + g(x)||-#(*), the usefulness of
which is indicated by the following theorem.
Recall that (x, s) = (c, x} — (b, y ) for feasible points x and (y, 5).

Theorem 3.7.1. I f x , s e K°, then

and

Moreover,
3.7. Primal-Dual Path-Following Methods 103

Proof. To prove (3.52), substitute H(w)~ls for x and H(w)x for s in the expression
\\rjs + g(x)\\-g(X), then use the identities
and
l 2
Next, we claim that H(x) / e = —g(x), for by Theorem 3.5.10 there exists x\/2 e K°
such that H(xi/2) = H(x)l/2 and hence by (3.36),

Thus, H(x)l/2e = —g(x) by the uniqueness giveninTheorem3.5.3. Likewise,


x. Consequently, relying on the fact that (x, g(x)} = —#/ by logarithmic homogeneity
(Theorem 2.3.9), we have

Toward proving the final inequality of the theorem, we remark that from the identities
H(w)~lr]s = x, gt,(x) = H(w)~lg(x), and

one can readily show

Since for all v,

Theorem 3.5.11 applied with w in place of e thus gives

completing the proof.


The quantity ||^5 + g(x)\\_g(X) is independent of inner products in the sense that if
one expresses the original primal and dual instances in terms of a new inner product { , }z,
transforming the dual feasible points by s h-> H(z)~ls, one has

as is easily verified using the identity H(—gz(x}} = H(z)H(—g(x})H(z). In the special


case that z= x, this gives ||^ + g(x)\\-g(X) = \\rjH(x)~ls — x\\x, making evident that the
quantity measures distance between the primal point and the (scaled) dual feasible point.
Likewise,
104 Chapter 3. Conic Programming and Duality

3.7.2 An Algorithm
A feature of our first primal-dual path-following ipm is the simplicity of its analysis. Indeed,
the algorithm is designed expressly for this feature. In the next subsection we present an
algorithm which is more typical of the ipm's found in the literature, one that relies on the
Nesterov-Todd directions. Its analysis extends the simpler analysis.
Assume x and 5 are the current feasible points, meant to crudely approximate x^ and
Sr,. Our first algorithm obtains better approximations by computing
and

where PL^ denotes orthogonal projection onto L, orthogonal w.r.t. { , }„,. The algorithm
then increases rj to some value rj+. The points x+, s+—computed with the aim of obtaining
better approximations to jc,,, s^ than x, s—can be considered as crude approximations to
XTI+ » Si>+, just as x, s were considered to be crude approximations to jc^, s^. The algorithm
repeats the steps with rj+ in place of rj, and so on.
Insight into the algorithm is gained by understanding that the primal and dual direction
vectors

can essentially be viewed as orthogonal projections of the same vector, similarly to the
primal and dual Nesterov-Todd directions. In particular, we claim

and hence

gives an orthogonal decomposition of 2(w — x).


To verify (3.54), first note that since w is a scalar multiple of w, the inner product
{ , }a, is a scalar multiple of { , }„,, and hence we have the operator identity PL,W =
PL,W Similarly, we have PL±,W* = PL^,W*- Relying on the operator identity PL±,W* =
H(w)PL±w<wH(w)~l (i.e., relying on (3.48)), it follows that

Also note that by logarithmic homogeneity (Theorem 2.3.9),

Thus,
3.7. Primal-Dual Path-Following Methods 105

establishing (3.54) and the fact that

gives an orthogonal decomposition of 2(w — x).


The orthogonal decomposition shows that A*, As, and A.y form the (unique) solution
of the system of linear equations

where A*w = H(w} 1A* = r]H(w) 1A* = rjA*^. These equations can be used to compute
the directions in practice.
In light of Theorem 3.7.1, the following theorem justifies the algorithm.

Theorem 3.7.2. Assume |1 — — | < l/(12y#/). Ifx ands are feasible points satisfying

then

Proof. The final inequality in Theorem 3.7.1 implies

Also note that since

we have

and

Consequently,

and because A* -L& H(w)~l&s,

Hence, by gw(w) = —w, H^w} = I, Corollary 1.5.8, and Theorem 2.2.1, we have
106 Chapter 3. Conic Programming and Duality

Since (3.39) (applied with w in place of e and x+ in place of x) implies that for all v,

it follows that

Hence, using |1 - y| < l/(l2^/dy) and #/ > 1, we have

completing the proof.


Applying the basic step of the algorithm repeatedly, using
Theorem 3.7.2 guarantees the algorithm stays on track. Moreover, the value rj will increase
by a factor of at least 2 with every 14^/j?/ iterations, i.e., will increase by a factor of at least
2 within O(J~&f} iterations. Consequently, using Theorem 3.7.1, we see that the difference
(c, xi) — (b, yi) will decrease by a factor of at least ^ within O(^fWf) iterations. This is
essentially the same as what was proven for the barrier method in §2.4.2.

3.7.3 Another Algorithm


Here we present an algorithm which is typical of primal-dual path-following methods found
in the literature, one that generalizes an approach for LP found in Kojima, Mizuno, and
Yoshise [11] and in Monteiro and Adler [14]. It relies on the (scaled) Nesterov-Todd
directions

and

introduced at the end of §3.6.


Given feasible x, s which crudely approximate xn, s,,, the algorithm attempts to obtain
better approximations to xn, s^ by computing

The algorithm then increases rj to /?+. The points jc+, s+ crudely approximate xn+, Sr,+ . The
algorithm repeats the steps with ??+ in place of 77, and so on.
3.7. Primal-Dual Path-Following Methods 107

The algorithm is closely related to the one of the previous subsection. Indeed, if x, s
are near *,,, s^, then w & x and hence

likewise, s 4- gw*(s) &2(s — w*). Consequently, it is not surprising that the analysis of the
algorithm is an extension of the arguments in the proof of Theorem 3.7.2.

Theorem 3.7.3. Assume |1 — y| < l/(2Q^/&f). Ifx ands are feasible points satisfying

then

Proof. The final inequality in Theorem 3.7.1 implies

Hence, defining

Corollary 1.5.8 and Theorem 2.2.1 give

We know from §3.6 that

Hence,

and

Consequently,

Moreover, as we know from §3.6 that


108 Chapter 3. Conic Programming and Duality

Hence, by (3.55), Corollary 1.5.8, and Theorem 2.2.1 we have

However, (3.39) implies that for all v,

Thus,

Hence, using |1 - ^| < l/(20y#7) and #/ > 1, we have

completing the proof.

3.8 A Primal-Dual Potential-Reduction Method


We assume K is a self-scaled cone and rely on the notation of §3.6. We continue to fix an
arbitrary point e e K° and let {,) = {, }e.

3.8.1 The Potential Function


The progress of a potential reduction method is established by showing that a certain
function—an appropriately chosen "potential function"—decreases by a constant amount
with each iteration of the algorithm. The decrease can be established regardless of where
the current iterate lies; it need not lie near the central path. Potential-reduction methods are
not as restricted in their movement as path-following methods whose analysis depends on
staying near the central path.
The primal-dual potential-reduction method we consider relies on the Tanabe-Todd-
Ye potential function [20], [23]
3.8. A Primal-Dual Potential-Reduction Method 109

where

The appropriateness of this potential function is shown by the following theorem, recalling
that

Theorem 3.8.1. For jc, s e K°,

(with equality iffs is a multiple of—g(x)).

Proof. Let t := & f / ( x , s } . Recall that /*(z) = f ( z ) + Ce for all z e K°, where
Ce := -&f - 2 f ( e ) (Theorem 3.5.1). Thus, by definition of /*,

(with equality iff ts = —g(x), i.e., iff s is a multiple of —#(*)). Consequently, since
f(ts) = f ( s ) — &f Int (logarithmic homogeneity),

(with equality iff s is a multiple of — g(x)).


An immediate consequence of the theorem is that if an algorithm generates primal
and dual feasible iterates for which $>(x, s) goes to —oo, then the objective values of the
iterates converge to the optimal value.
The potential function is the sum of a strictly convex functional—the map (x,s) t->
f ( x ) + f(s)—and a functional which when restricted to the primal and dual feasible regions
is concave. Indeed, to see that the functional

is concave when restricted to the feasible regions, observe that its Hessian acts on vectors
(M, u) eR" x R " as follows:

Thus, if

Hence, the Hessian of the functional obtained by restricting (3.56) to the primal and dual
feasible regions is negative semidefimte, i.e., the restricted functional is concave.
110 Chapter 3. Conic Programming and Duality

3.8.2 The Algorithm


Assume x e K° is primal feasible and (y, s) € Rm x K° is dual feasible. The algo-
rithm moves from these points to x+ and (y+, s+). Maintaining our assumption that A is
surjective—hence A* is injective—y and y+ are uniquely determined by s and s+, respec-
tively. Thus, to describe the algorithm, we need only describe how jc+ and s+ are computed
from x and s. We hide y and v+ to make the symmetry of the algorithm apparent.
The points x+ and s+ arise from a line search using the Nesterov-Todd directions

and

where

For motivation, we note that the gradient w.r.t. { , ) „, of the functional obtained by restricting
x h-> <E> (x, 5) to the primal feasible region is precisely —dx,i.e.,dxis the direction of steepest
descent. Likewise, ds is the direction of steepest descent w.r.t. { , ) « , * for the functional
obtained by restricting s h-> $>(x, s) to the dual feasible region.
The algorithm searches the line {(x + tdx, s + tds) : t e R} to obtain an approximate
minimizer rmjn of the univariate functional

It then lets

We show

For theory purposes we can formalize the condition that t^n be an "approximate minimizer"
by requiring t^n satisfy, say,

The specific constant -^ is irrelevant for the theory. What matters is that the constant is
positive and universal (i.e., entirely independent of the particular instances) and that potential
reduction of the amount can always be achieved. In practice, drastically greater reduction
of the potential function than ^ always happens.
The algorithm generates a sequence of points (XQ, SQ), (x\, s\),..., each obtained
from the previous one by line search. The relation (3.57) in conjunction with Theorem 3.8.1
immediately imply for all i that

Hence, for all e > 0, the algorithm computes feasible points (*,-, sv) satisfying

within
3.8. A Primal-Dual Potential-Reduction Method 111

iterations.
From a theoretical perspective, by far the most important term in the bound (3.58) is
J&~f In ^. The bound can be roughly interpreted as stating that the objective value difference
(c, xt} — (b, v,} is halved in O(^JWf) iterations, a result which is essentially the same as
what we obtained for the barrier method in §2.4.2 and for the primal-dual path-following
methods in §3.7. However, the potential-reduction method has the potential for much
greater efficiency (especially if each line search yields tmin as the exact minimizer) as it is
not required to stay near the central path like the other algorithms.
The potential-reduction method described above was first introduced in the context
of linear complementarity problems by Kojima, Mizuno, and Yoshise [12].

3.8.3 The Analysis


It remains to prove the following theorem.

Theorem 3.8.2.

Proof. Recall from §3.6 the definitions

and

Also recall that

is an orthogonal (w.r.t. { , }&*) decomposition of —(x + #«>(*)).


Observe that

Hence, since 77 := p/(x, s),

For convenience, we scale the search directions, defining

Since (3.59) is an orthogonal decomposition,

Let

It suffices to prove
112 Chapter 3. Conic Programming and Duality

For all x, s e K°, define

Let

Since for alii, s € K°,

and, using (3.34),

we have

and hence,

Consequently, to establish (3.63) it suffices to show

Let

so that

The function -fy\ is concave since the map ( x , s ) -> p\n(x,s) is concave when
restricted to the primal and dual feasible regions (as was discussed at the end of §3.8.1).
Hence,

that is,

Thus, using (3.59), (3.60), and (3.61), we have

Using (3.62) we have


3.8. A Primal-Dual Potential-Reduction Method 113

Consequently, by Theorem 2.2.2, for 0 < ? < 1,

Similarly, \\H(w) lds\\x < 1 and hence

Thus, using (3.59) and (3.61),

In all, forO < t < 1,

Hence, by Theorem 3.5.11,

However, since ||;c||| = p = #/ -I- y#/, ||u>||«i = ./#/, and i?/ > 1, it is not difficult to
prove that

Thus,

Substitution yields

concluding the proof.


This page intentionally left blank
Bibliography

[ 1 ] E. J. Andersen and P. Nash, Linear Programming in Infinite Dimensional Spaces: The-


ory and Applications, Wiley, Chichester, 1987.
[2] H.H. Bauschke, O. Giiler, A.S. Lewis, and H.S. Sendov, "Hyperbolic polynomials and
convex analysis," Canadian Journal of Mathematics, 53 (2001), pp. 470^88.
[3] D. den Hertog, C. Roos, and J.-Ph. Vial, "A complexity reduction for the long-step
path-following algorithm for linear programming," SIAM Journal on Optimization, 2
(1992), pp. 71-87.
[4] J. Faraut and A. Koranyi, Analysis on Symmetric Cones, Clarendon Press, Oxford,
1994.
[5] L. Faybusovich, "Euclidean Jordan algebras and interior-point algorithms," Positivity,
1(1997), pp. 331-335.
[6] C. Gonzaga, "An algorithm for solving linear programming problems in O («3L) oper-
ations," in Progress in Mathematical Programming—Interior Point and Related Meth-
ods, N. Megiddo, ed., Springer-Verlag, Berlin, 1989, pp. 1-28.
[7] C.C. Gonzaga, "Large step path-following methods for linear programming I. Barrier
function method," SIAM Journal on Optimization, 1 (1991), pp. 268-279.
[8] O. Giiler, "Barrier functions in interior point methods," Mathematics of Operations
Research, 21 (1996), pp. 860-885.
[9] O. Giiler, "Hyperbolic polynomials and interior point methods for convex program-
ming," Mathematics of Operations Research, 22 (1997), pp. 350-377.
[10] N. Karmarkar, "A new polynomial time algorithm for linear programming," Combi-
natorica, 4 (1984), pp. 373-395.
[11] M. Kojima, S. Mizuno, and A. Yoshise, "A primal-dual interior point algorithm for
linear programming," in Progress in Mathematical Programming—Interior Point and
Related Methods, N. Megiddo, ed., Springer-Verlag, Berlin, 1989, pp. 29-47.
[12] M. Kojima, S. Mizuno, and A. Yoshise, "An O(«/nL) potential reduction method for
linear complementarity problems," Mathematical Programming, 50 (1991), pp. 331-
342.

115
116 Bibliography

[13] N. Megiddo, "Pathways to the optimal set in linear programming," in Progress in


Mathematical Programming—Interior Point and Related Methods, N. Megiddo, ed.,
Springer-Verlag, Berlin, 1989, pp. 131-158.

[14] R.D.C. Monteiro and I. Adler, "Interior path following primal-dual algorithms. Part I:
Linear programming," Mathematical Programming, 44 (1989), pp. 27-41.

[15] Yu.E. Nesterov and A.S. Nemirovskii, Interior-Point Polynomial Algorithms in Convex
Programming, SIAM, Philadelphia, 1994.

[16] Yu.E. Nesterov and MJ. Todd, "Self-scaled barriers and interior-point methods for
convex programming," Mathematics of Operations Research, 22 (1997), pp. 1-42.

[17] Yu.E. Nesterov and M.J. Todd, "Primal-dual interior-point methods for self-scaled
cones," SIAM Journal on Optimization, 8 (1998), pp. 324-364.

[18] J. Renegar, "A polynomial-time algorithm based on Newton's method for linear pro-
gramming," Mathematical Programming, 40 (1988), pp. 59-94.

[19] R.T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970.

[20] K. Tanabe, "Centered Newton method for mathematical programming," System Mod-
elling and Optimization: Proceedings of the 13th IFIP Conference, Tokyo, Japan,
Aug./Sept. 1987, Lecture Notes in Control and Information Sciences 113, M. Iri and
K. Yajima, eds., Springer-Verlag, Berlin, 1988, pp. 197-206.

[21] T. Terlaky (editor), Interior Point Methods of Mathematical Programming, Kluwer,


Dordrecht, The Netherlands, 1996.

[22] M.J. Todd, "Potential-reduction methods in mathematical programming," Mathemat-


ical Programming, 76 (1996), pp. 3-45.

[23] M.J. Todd and Y. Ye, "A centered projective algorithm for linear programming," Math-
ematics of Operations Research, 15 (1990), pp. 508-529.

[24] S.J. Wright, Primal-Dual Interior-Point Methods, SIAM, Philadelphia, 1997.

[25] Y. Ye, Interior Point Algorithms: Theory and Analysis, John Wiley and Sons, New
York, 1997.
Index
A*, 3 { , )*,22
A:, 22 intrinsically self-conjugate, 83
analytic center, 39 intrinsically self-dual, 83
asymptotic feasibility, 71
asymptotic optimal value, 71, 73 logarithmic homogeneity, 42
a-val, 71
Nesterov-Todd directions, 98, 99
a-vaT, 73
n(x), 19
£c(y,r),23 II 11,2
barrier functional, 35 II II s, 5
II IU.22
central path, 43, 81 II A||, 4
complexity value, 35
tf/,35 positive definite (pd), 4
positive semidefinite (psd), 4
conjugate functional, 76
/*,76 PL, 8
PL,,, 22
Cl,6
C\9
frOO. 18
xi • x2, 2 Rn++' o/i
dual cone, 65
K*,65 scaling point, 86
K*, 83 self-adjoint, 4
self-concordant functional, 23
Frobenius norm, 2 self-scaled (or symmetric) cone, 84
SC, 23
gradient, 6 SCB, 35
*(*), 6 strong duality, 67
g, (y),22 strong feasibility, 73
*lL,*00,22 § wx ",2
Snxn
++,6 <r
Hessian, 9
symO, D), 40
H(x), 9
H\L(x\\\ Xi o X2, 2
#,00,22
H| L ,,(v),22 val, 65
val*, 66
(, ),2
( , )s,5 weak duality, 66

117

You might also like