An Introduction To Continuous Optimization
An Introduction To Continuous Optimization
Preface
The present book has been developed from course notes written by the
third author, and continuously updated and used in optimization courses
during the past several years at Chalmers University of Technology,
G
oteborg (Gothenburg), Sweden.
A note to the instructor: The book serves to provide lecture and exercise material in a first course on optimization for second to fourth year
students at the university. The books focus lies on providing a basis for
the analysis of optimization models and of candidate optimal solutions,
especially for continuous (even differentiable) optimization models. The
main part of the mathematical material therefore concerns the analysis
and algebra that underlie the workings of convexity and duality, and
necessary/sufficient local/global optimality conditions for unconstrained
and constrained optimization problems. Natural algorithms are then
developed from these principles, and their most important convergence
characteristics analyzed. The book answers many more questions of the
form Why/why not? than How?
This choice of focus is in contrast to books mainly providing numerical guidelines as to how these optimization problems should be
solved. The number of algorithms for linear and nonlinear optimization
problemsthe two main topics covered in this bookare kept quite low;
those that are discussed are considered classical, and serve to illustrate
the basic principles for solving such classes of optimization problems and
their links to the fundamental theory of optimality. Any course based
on this book therefore should add project work on concrete optimization problems, including their modelling, analysis, solution by practical
algorithms, and interpretation.
A note to the student: The material assumes some familiarity with
linear algebra, real analysis, and logic. In linear algebra, we assume
an active knowledge of bases, norms, and matrix algebra and calculus.
In real analysis, we assume an active knowledge of sequences, the basic
Preface
topology of sets, real- and vector-valued functions and their calculus of
differentiation. We also assume a familiarity with basic predicate logic,
since the understanding of proofs require it. A summary of the most
important background topics is found in Chapter 2, which also serves as
an introduction to the mathematical notation. The student is advised
to refresh any unfamiliar or forgotten material of this chapter before
reading the rest of the book.
We use only elementary mathematics in the main development of
the book; sections of supplementary material that provide an outlook
into more advanced topics and that require more advanced methods of
presentation are kept short, typically lack proofs, and are also marked
with an asterisk.
A detailed road map of the contents of the books chapters are provided at the end of Chapter 1. Each chapter ends with a selected number
of exercises which either illustrate the theory and algorithms with numerical examples or develop the theory slightly further. In Appendix A
solutions are given to most of them, in a few cases in detail. (Those
exercises marked exam together with a date are examples of exam
questions that have been given in the course Applied optimization at
G
oteborg University and Chalmers University of Technology since 1997.)
In our work on this book we have benefited from discussions with
Dr. Ann-Brith Str
omberg, presently at the FraunhoferChalmers Research Centre for Industrial Mathematics (FCC), Goteborg, and formerly at mathematics at Chalmers University of Technology, as well as
Dr. Fredrik Altenstedt, also formerly at mathematics at Chalmers University of Technology, and currently at Carmen Systems AB. We thank
the heads of undergraduate studies at mathematics, Goteborg University
and Chalmers University of Technology, Jan-Erik Andersson and Sven
J
arner, respectively, for reducing our teaching duties while preparing
this book. We also thank Yumi Karlsson for helping us by typesetting
a main part of the first draft based on the handwritten notes; after the
fact, we now realize that having been helped with this first draft made
us confident that such a tremendous task as that of writing a text book
would actually be possible. Finally, we thank all the students who gave
us critical remarks on the first versions during 2004 and 2005.
G
oteborg, May 2005
vi
Niclas Andreasson
Anton Evgrafov
Michael Patriksson
Contents
Introduction
II
Fundamentals
1
3
3
9
11
13
15
16
16
18
18
19
20
25
26
27
28
31
41
41
42
42
Contents
3.3
3.4
3.5
3.6
III
3.2.2 Polytopes . . . . . . . . . . . . . . . . . . . .
3.2.3 Polyhedra . . . . . . . . . . . . . . . . . . . .
3.2.4 The Separation Theorem and Farkas Lemma
Convex functions . . . . . . . . . . . . . . . . . . . .
Application: the projection of a vector onto a convex
Notes and further reading . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . .
. .
. .
. .
. .
set
. .
. .
.
.
.
.
.
.
.
Optimality Conditions
45
47
52
57
66
69
69
73
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
75
78
78
81
83
84
88
95
96
96
98
99
100
106
107
5 Optimality conditions
111
5.1 Relations between optimality conditions and CQs at a glance111
5.2 A note of caution . . . . . . . . . . . . . . . . . . . . . . . 112
5.3 Geometric optimality conditions . . . . . . . . . . . . . . 114
5.4 The Fritz John conditions . . . . . . . . . . . . . . . . . . 118
5.5 The KarushKuhnTucker conditions . . . . . . . . . . . . 124
5.6 Proper treatment of equality constraints . . . . . . . . . . 128
5.7 Constraint qualifications . . . . . . . . . . . . . . . . . . . 130
5.7.1 MangasarianFromovitz CQ (MFCQ) . . . . . . . 131
5.7.2 Slater CQ . . . . . . . . . . . . . . . . . . . . . . . 131
5.7.3 Linear independence CQ (LICQ) . . . . . . . . . . 132
5.7.4 Affine constraints . . . . . . . . . . . . . . . . . . . 132
5.8 Sufficiency of the KKT conditions under convexity . . . . 133
5.9 Applications and examples . . . . . . . . . . . . . . . . . . 135
5.10 Notes and further reading . . . . . . . . . . . . . . . . . . 137
viii
Contents
5.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6 Lagrangian duality
141
6.1 The relaxation theorem . . . . . . . . . . . . . . . . . . . 141
6.2 Lagrangian duality . . . . . . . . . . . . . . . . . . . . . . 142
6.2.1 Lagrangian relaxation and the dual problem . . . . 142
6.2.2 Global optimality conditions . . . . . . . . . . . . 147
6.2.3 Strong duality for convex programs . . . . . . . . . 149
6.2.4 Strong duality for linear and quadratic programs . 154
6.2.5 Two illustrative examples . . . . . . . . . . . . . . 156
6.3 Differentiability properties of the dual function . . . . . . 158
6.3.1 Subdifferentiability of convex functions . . . . . . . 158
6.3.2 Differentiability of the Lagrangian dual function . 162
6.4 Subgradient optimization methods . . . . . . . . . . . . . 164
6.4.1 Convex problems . . . . . . . . . . . . . . . . . . . 164
6.4.2 Application to the Lagrangian dual problem . . . . 170
6.4.3 The generation of ascent directions . . . . . . . . . 173
6.5 Obtaining a primal solution . . . . . . . . . . . . . . . . 174
6.5.1 Differentiability at the optimal solution . . . . . . 175
6.5.2 Everetts Theorem . . . . . . . . . . . . . . . . . . 176
6.6 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . 177
6.6.1 Analysis for convex problems . . . . . . . . . . . . 177
6.6.2 Analysis for differentiable problems . . . . . . . . . 179
6.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.7.1 Electrical networks . . . . . . . . . . . . . . . . . . 181
6.7.2 A Lagrangian relaxation of the traveling salesman
problem . . . . . . . . . . . . . . . . . . . . . . . . 185
6.8 Notes and further reading . . . . . . . . . . . . . . . . . . 189
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
IV
Linear Programming
195
Contents
7.5.2
7.5.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
225
225
226
232
236
237
238
238
239
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
241
241
242
243
243
247
247
250
254
257
258
259
260
261
Algorithms
. .
. .
II
. .
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
265
11 Unconstrained optimization
267
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 267
x
Contents
11.2 Descent directions . . . . . . . . . . . . . . . . . . .
11.2.1 Introduction . . . . . . . . . . . . . . . . . .
11.2.2 Newtons method and extensions . . . . . . .
11.3 The line search problem . . . . . . . . . . . . . . . .
11.3.1 A characterization of the line search problem
11.3.2 Approximate line search strategies . . . . . .
11.4 Convergent algorithms . . . . . . . . . . . . . . . . .
11.5 Finite termination criteria . . . . . . . . . . . . . . .
11.6 A comment on non-differentiability . . . . . . . . . .
11.7 Trust region methods . . . . . . . . . . . . . . . . . .
11.8 Conjugate gradient methods . . . . . . . . . . . . . .
11.8.1 Conjugate directions . . . . . . . . . . . . . .
11.8.2 Conjugate direction methods . . . . . . . . .
11.8.3 Generating conjugate directions . . . . . . . .
11.8.4 Conjugate gradient methods . . . . . . . . . .
11.8.5 Extension to non-quadratic problems . . . . .
11.9 A quasi-Newton method: DFP . . . . . . . . . . . .
11.10Convergence rates . . . . . . . . . . . . . . . . . . .
11.11Implicit functions . . . . . . . . . . . . . . . . . . . .
11.12Notes and further reading . . . . . . . . . . . . . . .
11.13Exercises . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
269
269
271
275
275
276
279
281
283
284
285
286
287
288
289
292
293
296
296
297
298
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
303
303
305
308
311
317
317
319
321
322
13 Constrained optimization
13.1 Penalty methods . . . . . . . . . . . . . . . . . . . . .
13.1.1 Exterior penalty methods . . . . . . . . . . . .
13.1.2 Interior penalty methods . . . . . . . . . . . .
13.1.3 Computational considerations . . . . . . . . . .
13.1.4 Applications and examples . . . . . . . . . . .
13.2 Sequential quadratic programming . . . . . . . . . . .
13.2.1 Introduction . . . . . . . . . . . . . . . . . . .
13.2.2 A penalty-function based SQP algorithm . . .
13.2.3 A numerical example on the MSQP algorithm .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
325
325
326
330
333
334
337
337
340
345
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xi
Contents
13.2.4 On recent developments in SQP algorithms
13.3 A summary and comparison . . . . . . . . . . . . .
13.4 Notes and further reading . . . . . . . . . . . . . .
13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . .
VI
.
.
.
.
.
.
.
.
.
.
.
.
Appendix
.
.
.
.
346
346
347
348
351
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
353
353
356
358
360
361
363
365
366
368
370
371
References
373
Index
385
xii
Part I
Introduction
Modelling and
classification
1.1
The word optimum is Latin, and means the ultimate ideal; similarly,
optimus means the best. Therefore, to optimize refers to trying to
bring whatever we are dealing with towards its ultimate state, that is,
towards its optimum. Let us take a closer look at what that means in
terms of an example, and at the same time bring the definition of the
term optimization forward, as the scientific field understands and uses
it.
Example 1.1 (a staff planning problem) Consider a hospital ward which
operates 24 hours a day. At different times of day, the staff requirement
differs. Table 1.1 shows the demand for reserve wardens during six work
shifts.
Table 1.1: Staff requirements at a hospital ward.
Shift
Hours
Demand
1
04
8
2
48
10
3
812
12
4
1216
10
5
1620
8
6
2024
6
Each member of staff works in 8 hour shifts. The goal is to fulfill the
demand with the least total number of reserve wardens.
Consider now the following interpretation of the term to optimize:
To optimize = to do something as well as is possible.
6
X
j=1
xj .
x1 + x2 10,
x2 + x3 12,
x3 + x4 10,
x4 + x5 8,
x5 + x6 6.
6
X
xj ,
j=1
subject to x1 + x6 8,
(last shift: 1)
x1 + x2 10,
x2 + x3 12,
(last shift: 2)
(last shift: 3)
x5 + x6 6,
(last shift: 6)
x3 + x4 10,
x4 + x5 8,
(last shift: 4)
(last shift: 5)
xj 0,
j = 1, . . . , 6,
xj integer,
j = 1, . . . , 6.
Reality
Communication
Evaluation
Simplification
Quantification
Limitation Modification
Interpretation
Optimization model
Algorithms
Results
Data
1.2
At Chalmers, courses in optimization are mainly given at the mathematics department. Mainly is the important word here, because courses
that have a substantial content of optimization theory and/or methodology can be found also at other departments, such as computer science,
the mechanical, industrial and chemical engineering departments, and at
the Gothenburg School of Economics. The reason is that optimization
is so broad in its applications.
From the mathematical standpoint, optimization, or mathematical
programming as it is sometimes called, rests on several legs: analysis,
topology, algebra, discrete mathematics, etcetera, build the foundation
of the theory, and applied mathematics subjects such as numerical analysis and mathematical parts of computer science build the bridge to
the algorithmic side of the subject. On the other side, then, with optimization we solve problems in a huge variety of areas, in the technical,
natural, life and engineering sciences, and in economics.
Before moving on, we would just like to point out that the term
program has nothing to do with computer program; a program is
understood to be a decision program, that is, a strategy or decision
rule. A mathematical program therefore is a mathematical problem
designed to produce a decision program.
The history of optimization is very long. Many, very often geometrical or mechanical, problems (and quite often related to warfare!) that
Archimedes, Euclid, Heron, and other masters from antiquity formulated
and also solved, are optimization problems. For example, we mention
the problem of maximizing the volume of a closed three-dimensional object (such as a sphere or a cylinder) built from a two-dimensional sheet
of metal with a given area.
The masters of two millenia later, like Bernoulli, Lagrange, Euler, and
Weierstrass developed variational calculus, studying problems in applied
physics (and still often with a mind towards warfare!) such as how to
find the best trajectory for a flying object.
The notion of optimality and especially how to characterize an optimal solution, began to be developed at the same time. Characterizations
of various forms of optimal solutions are indeed a crucial part of any basic
optimization course. (See Section 1.7.)
The scientific subject operations research refers to the study of decision problems regarding operations, in the sense of controlling complex
9
10
1.3
We here develop a subset of problem classes that can be set up by contrasting certain aspects of a general optimization problem. We let
n
j = 1, 2, . . . , n;
f : R R {} : objective function;
X Rn : ground set defined logically/physically;
i E.
(equality constraints)
f (x),
subject to gi (x) bi ,
gi (x) = di ,
x X.
(1.1a)
i I,
i E,
(1.1b)
(1.1c)
(1.1d)
2 Incidentally, several other laureates in economics have worked with the tools of
optimization: Paul A. Samuelson (1970, linear programming), Kenneth J. Arrow
(1972, game theory), Wassily Leontief (1973, linear transportation models), Gerard
Debreu (1983, game theory), Harry M. Markowitz (1990, quadratic programming in
finance), John F. Nash, Jr. (1994, game theory), William Vickrey (1996, econometrics), and Daniel L. McFadden (2000, microeconomics).
11
Constrained optimization I E =
6 and/or X Rn .
Differentiable optimization f, gi , i I E, are continuously differentiable on X; further, X is closed and convex.
Non-differentiable optimization At least one of f, gi , i I E, is
non-differentiable.
(CP) Convex programming f is convex; gi , i I, are concave;
gi , i E, are affine; and X is closed and convex. (See Section 3.3 for
definitions.)
Non-convex programming The complement of the above.
In Figure 1.3 we show how the problems NLP, IP, and LP are related.
NLP
IP
LP
12
Conventions
automatically is integer valued even without imposing any integrality
constraints, provided of course that the problem has any optimal solutions at all. We say that such problems have the integrality property.
An important example problem belonging to this category is the linear
single-commodity network flow problem with integer data; this class of
problems in turn includes as special cases such important problems as the
linear versions of the assignment problem, the transportation problem,
the maximum flow problem, and the shortest route problem.
Among the above list of problem classes, we distinguish, roughly only,
between two of the most important ones, as follows:
LP Linear programming applied linear algebra. LP is easy, because there exist algorithms that can solve every LP problem instance efficiently in practice.
NLP Nonlinear programming applied analysis in several variables.
NLP is hard, because there does not exist an algorithm that can
solve every NLP problem instance efficiently in practice. NLP is
such a large problem area that it contains very hard problems as
well as very easy problems. The largest class of NLP problems
that are solvable with some algorithm in reasonable time is CP (of
which LP is a special case).
Our problem formulation (1.1) does not cover the following:
infinite-dimensional problems (that is, problems formulated in function spaces rather than vector spaces);
implicit functions f and/or gi , i I E: then, no explicit formula
can be written down; this is typical in engineering applications,
where the value of, say, f (x) can be the result of a simulation (see
Section 11.11 for more details);
multiple-objective optimization:
1.4
Conventions
(1.2)
13
denotes the infimum value of the function f over the set S; if and only
if the infimum value is attained at some point x in S (and then both
f and x necessarily are finite) we can write that
f := minimum f (x),
x S
(1.3)
and then we of course have that f (x ) = f . (When considering maximization problems, we obtain the analogous definitions of the supremum
and the maximum.)
The second operation defines the set of optimal solutions to the problem at hand:
S := arg minimum f (x);
x S
(1.4)
is a special case which moreover defines an often much more simple task.
Consider the problem instance where S = { x R | x 0 } and
(
1/x, if x > 0,
f (x) :=
+, otherwise;
here, f = 0 but S = the value 0 is not attained for a finite value of
x, so the problem has a finite infimum value but not an optimal solution.
These examples lead to our convention in reading the problem (1.2):
the statement solve the problem (1.2) means find f and an x S ,
or conclude that S = .
Hence, it is implicit in the formulation that we are interested both
in the infimum value and in (at least) one optimal solution if one exists.
Whenever we are certain that only one of them is of interest we will
state so explicitly. We are aware that the interpretation of (1.2) may be
considered vague since no operation is visible; so, to summarize and
clarify our convention, it in fact includes two operations, (1.3) and (1.4).
There is a second reason for stating the optimization problem (1.1) in
the way it is, a reason which is computational. To solve the problem, we
almost always need to solve a sequence of relaxations/simplifications of
14
1.5
15
1.6
1.7
On optimality conditions
The most important topic of the book is the analysis of the local or global
optimality of a given feasible vector x in the problem (1.2), and its links
to the construction of algorithms for finding such vectors. While locally
or globally optimal vectors are the ones preferred, the types of vectors
that one can expect to reach for a general problem are referred to as
stationary points; we define what we mean by x S being a stationary
point in the problem (1.2) in non-mathematical terms as follows:
x S is a stationary point in the problem (1.2) if, with the
use only of first-order4 information about the problem at x ,
we cannot find a feasible descent direction at x .
4 This
16
On optimality conditions
In mathematical terms, this condition can be written as follows:
x S is a stationary point in the problem (1.2) if f (x )
NS (x ) holds, where NS (x ) is the normal cone to S at x .
See Definition 4.25 for the definition of the normal cone.
In applications to all model problems considered in the book this condition collapses to something that is rather easy to check.5 In the most
general case, however, its use in formulating necessary optimality conditions requires further that the point x satisfies a regularity condition
referred to as a constraint qualification (CQ).
The connection between local or global optimality, stationarity and
regularity is given by the following two implications, which constitute
the perhaps two most important ones in the entire book:
x local min in (1.2)
= x stationary point in (1.2);
x regular
(1.5)
x stationary point in (1.2)
= x global min in (1.2).
the problem (1.2) is convex
(1.6)
17
1.8
1.8.1
18
1.8.2
(1.7a)
subject to gi (x) 0,
i = 1, . . . , m,
(1.7b)
m
X
si ,
(1.8a)
i=1
subject to gi (x) si ,
si 0,
i = 1, . . . , m,
i = 1, . . . , m.
(1.8b)
(x ,s)
m
X
si ,
i=1
subject to si gi (x),
si 0,
i = 1, . . . , m,
i = 1, . . . , m,
19
subject to si gi (x),
si 0.
This problem is trivially solvable: si := maximum {0, gi (x)}, that is,
si takes on the role of a slack variable for the constraint. Using this
expression in the problem (1.8) we finally obtain the problem to
minimize
f (x) +
n
x R
m
X
i=1
(1.9)
If the constraints instead are ofPthe form gi (x) 0, then the resulting
m
penalty function is of the form i=1 maximum {0, gi (x)}.
We note that the use of the linear penalty term in (1.8a) resulted
in the penalty problem (1.9); other penalty terms than (1.8a) lead to
other penalty problems. See Section 13.1 for a thorough discussion on
and analysis of penalty functions and methods.
1.9
21
1.10
1.11
The subject of optimization, including both its basic theory and the
natural, basic, algorithmic development that is associated with solving
26
1.12
1.13
Exercises
Exercises
Exercise 1.2 (modelling) A computer company has estimated the service hours needed during the next five months; see Table 1.2.
Table 1.2: Number of service hours per month; Exercise 1.2.
Month
January
February
March
April
May
# Service hours
6000
7000
8000
9500
11,500
the money went to other interior designs of the office space, so there is no
money left to buy more cable.
29
b/2
Connection
b
Connection
l/2
l
three partners all want to sit as close as possible to it. Therefore, they
decide to try to minimize the distance to the window for the workplace
that is the furthest away from it.
Formulate the problem of placing the three work places so that the
maximum distance to the panorama window is minimized, subject to all
the necessary constraints.
Exercise 1.4 (modelling, exam 010523) A large chain of department
stores wants to build distribution centers (warehouses) which will supply
30 department stores with goods. They have 10 possible locations to
choose between. To build a warehouse at location i, i = 1, . . . , 10, costs
ci MEUR and the capacity of a warehouse at that location would be ki
volume units per week. Department store j has a demand of ej volume
units per week. The distance between warehouse i and department store
j is dij km, i = 1, . . . , 10, j = 1, . . . , 30, and a certain warehouse can
only serve a department store if the distance is at most D km.
One wishes to minimize the cost of investing in the necessary distribution centers.
(a) Formulate a linear integer optimization model describing the optimization problem.
(b) Suppose each department store must be served from one of the
warehouses. What must be changed in the model?
30
Part II
Fundamentals
Analysis and
algebraA summary
II
The analysis of optimization problems and related optimization algorithms requires the basic understanding of formal logic, linear algebra,
and multidimensional analysis. This chapter is not intended as a substitute for the basic courses on these subjects but rather to give a brief
review of the notation, definitions, and basic facts which will be used in
the subsequent chapters without any further notice. If you feel inconvenient with the limited summaries presented in this chapter, contact any
of the abundant number of basic text books on the subject.
2.1
Reductio ad absurdum
2.2
Linear algebra
Linear algebra
The largest number of linearly independent vectors in Rn is n; any
collection of n linearly independent vectors in Rn is referred to as a
basis. The basis (v 1 , . . . , v n ) is said to be orthogonal if (v i , v j ) = 0 for
all i, j = 1, . . . , n, i 6= j. If, in addition, it holds that kvi k = 1 for all
i = 1, . . . , n, the basis is called orthonormal.
Given the basis (v 1 , . . . P
, v n ) in Rn , every vector v Rn can be writn
ten in a unique way as v = i=1 i v i , and the n-tuple (1 , . . . , n )T will
be referred to as coordinates of v in this basis. If the basis (v 1 , . . . , v n )
is orthonormal, then the coordinates i are computed as i = (v, v i ),
i = 1, . . . , n.
The space Rn will typically be equipped with the standard basis
(e1 , . . . , en ), where
ei := ( 0, . . . , 0 , 1, 0, . . . , 0 )T Rn .
| {z }
| {z }
i 1 zeros
n i zeros
max
v Rn :kv k=1
kAvk.
Analysis
Even if A is not square, AT A as well as AAT are square and symmetric. If the columns of A are linearly independent, then AT A is
nonsingular. (Similarly, if the columns of AT are linearly independent,
then AAT is nonsingular.)
Sometimes, we will use the following simple fact: for every A Rkn
with elements aij , i = 1, . . . , k, j = 1, . . . , n, it holds that aij = (e
ei , Aej ),
where (e
e1 , . . . , e
ek ) is the standard basis in Rk , and (e1 , . . . , en ) is the
standard basis in Rn , i = 1, . . . , k, j = 1, . . . , n.
We will say that A Rnn is positive semidefinite (respectively,
positive definite), and denote this by A 0 (respectively, A 0) if
and only if for all v Rn it holds that (v, Av) 0 (respectively, for all
v Rn , v 6= 0n , it holds that (v, Av) > 0). The matrix A is positive
semidefinite (respectively, positive definite) if and only if its eigenvalues
are nonnegative (respectively, positive).
For two symmetric matrices A, B Rnn we will write A B
(respectively, A B) if and only if AB 0 (respectively, AB 0).
2.3
Analysis
(2.1)
and moreover
lim
t0
o(t)
= 0.
t
(2.2)
Analysis
matrix, and a function o : R R verifying (2.2), such that
1
f (x) = f (x0 ) + (f (x0 ), x x0 ) + (x x0 , 2 f (x0 )(x x0 ))
2
+ o(kx x0 k2 ).
(2.3)
Sometimes it will be convenient to discuss vector-valued functions
f : S Rk . We say that f = (f1 , . . . , fk )T is continuous if every fi ,
i = 1, . . . , k is; similarly we define differentiability. In the latter case,
by f Rnk we denote a matrix with columns (f1 , . . . , fk ). Its
transpose is often referred to as the Jacobian of f .
We call a continuous function f : S R continuously differentiable [notation: f C 1 (S)] if it is differentiable on S and the gradient
f : S Rn is continuous on S. We call f : S R twice continuously
differentiable [notation: f C 2 (S)], if it is continuously differentiable
and in addition every component of f : S Rn is continuously differentiable.
The following alternative forms of (2.1) and (2.3) will be useful some
times. If f : S R is continuously differentiable on S, and x0 S, then
for every x in some neighbourhood of x0 we have
f (x) = f (x0 ) + (f (), x x0 ),
(2.4)
40
III
Convex analysis
3.1
Convexity of sets
x1 + (1 )x2
x2
x1
Figure 3.1: A convex set. (For the intermediate vector shown, the value
of is 1/2.)
Two non-convex sets are shown in Figure 3.2.
Example 3.2 (convex and non-convex sets) By using the definition of a
convex set, the following can be established:
(a) The set Rn is a convex set.
(b) The empty set is a convex set.
Convex analysis
x1
x1 + (1 )x2
S
S
x2
Figure 3.2: Two non-convex sets.
We will not write the index 2 , but instead use the 2-norm implicitly
whenever writing k k.]
(d) The set { x Rn | kxk = a } is non-convex for every a > 0.
(e) The set {0, 1, 2} is non-convex. (The second illustration in Figure 3.2 is such a case of a set of integral points in R2 .)
is a convex set.
Proof. Let both x1 and x2 belong to S. (If two such points cannot be
found, then the result holds vacuously.) Then, x1 Sk and x2 Sk for
all k K. Take (0, 1). Then, x1 + (1 )x2 Sk , k K, by the
convexity of the sets Sk . So, x1 + (1 )x2 kK Sk = S.
3.2
3.2.1
Polyhedral theory
Convex hulls
Polyhedral theory
3.3(b)], that is, { v 1 + (1 )v 2 | R } = { 1 v 1 + 2 v 2 | 1 , 2
R; 1 + 2 = 1 }. Another set naturally related to V is the line segment
between v 1 and v 2 [see Figure 3.3(c)], that is, { v 1 + (1 )v 2 |
[0, 1] } = { 1 v 1 + 2 v 2 | 1 , 2 0; 1 + 2 = 1 }. Motivated by these
examples we define the affine hull and the convex hull of a set in Rn .
Definition 3.4 (affine hull) Let V := {v 1 , . . . , v k } Rn . The affine
hull of V is the set
(
)
k
X
aff V := 1 v 1 + + k v k 1 , . . . , k R;
i = 1 .
i=1
Example 3.6 (affine hull, convex hull) (a) The affine hull of three or
more points in R2 not all lying on the same line is R2 itself. The convex
hull of five points in R2 is shown in Figure 3.4 (observe that the corners
of the convex hull of the points are some of the points themselves).
(b) The affine hull of three points not all lying on the same line in
R3 is the plane through the points.
(c) The affine hull of an affine space is the space itself and the convex
hull of a convex set is the set itself.
From the definition of convex hull of a finite set it follows that the
convex hull equals the set of all convex combinations of points in the set.
It turns out that this also holds for arbitrary sets.
43
Convex analysis
v2
v2
v1
v2
v1
(a)
v1
(b)
(c)
Figure 3.3: (a) The set V . (b) The set aff V . (c) The set conv V .
v1
v5
v2
v4
v3
Figure 3.4: The convex hull of five points in R2 .
+ (1 )1 b1 + + (1 )m bm ,
44
Polyhedral theory
and since 1 + + k + (1 )1 + + (1 )m = 1, we have
that x1 + (1 )x2 Q, so Q is convex. Since Q is convex and V Q
it follows that conv V Q (from the definition of convex hull of an arbitrary set in Rn it follows that conv V is the smallest convex set that
contains V ). Therefore Q = conv V .
Proposition 3.7 shows that every point of the convex hull of a set
can be written as a convex combination of points from the set. It tells,
however, nothing about how many points that are required. This is the
content of Caratheodorys Theorem.
Theorem 3.8 (Caratheodorys Theorem) Let x conv V , where V
Rn . Then, x can be expressed as a convex combination of n + 1 or fewer
points of V .
Proof. From Proposition 3.7 it follows that x = 1P
a1 + + m am for
m
1
m
some a , . . . , a V and 1 , . . . , m 0 such that i=1 i = 1. We assume that this representation of x is chosen so that x cannot be expressed
as a convex combination of fewer than m points of V . It follows that
no two of the points a1 , . . . , am are equal and that 1 , . . . , m > 0. We
prove the theorem by showing that m n + 1. Assume that m > n + 1.
Then the set {a1 , . . . , am } must be affinely
dependent, so
exist
Pm
Pthere
m
1 , . . . , m R, not all zero, such that i=1 i ai = 0n and i=1 i = 0.
Let > 0 be such that 1 + 1 , . . . , m + m are non-negative with at
least one of them zero (such an exists since the s are
P all positive and
i
at least one of the s must be negative). Then, x = m
i=1 (i + i )a ,
and if terms with zero coefficients are omitted this is a representation of
x with fewer than m points; this is a contradiction.
3.2.2
Polytopes
Convex analysis
Definition 3.11 (extreme point) A point v of a convex set P is called
an extreme point if whenever v = x1 + (1 )x2 , where x1 , x2 P
and (0, 1), then v = x1 = x2 .
Example 3.12 (extreme points) The set shown in Figure 3.3(c) has the
extreme points v 1 and v 2 . The set shown in Figure 3.4 has the extreme
points v 1 , v 2 , and v 3 . The set shown in Figure 3.3(b) does not have any
extreme points.
Lemma 3.13 Let V := {v1 , . . . , v k } Rn and let P be the polytope
conv V . Then, each extreme point of P lies in V .
Proof.
/ V is an extreme
of P . We have that
Pk Assume that w
Ppoint
k
w = i=1 i v i , for some i 0 such that i=1 i = 1. At least one of
the i s must be nonzero, say 1 . If 1 = 1 then w = v 1 , a contradiction,
so 1 (0, 1). We have that
w = 1 v 1 + (1 1 )
Pk
k
X
i=2
i
vi .
1 1
P
Since i=2 i /(1 1 ) = 1 we have that ki=2 i /(1 1 )v i P , but
w is an extreme point of P so w = v 1 , a contradiction.
Proposition 3.14 Let V := {v1 , . . . , v k } Rn and let P be the polytope conv V . Then P is equal to the convex hull of its extreme points.
Proof. Let Q be the set of extreme points of P . If v i Q for all i =
1, . . . , k we are done, so assume that v 1
/ Q. Then v 1 = u + (1 )w
Pk
for some (0, 1) and u, w P , u 6= w. Further, u = i=1 i v i and
P
P
w = ki=1 i v i , for some 1 , . . . , k , 1 , . . . , k 0 such that ki=1 i =
Pk
i=1 i = 1. Hence,
v1 =
k
X
i=1
i v i + (1 )
k
X
i=1
i v i =
k
X
i=1
(i + (1 )i )v i .
k
X
i=2
46
i + (1 )i
vi ,
1 (1 + (1 )1 )
Polyhedral theory
Pk
and since i=2 (i + (1 )i )/(1 1 (1 )1 ) = 1 it follows
that conv V = conv (V \ {v1 }). Similarly, every v i
/ Q can be removed,
and we end up with a set T V such that conv T = conv V and T Q.
On the other hand, from Lemma 3.13 we have that every extreme point
of the set conv T lies in T and since conv T = conv V it follows that Q
is the set of extreme points of conv T , so Q T . Hence, T = Q and we
are done.
3.2.3
Polyhedra
in (A, b) is denoted by m.
Theorem 3.17 (algebraic characterization of extreme points) Let x
P = { x Rn | Ax b }, where A Rmn has rank A = n and
be the equality subsystem of A
x = b
b Rm . Further, let A
x b.
= n.
Convex analysis
x2
6
x1 = 2
5
4
2x1 x2 = 4
3
P
x1 + x2 = 6
1
1
x1
is an extreme point of P . If A
Proof. [=] Suppose that x
x < b
+ 1n , x
1n P if > 0 is sufficiently small. But x
=
then x
is an extreme point,
1/2(
x +1n )+1/2(
x1n ) which contradicts that x
so assume that at least one of the rows in A
x b is fulfilled with equality.
x = b
is the equality subsystem of A
n 1,
If A
x b and rank A
n
m
48
Polyhedral theory
x2
3x1 x2 = 0
6
5
4
P
x1 x2 = 2
2
1
1
6
5
x1 + x2 = 2
4
x1
Convex analysis
Definition 3.20 (cone) A subset C of Rn is a cone if x C whenever
x C and > 0.
Example 3.21 (cone) (a) The set { x Rn | Ax 0m }, where A
Rmn , is a cone. Since this set is a polyhedron, this type of cone is
usually called a polyhedral cone.
(b) Figure 3.7(a) illustrates a convex cone and Figure 3.7(b) illustrates a non-convex cone in R2 .
(a)
(b)
of A
x b. We prove the theorem by induction on the rank of A.
Polyhedral theory
n, and choose x
= k 1.
Q with k rank A
Q with rank A
all x
= 0m
Then there exists a w 6= 0n such that Aw
. If || is sufficiently
+ w Q. (Why?) If x
+ w Q for all R
small it follows that x
we must have Aw = 0n which implies rank A n 1, a contradiction.
+ + w Q. Then if
Suppose that there exists a largest + such that x
+
is the equality subsystem of A(
x + w) = b
A(
x + + w) b we must
k. (Why?) By the induction hypothesis it then follows
have rank A
+ + w P + C. On the other hand, if x
+ w Q for all 0
that x
+ (w) Q for all 0
then Aw 0m , so w C. Similarly, if x
+ (w) Q
then w C, and if there exists a largest such that x
+ (w) P + C.
then x
Above we got a contradiction if none of + or existed. If only
+ + w P + C and w C, and
one of them exists, say + , then x
P + C. Otherwise, if both + and exist then
it follows that x
+
+ w P + C and x
+ (w) P + C, and x
can be written as
x
P + C. We have
a convex combination of these points, which gives x
P + C for all x
Q with k 1 rank A n and the
shown that x
theorem follows by induction.
Example 3.23 (illustration of the Representation Theorem) Figure 3.8(a)
can be written as a
shows a bounded polyhedron. The interior point x
convex combination of the extreme point x5 and the point v on the
boundary, that is, there is a (0, 1) such that
= x5 + (1 )v.
x
51
Convex analysis
The point v lies on the halfline { x R2 | x = x2 + (x1 x2 ),
0 }. All the points on this halfline are feasible, which gives that if the
polyhedron is given by { x R2 | Ax b } then
A(x2 + (x1 x2 )) = Ax2 + A(x1 x2 ) b,
0.
But then we must have that A(x1 x2 ) 02 since otherwise some component of A(x1 x2 ) tends to infinity as tends to infinity. Therefore
x1 x2 lies in the cone C := { x R2 | Ax 02 }. Now there exists a
0 such that
v = x2 + (x1 x2 ),
and it follows that
= x3 + (1 )x2 + (1 )(x1 x2 ),
x
is the sum of a point in the
so since (1 ) 0 and x1 x2 C, x
convex hull of the extreme points and a point in the polyhedral cone C.
in a polyhedron is normally
Note that the representation of a vector x
not uniquely determined; in the case of Figure 3.8(a), for example, we
as a convex combination of x1 , x4 , and x5 .
can also represent x
3.2.4
Polyhedral theory
x2
x1
x1
x
x4
x
x4
x2
x3
x5
(a)
(b)
Convex analysis
x2
2
y = (1.5, 1.5)T
T x = x1 + x2 = 2
C
2
x1
(3.1)
Polyhedral theory
such that
Pk
i=1
1
m
a
a
+ + m
,
= 1
bm
b1
Pm
i=1
i = 1. Therefore,
= 1 (a1 )T x
+ + m (am )T x
1 b1 + + m bm = .
T x
> . So x
P , which completes
But this is a contradiction, since T x
the proof.
We introduce the concept of finitely generated cones. In the proof
of Farkas Lemma below we will use that finitely generated cones are
convex and closed, and in order to show this fact we prove that finitely
generated cones are polyhedral sets.
Definition 3.27 (finitely generated cone) A finitely generated cone is
one that is generated by a finite set, that is, a cone of the form
cone {v 1 , . . . , v m } := { 1 v 1 + + m v m | 1 , . . . , m 0 },
where v 1 , . . . , v m Rn . Note that if A Rmn , then the set { y Rm |
y = Ax; x 0n } is a finitely generated cone.
Recall that a cone that is a polyhedron is called a polyhedral cone.
We show that a finitely generated cone is always a polyhedral cone and
vice versa.
Theorem 3.28 A convex cone in Rn is finitely generated if and only if
it is polyhedral.
55
Convex analysis
Proof. [=] Assume that C is the finitely generated cone
cone {v 1 , . . . , v m },
where v 1 , . . . , v m Rn . From Theorem 3.26 we know that polytopes
are polyhedral sets, so conv {0n , v 1 , . . . , v m } is the solution set of some
linear inequalities
(a1 )T x b1 , . . . , (ak )T x bk .
(3.2)
Convex functions
Theorem 3.30 (Farkas Lemma) Let A Rmn and b Rm . Then,
exactly one of the systems
Ax = b,
(I)
x0 ,
and
AT 0n ,
(II)
b > 0,
x 0n .
(3.3)
3.3
Convex functions
xS
(0, 1)
= f (
x + (1 )x) f (
x) + (1 )f (x).
x + (1 )x S
S.
The function f is convex on S if it is convex at every x
57
Convex analysis
In other words, a convex function is such that a linear interpolation
never is lower than the function itself.1
From the definition follows that a function f : Rn R {+} is
convex on a convex set S Rn if and only if
x1 , x2 S
= f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ).
(0, 1)
Definition 3.32 (concave function) Suppose that S Rn . A function
S if f is convex at x
.
f : Rn R {+} is concave at x
S.
The function f is concave on S if it is concave at every x
Definition 3.33 (strictly convex/concave function) A function f : Rn
S if
R {+} is strictly convex at x
x S, x 6= x
(0, 1)
= f (
x + (1 )x) < f (
x) + (1 )f (x).
x + (1 )x S
In other words, a strictly convex function is such that a linear interpolation is strictly above the function itself.
Figure 3.10 illustrates a strictly convex function.
f (x1 )
f (x1 ) + (1 )f (x2 )
f
f (x2 )
f (x1 + (1 )x2 )
x1
x1 + (1 )x2
x2
1 Words like lower and above should be understood in the sense of the comparison between the y-coordinates of the respective function at the same coordinates
in x.
58
Convex functions
Example 3.34 (convex functions) By using the definition of a convex
function, the following can be established:
(a) The function f : Rn R defined by f (x) := kxk is convex on
Rn .
n
T
Pn(b) Let c R , a R. The affine function xn7 f (x) := c x + a =
j=1 cj xj + a is both convex and concave on R . The affine functions
are also the only finite functions that are both convex and concave.
Figure 3.11 illustrates a non-convex function.
f (x1 )
f (x1 + (1 )x2 )
f (x1 ) + (1 )f (x2 )
f (x2 )
f
x1
x1 + (1 )x2
x2
holds.
Proof. The proof is left as an exercise.
Proposition 3.36 (convexity of composite functions) Suppose that S
Rn and P R. Let further g : S R be a function which is convex on
S, and f : P R be convex and non-decreasing [y x = f (y) f (x)]
on P . Then, the composite function f (g) is convex on the set { x S |
g(x) P }.
59
Convex analysis
Proof. Let x1 , x2 S { x Rn | g(x) P }, and (0, 1). Then,
f (g(x1 + (1 )x2 )) f (g(x1 ) + (1 )g(x2 ))
where the first inequality follows from the convexity of g and the property of f being increasing, and the second inequality from the convexity
of f .
The following example functions are important in the development
of penalty methods in linear and nonlinear optimization; their convexity
is crucial when developing a convergence theory for such algorithms.
Example 3.37 (convex composite functions) Suppose that the function
g : Rn R is convex.
(a) The function x 7 log(g(x)) is convex on the set { x Rn |
g(x) < 0 }. (This function will be of interest in the analysis of interior
point methods; see Section 13.1.)
(b) The function x 7 1/g(x) is convex on the set { x Rn | g(x) <
0 }.
[Note: This function is convex, but the above rule for composite
functions cannot be used. Utilize the definition of a convex function
instead. The domain of the function must here be limited, because
x 7 1/x is convex only for positive x.]
(c) The function x 7 1/ log(g(x)) is convex on the set { x Rn |
g(x) < 1 }.
[Note: This function is convex, but the above rule for composite
functions cannot be used. Utilize the definition of a convex function
instead. The domain of the function must here be limited, because
x 7 1/x is convex only for positive x.]
We next characterize the convexity of a function on Rn by the convexity of its epigraph in Rn+1 .
[Note: the graph of a function f : Rn R is the boundary of epi f ,
which still resides in Rn+1 . See Figure 3.12 for an example, corresponding to the convex function in Figure 3.10.]
Definition 3.38 (epigraph) The epigraph of a function f : Rn R
{+} is the set
epi f := { (x, ) Rn R | f (x) }.
(3.4)
(3.5)
60
Convex functions
1111111111111111111111111111111111
0000000000000000000000000000000000
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
1111111111111111111111111111111111
0000000000000000000000000000000000
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
1111111111111111111111111111111111
0000000000000000000000000000000000
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
1111111111111111111111111111111111
0000000000000000000000000000000000
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
1111111111111111111111111111111111
0000000000000000000000000000000000
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
1111111111111111111111111111111111
0000000000000000000000000000000000
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
0000000000000000000000000000000000
1111111111111111111111111111111111
epi f
x
Figure 3.12: A convex function and its epigraph.
Theorem 3.39 Suppose that S Rn is a convex set. Then, the function f : Rn R {+} is convex on S if, and only if, its epigraph
restricted to S is a convex set in Rn+1 .
Proof. [=] Suppose that f is convex on S. Let (x1 , 1 ), (x2 , 2 )
epiS f . Let (0, 1). By the convexity of f on S,
f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 )
1 + (1 )2 .
Convex analysis
Theorem 3.40 (convexity characterizations in C 1 ) Let f C 1 on an
open convex set S.
(a) f is convex on S f (y) f (x) + f (x)T (y x), for all
x, y S.
(b) f is convex on S [f (x) f (y)]T (x y) 0, for all
x, y S.
The result in (a) states, in words, that every tangent plane to the
function surface in Rn+1 lies on, or below, the epigraph of f , or, that
a first-order approximation is below f .
The result in (b) states that f is monotone on S.
[Note: when n = 1, the result in (b) states that f is convex if and
only if its derivative f is non-decreasing, that is, that it is monotonically
increasing.]
Proof. (a) [=] Take x1 , x2 S and (0, 1). Then,
f (x1 ) + (1 )f (x2 ) f (x1 + (1 )x2 )
[ > 0]
Let 0. Then, the right-hand side of the above inequality tends to the
directional derivative of f at x2 in the direction of (x1 x2 ), so that in
the limit it becomes
f (x1 ) f (x2 ) f (x2 )T (x1 x2 ).
The result follows.
[=] We have that
f (x1 ) f (x1 + (1 )x2 ) + (1 )f (x1 + (1 )x2 )T (x1 x2 ),
f (x2 ) f (x1 + (1 )x2 ) + f (x1 + (1 )x2 )T (x2 x1 ).
x, y S,
x, y S,
(3.6)
Convex functions
where x = x1 +(1)x2 for some (0, 1). By assumption, [f (x)
f (x1 )]T (x x1 ) 0, so (1 )[f (x) f (x1 )]T (x2 x1 ) 0. From
this follows that f (x)T (x2 x1 ) f (x1 )T (x2 x1 ). By using this
inequality and (3.6), we get that f (x2 ) f (x1 ) + f (x1 )T (x2 x1 ).
We are done.
Figure 3.13 illustrates part (a) of Theorem 3.40.
f
x
f (
x)
1
Figure 3.13: A tangent plane to the graph of a convex function.
f (
x) + f (
x)(y x
)
Convex analysis
differentiability of f ,
f (
x + p) f (
x) + f (
x)T p,
1
x)p + o(2 ).
f (
x + p) = f (
x) + f (
x)T p + 2 pT 2 f (
2
(3.7)
(3.8)
(3.10)
Convex functions
Theorem 3.42 (convexity characterizations in C 2 , II) Let S Rn be a
nonempty convex set and f : Rn R be in C 2 on Rn . Let C be the
subspace parallel to the affine hull of S. Then,
f is convex on S
(3.11)
f
b
levf (b)
Proposition 3.44 (convex level sets from convex functions) Suppose that
the function g : Rn R is convex. Then, for every value of b R, the
level set levg (b) is a convex set. It is moreover closed.
Proof. The result follows immediately from the definitions of a convex
set and a convex function. Let x1 , x2 both satisfy the constraint that
65
Convex analysis
g(x) b holds, and let (0, 1). (If not two such points x1 , x2 can
be found, then the result holds vacuously.) Then, by the convexity of g,
g(x1 + (1 )x2 ) b + (1 )b = b, so the set levg (b) is convex.
The fact that a convex function which is defined on Rn is continuous
establishes that the set levg (b) is always closed.2 (Why?)
Definition 3.45 (convex problem) Suppose that the set X Rn is
closed and convex. Suppose further that f : Rn R is convex and
that the functions gi : Rn R, i I, are convex. Suppose, finally, that
the functions gi : Rn R, i E, are affine. Then, the problem to
minimize
subject to
f (x),
gi (x) 0,
i I,
gi (x) = 0, i E,
x X,
3.4
2 That
66
ProjS (z)
1
0
0w
1
1
0
0
1
1
0
1
0
ProjS (w)
The vector w ProjS (w) clearly is normal to the set S. The point z
has the Euclidean projection ProjS (z), but there are also several other
vectors with the same projection; the figure shows in a special shading
the set of vectors z which all have that same projection onto S. This set
is a cone, which we refer to as the normal cone to S at x = ProjS (z).
In the case of the point ProjS (w) the normal cone reduces to a ray
which of course is also a cone. (The difference between these two sets
is largely the consequence of the fact that there is only one constraint
active at ProjS (w), while there are two constraints active at ProjS (z);
when developing the KKT conditions in Chapter 5 we shall see how
strongly the active constraints influence the appearance of the optimality
conditions.)
We will also return to this image already in Section 4.6.3, because it
contains the building blocks of the optimality conditions for an optimization problem with an objective function in C 1 over a closed convex set.
For now, we will establish only one property of the projection operation
ProjS , namely that the distance function, distS , defined by
distS (x) := kx ProjS (x)k,
x Rn ,
(3.12)
is a convex function on Rn . In particular, then, this function is continuous. (Later, we will establish also that the projection operation ProjS is
a well-defined operation whenever S is nonempty, closed and convex, and
that the operation has particularly nice continuity properties. Before we
can do so, however, we need to establish some results on the existence
67
Convex analysis
of optimal solutions.)
Let x1 , x2 Rn , and (0, 1). Then,
distS (x1 + (1 )x2 ) = k(x1 + (1 )x2 )
k(x1 + (1 )x2 )
ProjS (x1 )
1
0
01
1
0
0
1
2
1
0
0x
1
1
0
0
1
ProjS (x2 )
68
3.5
3.6
Exercises
x2 1,
x3 2,
x2 + x3 2.
69
Convex analysis
Exercise 3.4 (existence of extreme points in LPs) Let A Rmn be such that
rank A = m, and let b Rm . Show that if the polyhedron
P := { x Rn | Ax = b;
x 0n }
x1 x2 1;
x1 x2 0;
x1 x2 1 },
x1 x2 0 },
and P be the convex hull of the extreme points of Q. Show that the feasible
= (1, 1)T can be written as
point x
x = p +
,
where
p P and C.
Exercise 3.6 (separation) Show that there is only one hyperplane in R3 which
separates the disjoint closed convex sets A and B defined by
A := { (0, x2 , 1)T | x2 R },
B := { x R3 | x 03 ; x1 x2 x23 },
Rn
is the
Exercise 3.8 (application of Farkas Lemma) In a paper submitted for publication in an operations research journal, the author considered the set
P :=
x
n+m
y R
Ax + By ; x 0n ; y 0m ,
70
Exercises
Exercise 3.11 (convex functions) Let a > 0. Consider the following functions
in one variable:
(a) f (x) := ln x, for x > 0;
(b) f (x) := ln x, for x > 0;
(c) f (x) := ln(1 eax ), for x > 0;
(d) f (x) := ln(1 + eax );
(e) f (x) := eax ;
(f) f (x) := x ln x, for x > 0.
Which of these functions are convex (or, strictly convex)?
Exercise 3.12 (convex functions) Consider the following functions:
(a) f (x) := ln(ex1 + ex2 );
P
a j xj
(b) f (x) := ln n
, where aj , j = 1, . . . , n, are constants;
j=1 e
v
uX
n
u
x2j ;
(c) f (x) := t
j=1
n
Y
!1/n
xj
, for xj > 0, j = 1, . . . , n.
j=1
1 2
y + 3x y.
2
a convex set?
71
Convex analysis
Exercise 3.16 (convex sets) Is the set defined by
S := { x R2 | x1 x22 1; x31 + x22 10; 2x1 + x2 8; x1 1; x2 0 }
a convex set?
Exercise 3.17 (convex problem) Suppose that the function g :
convex on Rn and that d Rn . Is the problem to
maximize
subject to
n
X
x2j ,
j=1
1
0,
ln(g(x))
dT x = 2,
g(x) 2,
x 0n
a convex problem?
Exercise 3.18 (convex problem) Is the problem to
maximize x1 ln x1 ,
subject to x21 + x22 0,
x 02
a convex problem?
72
Rn
is
Part III
Optimality Conditions
An introduction to
optimality conditions
4.1
IV
(4.1a)
(4.1b)
4 5 6 7S
xS
(4.2)
holds.
Let B (x ) := { y Rn | ky x k < } be the open Euclidean ball
with radius centered at x .
Definition 4.2 (local minimum) Consider the problem (4.1). Let x
S.
(a) We say that x is a local minimum of f over S if there exists a
small enough ball intersected with S around x such that it is a globally
optimal solution in that smaller set.
In other words, x S is a local minimum of f over S if
> 0 such that f (x ) f (x),
x S B (x ).
(4.3)
f (
x) + (1 )f (x ) < f (x ). Choosing > 0 small enough then leads
to a contradiction to the local optimality of x .
There is an intuitive image that can be seen from the proof design:
If x is a local minimum, then f cannot go down-hill from x in any
has a lower value, then f has to go down-hill sooner
direction, but if x
or later. No convex function can have this shape.
76
0.06
0.05
0.04
0.03
0.02
0.01
0.1
0
0.2
0
0.15
0.1
0.05
0.05
0.1
0.1
0.15
0.2
4.2
4.2.1
We first pave the way for a classic result from calculus: Weierstrass
Theorem.
Definition 4.5 (weakly coercive, coercive functions) Let S Rn be a
nonempty and closed set, and f : S R be a given function.
(a) We say that f is weakly coercive with respect to the set S if S is
bounded or for every N > 0 there exists an M > 0 such that f (x) N
whenever kxk M .
In other words, f is weakly coercive if either S is bounded or
lim
kx k
x S
f (x) =
holds.
(b) We say that f is coercive with respect to the set S if S is bounded
or for every N > 0 there exists an M > 0 such that f (x)/kxk N
whenever kxk M .
In other words, f is coercive if either S is bounded or
lim
kx k
x S
f (x)/kxk =
holds.
The weak coercivity of f : S R is (for nonempty closed sets S)
equivalent to the property that f has bounded level sets restricted to S
(cf. Definition 3.43). (Why?)
A coercive function grows faster than any linear function. In fact, for
convex functions f , f being coercive is equivalent to x 7 f (x) aT x
78
xk x
f (
x) lim inf f (xk ).
k
S if
(b) The function f is said to be upper semi-continuous at x
.
the value f (
x) is greater than or equal to every limit of f as xk x
S if
In other words, f is upper semi-continuous at x
xk x
f (
x) lim sup f (xk ).
k
x
Figure 4.3: A lower semi-continuous function in one variable.
Establish the following important relations:
1 For example, in Section 6.3.2 we suppose that the ground set X is compact in
order for the Lagrangian dual function q to be finite. It is possible to replace the
boundedness condition on X with a coercivity condition on f .
79
x S
(The infimum of f over S is the lowest limit of all sequences of the form
{f (xk )} with {xk } S, so such a sequence of vectors xk is what we
here are choosing.)
Due to the boundedness of S, the sequence {xk } must have limit
be an
points, all of which lie in S because of the closedness of S. Let x
arbitrary limit point of {xk }, corresponding to the subsequence K Z+ .
Then, by the lower semi-continuity of f ,
f (
x) lim inf f (xk ) = lim f (xk ) = inf f (x).
kK
kK
x S
4.2.2
Non-standard results
Ex = d },
(4.4)
Ep = 0 }.
(4.5)
1 T
x Qx + q T x,
2
x Rn ,
(4.6)
81
q T p 0 }.
qTp 0
holds.
The statement in (c) shows that the conditions for the existence of
an optimal solution in the case of convex quadratic programs are milder
than in the general convex case. In the latter case, we can state a slight
improvement over the Weierstrass Theorem 4.7 that if, in the problem
(4.1), f is convex on S where the latter is nonempty, closed and convex,
then the problem has a nonempty, convex and compact set of globally
optimal solutions if and only if recS recf = {0n }. The improvements in
the above results for polyhedral, in particular quadratic, programs stem
from the fact that convex polynomial functions cannot be lower bounded
and yet not have a global minimum.
[Note: Consider the special case of the problem (4.1) where f (x) :=
1/x and S := [1, +). It is clear that f is bounded from below on S,
in fact by the value zero which is the infimum of f over S, but it never
attains the value zero on S, and therefore this problem has no optimal
solution. Of course, f is not a polynomial function.]
3 Check that this cone actually is independent of the value of b under this only
requirement. Also confirm that if the level set levf (b) is (nonempty and) bounded
for some b R then it is bounded for every b R, thanks to the convexity of f .
82
qT p 0
holds.
Corollary 4.10 will in fact be established later on in Theorem 8.10, by
the use of polyhedral convexity, when we specialize our treatment of nonlinear optimization to that of linear optimization. Since we have already
established the Representation Theorem 3.22, proving Corollary 4.10 for
the case of LP will be easy: since the objective function is linear, every
feasible direction p recS with q T p < 0 leads to an unbounded solution
from any vector x S.
4.2.3
f (x),
subject to
x P,
(4.7)
83
4.3
(4.8)
x R
Note that
f (x) =
f (x) n
xj
j=1
(x )
so the requirement thus is that fx
= 0, j = 1, . . . , n.
j
Just as for the case n = 1, we refer to this condition as x being a
stationary point of f .
[Note: For n = 1, Theorem 4.14 reduces to: x R is a local minimum = f (x ) = 0.]
holds.
Proof. Since f is in C 1 around x, we can construct a Taylor expansion
of f , as above:
f (x + p) = f (x) + f (x)T p + o().
Since f (x)T p < 0, we obtain that f (x + p) < f (x) for all sufficiently
small values of > 0.
Notice that at a point x Rn there may be other descent directions
p Rn beside those satisfying that f (x)T p < 0; in Example 11.2(b) we
show how directions of negative curvature stemming from eigenvectors
corresponding to negative eigenvalues of the Hessian matrix 2 f (x) can
be utilized.
If f in addition is convex then the opposite implication in the above
proposition is true, thus making the descent property equivalent to the
property that the directional derivative is negative. Since this result can
be stated also for non-differentiable functions f (in which case we must
of course replace the expression f (x)T p with the classic expression
for the directional derivative, f (x; p) := lim0+ 1 [f (x + p) f (x)]),
we shall relegate the proof of this equivalence to our presentation of
the subdifferentiability analysis of convex functions in Section 6.3.1, in
particular to Proposition 6.18.
If f has stronger differentiability properties, then we can say even
more what a local optimum must be like.
Theorem 4.17 (necessary optimality conditions, C 2 case) Suppose that
f : Rn R is in C 2 on Rn . Then,
f (x ) = 0n
x is a local minimum of f =
2 f (x ) is positive semidefinite.
86
2 T 2
p f (x )p + o(2 ).
2
>0
> f (x )
for all small enough > 0. As p was arbitrary, the above implies that
x is a strict local minimum of f over Rn .
We naturally face the following question: When is a stationary point
a global minimum? The answer is given next. (Investigate the connection between this result and the Fundamental Theorem 4.3.)
87
f (x ) = 0n .
Proof. [=] This has already been shown in Theorem 4.14, since a
global minimum is a local minimum.
[=] The convexity of f yields that for every y Rn ,
f (y) f (x ) + f (x )T (y x ) = f (x ),
where the equality stems from the property that f (x ) = 0n .
4.4
(4.9a)
subject to x S,
(4.9b)
i E;
gi (x) 0,
i I },
Ap 0m },
as stated in (4.5). Note moreover that the above set recS represents the
cone C in the Representation Theorem 3.22.4
4 While that theorem was stated for sets defined only by linear inequalities, we can
always rewrite the equalities Ex = d as Ex d, Ex d; the corresponding
feasible directions are then given by Ep 0 , Ep 0 , that is, Ep = 0 .
89
xS
(4.10)
holds.
Proof. (a) We again utilize the Taylor expansion of f around x :
f (x + p) = f (x ) + f (x )T p + o().
The proof is by contradiction. As was shown in Proposition 4.16, if there
is a direction p for which it holds that f (x )T p < 0, then f (x +p) <
f (x ) for all sufficiently small values of > 0. It suffices here to state
that p should also be a feasible direction in order to reach a contradiction
to the local optimality of x .
(b) If S is convex then every feasible direction p can be written as a
positive scalar times the vector x x for some vector x S. (Why?)
The expression (4.10) then follows from the statement in (a).
The inequality (4.10) is sometimes referred to as a variational inequality. We will utilize it for several purposes: (i) to derive equivalent
optimality conditions involving a linear optimization problem as well as
the Euclidean projection operation ProjS introduced in Section 3.4; (ii)
to derive descent algorithms for the problem (4.9) in Section 12.2 and
12.4; (iii) to derive a near-optimality condition for convex optimization
problems in Section 4.5; and (iv) we will extend it to non-convex sets in
the form of the KarushKuhnTucker conditions in Chapter 5.
In Theorem 4.14 we established that for unconstrained C 1 optimization the necessary optimality condition is that f (x ) = 0n holds. Notice that that is exactly what becomes of the variational inequality (4.10)
when S = Rn , because the only way in which that inequality can hold
90
(4.10) holds.
Proof. [=] This has already been shown in Proposition 4.23(b), since
a global minimum is a local minimum.
[=] The convexity of f yields [cf. Theorem 3.40(a)] that for every
y S,
f (y) f (x ) + f (x )T (y x ) f (x ),
where the second inequality stems from (4.10).
First, we will provide the connection to the Euclidean projection of
a vector onto a convex set, discussed in Section 3.4. We claim that the
property (4.10) is equivalent to
x = ProjS [x f (x )],
(4.11)
1
kx zk2 .
2
(4.12)
The necessary optimality conditions for this problem, as stated in Proposition 4.23(b), is that
h(x)T (y x) 0,
y S,
(4.13)
91
y S,
that is, a statement identical to (4.10). The characterization (4.11) is interesting in that it states that if x is not stationary, then the projection
operation defined therein must provide a step away from x ; this step
will in fact yield a reduced value of f under some additional conditions
on the step length , and so it defines a descent algorithm for (4.9); see
Exercise 4.5, and the text in Section 12.4.
So far, we have two equivalent characterizations of a stationary point
of f at x : (4.10) and (4.11). The following one is based on a linear
optimization problem.
Notice that (4.10) states that f (x )T x f (x )T x for every
x S. Since we obtain an equality by setting x = x we see that x in
fact is a globally optimal solution to the problem to
minimize f (x )T x.
x S
(4.14)
x, with
then every direction of the form p := y
arg minimum f (x)T (y x),
y
y S
y S.
y S.
The interpretation of this inequality is that the angle between the two
vectors z x (the vector that points towards the point being projected)
and the vector y x (the vector that points towards any vector y S)
is 90 . So, the projection operation has the characterization
[z ProjS (z)]T (y ProjS (z)) 0,
y S.
(4.15)
x f (x )
1
0
0
1
y
S
Figure 4.4: Normal cone characterization of a stationary point.
Here, the point being projected is z = x f (x ), as used in the
characterization of stationarity.
What is left to complete the picture is to define the normal cone,
NS (x ), to S at x , depicted in Figure 4.4 in the lighter shade.
93
(4.17)
What this condition states geometrically is that the angle between the
negative gradient and any feasible direction is 90 , which, of course,
whenever f (x ) 6= 0n , is the same as stating that at x there exist
no feasible descent directions. The four conditions (4.10), (4.11), (4.14),
and (4.17) are equivalent, and so according to Theorem 4.24 they all are
also both necessary and sufficient for the global optimality of x as soon
as f is convex.
We remark that in the special case when S is an affine subspace (such
as the solution set of a number of linear equations, S := { x Rn | Ex =
d }), the statement (4.17) means that at a stationary point x , f (x )
is parallel to a normal of the subspace.
The normal cone inclusion (4.17) will later be extended to more general sets, where S is described by a finite collection of possibly nonconvex constraints. The extension will lead us to the famous Karush
KuhnTucker conditions in Chapter 5. [It turns out to be much more
convenient to extend (4.17) than the other three characterizations of
stationarity.]
We finish this section by proving a proposition on the behaviour of
the gradient of the objective function f on the solution set S to convex
problems of the form (4.1). The below result shows that f enjoys a
stability property, and it also extends the result from the unconstrained
case where the value of f always is zero on the solution set.
Proposition 4.26 (invariance of f on the solution set of convex programs) Suppose that S Rn is convex and that f : Rn R is convex
and in C 1 on S. Then, the value of f (x) is constant on the optimal
solution set S .
Further, suppose that x S . Then,
S = { x S | f (x )T (x x ) = 0 and f (x) = f (x ) }.
94
x Rn .
(4.18)
x Rn ,
4.5
(4.19)
The equality follows by definition; the first inequality stems from the
solves the linear minimization problem, while the vector x
fact that y
may not; the second inequality follows from the convexity of f on S [cf.
Theorem 3.40(a)]; the final inequality follows from the global optimality
of x and the feasibility of x.
From the above, we obtain a closed interval wherein we know that the
optimal value of the problem (4.9) lies. Let f := minimumx S f (x) =
arg minimumy S z(y),
f (x ). Then, for every x S and y
f [f (x) + z(
y ), f (x)].
(4.20)
95
4.6
4.6.1
Applications
Continuity of convex functions
A remarkable property of any convex function is that without any additional assumptions it can be shown to be continuous relative to any open
convex set in the intersection of its effective domain (that is, where the
function has a finite value) and its affine hull.5 We establish a slightly
weaker special case of this result below, in which relative interior is
replaced by interior for simplicity.
Theorem 4.27 (continuity of convex functions) Suppose that f : Rn
R {+} is a convex function, and consider an open convex subset S
of its effective domain. The function f is continuous on S.
S. To establish continuity of f at x
, we must show
Proof. Let x
k implies that
that given > 0, there exists > 0 such that kx x
|f (x) f (
x)| . We establish this property in two parts, by showing
(cf. Definition 4.6).
that f is both lower and upper semi-continuous at x
[upper semi-continuity] By the openness of S, there exists > 0 with
k , implying x S. Construct the value of the scalar as
kx x
follows:
:= maximum {max {|f (
x + ei ) f (
x)|, |f (
x ei ) f (
x)|}} ,
i{1,2,...,n}
(4.22)
5 In
96
Applications
where ei is the ith unit vector in Rn . If = 0 it follows that f is constant
and hence continuous there, so suppose that
in a neighbourhood of x
> 0. Let now
:= minimum
,
.
(4.23)
n n
k . For every i {1, 2, . . . , n},P
Choose an x with kx x
if xi xi
= ni=1 i z i ,
then define z i := ei , otherwise z i := ei . Then, x x
where i 0 for all i. Moreover,
k = kk.
kx x
(4.24)
1X
f (
x + i nz i )
n i=1
n
1X
x + i n(
x + z i )]
f [(1 i n)
n i=1
n
1X
[(1 i n)f (
x) + i nf (
x + z i )].
n i=1
Pn
Therefore, f (x) f (
x) i=1 i [f (
x + z i ) f (
x)]. From (4.22) it is
obvious that f (
x + z i ) f (
x) for each i; and since i 0, it follows
that
n
X
f (x) f (
x)
i .
(4.25)
i=1
Noting (4.23), (4.24), it follows that i /n, and (4.25) implies that
k
f (x) f (
x) . Hence, we have so far shown that kx x
implies that f (x) f (
x) . By Definition 4.6(b), f hence is upper
.
semi-continuous at x
k .
[lower semi-continuity] Let y := 2
x x, and note that ky x
Therefore, as above,
f (y) f (
x) .
(4.26)
97
4.6.2
Applications
4.6.3
Euclidean projection
x, y S,
(4.27)
holds.
Theorem 4.32 (the projection operation is non-expansive) Let S be a
nonempty, closed and convex set in Rn . For every x Rn , its projection
ProjS (x) is uniquely defined. The operator ProjS : Rn S is nonexpansive on Rn , and therefore in particular continuous.
Proof. The uniqueness of the operation is the result of the fact that
the function x 7 kx zk2 is both coercive and strictly convex on S,
so there exists a unique optimal solution to the projection problem for
every z Rn . (Cf. Weierstrass Theorem 4.7 and Proposition 4.11,
respectively.)
Next, take x1 , x2 Rn . Then, by the characterization (4.15) of the
Euclidean projection,
[x1 ProjS (x1 )]T (ProjS (x2 ) ProjS (x1 )) 0,
1
0
0
1
1
0
0
1
ProjS (x)
1
0
0
1
ProjS (y)
S
Figure 4.5: The projection operation is non-expansive.
4.6.4
Theory
Applications
Definition 4.33 (contractive operator) Let S Rn be a nonempty set
in Rn . Let f be a mapping from S to S. We say that f is contractive
on S if, as a result of applying the mapping f , the distance between any
two distinct vectors x and y in S decreases.
In other words, the operator f is contractive on S if there exists
[0, 1) such that
kf (x) f (y)k kx yk,
x, y S,
(4.28)
holds.
Clearly, a contractive operator is non-expansive.
In the below result we utilize the notion of a geometric convergence
rate; while its definition is in fact given in the result below, we also refer
to Sections 6.4.1 and 11.10 for more detailed discussions on convergence
rates.
Theorem 4.34 (fixed point theorems) Let S be a nonempty and closed
set in Rn .
(a) [Banachs Theorem] Let f : S S be a contraction mapping.
Then, f has a unique fixed point x S. Further, for every initial
vector x0 S, the iteration sequence {xk } defined by the fixed-point
iteration
xk+1 := f (xk ),
k = 0, 1, . . . ,
(4.29)
converges geometrically to the unique fixed point x . In particular,
kxk x k k kx0 x k,
k = 0, 1, . . . .
p
X
kxk+i xk+i1 k
i=1
p1
101
k = 1, 2, . . . .
Applications
102
Applications
f (x)
1
Figure 4.6: Consider the case S = [0, 1], and a continuous function
f : S S. Brouwers Theorem states that there exists an x S
with f (x ) = x . This is the same as saying that the continuous curve
starting at (0, f (0)) and ending at (1, f (1)) must pass through the line
y = x inside the square.
103
x S.
(4.30)
In order to turn it into a fixed point problem, we construct the following composite operator from Rn to S:
F (x) := ProjS (x f (x)),
x Rn ,
104
Applications
As an exercise, we consider the problem to find an x R such that
f (x) = 0, where f : R R is twice differentiable near a zero of f . The
classic NewtonRaphson algorithm has an iteration formula of the form
x0 R;
xk+1 = xk
f (xk )
,
f (xk )
k = 0, 1, . . . .
w W
w W
v V
(4.31)
In order to prove this theorem we can use the above existence theorem
for variational inequalities. Let
v
AT v
x=
; f (x) =
; S = V W.
w
Aw
It is a reasonably simple exercise to prove that the variational inequality (4.30) with the above identifications is equivalent to the saddle
point conditions, which can also be written as the existence of a pair
(v , w ) V W such that
(v )T Aw (v )T Aw v T Aw ,
(v, w) V W ;
105
4.7
Exercises
tion 12.4) which builds on the property (4.11); see also Exercise 4.5.
More on these algorithms will be said in Chapter 12.
(II) When the set S has an interior point, we may replace the constraints with an interior penalty function which has an asymptote
whenever approaching the boundary, thus automatically ensuring
that iterates stay (strictly) feasible. More on a class of methods
based on this penalty function is said in Chapter 13.
4.8
Exercises
(4.32)
subject to g(x) b,
where f : R R and g : R R are continuous functions, and b R. Suppose that this problem has a globally optimal solution, x , and that g(x ) < b
holds. Consider also the problem to
n
minimize f (x),
subject to
(4.33)
x Rn ,
in which we have removed the constraint. The question is under which circumstances this is vaild.
(a) Show by means of a counter-example that the vector x may not solve
(4.33); in other words, in general we cannot throw away a constraint that is
not active without affecting the optimal solution, even if it is inactive. Hence,
it would be wrong to call such constraints redundant.
(b) Suppose now that f is convex on Rn . Show that x solves (4.33).
Exercise 4.2 (unconstrained optimization, exam 020826) Consider the unconstrained optimization problem to minimize the function
f (x) :=
3 2
(x1 + x22 ) + (1 + a)x1 x2 (x1 + x2 ) + b
2
over R2 , where a and b are real-valued parameters. Find all values of a and b
such that the problem has a unique optimal solution.
Exercise 4.3 (spectral theory and unconstrained optimization) Let A be a symmetric n n matrix. For x Rn , x 6= 0n , consider the function (x) :=
xTTAx , and the related optimization problem to
x x
minimize
(x).
n
x 6=0
(P)
Determine the stationary points as well as the global minima in the problem (P). Interpret the result in terms of linear algebra.
107
p := ProjS [x f (x)] x.
Notice that
holds. Hence, p is zero if and only if x is stationary [according to the characterization in (4.11)], and if p is non-zero then it defines a feasible descent
direction with respect to f at x.
Exercise 4.6 (optimality conditions for a special problem) Suppose that f
C 1 on the set S := { x Rn | xj 0, j = 1, 2, . . . , n }, and consider
the problem of finding a minimum of f (x) over S. Develop the necessary
optimality conditions for this problem in a compact form.
Exercise 4.7 (optimality conditions for a special problem) Consider the problem to
maximize f (x) := xa1 1 xa2 2 xann ,
subject to
n
X
xj = 1,
j=1
xj 0,
j = 1, . . . , n,
108
Exercises
Suppose that S Rn and that f : Rn R is continuous on S.
(a) Suppose further that f is in C 1 on S. We say that the function f is
pseudo-convex on S if, for every x, y S,
f (x)T (y x) 0
f (y ) f (x).
x S.
(4.34)
xk+1
1
= g(xk ) :=
2
c
xk +
xk
k = 0, 1, . . . .
109
x 7 f (x) + kx v k over Rn .
f (v ) f (z ) < f (z ) + kz vk,
f (v ) + kv xk < f (z ) + kz xk.
Part (c) follows by the triangle inequality. Since v lies in M we have that
f (z ) + kz xk f (v ) + kv xk,
z Rn .
f (v ) + f + f (x) f (v ) + kv xk.
at an arbitrary point
z 7
kz f (z )k,
+,
if z S,
otherwise
:= kx f (x)k
and
:=
,
1
110
Optimality conditions
5.1
Optimality conditions are introduced as an attempt to construct an easily verifiable criterion that allows us to examine points in a feasible set,
one after another, and classify them into optimal and non-optimal ones.
Unfortunately, this is impossible in practice, and not only due to the
fact that there are far too many feasible points, but also because it is
impossible to construct such a universal criterion. It is usually possible
to construct either practical (that is, computationally verifiable) conditions that admit some mistakes in the characterization, or perfect ones
which are impossible to use in the computations. It is of course the first
group that is of practical value for us, and it may further be classified
into two distinct subgroups based on the type of mistakes allowed in the
decision-making process. Namely, optimality conditions encountered in
practice are divided into two classes, known as necessary and sufficient
conditions.
Necessary conditions must be satisfied at every locally optimal point;
on the other hand, we cannot guarantee that every point satisfying the
necessary optimality conditions is indeed locally optimal. On the contrary, sufficient optimality conditions provide such guarantees; however,
there may be some locally optimal points that violate the optimality
conditions. Arguably, it is much more important to be able to find a few
candidates for local minima that can be further investigated by other
means, than to eliminate some local (or even global) minima from the
beginning. Therefore, this chapter is dedicated to the development of
necessary optimality conditions. However, for convex optimization problems these conditions turn out to be sufficient.
Optimality conditions
Now, we can concentrate on what should be meant by easily verifiable
conditions. A human being can immediately state whether a given point
belongs to a simple set or not, by just glancing at a picture of it; for a numerical algorithm, a clear algebraic description of a set in terms of equalities and inequalities is vital. Therefore, we start our development with
geometric optimality conditions (Section 5.3), to gain an understanding
about the relationships between the gradient of the objective function
and the feasible set that must hold at every local minimum point. Given
a specific description of a feasible set in terms of inequalities, the geometric conditions immediately imply some relationships between the
gradients of the objective function and the constraints that are active
at the point under consideration (see Section 5.4); these conditions are
known as the Fritz John optimality conditions, and are rather weak (i.e.,
they can be satisfied by many points that have nothing in common with
locally optimal points). However, if we assume an additional regularity of
the system of inequalities and equalities that define our feasible set, then
the geometric optimality conditions imply stronger conditions, known as
the KarushKuhnTucker optimality conditions (see Section 5.5). The
additional regularity assumptions are known under the name constraint
qualifications (CQs), and they vary from very abstract and difficult to
check, but enjoyed by many feasible sets (such as, e.g., Abadies CQ,
see Definition 5.23) to more specific, easily verifiable but also somewhat
restrictive in many situations (such as the linear independence CQ, see
Definition 5.41, or the Slater CQ, see Definition 5.38). In Section 5.8
we show that for convex problems the KKT conditions are sufficient for
local, hence global, optimality.
The contents of this chapter are in principle summarized in the flowchart in Figure 5.1. Various optimality conditions and constraint qualifications that are discussed in this chapter constitute the nodes of the
flow-chart. Logical relationships between them are denoted with edges,
and the direction of the arrow shows the direction of the logical implication; each implication is further labeled with the result that establishes
it. We note that the KKT conditions follow from both geometric
conditions and constraint qualifications satisfied at a given point; also,
global optimality holds if both the KKT conditions are verified and the
optimization problem is convex.
5.2
A note of caution
A note of caution
x locally optimal
x globally optimal
Geometric OC
F (x ) TS (x ) =
Theorem 5.15
Theorem 5.45
Theorem 5.8
Theorem 5.33
Abadies CQ
TS (x ) = G(x ) H(x )
KKT OC (5.17)
Convexity
Proposition 5.36
Proposition 5.44
Affine constraints
Proposition 5.39
Optimality conditions
point to investigate further.
We will illustrate the importance of this with the story of the US Air
Forces controversial B-2 Stealth bomber program in the Reagan era of
the 1980s. There were many design variables, such as the various dimensions, the distribution of volume between the wing and the fuselage,
flying speed, thrust, fuel consumption, drag, lift, air density, etc., that
could be manipulated for obtaining the best range (i.e., the distance it
can fly starting with full tank, without refueling). The problem of maximizing the range subject to all the constraints was modeled as an NLP
in a secret Air Force study going back to the 1940s. A solution to the
necessary optimality conditions of this problem was found; it specified
values for the design variables that put almost all of the total volume in
the wing, leading to the flying wing design for the B-2 bomber. After
spending billions of dollars, building test planes, etc., it was found that
the design solution implemented works, but that its range was too low in
comparison with other bomber designs being experimented subsequently
in the US and abroad.
A careful review of the model was then carried out. The review indicated that all the formulas used, and the model itself, are perfectly valid.
However, the model was a nonconvex NLP, and the review revealed a
second solution to the system of necessary optimality conditions for it,
besides the one found and implemented as a result of earlier studies. The
second solution makes the wing volume much less than the total volume,
and seems to maximize the range; while the first solution that is implemented for the B-2 bomber seems to actually minimize the range. (The
second solution also looked like an aircraft should, while the flying wing
design was counter-intuitive.) In other words, the design implemented
was the aerodynamically worst possible choice of configuration, leading
to a very costly error. The aircraft does fly, but apparently, then, has
the only advantage that it is a stealth plane.
For an account, see Skeleton Alleged in the Stealth Bombers Closet,
Science, vol. 244, 12 May 1989 issue, pages 650651.
5.3
In this section we will discuss the optimality conditions for the following
optimization problem [cf. (4.1)]:
minimize f (x),
subject to x S,
(5.1)
Thus, this is nothing else but the cone containing all feasible directions
at x in the sense of Definition 4.20.
lim xk = x;
(5.3)
115
Optimality conditions
Thus, to construct a tangent cone we consider all the sequences {xk }
in S that converge to the given x Rn , and then calculate all the directions p Rn that are tangential to the sequences at x; such tangential
vectors are described as the limits of {k (xk x)} for arbitrary positive
sequences {k }. Note that to generate a nonzero vector p TS (x) the
sequence {k } must converge to +.
While it is possible that cl RS (x) = TS (x), or even that RS (x) =
TS (x), in general we have only the following proposition, and examples
that follow show that the two cones might be very different.
Proposition 5.3 (relationship between the radial and the tangent cones)
The tangent cone is a closed set, and the inclusion cl RS (x) TS (x)
holds for every x Rn .
Proof. Consider a sequence {pk } TS (x), and assume that {pk } p.
Since every pk TS (x), there exist xk S and k > 0, such that
kxk xk < k 1 and kk (xk x) pk k < k 1 . Then, clearly, {xk } x,
and, by the triangle inequality, kk (xk x) pk kk (xk x) pk k +
kpk pk, and the two terms in the right-hand side converge to 0, which
implies that p TS (x) and thus the latter set is closed.
In view of the closedness of the tangent cone, it is enough to show the
inclusion RS (x) TS (x). Let p RS (x). Then, for all large integers k
it holds that x + k 1 p S, and, therefore, setting xk = x + k 1 p and
k = k we see that p TS (x) as defined by Definition 5.2.
Example 5.4 Let S := { x R2 | x1 0; (x1 1)2 + x22 1 }. Then,
RS (02 ) = { p R2 | p1 > 0 }, and TS (02 ) = { p R2 | p1 0 }, i.e.,
TS (02 ) = cl RS (02 ) (see Figure 5.2).
Example 5.5 (complementarity constraint) Let S := { x R2 | x1
0; x2 0; x1 x2 0 }. In this case, S is a (non-convex) cone, and
RS (02 ) = TS (02 ) = S (see Figure 5.3).
Example 5.6 Let S := { x R2 | x31 +x2 0; x51 x2 0; x2 0 }.
Then, RS (02 ) = , and TS (02 ) = { p R2 | p1 0; p2 = 0 } (see
Figure 5.4).
Example 5.7 Let S := { x R2 | x2 0; (x1 1)2 + x22 = 1 }. Then,
RS (02 ) = , TS (02 ) = { p R2 | p1 = 0; p2 0 } (see Figure 5.5).
We already know that f decreases along any descent direction (cf. Definition 4.15), and that for a vector p Rn it is sufficient to verify the
inequality f (x )T p < 0 to be a descent direction for f at x Rn
116
S
1
S
TS (02 )
1
(a)
(b)
Figure 5.2: (a) The set S obtained as the intersection of the solution set
of two constraints; (b) the tangent cone TS (02 ) (see Example 5.4).
(see Proposition 4.16). Even though this condition is not necessary, it
is very easy to check in practice and therefore we will use it to develop
optimality conditions. Therefore, it would be convenient to define a cone
of such directions (which is empty if f (x ) happens to be 0n ):
F (x ) := { p Rn | f (x )T p < 0 }.
(5.4)
Now we have the necessary notation in order to state and prove the
main theorem of this section.
Theorem 5.8 (geometric necessary optimality conditions) Consider the
optimization problem (5.1). Then,
Optimality conditions
and thus p 6 F (x ).
Combining Proposition 5.3 and Theorem 5.8 we get that
x is a local minimum of f over S
F (x ) RS (x ) = ;
5.4
Theorem 5.8 gives a very elegant criterion for checking whether a given
point x S is a candidate for a local minimum of the problem (5.1),
118
(a)
(b)
Figure 5.4: (a) The set S; (b) the tangent cone TS (02 ) (see Example 5.6).
1
(a)
1
(b)
Figure 5.5: (a) The set S; (b) the tangent cone TS (02 ) (see Example 5.7).
but there is a catch: the set TS (x ) is close to impossible to compute
for general sets S! Therefore, in this section we will use an algebraic
characterization of the set S to compute other cones that we hope could
approximate TS (x ) in many practical situations.
Namely, we assume that the set S is defined as the solution set of a
system of differentiable inequality constraints defined by the functions
gi C 1 (Rn ), i = 1, . . . , m, such that
S := { x Rn | gi (x) 0,
i = 1, . . . , m }.
(5.5)
Optimality conditions
of the KKT conditions. Therefore, we keep this assumption for some
time, and state the KKT system that specifically distinguishes between
the inequality and equality constraints in Section 5.6. We will use the
symbol I(x) to denote the index set of active inequality constraints at
x Rn (see Definition 4.21), and |I(x)| to denote the cardinality of this
set, i.e., the number of active inequality constraints at x Rn .
In order to compute approximations to the tangent cone TS (x), similarly to Example 4.22 we consider cones associated with the active constraints at a given point:
(5.6)
(5.7)
and
The following proposition verifies that G(x) is an inner approximation for RS (x) (and, therefore, for TS (x) as well, see Proposition 5.3),
and G(x) is an outer approximation for TS (x).
0. Let us calculate G(02 ) and G(02 ). Both constraints are satisfied with
equality at the given point, so that I(02 ) = {1, 2}. Then, g1 (02 ) =
Example 5.13 (Example 5.6 continued) The set S is defined by the three
inequality constraints g1 (x) := x31 + x2 0, g2 (x) := x51 x2 0,
g3 (x) := x2 0, which are all active at x = 02 ; g1 (02 ) = (0, 1)T ,
Theorem 5.15 (Fritz John necessary optimality conditions) Let the set
S be defined by (5.5). If x S is a local minimum of f over S then
there exist multipliers 0 R, Rm , such that
0 f (x ) +
m
X
i=1
i gi (x ) = 0n ,
(5.8a)
i gi (x ) = 0,
i = 1, . . . , m,
(5.8b)
0 , i 0,
i = 1, . . . , m,
(5.8c)
T T
(0 , ) 6= 0
m+1
(5.8d)
121
Optimality conditions
In other words,
x local minimum of f over S = (0 , ) R Rm : (5.8) holds.
Proof. Combining the results of Lemma 5.10 with the geometric optimality conditions provided by Theorem 5.8, we conclude that there is
no direction p Rn such that f (x )T p < 0 and gi (x )T p < 0, i
I(x ). Define the matrix A with columns f (x ), gi (x ), i I(x );
Optimality conditions
The fact that 0 may be zero in the system (5.8) essentially means
that the objective function f plays no role in the optimality conditions.
This is of course a rather unexpected and unwanted situation, and the
rest of the chapter is dedicated to describing how one can avoid it.
Since the cone of feasible directions RS (x) may be a bad approxi
mation of the tangent cone TS (x), so may G(x) owing to Lemma 5.10.
Therefore, in the most general case we cannot improve on the conditions (5.8); however, it is possible to improve upon (5.8) if we assume
that the set S is regular in some sense, i.e., that either G(x) or G(x)
is a tight enough approximation of TS (x). Requirements of this type
are called constraint qualifications, and they will be discussed in more
detail in Section 5.7. However, to get a feeling of what can be achieved
with a regular constraint set S, we show that the multiplier 0 in the system (5.8) cannot vanish (i.e., the KKT conditions hold, see Section 5.5) if
ditions of Theorem 5.8, and assume that G(x ) 6= . Then, the multiplier
0 in (5.8) cannot be zero; dividing all equations by 0 we may assume
that it equals one.
Proof. Assume that 0 = 0 in (5.8), and define the matrix A with
columns g
Since A
= 0n , 0|I(x )| , and
i (x ), i I(x ).
Example 5.22 Out of the four Examples 5.45.7, only the first verifies
5.5
f (x ) +
m
X
i=1
i gi (x ) = 0n ,
i gi (x ) = 0,
m
0 .
(5.9a)
i = 1, . . . , m,
(5.9b)
(5.9c)
In other words,
x local minimum of f over S
Abadies CQ holds at x
= Rm : (5.9) holds.
Optimality conditions
0 has no solutions. By Farkas Lemma (cf. Theorem 3.30), the system A = f (x ), 0|I(x )| has a solution. Define the vector
I(x ) = , and i = 0, for i 6 I(x ). Then, the so defined verifies
the KKT conditions (5.9).
Remark 5.26 (terminology) Similarly to the case of the Fritz John necessary optimality conditions, the solutions to the system (5.9) are
known as Lagrange multipliers (or just multipliers) associated with a
given candidate x Rn for a local minimum. The conditions (5.9a)
and (5.9c) are known as the dual feasibility conditions, and (5.9b) as the
complementarity conditions, respectively; this terminology will become
more clear in Chapter 6. Owing to the complementarity constraints, the
multipliers i corresponding to inactive inequality constraints i 6 I(x )
must be zero. In general, the Lagrange multiplier i bears the important information about how sensitive a particular local minimum is with
respect to small changes in the constraint gi .
Remark 5.27 (geometric interpretation) The system of equations and
inequalities defining (5.9) can (and should) be interpreted geometrically
as f (x ) NS (x ) (see Figure 5.6), the latter cone being the normal
cone to S at x S (see Definition 4.25); according to the figure, the
normal cone to S at x is furthermore spanned by the gradients of the
active constraints at x .1
Notice the specific roles played by the different parts of the system (5.9) in this respect: the complementarity conditions (5.9b) force
i to be equal to 0 for the inactive constraints, whence the summation
in the left-hand side of the linear system (5.9a) involves the active constraints only. Further, the sign conditions in (5.9c) ensures that each
vector i gi (x ), i I(x ), is an outward normal to S at x .
Remark 5.28 Note that in the unconstrained case the KKT system (5.9)
reduces to the single requirement f (x ) = 0n , which we have already
encountered in Theorem 4.14.
It is possible to further develop the KKT theory (with some technical
complications) for twice differentiable functions as it has been done for
the unconstrained case in Theorem 4.17. We refer the interested reader
to [BSS93, Section 4.4].
1 Compare with the normal cone characterization (4.17) and Figure 4.4 in the case
of convex feasible sets: we could, roughly, say that the role of a constraint qualification
in the more general context of this chapter is to ensure that the normal cone to the
feasible set at the vector x is a finitely generated convex cone, which moreover is
generated by the gradients of the active constraints describing functions gi at x ,
thus extending the normal cone inclusion in (4.17) to more general sets.
126
f (x)
g2(x)
g1(x)
g1= 0
g2= 0
x
g3= 0
possesses solutions = (1 , 21 (1 1 ))T for every 0 1 1. Therefore, there are infinitely many multipliers, that all belong to a bounded
set.
Example 5.30 (Example 5.5 continued) This is one of the rare cases
when Abadies constraint qualification is violated, and nevertheless the
KKT system happens to be solvable:
1
1 0
+
0
0 1
0
= 02 ,
0
03 ,
127
Optimality conditions
admits solutions = (1, 0, 3 )T for every 3 0. That is, the set of
Lagrange multipliers is unbounded in this case.
Example 5.31 (Example 5.6 continued) Since, for this example, in the
Fritz John system the multiplier 0 is necessarily zero, the KKT system
admits no solutions:
1
0 0
0
+
= 02 ,
0
1 1 1
03 ,
5.6
i = 1, . . . , m;
j = 1, . . . , },
(5.10)
i = 1, . . . , m,
g i ,
gei := him ,
(5.11)
i = m + 1, . . . , m + ,
him ,
i = m + + 1, . . . , m + 2,
128
i = 1, . . . , m + 2 }.
(5.12)
e
Now, let G(x)
be defined by (5.7) for the inequality representation (5.12)
of S. We will use the old notation G(x) for the cone defined only by the
gradients of the functions defining the inequality constraints active at x
in the representation (5.10), and in addition define the null space of the
matrix defined by the gradients of the functions defining the equality
constraints:
H(x) := { p Rn | hi (x)T p = 0,
i = 1, . . . , }.
(5.13)
(5.14)
(5.15)
and thus Abadies constraint qualification (see Definition 5.23) for the
set (5.10) may be equivalently written as
Assuming that the latter constraint qualification holds we can write the
KKT system (5.9) for x S, corresponding to the inequality representation (5.12) (see Theorem 5.25):
m
X
i=1
i gi (x ) +
m+
X
i=m+1
i him (x )
m+2
X
i=m++1
i him (x )
+f (x ) = 0n ,
i gi (x ) = 0, i = 1, . . . , m,
(5.16a)
(5.16b)
i him (x ) = 0,
i = m + 1, . . . , m + ,
(5.16c)
i him (x ) = 0,
i = m + + 1, . . . , m + 2,
(5.16d)
0m+2 .
(5.16e)
e Rm R as
ej =
Define the pair of vectors (e
, )
ei = i , i = 1, . . . , m;
m+j m++j , j = 1, . . . , . We also note that the equations (5.16c)
and (5.16d) are superfluous, because x S implies that hj (x ) = 0,
e known
j = 1, . . . , . Therefore, we get the following system for (e
, ),
as the KKT necessary optimality conditions for the sets represented by
129
Optimality conditions
differentiable equality and inequality constraints:
f (x ) +
m
X
i=1
ei gi (x ) +
X
j=1
ej hj (x ) = 0n ,
(5.17a)
ei gi (x ) = 0, i = 1, . . . , m, (5.17b)
e 0m .
(5.17c)
e Rm R : (5.17) holds.
= (e
, )
Example 5.34 (Example 5.32 revisited) Let us write the system of KKT
conditions for the original representation of the set with one inequality
and one equality constraint (see Example 5.14). As has already been
mentioned, Abadies constraint qualification is satisfied, and therefore,
since an optimum exists, the KKT system is necessarily solvable:
1
0
2
+ 1
+ 1
= 02 ,
0
1
0
1 0,
5.7
Constraint qualifications
Constraint qualifications
5.7.1
MangasarianFromovitz CQ (MFCQ)
5.7.2
Slater CQ
Definition 5.38 (Slater CQ) We say that the system of constraints describing the feasible set S via (5.10) satisfies the Slater CQ, if the functions gi , i = 1, . . . , m, defining the inequality constraints are convex,
the functions hj , j = 1, . . . , , defining the equality constraints are affine
with linearly independent gradients hj (x), j = 1, . . . , , and, finally,
S such that gi (
that there exists x
x) < 0, for all i {1, . . . , m}.
Proposition 5.39 The Slater CQ implies the MFCQ.
Proof. Suppose the Slater CQ holds at x S. By the convexity of the
inequality constraints we get:
0 > gi (
x) = gi (
x) gi (x) gi (x)T (
x x),
131
Optimality conditions
for all i I(x). Furthermore, since the equality constraints are affine,
we have that
0 = hj (
x) hj (x) = hj (x)T (
x x),
x G(x) H(x).
j = 1, . . . , . Then, x
Example 5.40 Only Example 5.4 verifies the Slater CQ (which in particular explains why it satisfies MFCQ as well, see Example 5.37).
5.7.3
Proof.
[Sketch] Assume that G(x )H(x ) = , i.e., the system GT p <
|I(x )|
0
and H T p = 0 is unsolvable, where G and H are the matrices
having the gradients of the active inequality and equality constraints, respectively, as their columns. Using a separation result similar to Farkas
Lemma (cf. Theorem 3.30) one can show that the system G+H = 0n ,
0|I(x )| has a nonzero solution (T , T )T R|I(x )|+ , which contradicts the linear independence assumption.
In fact, the solution (, ) to the KKT system (5.17), if one exists,
is necessarily unique in this case, and therefore LICQ is a rather strong
assumption in many practical situations.
Example 5.43 Only Example 5.7 in the original description using both
inequality and equality constraints verifies the LICQ (which in particular
explains why it satisfies the MFCQ, see Example 5.37, and why the
Lagrange multipliers are unique in this case, see Example 5.34).
5.7.4
Affine constraints
Assume that both the functions gi , i = 1, . . . , m, defining the inequality constraints and the functions hj , j = 1, . . . , , defining the equality
constraints in the representation (5.10) are affine, that is, the feasible
132
5.8
(5.18)
(5.19)
for all j = 1, . . . , . Using the convexity of the objective function, equations (5.17a) and (5.17b), non-negativity of the Lagrange multipliers i ,
133
Optimality conditions
i I(x ), and equations (5.18) and (5.19) we obtain the inequality
f (x)f (x ) f (x )T (xx )
=
iI(x )
0.
i gi (x )T (xx )
X
j=1
j hj (x )T (xx )
f (x ) f (x) +
0 [(5.17c)+feas.]
=0 [(5.17b)]
=0 [feas.]
is unsolvable, and therefore the KKT conditions are not necessary without a CQ even for convex problems.
5.9
subject to xT x 1.
The only constraint of this problem is convex; furthermore, (0n )T 0n =
0 < 1, and thus Slaters CQ (Definition 5.38) is verified. Therefore, the
KKT conditions are necessary for the local optimality in this problem.
We will find all the possible KKT points, and then choose a globally
optimal point among them.
(xT Ax) = 2Ax (A is symmetric), and (xT x) = 2x. Thus,
the KKT system is as follows: xT x 1 and
2Ax + 2x = 0n ,
0,
(xT x 1) = 0.
Optimality conditions
1. Let 1 , . . . , k be all the positive eigenvalues of A (if any), and
define Xi := { x Rn | xT x = 1; Ax = i x } to be the set of
corresponding eigenvectors of length 1, i = 1, . . . , k. Then, (x, i )
is a KKT point with the corresponding multiplier for every x Xi ,
i = 1, . . . , k. Moreover, xT Ax = i xT x = i < 0, for every
x Xi , i = 1, . . . , k.
2. Define also X0 := { x Rn | xT x 1; Ax = 0n }. Then, the
pair (x, 0) is a KKT point with the corresponding multiplier for
every x X0 . We note that if the matrix A is nonsingular, then
X0 = {0n }. In any case, xT Ax = 0 for every x X0 .
Therefore, if the matrix A has any positive eigenvalue, then the global
minima of the problem we consider are the eigenvectors of length one,
corresponding to the largest positive eigenvalue; otherwise, every vector
x X0 is globally optimal.
Example 5.50 Similarly to the previous example, consider the following equality-constrained minimization problem associated with a symmetric matrix A Rnn :
minimize xT Ax,
subject to xT x = 1.
The gradient of the only equality constraint equals 2x, and since 0n is
infeasible, LICQ is satisfied at every feasible point (see Definition 5.41),
and the KKT conditions are necessary for local optimality. In this case,
the KKT system is extremely simple: xT x = 1 and
2Ax + 2x = 0n .
Let 1 < 2 < < k denote all distinct eigenvalues of A, and define
as before Xi := { x Rn | xT x = 1; Ax = i x } to be the set of corresponding eigenvectors of length 1, i = 1, . . . , k. Then, (x, i ) is a KKT
point with the corresponding multiplier for every x Xi , i = 1, . . . , k.
Furthermore, since xT Ax = i for every x Xi , i = 1, . . . , k, it
holds that every x Xk , that is, every eigenvector corresponding to the
largest eigenvalue, is globally optimal.
Considering the problem for AT A and using the spectral theorem,
we deduce the well known fact that kAk = max1ik { |i | }.
Example 5.51 Consider the problem of finding the projection of a given
point y onto the polyhedron { x Rn | Ax = b }, where A Rkn ,
b Rk . Thus, we consider the following minimization problem with
136
5.10
Optimality conditions
qualifications, some of which we considered in this chapter, may be found
in the works of Arrow, Hurwitz, and Uzawa [AHU61], Abadie [Aba67],
Mangasarian and Fromowitz [MaF67], Guignard [Gui69], Zangwill [Zan69],
and Evans [Eva70].
5.11
Exercises
3x1 + x2 6.
Check if the point x0 = (1, 2)T is a KKT point for this problem. Is this an
optimal solution? Which CQs are satisfied at the point x0 ?
Exercise 5.2 (optimality conditions, exam 020529) (a) Consider the following
optimization problem:
x2 ,
minimize
sin(x) 1.
subject to
(5.20)
Find every locally and every globally optimal solution. Write down the KKT
conditions. Are they necessary/sufficient for this problem?
(b) Do the locally/globally optimal solutions to the problem (5.20) satisfy
the FJ optimality conditions?
(c) Question the usefulness of the FJ optimality conditions by finding a
point (x, y), which satisfies the FJ conditions for the problem:
minimize
y,
subject to
x2 + y 2 1,
x3 y 4 ,
T x,
subject to Ax b.
minimize
State the KKT conditions for this problem. Verify that every KKT point
satisfies
T x = bT , where is a vector of KKT multipliers.
Exercise 5.4 (optimality conditions, exam 020826) (a) Consider the nonlinear
programming problem with equality constraints:
minimize f (x),
subject to hi (x) = 0,
138
i = 1, . . . , m,
(5.21)
Exercises
where f , h1 , . . . , hm are continuously differentiable functions.
Show that the problem (5.21) is equivalent to the following problem with
one inequality constraint:
minimize f (x),
m
X
subject to
hi (x)
2
i=1
(5.22)
0.
1 0, 2 0,
1 f1 (x ) + 2 f2 (x ) = 0n ,
1 + 2 = 1,
Assume that the matrix A has full row rank. Find the globally optimal solution
to this problem.
Exercise 5.6 Consider the following optimization problem:
n
X
minimize
cj xj ,
j=1
n
X
subject to
j=1
(5.23)
x2j 1,
xj 0,
j = 1, . . . , n.
Assume that min {c1 , . . . , cn } < 0, and let us introduce KKT multipliers 0
and j 0, j = 1, . . . , n for the inequality constraints.
(a) Show that the equalities
xj = min{0, cj }/(2 ),
=
1
2
n
X
j=1
j = max{0, cj },
j = 1, . . . , n,
!1/2
j = 1, . . . , n,
139
Optimality conditions
Exercise 5.7 (optimality conditions, exam 040308) Consider the following optimization problem:
RR
minimize f (x, y) :=
(x,y)
1
1
(x 2)2 + (y 1)2 ,
2
2
x y 0,
subject to
(5.24)
y 0,
y(x y) = 0.
(a) Find all points of global and local minima (you may do this graphically),
as well as all KKT points. Is this a convex problem? Are the KKT optimality
conditions necessary and/or sufficient for local optimality in this problem?
(b) Demonstrate that LICQ is violated at every feasible point of the problem (5.24). Show that instead of solving the problem (5.24) we can solve two
convex optimization problems that furthermore verify some constraint qualification, and then choose the best point out of the two.
(c) Generalize the procedure from the previous part to the more general
optimization problem to
minimize
g(x),
aTi x bi ,
subject to
i = 1, . . . , n,
xi 0,
i = 1, . . . , n,
xi (aT
i x bi ) = 0,
where x = (x1 , . . . , xn )T Rn , ai
is a convex differentiable function.
i = 1, . . . , n,
Rn , bi R, i = 1, . . . , n, and g : Rn R
Exercise 5.8 Determine the values of the parameter c for which the point
(x, y) = (4, 3) is an optimal solution to the following problem:
RR
minimize cx + y,
(x,y)
subject to x2 + y 2 25,
x y 1.
f (x) :=
n
X
x2j
j=1
subject to
n
X
cj
xj = D,
j=1
xj 0,
j = 1, . . . , n,
140
Lagrangian duality
VI
This chapter collects some basic results on Lagrangian duality, in particular as it applies to convex programs with a zero duality gap.
6.1
subject to x S,
(6.1a)
(6.1b)
subject to x SR ,
(6.2a)
(6.2b)
and
fR (xR ) = f (xR ),
(6.3)
Lagrangian duality
Proof. The result in (a) is obvious, as every solution feasible in (6.1)
is both feasible in (6.2) and has a lower objective value in the latter
problem. The result in (b) follows for similar reasons. For the result in
(c), we note that
f (xR ) = fR (xR ) fR (x) f (x),
x S,
6.2
Lagrangian duality
6.2.1
(6.4)
x X,
subject to
gi (x) 0,
i = 1, . . . , m,
(6.5)
that is, that f is bounded from below on the feasible set and the problem
has at least one feasible solution.
Definition 6.2 (Lagrange function, relaxation, multiplier) (a) For an arbitrary vector Rm , the Lagrange function is
L(x, ) := f (x) +
m
X
i=1
142
(6.6)
Lagrangian duality
(b) Consider the problem to
minimize L(x, ),
(6.7)
subject to x X.
Whenever is non-negative, the problem (6.7) is referred to as a Lagrangian relaxation.
(c) We call the vector Rm a Lagrange multiplier vector if it is
non-negative and if f = inf x X L(x, ) holds.
Note that the Lagrangian relaxation (6.7) is a relaxation, in terms of
Section 6.1.
Theorem 6.3 (Lagrange multipliers and global optima) Let be a Lagrange multiplier vector. Then, x is an optimal solution to (6.4) if and
only if x is feasible in (6.4) and
x arg min L(x, ),
x X
and i gi (x ) = 0, i = 1, . . . , m.
(6.8)
where the first inequality stems from the feasibility of x and the definition of a Lagrange multiplier vector. The second part of that definition
implies that f = inf x X L(x, ), so that equality holds throughout in
the above line of inequalities. Hence, (6.8) follows.
Conversely, if x is feasible and (6.8) holds, then by the use of the
definition of a Lagrange multiplier vector,
f (x ) = L(x , ) = minimum L(x, ) = f ,
x X
so x is a global optimum.
Let
q() := infimum L(x, )
x X
(6.9)
(6.10)
subject to 0m .
143
Lagrangian duality
For some , q() = is possible; if it is true for all 0m , then
q := supremum q()
0m
equals . (We can then say that the dual problem is infeasible.)
The effective domain of q is
Dq := { Rm | q() > } .
Theorem 6.4 (convex dual problem) The effective domain Dq of q is
convex, and q is concave on Dq .
Rm , and [0, 1]. We have that
Proof. Let x Rn , ,
= L(x, ) + (1 )L(x,
).
L(x, + (1 ))
Take the infimum over x X on both sides; then,
= inf {L(x, ) + (1 )L(x, )}
inf L(x, + (1 ))
x X
x X
)
inf L(x, ) + inf (1 )L(x,
x X
x X
x X
since [0, 1], and the sum of infimum values may be smaller than the
infimum of the sum, since in the former case we have the possibility to
choose different optimal solutions in the two problems. Hence,
q() + (1 )q()
q( + (1 ))
lie in Dq , then
holds. This inequality has two implications: if and
so Dq is convex; also, q is concave on Dq .
so does + (1 ),
That the Lagrangian dual problem always is convex (we indeed maximize a concave function) is good news, because it means that it can
be solved efficiently. What remains is to show how a Lagrangian dual
optimal solution can be used to generate a primal optimal solution.
Next, we establish that every feasible point in the Lagrangian dual
problem always underestimates the objective function value of every feasible point in the primal problem; hence, also their optimal values have
this relationship.
Theorem 6.5 (Weak Duality Theorem) (a) Let x and be feasible in
the problems (6.4) and (6.10), respectively. Then,
q() f (x).
144
Lagrangian duality
In particular,
q f .
(b) If q() = f (x), then the pair (x, ) is optimal in its respective
problem.
Proof. For all 0m and x X with g(x) 0m ,
q() = infimum L(z, ) f (x) + T g(x) f (x),
z X
so
q = supremum q()
0m
infimum
x X:g(x )0m
f (x) = f .
SR := X,
fR := L(, ).
(6.11a)
(6.11b)
(6.11c)
145
Lagrangian duality
Proof. By definition, a vector 0m is a Lagrange multiplier vector
if and only if f = q( ) q , the equality following from the definition
of q( ) and the inequality from the definition of q as the supremum of
q() over Rm
+ . By weak duality, this relation holds if and only if there
is no duality gap and is an optimal dual solution.
Above we have developed properties of the min-max problem for
finding
q := supremum infimum L(x, ).
0m
x X
0m
Fix x X. Then,
p(x) := supremum L(x, ) =
0m
f (x),
+,
if g(x) 0m ,
otherwise.
Lagrangian duality
6.2.2
The following result characterizes every optimal primal and dual solution. It is however applicable only in the presence of Lagrange multipliers; in other words, the below system (6.12) is consistent if and only if
there exists a Lagrange multiplier vector and there is no duality gap.
Theorem 6.7 (global optimality conditions in the absence of a duality
gap) The vector (x , ) is a pair of primal optimal solution and Lagrange multiplier vector if and only if
0m ,
x arg min L(x, ),
(Dual feasibility)
(Lagrangian optimality)
(6.12a)
(6.12b)
x X,
(Primal feasibility)
(6.12c)
i gi (x )
= 0,
x X
g(x ) 0m ,
i = 1, . . . , m.
Proof. Suppose that the pair (x ) satisfies (6.12). Then, from (6.12a)
we have that the Lagrangian problem to minimize L(x, ) over x X
is a (Lagrangian) relaxation of (6.4). Moreover, according to (6.12b)
x solves this problem, (6.12c) shows that x is feasible in (6.4), and
(6.12d) implies that L(x , ) = f (x ). The Relaxation Theorem 6.1
then yields that x is optimal in (6.4), which in turn implies that is
a Lagrange multiplier vector.
Conversely, if (x , ) is a pair of optimal primal solution and Lagrange multiplier vector, then they are primal and dual feasible, respectively. The relations (6.12b) and (6.12d) follow from Theorem 6.3.
Theorem 6.8 (global optimality and saddle points) The vector (x , )
is a pair of optimal primal solution and Lagrange multiplier vector if
and only if x X, 0m , and (x , ) is a saddle point of the
Lagrangian function on X Rm
+ , that is,
L(x , ) L(x , ) L(x, ),
(x, ) X Rm
+
(6.13)
holds.
Proof. We establish that (6.12) and (6.13) are equivalent; Theorem 6.7
then gives the result. The first inequality in (6.13) is equivalent to
g(x )T ( ) 0,
Rm
+,
(6.14)
147
Lagrangian duality
for the given pair (x , ) X Rm
+ . This variational inequality is
equivalent to stating that2
0m g(x ) 0m ,
(6.15)
m
T
148
Lagrangian duality
Lagrange multipliers in the investigation of the KKT system (5.9). In
contrast, the system (6.12) is normally investigated in the reverse order;
we formulate and solve the Lagrangian dual problem, thereby obtaining
an optimal dual vector . Starting from that vector, we investigate the
global optimality conditions stated in (6.12) to obtain, if possible, an
optimal primal vector x. In the section to follow, we show when this is
possible, and provide strong connections between the systems (5.9) and
(6.12) in the convex and differentiable case.
6.2.3
So far the results have been rather non-technical to achieve: the convexity of the Lagrangian dual problem comes with very few assumptions
on the original, primal problem, and the characterization of the primal
dual set of optimal solutions is simple and also quite easily established.
In order to establish strong duality, that is, to establish sufficient conditions under which there is no duality gap, however, takes much more.
In particular, as is the case with the KKT conditions we need regularity
conditions (that is, constraint qualifications), and we also need to utilize
separation theorems such as Theorem 4.29. Most importantly, however,
is that strong duality is deeply associated with the convexity of the original problem, and it is in particular under convexity that the primal and
dual optimal solutions are linked through the global optimality conditions provided in the previous section. We begin by concentrating on the
inequality constrained case, proving this result in detail. We will also
specialize the result to quadratic and linear optimization problems.
Consider the inequality constrained convex program (6.4), where f :
Rn R and gi (i = 1, . . . , m) are convex functions and X Rn is
a convex set. For this problem, we introduce the following regularity
condition, due to Slater (cf. Definition 5.38):
x X with g(x) < 0m .
(6.16)
Lagrangian duality
(6.12b) can equivalently be written as the variational inequality
x L(x , )T (x x ) 0,
x X.
(6.17)
m
X
i=1
i gi (x ) = 0n ,
(6.18)
(z T , w)T A.
(6.19)
0m ,
(6.20)
x X.
Lagrangian duality
A
(g(
x)T , f (
x))T
((0m )T , f )T
z
(T , 1)T
0m
From the Weak Duality Theorem 6.5 follows that is a Lagrange multiplier vector, and that there is no duality gap.
X satisfying (6.16) and a Lagrange multiplier
Take any vector x
vector . By the definition of a Lagrange multiplier vector, f
L(
x, ) holds, which implies that
m
X
i=1
[f (
x) f ]
.
mini=1,...,m {gi (
x)}
Lagrangian duality
(b) The result follows from Theorem 6.7.
(c) The first part follows from Theorem 4.24, as the Lagrangian function L(, ) is convex. The second part follows by identification.
Consider next the extension of the inequality constrained convex program (6.4) in which we seek to find
f := infimum f (x),
(6.21)
x X,
subject to
gi (x) 0,
T
j x
dj = 0,
i = 1, . . . , m,
j = 1, . . . , ,
(6.22)
q := supremum q(, ),
(6.23)
(,)
subject to 0m ,
where
q(, ) := infimum L(x, , ) := f (x) + T g(x) + T (Ex d),
x
subject to x X.
Theorem 6.10 (Strong Duality, general convex programs) Suppose that
in addition to the feasibility condition (6.5), Slaters constraint qualification (6.22) holds for the problem (6.21).
(a) The duality gap is zero and there exists at least one Lagrange
multiplier vector pair ( , ).
(b) If the infimum in (6.21) is attained at some x , then the triple
x X
152
(6.24a)
(6.24b)
(6.24c)
Lagrangian duality
(c) If further f and g are differentiable at x , then the condition
(6.24b) can equivalently be written as
x L(x , , )T (x x ) 0,
x X.
(6.25)
m
X
i=1
i gi (x ) +
j j = 0n , (6.26)
j=1
(6.27)
subject to
x X,
aT
i x bi
T
j x dj
0,
= 0,
i = 1, . . . , m,
j = 1, . . . , ,
a detailed proof, see [Ber99, Proposition 5.2.1]. (The special case where f is
moreover differentiable is covered in [Ber99, Proposition 3.4.2].)
153
Lagrangian duality
For convex programs where a Slater CQ holds, the Lagrange multipliers defined in this section, and those that appear in the Karush
KuhnTucker conditions, clearly are identical. Next, we specialize the
above to linear and quadratic programs.
6.2.4
The following result will be established and analyzed in detail in Chapter 10 on linear programming duality (cf. Theorem 10.6), but can in
fact also be established similarly to above. (See [BSS93, Theorem 2.7.3]
or [Ber99, Proposition 5.2.2], for example.) Its proof will however be
relegated to that of Theorem 10.6.
Theorem 6.12 (Strong Duality, linear programs) Assume, in addition to
the conditions of Theorem 6.11, that f is linear, so that (6.27) is a linear
program. Then, the primal and dual problems have optimal solutions
and there is no duality gap.
The above result states a strong duality result for a general linear
program. We next develop an explicit Lagrangian dual problem for a
linear program.
Let A Rmn , c Rn , and b Rm ; consider the linear program
minimize cT x,
(6.28)
subject to Ax = b,
x 0n .
If we let X := Rn+ , then the Lagrangian dual problem is to
maximize
bT ,
m
(6.29)
subject to AT c.
The reason why we can write it in this form is that
n
o
T
T
q() := infimum
c
x
+
(b
Ax)
= bT + infimum
(c AT )T x,
n
n
x 0
x 0
so that
q() =
bT , if AT c,
, otherwise.
Lagrangian duality
Further, why is it that here is not restricted in sign? Suppose we
were to split the system Ax = b into an inequality system of the form
Ax b,
Ax b.
Let ((+ )T , ( )T )T be the corresponding vector of multipliers, and
take the Lagrangian dual for this formulation. Then, we would have a
Lagrange function of the form
(x, + , ) 7 L(x, + , ) := cT x + (+ )T (b Ax),
and since + can take on any value in Rm we can simply replace it
with the unrestricted vector Rm . This motivates why the multiplier
for an equality constraint never is sign restricted; the same was the case,
as we saw in Section 5.6, for the multipliers in the KKT conditions.
As applied to this problem, Theorem 6.12 states that if both the
primal or dual problems have feasible solutions, then they both have
optimal solutions, satisfying strong duality (cT x = bT ). On the
other hand, if any of the two problems has an unbounded solution, then
the other problem is infeasible.
Consider next the quadratic programming problem to
1 T
T
minimize
x Qx + c x ,
(6.30)
x
2
subject to Ax b,
where Q Rnn , c Rn , A Rmn , and b Rm . We develop an
explicit dual problem under the assumption that Q is positive definite.
By Lagrangian relaxing the inequality constraints, we obtain that the
inner problem in x is solved by letting
x = Q1 (c + AT ).
(6.31)
Substituting this expression into the Lagrangian function yields the Lagrangian dual problem to
1
1
maximize T AQ1 AT (b+AQ1 c)T cT Q1 c , (6.32)
2
2
subject to 0m ,
Lagrangian duality
Theorem 6.13 (Strong Duality, quadratic programs) For the primal
dual pair of convex quadratic programs (6.30), (6.32), the following
holds:
(a) If both problems have feasible solutions, then both problems also
have optimal solutions, and the primal problem (6.30) also has a unique
optimal solution, given by (6.31) for any optimal Lagrange multiplier
vector, and in the two problems the optimal values are equal.
(b) If either of the two problems has an unbounded solution, then
the other one is infeasible.
(c) Suppose that Q is positive semidefinite, and that the feasibility
condition (6.5) holds. Then, both the problem (6.30) and its Lagrangian
dual have nonempty, closed and convex sets of optimal solutions, and
their optimal values are equal.
In the result (a) it is important to note that the Lagrangian dual
problem (6.32) is not necessarily strictly convex; the matrix AQ1 AT
need not be positive definite, especially so when A does not have full
rank. The result (c) extends the strong duality result from linear programming, since Q in (c) can be the zero matrix. In the case of (c) we of
course cannot write the Lagrangian dual problem in the form of (6.32)
because Q is not necessarily invertible.
6.2.5
Example 6.14 (an explicit, differentiable dual problem) Consider the problem to
minimize f (x) := x21 + x22 ,
x
subject to x1 + x2 4,
xj 0,
j = 1, 2.
x2 0
0.
Lagrangian duality
Note that q is strictly concave, and it is differentiable everywhere (due
to the fact that f, g are differentiable and x() is unique), by Danskins
Theorem 6.17(d).
We have that q () = 4 = 0 = 4. As = 4 0, it is the
optimum in the dual problem: = 4; x = (x1 ( ), x2 ( ))T = (2, 2)T .
Also, f (x ) = q( ) = 8.
This is an example where the dual function is differentiable, and
therefore we can utilize Proposition 6.29(c). In this case, the optimum
x is also unique, so it is automatically given as x = x().
Example 6.15 (an implicit, non-differentiable dual problem) Consider
the linear programming problem to
minimize f (x) := x1 x2 ,
x
0 x1 2,
0 x2 1.
3 + 5, 0 1/4,
2 + , 1/4 1/2,
=
3, 1/2 .
Lagrangian duality
be kinks in the function q where there are alternative solutions x(); as a
result, to obtain a primal optimal solution becomes more complex. The
DantzigWolfe algorithm, for example, represents a means by which to
automatize the process that we have just shown; the algorithm generates
extreme points of X() algorithmically, and constructs the best feasible
convex combination thereof, obtaining a primaldual optimal solution in
a finite number of iterations for linear programs.
The above examples motivate a deeper study of the differentiability properties of convex (or, concave) functions in general, and the Lagrangian dual objective function in particular.
6.3
6.3.1
Throughout this section we suppose that f : Rn R is a convex function, and study its subdifferentiability properties. We will later on apply
our findings to the Lagrangian dual function q, or, rather, its negative
q. We first remark that a finite convex function is automatically continuous (cf. Theorem 4.27).
Definition 6.16 (subgradient) Let f : Rn R be a convex function.
We say that a vector g Rn is a subgradient of f at x Rn if
f (y) f (x) + g T (y x),
y Rn .
(6.33)
(6.34)
Lagrangian duality
(d) [Danskins Theoremdirectional derivatives of a convex max function] Let Z be a compact subset of Rm , and let : Rn Z R be
continuous and such that (, z) : Rn R is convex for each z Z. Let
the function f : Rn R be given by
f (x) := maximum (x, z),
z Z
x Rn .
(6.35)
(6.36)
z Z(x )
where (x, z; p) is the directional derivative of (, z) at x in the direction of p, and Z(x) := { z Rm | (x, z) = f (x) }.
In particular, if Z(x) contains a single point z and (, z ) is differen ), where
tiable at x, then f is differentiable at x, and f (x) = x (x, z
(x ,
z)
) is the vector with components xi , i = 1, . . . , n.
x (x, z
If further (, z) is differentiable for all z Z and x (x, ) is continuous on Z for each x, then
f (x) = conv { x (x, z) | z Z(x) },
x Rn .
2. 0n f (x );
3. f (x ; p) 0 for all p Rn .
160
f (x)
x
5
4
3
2
1
p Rn }.
p Rn ,
Lagrangian duality
6.3.2
(6.37)
set {
x} for some R , and for some sequence {k } Rm with
, xk X(k ) for all k, then xk x
.
k
(c) Let Rm . If x X(), then g(x) is a subgradient to q at ,
that is, g(x) q().
(d) Let Rm . Then,
q() = conv { g(x) | x X() }.
The set q() is convex and compact. Moreover, if U is a bounded
set, then U q() is also bounded.
(e) The directional derivative of q at Rm in the direction of
p Rm is
q (; p) = minimum g T p.
g q()
Proof. (a) Theorem 6.4 stated the concavity of q on its effective domain.
Weierstrass Theorem 4.7 states that q is finite on Rm , which is then
also its effective domain. The continuity of q follows from that of any
finite concave function, as we have already seen in Theorem 4.27. The
closedness property of the solution set is a direct consequence of the
continuity of q (the upper level set then automatically is closed), and
complements the result of Theorem 6.9(a).
162
so that x
follows.
Rm be arbitrary and let x
X().
We have that
(c) Let
= infimum L(y,
) = f (x) +
T g(x)
q()
y X
)T g(x) q() + (
)T g(x),
= f (x) + T g(x) + (
which implies that g(x) q().
(d) The inclusion q() conv { g(x) | x X() } follows from (c)
and the convexity of q(). The opposite inclusion follows by applying
the Separation Theorem 3.24.5
(e) See Proposition 6.17(c).
The result in (c) is an independent proof of the concavity of q on Rm .
The result (d) is particularly interesting, because by Caratheodorys
Theorem 3.8 every subgradient of q at any point is the convex combination of a finite number (in fact, at most m + 1) of vectors of the form
g(xs ) with xs X(). Computationally, this has been utilized to devise
efficient (proximal) bundle methods for the Lagrangian dual problem as
well as to devise methods to recover primal optimal solutions.
Next, we establish the differentiability of the dual function under
additional assumptions.
Proposition 6.20 (differentiability of the dual function) Suppose that,
in the problem (6.4), the compactness condition (6.37) holds.
(a) Let Rm . The dual function q is differentiable at if and
only if { g(x) | x X() } is a singleton set, that is, if the value of the
vector of constraint functions is invariant over the set of solutions X()
to the Lagrangian subproblem. Then, we have that
q() = g(x),
5 See
163
Lagrangian duality
for every x X().
(b) The result in (a) holds in particular if the Lagrangian subproblem
has a unique solution, that is, X() is a singleton set. In particular, this
property is satisfied for 0m if further X is a convex set, f is strictly
convex on X, and gi (i = 1, . . . , m) are convex, in which case q C 1 .
Proof. (a) The concave function q is differentiable at the point (where
it is finite) if and only if its subdifferential q() there is a singleton, cf.
Proposition 6.17(c).
(b) Under either one of the assumptions stated, X() is a singleton,
whence the result follows from (a). Uniqueness follows from the convexity of the feasible set and strict convexity of the objective function,
according to Proposition 4.11. That q C 1 follows from the continuity
of g and Proposition 6.19(b).
6.4
We begin by establishing the convergence of classic subgradient optimization methods as applied to a general convex optimization problem.
6.4.1
Convex problems
(6.39a)
subject to x X,
(6.39b)
6 See
164
(6.40a)
(6.40b)
k = 0, 1, . . . ;
lim k = 0;
k = +.
(6.41)
k=0
X
(6.42)
2k < +.
k=0
The conditions in (6.41) allow for convergence to any point from any
starting point, since the total step is infinite, but convergence is therefore
also quite slow; the additional condition in (6.42) means fast sequences
are selected. An instance of the step length formulas which satisfies both
(6.41) and (6.42) is the following:
k = + /(k + 1),
k = 0, 1, . . . ,
where > 0, 0.
The third step length rule is
k = k
f (xk ) f
,
kg k k2
0 < 1 k 2 2 < 2,
(6.43)
where f is the optimal value of (6.39). We refer to this step length formula as the Polyak step, after the Russian mathematician Boris Polyak
who invented the subgradient method in the 1960s together with Ermolev and Shor.
How is convergence established for subgradient optimization methods? As shall be demonstrated in Chapters 11 and 12 convergence of
algorithms for problems with a differentiable objective function is typically based on generating descent directions, and step length rules that
result in the sequence {xk } of iterates being strictly descending in the
165
Lagrangian duality
value of f . For the non-differentiable problem at hand, generating descent directions is a difficult task, since it is not true that the negative of
an arbitrarily chosen subgradient of f at a non-optimal vector x defines
a descent direction.
In bundle methods one gathers information from more than one subgradient (hence the term bundle) around a current iteration point so
that a descent direction can be generated, followed by an inexact line
search. We concentrate here on the simpler methodology of subgradient
optimization methods, in which we apply the formula (6.40) where the
step length k is chosen based on very simple rules.
We establish below that if the step length is small enough, an iteration of the subgradient projection method leads to a vector that is closer
to the set of optimal solutions. This technical result also motivates the
construction of the Polyak step length rule, and hence shows that the
convergence of subgradient methods is based on the reduction of the Euclidean distance to the optimal solutions rather than on the reduction of
the value of the objective function f .
Proposition 6.22 (decreasing distance to the optimal solution set) Suppose that xk X is not optimal in (6.39), and that xk+1 is given by
(6.40) for some step length k > 0.
Then, for every optimal solution x in (6.39),
kxk+1 x k < kxk x k
holds for every step length k in the interval
k (0, 2[f (xk ) f ]/kg k k2 ).
(6.44)
= kxk x k2 2k (xk x )T g k + 2k kg k k2
kxk x k2 2k [f (xk ) f ] + 2k kg k k2
< kxk x k2 ,
where we have utilized the property that the Euclidean projection is nonexpansive (Theorem 4.32), the subgradient inequality (6.33) for convex
166
y X
(6.45a)
kx xk + k g k k
(6.45b)
2
2
= kx xk k + k 2g T
, (6.45c)
k (x xk ) + k kg k k
7 A proper function is a function which is finite at least at some vector and nowhere
attains the value . See also Section 1.4.
8 For any set S Rn the function is the indicator function of the set S, that
S
is, S (x) = 0 if x S; and S (x) = + if x 6 S. See also Section 13.1.
167
Lagrangian duality
where the inequality follows from the projection property. Now, suppose
2
2 gT
s (x xs ) + s kg s k <
(6.46)
for all s N (). Then, using (6.45) repeatedly, we obtain that for any
k N (),
k
X
2
kx xk+1 k2 <
x xN ()
s ,
s=N ()
and from (6.40) it follows that the right-hand side of this inequality tends
to minus infinity as k , which clearly is impossible. Therefore,
2
2 gT
k (x xk ) + k kg k k
(6.47)
for at least one k N (), say k = k(). From the definition of N (), it
follows that g T
k() (x xk() ) . From the definition of a subgradient
Thus, xk+1 X +B . Otherwise, (6.47) must hold and, using the same
arguments as above, we obtain that f (xk ) f + , i.e., xk X
x + B /2 . As
kxk+1 xk k = kProjX (xk k g k ) xk k kxk k g k xk k
= k kg k k /2
whenever k N (), it follows that xk+1 X + B /2 + B /2 = X + B .
By induction with respect to k k(), it follows that xk X + B
for all k k(). Since this holds for arbitrarily small values of > 0
and f is continuous, the theorem follows.
We next introduce the additional requirement (6.42); the resulting
algorithms convergence behaviour is now much more favourable, and
the proof is at the same time less technical.
168
kx xk k kx x0 k + 2
k1
X
s=0
s g T
s (x xs ) +
k1
X
s=0
2s kg s k .(6.48)
f (xs ) f f (xs ) + g T
s (x xs ) ,
s 0,
(6.49)
that g T
ki (x xki ) 0. From (6.49) it follows that f (xki ) f .
The boundedness of {xk } implies the existence of an limit point of the
subsequence {xki }, say x . From the continuity of f it follows that
x X .
To show that x is the only limit point of {xkP
}, let > 0 and choose
2s kg s k2 < + 2 c2 = .
kx xk k2
x xM()
+
2 2c
s=M()
Since this holds for arbitrarily small values of > 0, we are done.
Lagrangian duality
Theorem 6.25 (convergence of subgradient optimization methods, III)
Let {xk } be generated by the method (6.40), (6.43). If X is nonempty
then f (xk ) f and xk x X holds.
Proof. From Proposition 6.22 follows that the sequence {kxk x k}
is strictly decreasing for every x X , and therefore has a limit. By
construction of the step length, in which the step lengths are bounded
away from zero and 2[f (xk ) f ]/kg k k2 , it follows from the proof of
Proposition 6.22 that [f (xk ) f ]2 /kg k k2 0 must hold. Since {g k }
must be bounded due to the boundedness of {xk } [Proposition 6.17(a)],
we have that f (xk ) f . Further, xk is bounded, and due to the
continuity property of f every limit point must then belong to X .
It remains to show that there can be only one limit point. This
property follows from the monotone decrease of the distance kxk x k.
In detail, the proof is as follows. Suppose two subsequences of {xk }
exist, such that they converge to two different vectors in X :
xmi x1 ;
xli x2 ;
x1 6= x2 .
k = 0, 1, . . . .
See Section 6.8 for references to other subgradient algorithms than those
presented here.
6.4.2
{4 2x}, if x < 1;
if x = 1;
[1, 2] ,
h(x) = {1},
if 1 < x < 4;
[4, 1] , if x = 4;
{4 2x}, if x > 4.
Lagrangian duality
1
2
3
q()
4
5
q
Figure 6.5: The half-space defined by the subgradient g of q at . Note
that the subgradient is not an ascent direction.
(6.50)
q q(k )
,
kg k k2
0 < 1 k 2 2 < 2,
(6.51)
pose that the problem (6.4) is feasible, and that the compactness condition (6.37) and the Slater condition (6.16) hold.
(a) Let {k } be generated by the method (6.50), (6.41). Then,
q(k ) q , and distU (k ) 0.
(b) Let {k } be generated by the method (6.50), (6.41), (6.42). Then,
{k } converges to an optimal solution to (6.10).
(c) Let {k } be generated by the method (6.50), (6.51). Then, {k }
converges to an optimal solution to (6.10).
Proof. The results follow from Theorems 6.23, 6.24, and 6.25, respectively. Note that in the first two cases, boundedness conditions were
assumed for X and the sequence of subgradients. The corresponding
conditions for the Lagrangian dual problem are fulfilled under the CQs
imposed, since they imply that the search for an optimal solution is done
over a compact set; cf. Theorem 6.9(a) and its proof.
6.4.3
R such that f (
) < 0. According
the existence of some vector p
x; p
to the definition of the directional derivative and the compactness of
< 0 for every
f (
x), this is equivalent to the statement that g T p
g f (
x). In the context of Lagrangian duality we show below how we
can generate an ascent direction for q at some Rm .
Definition 6.27 (steepest ascent direction) Suppose that the problem
(6.4) is feasible, and that the compactness condition (6.37) holds. Consider the Lagrangian dual problem (6.10), and let Rm . A vector
Rm with k
p
pk 1 is a steepest ascent direction if
) = maximum q (; p)
q (; p
kpk1
holds.
Proposition 6.28 (the shortest subgradient yields the steepest ascent direction) Suppose that the problem (6.4) is feasible, and that the compactness condition (6.37) holds. Consider the Lagrangian dual problem
of steepest ascent with respect to q at is given
(6.10). The direction p
q() is the shortest subgradient in q() with respect
below, where g
to the Euclidean norm:
(
= 0m ,
0m , if g
=
p
g
6= 0m .
k , if g
kg
173
Lagrangian duality
infimum maximum g T p
g q()
kpk1
= infimum kgk
g q()
= k
g k.
(6.52)
such that q (; p
) = k
If we can construct a direction p
g k then by (6.52)
is the steepest ascent direction. If g
= 0m then for p
= 0m we
p
) = k
obviously have that q (; p
g k. Suppose then that g 6= 0m , and let
:= g
/k
p
gk. Note that
g T g
) = infimum g T p
= infimum
q (; p
gk
g q()
g q() k
1
=
infimum k
g k2 + g T (g g )
k
gk g q()
1
T (g g
).
= k
gk +
infimum g
k
g k g q()
(6.53)
6.5
It remains for us to show how an optimal dual solution can be translated into an optimal primal solution x . Obviously, convexity and
strong duality will be needed in general, if we are to be able to utilize
the primaldual optimality characterization in Theorem 6.7. It turns
out that the generation of a primal optimum is automatic if q is differentiable at , which is also the condition under which the famous
Lagrange multiplier method works. Unfortunately, in many cases, such
as for most non-strictly convex optimization problems (like linear programming), this will not be the case, and then the translation work
becomes more complex.
We start with the ideal case.
174
6.5.1
The following results summarize the optimality conditions for the Lagrangian dual problem (6.10), and their consequences for the availability
of a primal optimal solution in the absence of a duality gap.
Proposition 6.29 (optimality conditions for the dual problem) Suppose
that the compactness condition (6.37) holds in the problem (6.4). Suppose further that the vector is solves the Lagrangian dual problem.
(a) The dual optimal solution is characterized by the inclusion
0m q( ) + NRm
( ).
+
(6.54)
In other words, there then exists q( )an optimality-characterizing subgradient of q at such that
0m 0m .
(6.55)
k
X
i g(xi );
i=1
k
X
i=1
i = 1;
i 0, i = 1, . . . , k.
(6.56)
i i gi (xi ) = 0.
(6.57)
i=1
175
Lagrangian duality
Remark 6.30 (the non-coordinability phenomenon and decomposition algorithms) Many interesting problems do not comply with the conditions
in (c); for example, linear programming is one where the Lagrangian
dual problem often is non-differentiable at every dual optimal solution.9 This is sometimes called the non-coordinability phenomenon (cf.
[Las70, DiJ79]). It was in order to cope with this phenomenon that
DantzigWolfe decomposition ([DaW60, Las70]) and other column generation algorithms, Benders decomposition ([Ben62, Las70]) and generalized linear programming were developed; noticing that the convex combination of a finite number of candidate primal solutions are sufficient to
verify an optimal primaldual solution [cf. (6.57)], methodologies were
developed to generate those vectors algorithmically. See also [LPS99]
for overviews on the subject of generating primal optimal solutions from
dual optimal ones, and [BSS93, Theorem 6.5.2] for an LP procedure that
provides primal feasible solutions for convex programs.
Note that the equation (6.57) in (a) reduces to the complementar :=
ity condition that i gi (
x) = 0 holds, for the averaged solution, x
Pk
i
i=1 i x , whenever all the functions gi are affine.
6.5.2
Everetts Theorem
The next result shows that the solution to the Lagrangian subproblem
solves a perturbed version of the original problem. We state the result
for the general problem to find
f := infimum f (x),
(6.58)
subject to
x X,
gi (x) 0,
hj (x) = 0,
i = 1, . . . , m,
j = 1, . . . , ,
where f : Rn R, gi : Rn R, i = 1, 2, . . . , m, and hj : Rn R,
j = 1, 2, . . . , , are given functions, and X Rn .
176
Sensitivity analysis
(6.60)
subject to
x X,
gi (x) gi (
x),
hj (x) = hj (
x),
i I(
x),
j = 1, . . . , .
f (
x) + T g(
x) + T h(
x),
(6.61)
subject to
x X,
gi (x) 0,
hj (x) = 0,
i I(
x),
j = 1, . . . , .
6.6
6.6.1
Sensitivity analysis
Lagrangian duality
a convex set. Suppose that the problem (6.4) is feasible, and that the
compactness condition (6.37) and Slater condition (6.16) hold. This is
the classic case where there exist multiplier vectors , according to
Theorem 6.9, and strong duality holds.
For certain types of problems where the duality gap is zero and where
there exist primaldual optimal solutions, we have access to a beautiful
theory of sensitivity analysis. The classic meaning of the term is the
answer to the following question: what is the rate of change in f when
a constraint right-hand side changes? This question answers important
practical questions, like the following in manufacturing: If we buy one
unit of additional resource at a given price, or if the demand of a product
that we sell increases by a certain amount, then how much additional
profit do we make?
We will here provide a basic result which states when this sensitivity
analysis of the optimal objective value can be performed for the problem
(6.4), and establish that the answer is determined precisely by the value
of the Lagrange multiplier vector , provided that it is unique.
Definition 6.32 (perturbation function) Consider the function p : Rm
R {} defined by
p(u) := infimum f (x),
(6.62)
subject to
x X,
gi (x) ui ,
i = 1, . . . , m,
u Rm ;
=
=
infimum
{f (x) + ( )T g(x)}
infimum
{f (x) + ( )T u}
{ (u,x )P X|g(x )u }
{ (u,x )P X|g(x )u }
= infimum
u P
infimum
{x X|g(x )u }
{f (x) + ( )T u}.
u Rm ,
Sensitivity analysis
6.6.2
There exist local versions of the analysis valid also for non-convex problems, where we are interested in the effect of a problem perturbation
on a KKT point. A special such analysis was recently performed by
Bertsekas [Ber04], in which he shows that even when the problem is
non-convex and the Lagrange multipliers are not unique, a sensitivity
analysis is available as long as the functions defining the problem are
differentiable. Suppose then that in the problem (6.4) the functions f
and gi , i = 1, . . . , m are in C 1 and that X is nonempty. We generalize
the concept of a Lagrange multiplier vector to here mean that it is a
179
Lagrangian duality
vector associated with a local minimum x such that
f (x ) +
m
X
i=1
!T
i gi (x )
p 0,
p TX (x ),
(6.63a)
i 0,
i = 0,
i = 1, . . . , m,
i 6 I(x ),
(6.63b)
(6.63c)
i = 1, . . . , m,
(6.64)
i = 1, . . . , m,
holds.
Theorem 6.34 establishes the optimal rate of cost improvement with
respect to infeasible constraint perturbations (in effect, those that imply
an enlargement of the feasible set).
We finally remark that under stronger conditions still, the operator
u 7 x (u) assigning the (unique) optimal solution x to each perturbation vector u Rm is differentiable at u = 0m . Such a result is
180
Applications
reminiscent to the Implicit Function Theorem, which however only covers equality systems. If we are to study the sensitivity of x to changes
in the right-hand sides of inequality constraints as well, then the analysis
becomes complicated due to the fact that we must be able to predict if
some active constraints may become inactive in the process. In some
circumstances, different directions of change in the right-hand sides may
cause different subsets of the active constraints I(x ) at x to become
inactive, and this would most probably then be a non-differentiable point
of the operator u 7 x (u). A sufficient condition (but not necessary, at
least in the case of linear constraints) for this to not happen is when x
is strictly complementary, that is, when there exists a multiplier vector
with i > 0 for every i I(x ).
6.7
Applications
6.7.1
Electrical networks
An electrical network (or, circuit) is an interconnection of analog electrical elements such as resistors, inductors, capacitors, diodes, and transistors. Its size varies from the smallest integrated circuit to an entire
electricity distribution network. A circuit is a network that has at least
one closed loop. A network is a connection of 2 or more simple circuit
elements, and may not be a circuit. The goal when designing electrical
networks for signal processing is to apply a predefined operation on potential differences (measured in volts) or currents (measured in amperes).
Typical functions for these electrical networks are amplification, oscillation and analog linear algorithmic operations such as addition, subtraction, multiplication, and division. In the case of power distribution
networks, engineers design the circuit to transport energy as efficiently
as possible while at the same time taking into account economic factors,
network safety and redundancy. These networks use components such
as power lines, cables, circuit breakers, switches and transformers.
To design any electrical circuits, electrical engineers need to be able
181
Lagrangian duality
to predict the voltages and currents in the circuit. Linear circuits (that
is, an electrical network where all elements have a linear currentvoltage
relation) can be quite easily analyzed through the use of complex numbers and systems of linear equations,10 while nonlinear elements require
a more sophisticated analysis. The classic electrical laws describing
the equilibrium state of an electrical network are due to G. Kirchhoff
[Kir1847]; referred to as Kirchhoffs circuit laws they express in a mathematical form the conservation of charge and energy.11
Formally, we let an electrical circuit be described by branches (or,
links) connecting nodes. We present a simple example where the only
devices are voltage sources, resistors, and diodes. The resulting equilibrium conditions will be shown to be represented as the solution to a
strictly convex quadratic program. In general, devices such as resistors
can be non-linear, but linearity is assumed throughout this section.
A voltage source maintains a constant branch voltage vs irrespective of the branch current cs . The power absorbed by the device
is vs cs .
A diode permits the branch current cd to flow in one direction only,
but consumes no power regardless of the current or voltage on the
branch. Denoting the branch voltage by vd , the direction condition
can be stated as a complementarity condition:
cd 0;
vd 0;
vd cd = 0.
(6.65)
vr = Rr cr .
(6.66)
vr2
= Rr c2r ,
Rr
(6.67)
182
Applications
matrix of the form
0,
otherwise.
(6.68)
(6.69a)
T
ND
p
T
NR p
= vD ,
(6.69b)
= vR .
(6.69c)
cD 0;
vT
D cD = 0.
(6.70)
(6.71)
12 This law is also referred to as the first law, the point rule, the junction rule, and
the node law.
13 This law is a corollary to Ohms law, and is also referred to as the loop law.
183
Lagrangian duality
R being the diagonal matrix with elements equal to the values Rr .
Hence, (6.68)(6.71) represent the equilibrium conditions of the circuit. We will now describe the optimization problem whose optimality
conditions are, precisely, (6.68)(6.71) [note that v S is fixed]:
1 T
c RcR v T
S cS ,
2 R
subject to NS cS + ND cD + NR cR = 0,
cD 0.
minimize
(6.72)
(6.73)
T
ND
p v D = 0,
NRT p v R = 0,
v D 0.
Applications
Finally, let us note that by Theorem 6.13(a) the two problems (6.72)
and (6.73) have the same objective value at optimality. That is,
1
1 T
c RcR + v T
R1 v R v T
S cS = 0.
2 R
2 R
By (6.70)(6.71), the above equation reduces to
T
T
vT
S cS + v D cD + v R cR = 0,
6.7.2
6.7.2.1
continuous relaxation amounts to removing the integrality conditions, replacing, for example, xj {0, 1} by xj [0, 1].
185
Lagrangian duality
With these definitions, the undirected traveling salesman problem
(TSP) is to
minimize
x
subject to
cij xij ,
(6.74a)
(i,j)L
(i,j)L:{i,j}S
(i,j)L
X
xij |S| 1,
S N,
xij = n,
xij = 2,
iN :(i,j)L
(6.74b)
(6.74c)
j N,
(i, j) L.
(6.74d)
(6.74e)
Applications
6.7.2.2
X
X
X
q() = minimum
cij xij +
j 2
xij
x
=2
jN
jN
(i,j)L
j + minimum
x
iN :(i,j)L
(cij i j )xij .
(i,j)L
where the value of xij {0, 1} is the solution to the 1-MST solution
with link costs cij i j . We see from the direction formula that
X
new
xij () ,
j N,
:= j + 2
j
iN :(i,j)L
Lagrangian duality
(downwards) if there are too many (too few) links connected to node j
in the 1-MST. We are hence adjusting the node prices of the nodes in
such a way as to try to influence the 1-MST problem to always choose 2
links per node to connect to.
6.7.2.3
A feasibility heuristic
6.8
Lagrangian duality
The differentiability properties of convex functions were developed
largely by Rockafellar [Roc70], whose text we mostly follow.
Subgradient methods were developed in the Soviet Union in the
1960s, predominantly by Ermolev, Polyak, and Shor. Text book treatments of subgradient methods are found, for example, in [Sho85, HiL93,
Ber99]. Theorem 6.23 is essentially due to Ermolev [Erm66]; the proof
stems from [LPS96]. Theorem 6.24 is due to Shepilov [She76]; finally,
Theorem 6.25 is due to Polyak [Pol69].
Everetts Theorem 6.31 is due to Everett [Eve63].
Theorem 6.34 stems from [Ber04, Proposition 1.1].
That the equilibrium conditions of an electrical or hydraulic network
are attained as the minimum of the total energy loss were known more
than a century ago. Mathematical programming models for the electrical network equilibrium problems described in Section 6.7.1 date at least
as far back as to Duffin [Duf46, Duf47] and dAuriac [dAu47]. Duffin
constructs his objective function as a sum of integrals of resistance functions. The possibility of viewing the equilibrium problem in at least two
related, dual, ways as that of either finding the optimal flows of currents
or the optimal potentials was also known early in the analysis of electrical networks; these two principles are written out in [Cro36] in work on
pipe networks, and explicitly stated as a pair of primaldual quadratic
programming problems in [Den59]; we followed his development, as represented in [BSS93, Section 1.2.D].
The traveling salesman problem is an essential model problem in
combinatorial optimization. Excellent introductions to the field can be
found in [Law76, PaS82, NeW88, Wol98, Sch03]. It was the work in
[HWC74, Geo74, Fis81, Fis85], among others, in the 1970s and 1980s on
the traveling salesman problem and its relatives that made Lagrangian
relaxation and subgradient optimization popular, and it remains most
popular within the combinatorial optimization field.
6.9
Exercises
Exercise 6.1 (numerical example of Lagrangian relaxation) Consider the convex problem to
1
4
+
,
x1
x2
subject to x1 + x2 4,
minimize
x1 , x2 0.
(a) Lagrangian relax the first constraint, and write down the resulting
implicit dual objective function and the dual problem. Motivate why the
190
Exercises
relaxed problem always has a unique optimum, whence the dual objective
function is everywhere differentiable.
(b) Solve the implicit Lagrangian dual problem by utilizing that the gradient to a differentiable dual objective function can be expressed by using the
functions that are involved in the relaxed constraints and the unique solution
to the relaxed problem.
(c) Give an explicit dual problem (a dual problem only in terms of the
Lagrange multipliers). Solve it to confirm the results in (b).
(d) Find the original problems optimal solution.
(e) Show that strong duality holds.
Exercise 6.2 (global optimality conditions) Consider the problem to
minimize f (x) := x1 + 2x22 + 3x33 ,
subject to x1 + 2x2 + x3 3,
2x21 + x2 2,
2x1 + x3 = 2,
xj 0,
j = 1, 2, 3.
(a) Formulate the Lagrangian dual problem that results from Lagrangian
relaxing all but the sign constraints.
(b) State the global primaldual optimality conditions.
Exercise 6.3 (Lagrangian relaxation) Consider the problem to
minimize f (x) := x21 + 2x22 ,
subject to x1 + x2 2,
x21 + x22 5.
x = y AT (AAT )1 Ay .
191
Lagrangian duality
Derive this formula by utilizing Lagrangian duality. Motivate every step
by showing that the necessary properties are fulfilled.
[Note: This exercise is similar to that in Example 5.51, but utilizes Lagrangian duality rather than the KKT conditions to derive the projection
formula.]
Exercise 6.5 (Lagrangian relaxation, exam 040823) Consider the following linear optimization problem:
minimize f (x, y) := x 0.5y,
subject to
x + y 1,
2x + y 2,
(x, y)T R2+ .
(a) Show that the problem satisfies Slaters constraint qualification. Derive
the Lagrangian dual problem corresponding to the Lagrangian relaxation of
the two linear inequality constraints, and show that its set of optimal solutions
is convex and bounded.
(b) Calculate the set of subgradients of the Lagrangian dual function at
the dual points (1/4, 1/3)T and (1, 0)T .
Exercise 6.6 (Lagrangian relaxation) Provide an explicit form of the Lagrangian
dual problem for the problem to
minimize
subject to
n
m X
X
xij ln xij
i=1 j=1
m
X
i=1
n
X
xij = bj ,
j = 1, . . . , n,
xij = ai ,
i = 1, . . . , m,
xij 0,
i = 1, . . . , m,
j=1
j = 1, . . . , n,
where ai > 0, bj > 0 for all i, j, and where the linear equalities are Lagrangian
relaxed.
Exercise 6.7 (Lagrangian relaxation) Given is the problem to
minimize
x
subject to
x21
+ x2 8,
x1 [1, 3],
x2 [2, 5].
(6.75a)
(6.75b)
(6.75c)
(6.75d)
192
Exercises
Exercise 6.8 (Lagrangian duality for integer problems) Consider the primal
problem to
minimize f (x),
subject to
g(x) 0m ,
x X,
0
where
Rm .
(a) Suppose that the set X is finite (for example, consisting of a finite
number of integer vectors). Denote the elements of X by xp , p = 1, . . . , P .
Show that the dual objective function is piece-wise linear. How many linear
segments can it have, at most? Why is it not always built up by that many
segments?
[Note: This property holds regardless of any properties of f and g.]
(b) Illustrate the result in (a) on the linear 0/1 problem to find
z = maximum z = 5x1 + 8x2 + 7x3 + 9x4 ,
subject to
3x1 + 2x2 + 2x3 + 4x4 5,
2x1 + x2 + 2x3 + x4 = 3,
x1 , x2 , x3 , x4 {0, 1},
where the first constraint is considered complicating.
(c) Suppose that the function f and all components of g are linear, and
that the set X is a polytope (that is, a bounded polyhedron). Show that the
dual objective function is also in this case piece-wise linear. How many linear
pieces can it be built from, at most?
Exercise 6.9 (Lagrangian relaxation) Consider the problem to
minimize z = 2x1 + x2 ,
subject to
x1 + x2 5,
x1
4,
x2 4,
x1 , x2 0, integer.
Lagrangian relax the first constraint. Describe the Lagrangian function and
the dual problem. Calculate the Lagrangian dual function at these four points:
= 0, 1, 2, 3. Give the best lower and upper bounds on the optimal value of
the original problem that you have found.
193
Lagrangian duality
Exercise 6.10 (surrogate relaxation) Consider an optimization problem of the
form
minimize f (x),
subject to gi (x) 0,
i = 1, . . . , m,
(P )
x X,
T g (x) 0,
x X.
(S)
(1)
(2)
Surrogate relax the constraints (1) and (2) with multipliers 1 , 2 0 and
= (1, 2)T . Calculate s(
).
formulate the problem (S). Let
Consider again the original problem and Lagrangian relax the constraints
(1) and (2) with multipliers 1 , 2 0. Calculate the Lagrangian dual objec.
tive value at =
Compare the two results!
(c) [comparison with Lagrangian duality] Let 0m and
q() := minimum {f (x) + T g(x)}.
x X
0
holds.
194
0
Part IV
Linear Programming
Linear programming:
An introduction
VII
7.1
Small piece
Large piece
Chair, x2
Table, x1
7.2
(7.1)
Graphical solution
2x1 + 2x2 . But only 6 large pieces and 8 small pieces are available, so
we must have that
2x1 + x2 6,
2x1 + 2x2 8.
(7.2)
(7.3)
(7.4)
(Also, the number of chairs and tables produced must be integers, but
we will not take that into account here.)
Now the objective is to maximize the total income, so if we combine
the income function (7.1) and the constraints (7.2)(7.4) we get the
following linear programming model:
maximize
subject to
z = 1600x1 +1000x2 ,
2x1
2x1
x1 ,
7.3
(7.5)
+x2 6,
+2x2 8,
x2 0.
Graphical solution
7.4
Sensitivity analysis
x2
x = (2, 2)T
x1
z = 5200
z = 2600
z=0
Figure 7.2: Graphical solution of the manufacturing problem.
7.4.1
z = 1600x1 +1000x2 ,
2x1
2x1
x1 ,
200
+x2 7,
+2x2 8,
x2 0.
Sensitivity analysis
The feasible region is shown in Figure 7.3.
x2
2x1 + 2x2 = 12
2x1 + 2x2 = 10
2x1 + x2 = 7
2x1 + x2 = 8
x1
z=0
Figure 7.3: An increase in the number of large and small pieces available.
We see that the optimal solution becomes (3, 1)T and z = 5800,
which means that an additional large piece increases the income by
5800 5200 = 600. Hence the shadow price of the large pieces is 600.
The figure also illustrates what happens if the number of large pieces is
8. Then the optimal solution becomes (4, 0)T and z = 6400. But what
happens if we increase the number of large pieces further? From the
figure it follows that the optimal solution will not change (since x2 0
must apply), so an increase larger than 2 in the number of large pieces
gives no further income. This illustrates that the validity of the shadow
price depends on the actual increment; exactly when the shadow price
is valid is investigated in Theorem 10.8 and Remark 10.9.
7.4.2
Starting from the original setup, in the same manner as for the large
pieces it follows from Figure 7.3 that two additional small pieces give
the new optimal solution x = (1, 4)T and z = 5600, so the income per
201
7.4.3
Now assume that the price of tables is decreased from 1600 to 800. The
new linear program becomes
maximize z = 800x1 +1000x2 ,
subject to
2x1
2x1
x1 ,
+x2 6,
+2x2 8,
x2 0.
This new situation is illustrated in Figure 7.4, from which we see that
x2
x1
1600x1 + 1000x2 = 0
800x1 + 1000x2 = 0
7.5
7.5.1
Suppose that another manufacturer (let us call them Billy) produce book
shelves whose raw material is identical to those used for the tables and
chairs, that is, the small and large pieces. Billy wish to expand their
production, and are interested in acquiring the resources that our factory sits on. Let us ask ourselves two questions, which (as we shall see)
have identical answers: (1) what is the lowest bid (price) for the total
capacity at which we are willing to sell?; (2) what is the highest bid
(price) that Billy are prepared to offer for the resources? The answer to
those two questions is a measure of the wealth of the company in terms
of their resources.
7.5.2
A dual problem
w = 6y1 +8y2 ,
(7.6)
y2 0.
203
7.5.3
From the above we see that the dual optimal solution is identical to the
shadow prices for the resource (capacity) constraints. (This is indeed
a general conclusion in linear programming.) To motivate that this is
reasonable in our setting, we may consider Billy as a fictitious competitor
only, which we use together with the dual problem to measure the value
of our resources. This (fictitious) measure can be used to create internal
prices in a company in order to utilize limited resources as efficiently
as possible, especially if the resource is common to several independent
sub-units. The price that the dual optimal solution provides will then
be a price directive for the sub-units, that will make them utilize the
scarce resources in a manner which is optimal for the overall goal.
We note that the optimal value z = 5200 of the production agrees
with the total value w = 5200 of the resources in our company. (This
is also a general result in linear programming; see the Strong Duality
Theorem 10.6.) Billy will of course not pay more than what the resource
is worth, but can at the same time not offer less than the profit that our
company can make ourselves, since we would then not agree to sell. It
follows immediately that for each feasible production plan x and price
y, it holds that z w, since
z = 1600x1 + 1000x2 (2y1 + 2y2 )x1 + (y1 + 2y2 )x2
= y1 (2x1 + x2 ) + y2 (2x1 + 2x2 ) 6y1 + 8y2 = w,
where in the inequalities we utilize all the constraints of the primal and
dual problems. (Also this fact is general in linear programming; see the
Weak Duality Theorem 10.4.) So, each offer accepted (from our point
of view) must necessarily be an upper bound on our own possible profit,
and this upper bound is what Billy wish to minimize in the dual problem.
204
Linear programming
models
VIII
8.1
Many real world situations can be modelled as linear programs. However, the applicability of a linear program requires certain axioms to be
fulfilled. Hence, often approximations of the real world problem must be
made prior to the formulation of a linear program. The axioms underlying the use of linear programming models are:
proportionality (linearity: no economiesofscale; no fixed costs);
additivity (no substitutetimeeffects);
divisibility (continuity); and
determinism (no randomness).
1
9
14
2
16
29
3
28
19
centers are three steel plants. The unit costs of shipping ore from each
mine to each steel plant are given in Table 8.1.
Further, the amount of ore available at the mines and the Mtons of
ore required at each steel plant are given in the Tables 8.2 and 8.3.
Table 8.2: Amount of ore available at the mines (Mtons).
Mine 1
Mine 2
103
197
71
133
96
x21
x12
x13
x22
Plant 2
Mine 2
x23
Plant 3
Figure 8.1: Illustration of the transportation problem.
The items in this problem are the ore at various locations. Consider
the ore at mine 1. According to Table 8.2 there are only 103 Mtons
of it available, and the amount of ore shipped out of mine 1, which
is x11 + x12 + x13 , cannot exceed the amount available, leading to the
constraint
x11 + x12 + x13 103.
Likewise, if we consider ore at mine 2 we get the constraint
x21 + x22 + x23 197.
Further, at steel plant 1 according to Table 8.3 there are at least 71
Mtons of ore required, so the amount of ore shipped to steel plant 1 has
to be greater than or equal to this amount, leading to the constraint
x11 + x21 71.
In the same manner, for the steel plants 2 and 3 we get
x12 + x22 133,
x13 + x23 96.
208
i = 1, 2,
j = 1, 2, 3.
From Table 8.1 follows that the total cost (in KSEK) of shipping is
z = 9x11 + 16x12 + 28x13 + 14x21 + 29x22 + 19x23 .
Finally, since the objective is to minimize the total cost we get the
following linear programming model:
minimize
subject to
+x12
+x13
x21
x11
+x22
71,
133,
+x21
x12
+x22
x13
x11 ,
+x23
x12 ,
x13 ,
x21 ,
x22 ,
103,
197,
+x23 96,
x23
0.
The transportation problem may be given in a compact general formulation. Assume that we have N sources and M demand centers. For
i = 1, . . . , N and j = 1, . . . , M , introduce the variables
xij = amount of commodity shipped from source i to demand center j,
and let
z = total shipping cost.
Further for i = 1, . . . , N and j = 1, . . . , M introduce the shipping costs
cij = unit cost of shipping commodity from source i to demand center j.
Also, let
si = amount of commodity available at source i, i = 1, . . . , N,
dj = amount of commodity required at demand center j, j = 1, . . . , M.
Consider source i. The amount of commodity available is given by
si , which gives the constraint
M
X
j=1
xij si .
209
xij dj .
i = 1, . . . , N,
j = 1, . . . , M.
N X
M
X
cij xij ,
i=1 j=1
z=
N X
M
X
cij xij ,
i=1 j=1
subject to
M
X
j=1
N
X
i=1
xij si ,
i = 1, . . . , N,
xij dj ,
j = 1, . . . , M,
xij 0,
i = 1, . . . , N,
j = 1, . . . , M.
8.2
x 0n },
(8.1)
8.2.1
Standard form
z = cT x,
subject to
Ax = b,
x 0n ,
211
(8.2a)
(8.2b)
z = cT x,
subject to
Ax = b,
x2 0,
xj 0,
j = 3, . . . , n,
(8.4)
4x1 x2 +2y1 3,
x1 , x2
0.
= 1,
4x1 x2 +2y1
s2 = 3,
x1 , x2 ,
s1 , s2 0.
Finally, by introducing the variables y1+ and y1 we can handle the unrestricted variable y1 by substituting it by y1+ y1 wherever it occurs.
We arrive at the standard form to
minimize
subject to
4x1
x1 ,
x2 , y1 , y1 , s1 , s2
(8.5)
= 1,
= 3,
0.
8.2.2
z = cT x,
subject to
Ax = b,
x 0n ,
(8.6)
xB
xN
A = (B, N ),
xB
xN
1
B b
.
0nm
x2 ,
= 3,
2x4
= 1,
+x4 +x5 = 7,
x3 ,
x4 , x5 0.
The constraint matrix and the right-hand side vector are given by
1 0 1 0 0
3
A = 1 1 0 2 0 , b = 1 .
2 0
0
1 1
7
(a) The partition xB = (x2 , x3 , x4 )T , xN = (x1 , x5 )T ,
0 1 0
1 0
B = 1 0 2 , N = 1 0 ,
0
0
1
2 1
x2
15
x3 1 3
xB
B b
x=
=
=
2
x4 =
7 .
xN
0
x1
0
x5
0
This is, however, not a basic feasible solution (since x2 and x3 are negative).
(b) The partition, xB = (x1 , x2 , x5 )T , xN = (x3 , x4 )T ,
1 0 0
1 0
B = 1 1 0 , N = 0 2 ,
2 0 1
0
1
corresponds to the basic solution
x1
3
x2 1 2
xB
B b
x=
=
x5
=
=
2
1 .
xN
0
x3
0
x4
0
0
0 0
1 1
B = 1 2 0 , N = 1 0 ,
0
1 1
2 0
1 0 0
0 1
B = 1 2 0 , N = 1 0 ,
2 1 1
0
0
corresponds to the basic feasible solution
x1
3
x4 1 1
xB
B b
=
=
x=
2
x5 =
0 ,
xN
0
0
x2
x3
0
There are, of course, infinitely many ways to generate a certain polyhedral cone C. Assume that C = cone {d1 , . . . , dr }. If there exists a
vector di {d1 , . . . , dr } such that
di cone {d1 , . . . , dr } \ {di } ,
k
X
i v i +
i=1
r
X
j d j ,
j=1
Pk
i=1
i = 1, and 1 , . . . , r 0.
We have arrived at the important result that if there exists an optimal solution to a linear program in standard form then there exists an
optimal solution among the basic feasible solutions.
Theorem 8.10 (existence and properties of optimal solutions) Let the
sets P , V , and D be defined as in Theorem 8.9 and consider the linear
program
minimize
subject to
z = cT x,
x P.
(8.7)
k
X
i v i +
i=1
k
X
i=1
r
X
j d j ,
(8.8)
j=1
Pk
i=1
i cT v i +
i = 1, and 1 , . . . , r 0.
r
X
j cT dj .
(8.9)
j=1
k
X
i v i .
i=1
Further, let
a arg minimum cT v i .
i{1,...,k}
Then,
cT v a = cT v a
k
X
i =
i=1
k
X
i=1
i cT v a
k
X
i cT v i = cT x,
i=1
8.2.3
Consider the polytope in Figure 8.2. Clearly, every point on the line
segment joining the extreme points x and u cannot be written as a convex combination of any pair of points that are not on this line segment.
However, this is not true for the points on the line segment between the
extreme points x and w. The extreme points x and u are said to be
adjacent (while x and w are not adjacent).
Definition 8.12 (adjacent extreme points) Two extreme points x and
u of a polyhedron P are adjacent if each point y on the line segment
between x and u has the property that if
y = v + (1 )w,
220
u (adjacent to x)
x
Figure 8.2: Illustration of adjacent extreme points.
x 0n }.
1 1
ym+1 (B 1 )1 n1
(B ) b
,
y=
+
ym+1
0nm
0nm1
y 0n .
But this is in fact the line segment between u and v (if ym+1 = 0 then
y = u and if ym+1 = vm+1 then y = v). In other words, y 1 and y 2 are
on the line segment between u and v, and we are done.
Since the simplex method at each iteration performs exactly the
above replacement action the proposition actually shows that the simplex method at each non-degenerate iteration moves from one extreme
point to an adjacent.
Remark 8.14 Actually a converse of the implication in Proposition
8.13 also holds. Namely, if two extreme points u and v are adjacent,
then there exists a partition (B 1 , N 1 ) corresponding to u and a partition (B 2 , N 2 ) corresponding to v such that the columns of B 1 and B 2
are the same except for one. The proof is similar to that of Proposition
8.13.
222
8.3
The material in this chapter can be found in most books on linear programming, such as [Dan63, Chv83, Mur83, Sch86, DaT97, Pad99, Van01,
DaT03, DaM05].
8.4
Exercises
R
R
aT v b, for all v V ,
aT w b, for all w W .
(b) Construct, if possible, a sphere that separates the sets V and W , that
is, find a center xc Rn and a radius R 0 such that
kv xc k2 R,
v V,
kw x k2 R, for all w W .
for all
f (x) := ( T x + )/(dT x + ),
(8.10)
Ax b,
where
, d R , A R
, and b Rm . Further, assume that the polyhen
dron P := {x R | Ax b} is bounded and that dT x + > 0 for all x P .
Show that (8.10) can be solved by solving the linear program
n
mn
minimize
g(y, z) := T y + z,
subject to
Ay z b 0 ,
dT y + z = 1,
(8.11)
z 0.
[Hint: Suppose that y together with z are a solution to (8.11), and show
that z > 0 and that y /z is a solution to (8.10).]
223
x1 5x2 7x3 ,
z=
subject to
2,
x1
= 11,
x2 ,
x4 0.
(a) Show how to transform this problem into standard form by eliminating
one constraint and the unrestricted variable x3 .
(b) Why cannot this technique be used to eliminate variables with nonnegativity restrictions?
Exercise 8.6 (basic feasible solutions) Suppose that a linear program includes
a free variable xj . When transforming this problem into standard form, xj is
replaced by
xj = x+
j xj ,
x+
j , xj 0.
aij xj = bi ,
i = 1, . . . , m.
(8.12)
j=1
224
aij xj bi ,
aij xj
m
X
i=1
i = 1, . . . , m,
bi .
(8.13a)
(8.13b)
IX
This chapter presents the simplex method for solving linear programs.
In Section 9.1 the algorithm is presented. First, we assume that a basic feasible solution is known at the start of the algorithm, and then
we describe what to do when a BFS is not known from the beginning.
In Section 9.2 we discuss termination characteristics of the algorithm.
It turns out that if all the BFSs of the problem are non-degenerate,
then the algorithm terminates. However, if there exist degenerate BFSs
then there is a possibility that the algorithm cycles between degenerate
BFSs and hence never terminates. Fortunately, the simple Blands rule
eliminates cycling, and which we describe. We close the chapter by discussing the computational complexity of the simplex algorithm. In the
worst case, the algorithm visits all the extreme points of the problem,
and since the number of extreme points may be exponential in the dimension of the problem, the simplex algorithm does not belong to the
desirable polynomial complexity class. The simplex method is therefore
not theoretically satisfactory, but in practice it works very well and thus
it frequently appears in commercial linear programming codes.
9.1
The algorithm
z = cT x,
subject to
Ax = b,
x 0n ,
0,
9.1.1
A BFS is known
T T
Assume that a basic feasible solution x = (xT
B , xN ) corresponding to
the partition A = (B, N ) is known. Then we have that
xB
= BxB + N xN = b,
Ax = (B, N )
xN
or, equivalently,
xB = B 1 b B 1 N xN .
(9.1)
T T
Further, rearrange the components of c such that c = (cT
B , cN ) has the
T
T T
same ordering as x = (xB , xN ) . Then from (9.1) it follows that
T
cT x = cT
B xB + cN xN
1
= cT
b B 1 N xN ) + cT
B (B
N xN
1
T 1
= cT
b + (cT
N )xN .
BB
N cB B
(9.2)
The algorithm
we increase the j th component of the non-basic vector xN from 0 to 1,
then the change in the objective function value becomes
T 1
(
cN )j := (cT
N )j ,
N cB B
that is, the change in the objective function value resulting from a unit
increase of the non-basic variable (xN )j from zero is given by the j th
T
T 1
T
component of the vector c
N.
N := cN cB B
We call (
cN )j the reduced cost of the non-basic variable (xN )j for j =
T
= (
T
1, . . . , n m. Actually, we can define the reduced cost, c
cT
B, c
N) ,
of all the variables at the given BFS by
1
T 1
T
T := cT cT
A = (cT
(B, N )
c
B , cN ) cB B
BB
T 1
= ((0m )T , cT
N );
N cB B
B = 0m .
note that the reduced costs of the basic variables are c
If (
cN )j 0 for all j = 1, . . . , n m, then there exists no adjacent
extreme point such that the objective function value decreases and we
stop; x is then an optimal solution.
Proposition 9.1 (optimality in the simplex method) Let x be the basic feasible solution that corresponds to the partition A = (B, N ). If
(
cN )j 0 for all j = 1, . . . , n m, then x is an optimal solution.
1
Proof. Since cT
b is constant, it follows from (9.2) that the original
BB
linear program is equivalent to
minimize
T
c
N xN
z=
subject to
xB +B 1 N xN = B 1 b,
xB
0m ,
xN 0nm ,
cT
N xN
z=
B
(9.3)
N xN B
b,
nm
xN 0
227
minimum
1
i{ k | (B
N j )k >0 }
(B 1 b)i
.
(B 1 N j )i
minimum
i{ k | (B 1 N j )k >0 }
(B 1 b)i
,
(B 1 N j )i
The algorithm
The Simplex Algorithm:
T T
Step 0 (initialization: BFS). Let x = (xT
B , xN ) be a BFS corresponding to the partition A = (B, N ).
Step 1 (descent direction or termination: entering variable, pricing). Calculate the reduced costs of the non-basic variables:
T 1
(
cN )j := (cT
N )j ,
N cB B
j = 1, . . . , n m.
If (
cN )j 0 for all j = 1, . . . , n m then stop; x is then optimal.
Otherwise choose (xN )j , where
j arg minimum (
cN )j ,
j{1,...,nm}
i arg
minimum
i{ k | (B 1 N j )k >0 }
(B 1 b)i
,
(B 1 N j )i
B T y = cB ,
j{1,...,nm}
c T pj
,
kpj k
1
that is, the usual pricing rule based on cT pj = cT
N j ) + (cN )j =
B (B
(
cN )j is replaced by a rule wherein the reduced costs are scaled by the
length of the candidate search directions pj . (Other scaling factors can
of course be used.)
Remark 9.6 (initial basic feasible solution) Consider the linear program
minimize
subject to
z = cT x,
Ax b,
x 0n ,
230
(9.4)
The algorithm
where A Rmn , b 0m , and c Rn . By introducing slack variables
s Rm we get
minimize
z = cT x,
(9.5)
m
subject to
Ax +I s = b,
x
0n ,
s 0m .
minimum
1
i{ k | (B
N 3 )k >0 }
(B 1 b)i
= {1},
(B 1 N 3 )i
so we choose x5 to leave the basis. The new basic and non-basic vectors
are xB = (x3 , x6 , x7 )T and xN = (x1 , x2 , x5 , x4 )T , and the reduced costs
of the non-basic variables become
T 1
cT
N = (1, 4, 2, 6),
N cB B
minimum
1
i{ k | (B
N 2 )k >0 }
(B 1 b)i
= {2},
(B 1 N 2 )i
and hence x6 is the leaving variable. The new basic and non-basic vectors
become xB = (x3 , x2 , x7 )T and xN = (x1 , x6 , x5 , x4 )T , and the reduced
costs of the non-basic variables are
T 1
cT
N = (13/3, 8/3, 2/3, 6),
N cB B
0.
9.1.2
Often a basic feasible solution is not known initially. (In fact, only if
the origin is feasible in (9.4) we know a BFS immediately.) However, an
initial basic feasible solution can be found by solving a linear program
that is a pure feasibility problem. We call this the phase I problem.
Consider the following linear program in standard form:
minimize
subject to
z = cT x,
Ax = b,
x 0n .
232
(9.6)
The algorithm
In order to find a basic feasible solution we introduce the artificial variables a Rm and consider the phase I problem to
minimize w =
(1m )T a,
subject to
Ax +I m a = b,
x
(9.7)
0n ,
a 0m .
233
z = 2x1 ,
x1
x3
= 3,
x1 x2
2x4 = 1,
2x1
+x4 7,
x1 , x2 , x3 ,
x4 0.
By introducing a slack variable x5 we get the equivalent linear program in standard form:
minimize
subject to
z = 2x1 ,
x1
x3
x1 x2
2x4
(9.8)
= 3,
= 1,
2x1
+x4 +x5 = 7,
x1 , x2 , x3 , x4 , x5 0.
We cannot identify the identity matrix among the columns of the
constraint matrix of the problem (9.8), but the third unit vector e3 is
found in the column corresponding to the x5 -variable. Therefore, we
leave the problem (9.8) for a while, and instead introduce two artificial
variables a1 and a2 and consider the phase I problem to
minimize
subject to
w=
x1
x3
x1 x2
2x4
2x1
+x4 +x5
x1 , x2 ,
x3 ,
x4 ,
x5 ,
a1
+a1
+a2
= 3,
+a2 = 1,
= 7,
a1 ,
a2 0.
The algorithm
(B 1 b)i
= {2},
1
i{ k | (B N 1 )k >0 } (B
N 1 )i
so we choose a2 as the leaving variable. The new basic and non-basic vectors are xB = (a1 , x1 , x5 )T and xN = (a2 , x2 , x3 , x4 )T , and the reduced
costs of the non-basic variables become
arg
minimum
1
T 1
cT
N = (2, 1, 1, 2),
N cB B
minimum
T 1
cT
N = (1, 0, 0, 1),
N cB B
N = (0, 2),
cT
N = cN cB B
x4
1
x1 1 3
xB
B b
x5
=
=
=
x=
2
0
xN
0
x2
0
x3
0
minimum
1
i{ k | (B
N 1 )k >0 }
(B 1 b)i
= {1},
(B 1 N 1 )i
cT
N = (0, 2),
N = cN cB B
2
x2
x1 1 3
xB
B b
x5
=
=
x=
=
2
1
xN
0
x4
0
x3
0
is an alternative optimal basic feasible solution.
9.1.3
z = cT x,
subject to
Ax = b,
x 0n .
T T
Let x = (xT
B , xN ) be an optimal basic feasible solution that corresponds
to the partition A = (B, N ). If the reduced costs of the non-basic variables xN are all strictly positive, then x is the unique optimal solution.
T
c
N xN
z=
xB +B 1 N xN = B 1 b,
xB
0m ,
xN 0nm .
236
Termination
Now if the reduced costs of the non-basic variables are all strictly positive, that is,
cN > 0nm , it follows that a solution for which (xN )j > 0
for some j = 1, . . . , n m cannot be optimal. Hence
1
xB
B b
x=
=
xN
0nm
is the unique optimal solution.
9.2
Termination
minimum
1
i{ k | (B
N j )k >0 }
(B 1 b)i
> 0.
(B 1 N j )i
9.3
Computational complexity
9.4
The simplex method was developed by George Dantzig [Dan51]. The version of the simplex method presented is usually called the revised simplex
method, and was first described by Dantzig [Dan53] and Orchard-Hays
[Orc54]. The first book describing the simplex method was [Dan63].
In the (revised) simplex algorithm several computations are performed using B 1 . The major drawback in this approach is that roundoff
2 By eligible entering variables we mean the variables (x ) for which (
N )j < 0,
N j
and when we have chosen the entering variable j, the eligible leaving variables are
the variables (xB )i such that
i arg
238
minimum
i{ k | (B 1 N j )k >0 }
(B 1 b)i
.
(B 1 N j )i
Exercises
errors accumulate as the algorithm proceeds. This drawback can however be alleviated by using stable forms of LU decomposition or Cholesky
factorization. Most of the software packages for linear programming use
LU decomposition. Early references on numerically stable forms of the
simplex method are [BaG69, Bar71, Sau72, GiM73]. Books that discuss
the subject are [Mur83, NaS96].
The first example of cycling of the simplex algorithm was constructed
by Hoffman [Hof53]. Several methods have been developed for avoiding
cycling, such as the perturbation method of Charnes [Cha52], the lexicographic method of Dantzig, Orden and Wolfe [DOW55], and Blands rule
[Bla77]. In practice, however, cycling is rarely encountered. Instead, the
problem is stalling, which means that the value of the objective function
does not change (or changes very little) for a very large number of iterations3 before it eventually starts to make substantial progress again. So
in practice, we are interested in methods that primarily prevent stalling,
and only secondarily avoid cycling (see, e.g., [GMSW89]).
In 1972, Klee and Minty [KlM72] showed that there exist problems
of arbitrary size that cause the simplex method to examine every possible basis when the standard (steepest-descent) pricing rule is used, and
hence showed that the simplex method is an exponential algorithm in
the worst case. It is still an open question, however, whether there exists
a rule for choosing entering and leaving basic variables that makes the
simplex method polynomial. The first polynomial-time method for linear
programming was given by Khachiyan [Kha79, Kha80], by adapting the
ellipsoid method for nonlinear programming of Shor [Sho77] and Yudin
and Nemirovskii [YuN77]. Karmarkar [Kar84a, Kar84b] showed that interior point methods can be used in order to solve linear programming
problems in polynomial time.
General text books that discuss the simplex method are [Dan63,
Chv83, Mur83, Sch86, DaT97, Pad99, Van01, DaT03, DaM05].
9.5
Exercises
x1 x2 +2x3 1,
x1 ,
x2 ,
x3 0.
239
subject to
+x3 3,
2x1
x2 , x3 0.
(a) Solve this problem by using the simplex algorithm with phase I & II.
(b) Is the optimal solution obtained unique?
Exercise 9.3 (the simplex algorithm) Consider the linear program
z =
T x,
minimize
Ax = b,
x 0n .
subject to
Suppose that at a given step of the simplex algorithm, there is only one
possible entering variable, (xN )j . Also assume that the current BFS is nondegenerate. Show that (xN )j > 0 in any optimal solution.
Exercise 9.4 (cycling of the simplex algorithm) Consider the linear program
minimize
2
x5
5
3
+ x5
5
1
+ x5
5
2
+ x5
5
z=
subject to
x1
x2
x3
x4
x1 , x2 , x3 , x4 ,
x5 ,
2
x6
5
32
x6
5
9
x6
5
8
x6
5
+x6
9
+ x7 ,
5
24
+ x7 = 0,
5
3
+ x7 = 0,
5
1
+ x7 = 0,
5
= 1,
x6 ,
x7 0.
240
Linear programming
duality and sensitivity
analysis
10.1
Introduction
z = cT x,
Ax = b,
(10.1)
x 0n ,
where A Rmn , b Rm , and c Rn , and assume that this problem
T T
has been solved by the simplex algorithm. Let x = (xT
be
B , xN )
an optimal basic feasible solution corresponding to the partition A =
(B, N ). Introduce the vector y Rm through
1
(y )T := cT
.
BB
T
nm T
cT
) .
N (y ) N (0
T
T
T 1
Further, cT
B = (0m )T , so we have that
B (y ) B = cB cB B
cT (y )T A (0n )T ,
or equivalently,
AT y c.
bT y,
subject to
(10.2)
A y c,
y free.
10.2
10.2.1
Canonical form
When presenting the rules for constructing the linear programming dual
we will utilize the notation of canonical form. The canonical form is
connected with the directions of the inequalities of the problem and with
the objective. If the objective is to maximize the objective function, then
every inequality of type is said to be of canonical form. Similarly, if
the objective is to minimize the objective function, then every inequality
of type is said to be of canonical form. Further, we consider nonnegative variables to be variables in canonical form.
Remark 10.1 (mnemonic rule for canonical form) Consider the LP
minimize
subject to
z = x1
x1 1.
z = x1
subject to
x1 1,
10.2.2
dual/primal variable
canonical inequality 0
non-canonical inequality 0
unrestricted
equality
n
X
cj xj ,
j=1
subject to
n
X
j=1
n
X
j=1
n
X
aij xj bi ,
i C,
aij xj bi ,
i N C,
aij xj = bi ,
i E,
xj 0,
xj 0,
j P,
j N,
j=1
xj free,
j F,
where C stands for canonical, N C for non-canonical, E for equality, P for positive, N for negative, and F for free. Note that
244
m
X
bi yi ,
i=1
subject to
m
X
i=1
m
X
i=1
m
X
aij yi cj ,
j P,
aij yi cj ,
j N,
aij yi = cj ,
j F,
i=1
yi 0,
yi 0,
yi free,
i C,
i N C,
i E.
From this it is easily established that if we construct the dual of the dual
linear program, then we return to the original (primal) linear program.
Examples
In order to illustrate how to construct the dual linear program we present
two examples. The first example considers a linear program with matrix
block structure. This is a usual form of linear programs and it is particularly easy to construct the dual linear program. The other example deals
with the transportation problem presented in Section 8.1. The purpose
of constructing the dual to this problem is to show how to handle double
subscripted variables and indexed constraints.
Example 10.2 (the dual to a linear program of matrix block form) Consider the linear program
maximize
subject to
cT x+dT y,
Ax +By b,
x
Dy = e,
0n1 ,
y 0n2 ,
245
minimize
AT u
subject to
c,
B u+D v d,
u
0 m1 ,
v free.
z=
N X
M
X
cij xij ,
i=1 j=1
subject to
M
X
j=1
N
X
i=1
xij si ,
i = 1, . . . , N,
xij dj ,
j = 1, . . . , M,
xij 0,
i = 1, . . . , N,
j = 1, . . . , M.
w=
N
X
si u i +
i=1
subject to
ui + vj cij ,
ui 0,
vj 0,
246
M
X
dj vj ,
j=1
i = 1, . . . , N,
i = 1, . . . , N,
j = 1, . . . , M.
j = 1, . . . , M,
10.3
In this section we present some of the most fundamental duality theorems. Throughout the section we will consider the primal linear program
minimize
subject to
z = cT x,
Ax = b,
(P)
x 0n ,
where A Rmn , b Rm , and c Rn , and its dual linear program
maximize
subject to
w = bT y,
(D)
A y c,
y free.
10.3.1
= y T Ax = y T b
[c AT y,
[Ax = b]
x 0n ]
= b y,
and we are done.
247
(10.3)
for any optimal basic feasible solution to (P). If xB > 0m , then a small
change in b does not change the basis, and so the optimal value of (D)
(and (P)), namely
v(b) := bT y
is linear at, and locally around, the value b. If, however, some (xB )i = 0,
then in this degenerate case it could be that the basis changes in a nondifferentiable manner with b. We summarize:
Theorem 10.8 (shadow price) If, for a given vector b Rm , the optimal
solution to (P) corresponds to a non-degenerate basic feasible solution,
then its optimal value is differentiable at b, with
v(b)
= yi ,
bi
i = 1, . . . , m,
(I)
x0 ,
249
(II)
b y > 0,
bT y,
subject to
(10.4)
n
A y0 ,
y free,
(0n )T x,
Ax = b,
(10.5)
x 0n .
Since (II) is infeasible, y = 0m is an optimal solution to (10.4). Hence
the Strong Duality Theorem 10.6 implies that there exists an optimal
solution to (10.5). This solution is feasible in (I).
What we have proved above is the equivalence
(I) (II).
Logically, this is equivalent to the statement that
(I) (II).
We have hence established that precisely one of the two systems (I) and
(II) has a solution.
10.3.2
Complementary slackness
(10.7)
Further, by the Strong Duality Theorem 10.6 and the Weak Duality
Theorem 10.4, x and y are optimal if and only if cT x = bT y, so in fact
(10.7) holds with equality, that is,
cT x = (AT y)T x
xT (c AT y) = 0.
cT x
subject to
Ax b,
x 0n ,
(10.8)
and
minimize
subject to
bT y
(10.9)
A y c,
y 0m .
j = 1, . . . , n,
(10.10a)
i = 1 . . . , m,
(10.10b)
i = 1, . . . , N,
j = 1, . . . , M.
252
aij yi cj ,
z = 3x1 +2x2 ,
x1 +x2 80,
(10.11)
x2 0,
and
minimize
subject to
y1
y1
y1 ,
+2y2
+y2
y2 ,
(10.12)
+y3 3,
2,
y3 0.
y2 (2x1 + x2 100)
y3 (x1 40)
x1 (y1 + 2y2 + y3 3)
x2 (y1 + y2 2)
=0
=0
=0
y3 = 0
=0
=0
=
=
[x1 = 20 6= 40]
253
10.4
cT
N (0nm )T .
N := cN cB B
nm
X
j=1
so (xB )1 < 0 in the current basis and will be the leaving variable. If
(B 1 N )1j 0,
j = 1, . . . , n m,
(10.13)
(
cB )1 :=
(
cN )j := (
cN )j (
cN )k
(B 1 N )1j
,
(B 1 N )1k
j = 1, . . . , n m.
Since we want the new basis to be dual feasible it must hold that all of
the new reduced costs are non-negative, that is,
(
cN )j (
cN )k
(B 1 N )1j
,
(B 1 N )1k
j = 1, . . . , n m,
or, equivalently,
(
cN )k
(
cN )j
,
(B 1 N )1k
(B 1 N )1j
maximum
i{ j
| (B 1 N )
1j <0 }
(
cN )j
.
1
(B N )1j
We have now derived an infeasibility criterion and criteria for choosing the leaving and the entering variables, and are ready to state the
dual simplex algorithm.
The Dual Simplex Algorithm:
T T
Step 0 (initialization: DFS) Assume that x = (xT
B , xN ) is a dual feasible basis corresponding to the partition A = (B, N ).
j = 1, . . . , n m,
255
maximum
i{ j | (B 1 N )sj <0 }
(
cN )j
,
1
(B N )sj
x1 x2 x3 +x4 +x5 2,
x1 +x2 2x3 +2x4 3x5 4,
x1 ,
x2 ,
x3 ,
x4 ,
x5 0.
x1 x2 x3 +x4 +x5
x1 +x2 2x3 +2x4 3x5
x1 ,
256
x2 ,
x3 ,
x4 ,
=3,
+x7
=2,
+x8 = 4,
x5 , x6 , x7 , x8 0.
Sensitivity analysis
We see that the basis xB := (x6 , x7 , x8 )T is dual feasible, but primal
infeasible. Hence we use the dual simplex algorithm to solve the problem.
We have that
:= B 1 b = (3, 2, 4)T ,
b
so we choose (xB )1 = x6 to leave the basis. Further we have that the
reduced costs of xN := (x1 , x2 , x3 , x4 , x5 )T are
cT
N = (3, 4, 2, 1, 5),
and
(B 1 N )1 = (1, 2, 1, 1, 1),
cT
N = (5, 2, 0, 3, 7),
(B 1 N )2 = (1.5, 0.5, 0.5, 0.5, 0.5)T,
which gives that x3 is the entering variable. The new basis becomes
xB := (x2 , x3 , x8 )T . We get that
b := B 1 b = (1, 1, 5)T ,
which means that the optimality criterion (primal feasibility) is satisfied,
and an optimal solution to the original problem is given by
x = (x1 , x2 , x3 , x4 , x5 )T = (0, 1, 1, 0, 0)T.
Check that this is indeed true, for example by using Theorem 10.12.
10.5
Sensitivity analysis
z = cT x,
subject to
Ax = b,
x 0n ,
(10.14)
namely
257
10.5.1
(10.15)
x 0n .
The optimal solution x to the unperturbed problem (10.14) is obviously
a feasible solution to (10.15), but is it still optimal? To answer this question, we note that a basic feasible solution is optimal if the reduced costs
of the non-basic variables are greater than or equal to zero. The reduced
costs for the non-basic variables of the perturbed problem (10.15) are
T T
given by [let p = (pT
B , pN ) ]
T
T 1
T
c
N.
N = (cN + pN ) (cB + pB ) B
+ (
cN )j 0,
so in this case we only have to check that the perturbation is not less
than (
cN )j in order to guarantee that x is an optimal solution to the
perturbed problem.
258
Sensitivity analysis
Perturbations of a basic cost coefficient
If only one component of cB is perturbed, that is,
pB
ej
p=
=
,
pN
0nm
for some R and j {1, . . . , m}, then we have that x is an optimal
solution to the perturbed problem if
1
T
N (0nm )T
(cN )T (cT
B + ej )B
1
nm T
eT
N + cT
) .
j B
N (0
In this case all of the reduced costs of the non-basic variables may change,
and we must check that the perturbation multiplied by the j th row of
T
B 1 N plus the original reduced costs c
N is a vector whose components
all are greater than or equal to zero.
Perturbations that make x non-optimal
If the perturbation p is such that some of the reduced costs of the perturbed problem becomes strictly negative for the basis xB , then x is
perhaps not an optimal solution anymore. If this happens, let some of
the variables with strictly negative reduced cost enter the basis and continue the simplex algorithm until an optimal solution is found (or until
the unboundedness criterion is satisfied).
10.5.2
Now, assume that the right-hand side b of the linear program (10.14)
is perturbed by the vector p Rm , that is, we consider the perturbed
problem to
minimize
z = cT x,
subject to Ax = b + p,
(10.16)
x 0n .
The reduced costs of the unperturbed problem do not change as the
right-hand side is perturbed, so the basic feasible solution given by the
partition A = (B, N ) is optimal to the perturbed problem (10.16) if
259
B 1 ej + B 1 b 0m ,
10.6
Exercises
10.7
Exercises
0,
x2
x3
x4
0,
free.
z = T x,
Ax = b,
l x u.
A, b,
Exercise 10.3 (application of the Weak and Strong Duality Theorems) Consider the linear program
minimize
subject to
z = T x,
(P)
z =
T x,
Ax = b,
(P)
Ax = b,
x 0n ,
x 0n .
Show that if (P) has an optimal solution, then the perturbed problem (P)
).
cannot be unbounded (independently of b
Exercise 10.4 (application of the Weak and Strong Duality Theorems) Consider the linear program
minimize
subject to
z = T x,
(10.17)
Ax b.
261
minimize
(10.18)
Ax b,
x 0n .
subject to
z = T x,
subject to
Ax b,
x 0n .
(10.19)
y be optimal
z = (y )T Ax .
z=
subject to
x2 ,
x3 ,
= 3,
+x4 x5 2,
x4 ,
x5 0.
where
maximize
z = T x,
subject to
aT x b,
x 1n ,
x 0n ,
c1
c2
cn
.
a1
a2
an
Show that the feasible solution
xj = 1, j = 1, . . . , r 1,
where r is such that
262
Pr1
j=1
x given by
xr =
aj b and
Pr1
j=1
ar
Pr
j=1
aj
xj = 0, j = r + 1, . . . , n,
Exercises
Exercise 10.9 Prove Theorem 10.15.
Exercise 10.10 (KKT versus LP primaldual optimality conditions) Consider
the linear program
minimize
subject to
z = T x,
Ax b.
Show that the KKT conditions are equivalent to the LP primaldual optimality
conditions.
Exercise 10.11 (Lagrangian primaldual versus LP primaldual) Consider the
linear program
minimize
subject to
z = T x,
Ax b.
z=
subject to
x2 ,
x3 ,
x4 0.
Find the values of c3 and c4 such that the basic solution that corresponds
to the partition xB := (x1 , x2 )T is an optimal basic feasible solution to the
problem.
Exercise 10.14 (sensitivity analysis: perturbations in the right-hand side) Consider the linear program
minimize
subject to
z = x1 +2x2 +x3 ,
2x1 +x2 x3 7,
x1 +2x2 +3x3 3 + ,
x1 ,
x2 ,
x3 0.
263
(P)
subject to
Ax = b,
x 0n ,
maximize
w = bT y ,
(D)
subject to
A y
,
y free.
minimize
Show that if one of the problems (P) and (D) has a finite optimal solution,
then so does its dual, and their optimal objective function values are equal.
Exercise 10.16 (an LP duality paradox) For a standard primaldual pair of
LPs, consider the following string of inequalities:
maximum {
T x | Ax b;
x 0n } minimum { bT y | AT y
; y 0m }
maximum { bT y | AT y
; y 0m }
minimum {
T x | Ax b; x 0n }
maximum {
T x | Ax b; x 0n }
minimum { bT y | AT y
; y 0m }
maximum { bT y | AT y
; y 0m }
minimum {
T x | Ax b; x 0n }
maximum {
T x | Ax b; x 0n }.
Since equality must hold throughout, the range of
T x is a constant over the
primal polyhedron, and bT y is constant over the dual polyhedron, yet
, A,
and b are arbitrary. What is wrong in the above line of arguments?
[Note: This and other paradoxes in optimization are found on Harvey
Greenbergs page http://www.cudenver.edu/~hgreenbe/myths/myths.html.]
264
Part V
Algorithms
Unconstrained
optimization
11.1
XI
Introduction
(11.1)
x R
i = 1, . . . , m,
i = 1, . . . , m.
Unconstrained optimization
A minimization will then yield the best fit with respect to the data points
available. The following then is the resulting optimization problem to
be solved:
minimize
f (x) :=
5
x R
m
X
i=1
|fi (x)|2 =
m
X
[fi (x)]2 .
i=1
This type of problem is very often solved within numerical analysis and
mathematical statistics. Note that the 2-norm is not the only measure
of the residual used; sometimes the maximum norm is used.
What is the typical form of an algorithm in unconstrained optimization (in fact, for almost every problem class)? Take a look at Figure 11.1
of the level curves1 of a convex, quadratic function, and the algorithm
description below.
pk+1
5
4
3
2
pk
1
0
xk+1
5
1
2
3
1
xk
2
Descent algorithm:
Step 0 (initialization). Determine a starting point x0 Rn . Set k := 0.
level curve (or, iso-curve, or iso-cost line) is a set of the form { x Rn | f (x) =
k } for a fixed value of k R.
1A
268
Descent directions
Step 1 (descent direction). Determine a descent direction pk Rn .
Step 2 (line search). Determine a step length k > 0 such that f (xk +
k pk ) < f (xk ) holds.
Step 3 (update). Let xk+1 := xk + k pk .
Step 4 (termination check). If a termination criterion is fulfilled, then
stop! Otherwise, let k := k + 1 and go to step 1.
This type of algorithm is inherently local, since we cannot in general
use more than the information that can be calculated at the current point
xk , that is, f (xk ), f (xk ), and 2 f (xk ). As far as our local sight
is concerned, we sometimes call this type of method (for maximization
problems) the near-sighted mountain climber, reflecting the situation
in which the mountain climber is in a deep fog and can only check her
barometer for the height and feel the steepness of the slope under her
feet. Notice then that Figure 11.1 was plotted using several thousands of
function evaluations; in realityand definitely in higher dimension than
twowe never have this type of orienteering map.
We begin by analyzing Step 1, the most important step of the abovedescribed algorithm. Based on the result in Proposition 4.16 it makes
good sense to generate pk such that it is a direction of descent.
11.2
Descent directions
11.2.1
Introduction
pRn :kpk=1
(11.2)
2 We have that f (x)T p = kf (x)k kpk cos , where is the angle between
the vectors f (x) and p; this expression is clearly minimized by making cos =
1, that is, by letting p have the angle 180 with f (x); in other words, p =
f (x)/kf (x )k.
269
Unconstrained optimization
(b) Let f C 2 (N ) in some neighborhood N of xk . If f (xk ) = 0n
we cannot use the steepest descent direction anymore. However, we can
work with second order information provided by the Hessian to find a
descent direction in this case also, provided that f is non-convex at xk .
Assume that 2 f (xk ) is not positive semidefinite (otherwise, xk is likely
to be a local minimum; see Theorem 4.17). If 2 f (xk ) is indefinite we
call the stationary point xk a saddle point of f . Let p be an eigenvector
corresponding to a negative eigenvalue of 2 f (xk ). Then, we call p a
direction of negative curvature for f at xk , and p is a descent direction
since for all > 0 small enough, f (xk + p) f (xk ) = f (xk )T p +
2
2 T 2
2
2
2
2 p f (xk )p + o( ) = 2 kpk + o( ) < 0.
(c) Assume the conditions of (a), and let Q Rnn be an arbitrary
symmetric, positive definite matrix. Then p = Qf (xk ) is a descent
direction for f at xk : f (xk )T p = f (xk )T Qf (xk ) < 0, due to
the positive definiteness of Q. (This is of course true only if xk is nonstationary, as assumed.)
Pre-multiplying by Q may be interpreted as a scaling of f if we
choose a diagonal matrix Q; the use of more general matrices is of course
possible and leads to exceptionally good computational results for clever
choices of Q. Newton and quasi-Newton methods are based on constructing directions in this way. Note that setting Q = I n (the identity
matrix in Rnn ), we obtain the steepest descent direction.
To find some arbitrary direction of descent is not a very difficult
task as demonstrated by Example 11.2 [in fact, the situation when
f (xk ) = 0n appearing in (b) is quite an exotic one already, so typically one can always use directions constructed in (a), or, more generally
(c), as descent directions]. However, in order to secure the convergence
of numerical algorithms we must provide descent directions that behave well numerically. Typical requirements, additional to the basic
requirement of being a direction of descent, are:
|f (xk )T pk | s1 kf (xk )k2 ,
(11.3)
f (xk )T pk
s1 ,
kf (xk )k kpk k
(11.4)
or
Descent directions
For example, the first condition in (11.3) states that if the directional
derivative of f tends to zero then it must be that the gradient of f also
tends to zero, while the second condition makes sure that a bad direction
in terms of the directional derivative is not compensated by the search
direction becoming extremely long in norm. The first condition in (11.4)
is equivalent to the requirement that the cosine of the angle between
f (xk ) and pk is positive and bounded away from zero by the value
of s1 , that is, the angle must be acute and not too close to /2; this is
another way of saying that the direction pk must be steep enough. The
purpose of the second condition in (11.4) then is to ensure that if the
search direction vanishes then so does the gradient. Methods satisfying
(11.3), (11.4) are sometimes referred to as gradient related, since they
cannot be based on search directions that are very far from those of the
steepest descent method.
The choice pk = f (xk ) fulfills (11.3), (11.4) with s1 = s2 = 1.
Another example is as follows: set pk = Qk f (xk ), where Qk
Rnn is a symmetric and positive definite matrix such that mksk2
sT Qk s M ksk2 , for all s Rn , holds. [All eigenvalues of Qk lie in the
interval [m, M ] (0, ).] Then, the requirement (11.3) is verified with
s1 = m, s2 = M , and (11.4) holds with s1 = m/M , s2 = m.
11.2.2
What should a good descent direction accomplish? Roughly speaking, it should provide as large descent as possible, that is, minimize
f (x + p) f (x) over some large enough region of p around the origin.
In principle, this is the idea behind the optimization problem (11.2),
because, according to (2.1), f (x + p) f (x) f (x)T p.
Therefore, more insights into how the scaling matrices Q appearing
in Example 11.2(c) should be constructed and, in particular, reasons why
the steepest descent direction is not a very wise choice, can be gained if
we consider more general approximations than the ones given by (2.1).
Namely, assume that f C 1 near x, and that for some positive definite
matrix Q it holds that
1
f (x + p) f (x) x (p) := f (x)T p + pT Q1 p.
2
(11.5)
271
Unconstrained optimization
where x (p) is defined by (11.5). The closer x (p) approximates f (x +
p) f (x), the better we can expect the quality of the search directions
generated by the method described in Example 11.2(c) to be.
As was already mentioned, setting Q = I n , which absolutely fails
to take into account any information about f (that is, it is a one-sizefits-all approximation), gives us the steepest descent direction. (Cases
can easily be constructed such that the algorithm converges extremely
slowly; convergence can actually be so bad that the authors of the book
[BGLS03] decree that the steepest descent method should be forbidden!)
On the other hand, the best second-order approximation is given by
the Taylor expansion (2.3), and therefore we would like to set Q =
[2 f (x)]1 ; this is exactly the choice made in Newtons method.
Remark 11.3 (a motivation for the descent property in Newtons method)
The search direction in Newtons method is based on the solution of the
following linear system of equations: find p Rn such that
p x (p) := f (x) + 2 f (x)p = 0n .
Consider the case of n = 1. We should then solve
f (x) + f (x)p = 0.
(11.7)
272
Descent directions
type around x [that is, if f (x) > 0], and an ascent direction if it is of the
strictly concave type around x [that is, if f (x) < 0]. In other words, if
the objective function is (strictly) convex or concave, the Newton equation will give us the right direction, if it gives us a direction at all. In
the case when n > 1, Newtons method acts as a descent method if the
Hessian matrix 2 f (x) is positive definite, and as an ascent method if
it is negative definite, which is appropriate.
An essential problem arises if the above-described is not what we
want; for example, we may be interested in maximizing a function which
is neither convex nor concave, and around a current point the function
is of strictly convex type (that is, the Hessian is positive definite). In
this case the Newton direction will not point in an ascent direction,
but instead the opposite. How to solve a problem with a Newton-type
method in a non-convex world is the main topic of what follows. As
always, we consider minimization to be the direction of interest for f .
So, why might one want to choose a matrix Q different from the
best choice [2 f (x)]1 ? There are several reasons:
Lack of positive definiteness The matrix 2 f (x) may not be positive definite. As a result, the problem (11.6) may even lack solutions
and [2 f (x)]1 f (x) may in any case not be a descent direction.
This problem can be cured by adding to 2 f (x) a diagonal matrix
E, so that 2 f (x) + E is positive definite. For example, E = I n , for
smaller than all the non-positive eigenvalues of 2 f (x), may be used
because such a modification shifts the original eigenvalues of 2 f (x)
by > 0. The value of needed will automatically be found when
solving the Newton equation 2 f (x)p = f (x), since eigenvalues
of 2 f (x) are pivot elements in Gaussian-elimination procedures. This
modification bears the name LevenbergMarquardt.
[Note: as becomes large, p resembles more and more the steepest
descent direction.]
Lack of enough differentiability The function f might not be in
C 2 , or the matrix of second derivatives might be too costly to compute.
Either being the case, in quasi-Newton methods one approximates
the Newton equation by replacing 2 f (xk ) with a matrix B k that is
cheaper to compute, typically by only using values of f at the current
and some previous points.
Using a first-order Taylor expansion (2.1) for f (xk ) we know that
2 f (xk )(xk xk1 ) f (xk ) f (xk1 ),
273
Unconstrained optimization
so the matrix B k is taken to satisfy the similar system
B k (xk xk1 ) = f (xk ) f (xk1 ).
Notice that for n = 1, this corresponds to the secant method, in
which at iteration k we approximate the second derivative as
f (xk )
f (xk ) f (xk1 )
.
xk xk1
(B k sk )(B k sk )T
y yT
+ kT k ,
T
sk B k sk
y k sk
Linear system
pk = f (xk )
2 f (xk )pk = f (xk )
2
[ f (xk ) + k I n ]pk = f (xk )
B k pk = f (xk )
11.3
11.3.1
(11.8)
4 These
( ) 0;
( ) = 0;
0,
(11.10)
0.
(11.9)
275
Unconstrained optimization
that is,
f (xk + pk )T pk 0;
f (xk + pk )T pk = 0;
0,
pk
5
4
()
3
2
xk + pk
5
xk
4
1
0
1
2
3
1
4
2
xk + pk
11.3.2
First, we consider the case where f is quadratic; this is the only general
case where an accurate line search is practical.
Let f (x) := (1/2)xT Qx q T x + a, where Q Rnn is symmetric,
q Rn , and a R. Suppose we wish to minimize the function for this
Setting first = 0 in (11.9), then ( ) 0 follows. On the other hand, setting
= 2 in (11.9), then ( ) 0 follows. So, ( ) = 0 must hold. Also,
setting = + 1 in (11.9), we obtain that ( ) 0. This establishes that (11.10)
follows from (4.10). To establish the reverse conclusion and therefore prove that the
two conditions are the same, we note that if we satisfy (11.10), then it follows that
for every 0, ( )( ) = ( ) 0, and we are done.
276
51
0.618.
2
(11.11a)
that is,
f (xk + pk ) f (xk ) f (xk )T pk .
(11.11b)
277
Unconstrained optimization
()
R
(0) + (0)
(0) + (0)
Figure 11.3: The interval R accepted by the Armijo step length rule.
Convergent algorithms
provided only that f is bounded from below and pk is a direction of
descent.
11.4
Convergent algorithms
This section presents two basic convergence results for descent methods
under different step length rules.
Theorem 11.4 (convergence of a gradient related algorithm) Suppose
that f C 1 , and that for the initial point x0 the level set levf (f (x0 )) :=
{ x Rn | f (x) f (x0 ) } is bounded. Consider the iterative algorithm
defined by the description in Section 11.1. In this algorithm, we make
the following choices, valid for each iteration k:
pk satisfies the sufficient descent condition (11.4);
kpk k M , where M is some positive constant; and
the Armijo step length rule (11.11) is used.
k k.
Unconstrained optimization
above we must have that f (xk )T pk 0 holds, so by letting k tend to
infinity we obtain that
f (
x)T p = 0,
which again produces a contradiction to the initial claim because of
(11.4). We conclude that f (
x) = 0n must therefore hold.
The above proof can be repeated almost in verbatim to establish
that any step length rule that provides reduction in the value of f that
is at least as good as that guaranteed by the Armijo rule will inherit its
convergence properties. The main argument is based on the inequality
f (xk+1 xk ) f (xk ) f (
xk+1 ) f (xk )
k f (xk )T pk ,
k+1 and
where x
k are the next iterate and step length resulting from
the use of the Armijo rule, respectively. If we repeat the arguments in the
above proof, replacing k with
k , we obtain the same contradictions
to the condition (11.4). For example, this argument can be used to
establish the convergence of gradient related algorithms using exact line
searches.
is a
We further note that there is no guarantee that the limit point x
local minimum; it may also be a saddle point, that is, a stationary point
where 2 f (
x) is indefinite, if it exists.
Another result is cited below from [BeT00]. It allows the Armijo step
length rule to be replaced by a much simpler type of step length rule
which is also used to minimize a class of non-differentiable functions (cf.
Section 6.4). The proof requires the addition of a technical assumption:
Definition 11.5 (Lipschitz continuity) A C 1 function f : Rn R is said
to have a Lipschitz continuous gradient mapping on Rn if there exists a
scalar L 0 such that
kf (x) f (y)k Lkx yk
(11.12)
280
c1 > 0;
c2 > 0;
Pk
s=1
s = .
11.5
Unconstrained optimization
Notice that using the criterion 2. only might mean that we terminate
too soon if f is very flat; similarly, using only 3., we terminate prematurely if f is steep around the stationary point we are approaching. The
presence of the constant 1 is to remove the dependency of the criterion
on the absolute values of f and xk , particularly if they are near zero.
We also note that using the k k2 norm may not be good when n is
very large: suppose
that f (
x) = (, , . . . , )T = (1, 1, . . . , 1)T . Then,
kf (
x)k2 = n , which illustrates that the dimension of the problem
may enter the norm. Better then is to use the -norm: kf (
x )k :=
)
f (x
max1jn | xj | = ||, which does not depend on n.
Norms may have other bad effects. From
xk1 = (1.44453, 0.00093, 0.0000079)T,
xk = (1.44441, 0.00012, 0.0000011)T;
kxk1 xk k = k(0.00012, 0.00081, 0.0000068)Tk = 0.00081
follows. Here, the termination test would possibly pass, although the
number of significant digits is very small (the first significant digit is still
changing in two components of x!) Norms emphasize larger elements, so
small ones may have bad relative accuracy. This is a case where scaling
is needed.
Suppose we know that x = (1, 104 , 106 )T . If, by transforming
= (1, 1, 1)T , then the same
the space, we obtain the optimal solution x
relative accuracy would be possible to achieve for all variables. Let then
1 0
0
= Dx, where D := 0 104
0 .
x
0 0 106
Let f (x) := 21 xT Qx q T x, where
8
Q := 3 104
0
3 104
4 108
1010
0
1010
6 1012
11
and q := 8 104 .
7 106
x
(D
QD
)
x
(D
q)
x
,
with
2
D 1 QD1
282
8 3
= 3 4
0 1
0
1 ;
6
11
D1 q = 8 ,
7
A comment on non-differentiability
= (1, 1, 1)T . Notice the change in the condition number of the
and x
matrix!
The steepest descent algorithm takes only f (x) into account, not
2 f (x). Therefore, if the problem is badly scaled, it will suffer from a
poor convergence behaviour. Introducing elements of 2 f (x) into the
search direction helps in this respect. This is the precisely the effect of
using second-order (Newton-type) algorithms.
11.6
A comment on non-differentiability
x Rn ,
283
Unconstrained optimization
f (x)
x
Figure 11.4: A piece-wise affine convex function.
b) The step lengths must be chosen differently; exact line searches are
clearly forbidden, as we have just seen.
From such considerations, we may develop algorithms that find optima to non-differentiable problems. They are referred to as subgradient
algorithms, and are analyzed in Section 6.4.
11.7
f (xk ) f (xk + pk )
actual reduction
=
.
f (xk ) k (pk )
predicted reduction
11.8
When applied to nonlinear unconstrained optimization problems conjugate direction methods are methods intermediate between the steepest
285
Unconstrained optimization
xk
Figure 11.5: Trust region and line search step. The dashed ellipses
are two level curves of the quadratic model constructed at xk , while the
dotted circle is the boundary of the trust region. A step to the minimum
of the quadratic model is here clearly inferior to the step taken within
the trust region.
descent and Newton methods. The motivation behind them is similar to that for quasi-Newton methods: accelerating the steepest descent
method but avoid the evaluation, storage and inversion of the Hessian
matrix. They are analyzed for quadratic problems only; extensions to
non-quadratic problems utilize that close to an optimal solution every
problem is nearly quadratic. Even for non-quadratic problems, the last
few decades of developments have resulted in conjugate direction methods being among the most efficient general methodologies available.
11.8.1
Conjugate directions
1 T
x Qx q T x,
2
(11.13)
n1
X
i=0
pT
i q
pi .
pT
i Qpi
(11.16)
Two ideas are embedded in (11.16): by selecting a proper set of orthogonal vectors pi , and by taking the appropriate scalar product all terms
but i in (11.14) disappear. This could be accomplished by using any n orthogonal vectors, but (11.15) shows that by making them Q-orthogonal
we can express wi without knowing x .
11.8.2
k = 0, . . . , n 1,
Unconstrained optimization
{xk+1 } = arg minimum f (x)
x Mk
(11.17)
holds. To show this, note that by the exact line search rule, for all i,
f (xi + pi )
= f (xi+1 )T pi = 0.
=i
and for i = 0, 1, . . . , k 1,
k
X
= xi+1 +
j pj Qpi q T pi
j=i+1
xT
i+1 Qpi
q T pi
= f (xi+1 )T pi ,
11.8.3
288
i
X
ci+1
m pm ,
m=0
ci+1
m
choosing
so that pi+1 is Q-orthogonal to p0 , p1 , . . . , pi . This will
be true if, for each j = 0, 1, . . . , i,
!T
i
X
T
pT
ci+1
Qpj = 0.
i+1 Qpj = di+1 Qpj +
m pm
m=0
dT
i+1 Qpj
,
pT
j Qpj
j = 0, 1, . . . , i.
11.8.4
The conjugate gradient method applies the above GramSchmidt procedure to the vectors
d0 = f (x0 ),
d1 = f (x1 ),
...,
dn1 = f (xn1 ).
k1
X
j=0
f (xk )T Qpj
pj .
pT
j Qpj
(11.18)
289
Unconstrained optimization
It holds that p0 = f (x0 ), and termination occurs at step k if f (xk ) =
0n ; the latter happens exactly when pk = 0n . (Why?)
[Note: the search directions are based on negative gradients of f ,
f (xk ) = q Qxk , which are identical to the residual in the linear
system Qx = q that identifies the optimal solution to (11.13).]
The formula (11.18) can in fact be simplified. The reason is that,
because of the successive optimization over subspaces, f (xk ) is orthogonal to the subspace spanned by p0 , p1 , . . . , pk1 .
Theorem 11.10 (the conjugate gradient method) The directions of the
conjugate gradient method are generated by
p0 = f (x0 );
pk = f (xk ) + k pk1 ,
(11.19a)
k = 1, 2, . . . , n 1,
(11.19b)
where
k =
f (xk )T f (xk )
.
f (xk1 )T f (xk1 )
(11.19c)
f (xi )T Qpj =
1 T
p [f (xj+1 ) f (xj )].
j j
f (xk )T f (xk )
.
pT
k1 [f (xk ) f (xk1 )]
x k2Q
nk 1
nk + 1
2
kx0 x k2Q ,
Unconstrained optimization
This is in sharp contrast with the convergence rate of the steepest
descent algorithm, which is known to be
kxk+1
x k2Q
n 1
n + 1
2
kxk x k2Q ;
11.8.5
f (xk+1 ) f (x )
292
1
n + 1
n
2
[f (xk ) f (x )].
11.9
(11.22a)
(11.22b)
xk+1 = xk + k dk ;
(11.22c)
pk = k dk ;
q k = f (xk+1 ) f (xk );
Hk+1 = Hk +
pk pT
(Hk q k )(q T
k
k Hk )
;
T
T
pk q k
q k Hk q k
(11.22d)
(11.22e)
(11.22f)
Unconstrained optimization
We first demonstrate that the matrices Hk are positive definite. For
any x Rn we have
xT Hk+1 x = xT Hk x +
1/2
(xT pk )2
(xT Hk q k )2
.
pk q k
qT
k Hk q k
1/2
since
pT
k f (xk+1 ) = 0
(11.23)
and hence
xT Hk+1 x =
Both terms in the right-hand side are non-negative, the first because
of the CauchyBunyakowskiSchwarz inequality. We must finally show
that not both can be zero at the same time. The first term disappears
precisely when a and b are parallel. This in turn implies that x and q k
are parallel, say, x = q k for some R. But this would mean that
T
T
pT
k x = pk q k = k f (xk ) Hk f (xk ) 6= 0,
0 i < j k,
0ik
(11.24a)
(11.24b)
(11.25)
and
Hk+1 Qpk = Hk+1 q k = pk ,
(11.26)
0 i < k.
i < k,
(11.27)
0 i < k,
we have that
Hk+1 Qpi = Hk Qpi = pi ,
0 i < k.
Unconstrained optimization
11.10
Convergence rates
The local convergence rate is a statement about the speed in which one
iteration takes the guess closer to the solution.
Definition 11.12 (local convergence rate) Suppose that {xk } Rn and
that xk x . Consider for large k the quotients
qk :=
kxk+1 x k
.
kxk x k
qk
c,
kxk x k
c 0.
11.11
Implicit functions
x Rn
Simulation
y Rm
,
(forward difference)
xi
h
f (x)
f (x + hei ) f (x hei )
.
(central difference)
xi
2h
The value of h is typically set to a function of the machine precision; if
chosen too large, we get a bad approximation of the partial derivative,
while a too small value might result in numerical cancellation.
The automatic differentiation technique exploits the inherent structure of most practical functions, that they almost always are evaluated
through a sequence of elementary operations. Automatic differentiation
represents this structure in the form of a computational graph; when
forming partial derivatives this graph is utilized in the design of chain
rules for the automatic derivative calculation. In applications to simulation models, this means that differentiation is performed within the
simulation package, thereby avoiding some of the computational cost and
the potential instability inherent in difference formulas.
11.12
The material of this chapter is classic; text books covering similar material in more depth include [OrR70, DeS83, Lue84, Fle87, BSS93, BGLS03].
Line search methods were first developed by Newton [New1687], and the
steepest descent method is due to Cauchy [Cau1847]. The Armijo rule is
due to Armijo [Arm66], and the Wolfe condition is due to Wolfe [Wol69].
The classic book by Brent [Bre73] analyzes algorithms that do not use
derivatives, especially line search methods.
297
Unconstrained optimization
Rademachers Theorem [Rad19] states that a Lipschitz continuous
function is differentiable everywhere except on sets of Lebesgue measure
zero. The Lipschitz condition is due to Lipschitz [Lip1877]. Algorithms
for the minimization of non-differentiable convex functions are given in
[Sho85, HiL93, Ber99, BGLS03].
Trust region methods are given a thorough treatment in the book
[CGT00]. The material on the conjugate gradient and BFGS methods
was collected from [Lue84, Ber99]; another good source is [NoW99].
A popular class of algorithms for problems with an implicit objective
function is the class of pattern search methods. With such algorithms
the search for a good gradient-like direction is replaced by calculations of
the objective function along directions specified by a pattern of possible
points. For an introduction to the field, see [KLT03].
Automatic differentiation is covered in the monograph [Gri00].
11.13
Exercises
298
Exercises
Exercise 11.5 (steepest descent) Consider the problem to
f (x) := (2x21 x2 )2 + 3x21 x2 .
minimize
n
x
(a) Perform one iteration of the steepest descent method using an exact
line search, starting at x0 := (1/2, 5/4)T .
(b) Is the function convex around x1 ?
(c) Will it converge to a global optimum? Why/why not?
Exercise 11.6 (Newtons method with exact line search) Consider the problem
to
minimize
f (x) := (x1 + 2x2 3)2 + (x1 2)2 .
n
x
(a) Start from x0 := (0, 0)T , and perform one iteration of Newtons method
with an exact line search.
(b) Are there any descent directions from x1 ?
(c) Is x1 optimal? Why/why not?
Exercise 11.7 (Newtons method with Armijo line search) Consider the problem to
1
minimize
f (x) := (x1 2x2 )2 + x41 .
x n
2
(a) Start from x0 := (2, 1)T , and perform one iteration of Newtons method
with the Armijo rule, using the fraction requirement = 0.1.
(b) Determine the values of (0, 1) such that the step length = 1 will
be accepted.
Exercise 11.8 (Newtons method for nonlinear equations) Suppose the function f : Rn Rn is continuously differentiable and consider the following
system of nonlinear equations:
f (x) = 0n .
Newtons method for the solution of unconstrained optimization problems has
its correspondence for the above problem.
Given an iterate xk we construct the following linear approximation of the
nonlinear function:
f1 (x)T
B f2 (x)T C
B
C
f (x) = B
C
..
A
.
T
fn (x)
299
Unconstrained optimization
is the Jacobian of f at x. Assuming that f (x) is non-singular, this linear
system has a unique solution which defines the new iterate, xk+1 , that is,
f (x , x )
f (x1 , x2 ) = f1 (x1 , x2 ) =
2
1
2
0
.
0
x0 = (1, 0)T .
f1 (x1 , x2 )2 + f2 (x1 , x2 )2
minimize
n
x
1
kAx bk2 ,
2
300
Exercises
Exercise 11.12 (LevenbergMarquardt, exam 990308) Consider the unconstrained optimization problem to
minimize f (x) := qT x +
subject to
xR ,
1 T
x Qx,
2
(11.28a)
(11.28b)
where Q Rnn is symmetric and positive semidefinite but not positive definite. We attack the problem through a LevenbergMarquardt strategy, that
is, we utilize a Newton-type method where a multiple > 0 of the unit matrix
is added to the Hessian of f (that is, to the matrix Q) in order to guarantee
that the (modified) Newton equation is uniquely solvable. (See Section 11.2.2.)
This implies that, given an iteration point xk , the search direction pk is determined by solving the linear system
[2 f (xk ) + I n ]p = f (xk )
[Q + I ]p = (Qxk + q).
n
(11.29)
xk+1 := xk + pk ,
k = 0, 1, . . . ,
(11.30)
that is, the algorithm that is obtained by utilizing the Newton-like search
direction pk from (11.29) and the step length 1 in every iteration. Show that
this iterative step is the same as that to let xk+1 be given by the solution to
the problem to
ky xk k2 ,
2
Rn .
minimize f (y ) +
subject to
(11.31a)
(11.31b)
(b) Suppose that an optimal solution to (11.28) exists. Suppose also that
the sequence {xk } generated by the algorithm (11.30) converges to a point x .
(This can actually be shown to hold.) Show that x is optimal in (11.28).
[Note: This algorithm is in fact a special case of the proximal point algorithm. Suppose that f is a convex function on Rn and the variables are
constrained to a non-empty, closed and convex set S Rn .
We extend the iteration formula (11.31) to the following:
minimize f (y ) +
subject to
y S,
k
ky xk k2 ,
2
(11.32a)
(11.32b)
301
Unconstrained optimization
Exercise 11.13 (unconstrained optimization algorithms, exam 980819) Consider the unconstrained optimization problem to minimize f (x) over x Rn ,
where f : Rn R is in C 1 . Let {xk } be a sequence of iteration points generated by some algorithm for solving this problem, and suppose that it holds
that f (xk ) 0n , that is, the gradient value tends to zero (which of course
is a favourable behaviour of the algorithm). The question is what this means
in terms of the convergence of the more important sequence {xk }.
Consider therefore the sequence {xk }, and also the sequence {f (xk )} of
function values. Given the assumption that f (xk ) 0n , is it true that
{xk } and/or {f (xk )} converges or are even bounded? Provide every possible
case in terms of the convergence of these two sequences, and give examples,
preferably simple ones for n = 1.
Exercise 11.14 (conjugate directions) Prove Proposition 11.9.
Exercise 11.15 (conjugate gradient method) Apply the conjugate gradient
method to the system Qx = q , where
0
2
Q = 1
0
1
2
1
0
1A
2
0 1
and
q = 1A .
1
Exercise 11.16 (convergence of the conjugate gradient method, I) In the conjugate gradient method, prove that the vector pi can be written as a linear
combination of the set of vectors {q , Qq, Q2 q, . . . , Qi q}. Also prove that xi+1
minimizes the quadratic function Rn x 7 f (x) := 12 xT Qx q T x over all
the linear combinations of these vectors.
Exercise 11.17 (convergence of the conjugate gradient method, II) Use the
result of the previous problem to establish that the conjugate gradient method
converges in a number of iterations equal to the number of distinct eigenvalues
of the matrix Q.
302
Optimization over
convex sets
12.1
XII
(12.1a)
subject to x X,
(12.1b)
12.2
(12.2)
(12.3)
k = 0, 1, . . . ,
(12.4)
k ,
k K.
y X,
we obtain that
f (x )T (y x ) f (x )T (y x ) = 0,
y X.
Since the limit point was arbitrarily chosen, the first result follows.
The second part of the theorem follows from Theorem 4.24.
Figure 12.1 illustrates the LP problem in Step 1 at a non-stationary
point xk , the resulting extreme point y k , and search direction pk .
We have above established the result for the Armijo rule. By applying
the same technique as that discussed after Theorem 11.4 for gradient
related methods in unconstrained optimization, we can also establish
convergence to stationary points under the use of exact line searches.
Under additional technical assumptions we can establish that the
sequence f (xk )T pk 0; see Exercise 12.2.
Under the assumption that f is convex, several additional techniques
for choosing the step lengths are available; see the notes for references.
We refer to one such choice below.
306
f (xk )
15
xk
10
pk
yk
5
5
10
(12.5a)
(12.5b)
If the sequence {xk } is finite, then the last iterate solves (12.1). Otherwise, f (xk ) f , and the sequence {xk } converges to the set of
solutions to (12.1): distX (xk ) 0. In particular, any limit point of
{xk } solves (12.1).
(b) Suppose that f is Lipschitz continuous on X. In the Frank
Wolfe algorithm, suppose the step lengths k (0, 1] are chosen according to the quadratically convergent divergent step length rule (6.41),
(6.42). Then, the conclusions in (a) hold.
1 According
307
12.3
Consider the problem (12.1) under the same conditions as stated in Section 12.2. The simplicial decomposition algorithm builds on the Representation Theorem 3.22.
In the below description we let P denote the set of extreme points
of X. We also denote by Pk a subset of the extreme points which have
been generated prior to iteration k and which are kept in memory; an
element of this set is denoted by y i in order to not mix these extreme
points with the vectors y k solving the LP problem.
The simplicial decomposition algorithm works as follows:
Simplicial decomposition algorithm:
Step 0 (initialization). Generate the starting point x0 X (for example
by letting it be any extreme point in X). Set k := 0. Let Pb0 =
0 = x0 .
P0 := . Let x
Step 2 (multidimensional line search). Let k+1 be an approximate solution to the restricted master problem to
X
k +
k ) ,
minimize f x
i (y i x
(12.6a)
subject to
iPk+1
iPk+1
i 1,
i 0,
(12.6b)
i Pk+1 ,
(12.6c)
k Xk := conv ({
where x
xk1 } {y i | i Pk }).
P
k + iPk+1 ( k+1 )i (y i x
k ).
Step 3 (update). Let xk+1 := x
bk := { i Pk | ( k )i > 0 } .
P
k := x0 .
For all k, x
ik arg minimum {( k1 )i },
bk
iP
k := xk .
where ties are broken arbitrarily, and x
b
k := xk .
(d) [FrankWolfe]: For all k, Pk := and x
minimize f over the convex hull of them, and repeat until we either get
close enough to a stationary point or if the last LP did not give us a new
extreme point. (In the latter case we are at a stationary point! Why?)
Suppose instead that we drop every extreme point that got a zero
weight in the last restricted master problem, that is, we work according
to the principle in (b). We then remove all the extreme points that we
believe will not be useful in order to describe the optimal solution as a
convex combination of them.
The algorithm corresponding to the principle in (c) is normally called
the restricted simplicial decomposition algorithm; it allows us to drop
extreme points in order to keep the memory requirements below a certain
threshold. In order to do so, we may need to also throw away an extreme
point that had a positive weight at the optimal solution to the previous
restricted master problem, and we implement this by removing one with
the least weight.
The most extreme case of the principle in (c) is to throw away every
point that was previously generated, and keep only the most recent one.
(It corresponds to letting r = 1.) Then, according to the principle in
(d), we are back at the FrankWolfe algorithm!
309
X
minimize f
xk +
i y i ,
(12.7a)
(,)
subject to
iPk+1
i = 1,
(12.7b)
iPk+1
, i 0,
i Pk+1 .
(12.7c)
12.4
k = 1, . . . ,
(12.8)
311
xk (
/2)f (xk )
xk (
/4)f (xk )
xk
(12.11)
where
k := 2k f (xk )T (xk ProjX [xk k f (xk )]),
k = 0, 1, . . . .
2k
[f (xk ) f (xk+1 )]
[f (xk ) f (xk+1 )].
(12.12)
By (12.12),
k=0
[f (x0 ) f (x )] < .
(12.13)
k1
X
j=0
k ka0 xk2 +
X
j=0
k C < .
k kalk1 a
k +
kak a
k1
X
j=lk1
k +
j kalk1 a
j=lk1
j <
+ = .
2 2
.
We conclude that ak a
By the above lemma, we conclude that {xk } is convergent to a vector
x . This vector must be stationary, by Theorem 12.3, which means, by
convexity, that it is also globally optimal. We are done.
Suppose now that X = Rn . Then the gradient projection algorithm
reduces to the steepest descent method in unconstrained optimization,
and the Armijo step length rule along the projection arc reduces to
the classic Armijo rule. The above result then states that the steepest
descent algorithm converges to an optimal solution whenever f is convex
and there exist optimal solutions (see Theorem 11.7).
Finally, we consider the problem of performing the Euclidean projection. This is a strictly convex quadratic programming problem of the
form (4.12). We will show that we can utilize the phase I procedure of
the simplex method (see Section 9.1.2) in order to solve this problem. We
take a slightly more general viewpoint here, and present the algorithm
for a general strictly convex quadratic program.
315
(12.14)
x 0n ,
v x = 0,
x, y, v 0,
(12.15a)
(12.15b)
(12.15c)
(12.15d)
(12.15e)
where y and v are the vectors of Lagrange multipliers for the constraints
of (12.14). We introduce a slack variable vector s in (12.15b), and can
therefore write the above system equivalently as
Qx + AT y v = q,
(12.16a)
(12.16b)
(12.16c)
(12.16d)
(12.16e)
Ax + I s = b,
y s = 0,
v x = 0,
x, s, y, v 0.
12.5
12.5.1
Model analysis
317
h = d,
(12.17a)
(12.17b)
t(v )T (v v ) 0,
(12.18)
|R|
where F := { v R|L| | h R+ with T h = d and v = h } is the
set of demand-feasible link volumes.
In the case where t is integrable,5 the model (12.18) defines the firstorder optimality conditions for an optimization problem; assuming, further, that t is separable, that is, that tl is a function only of vl , l L,
the optimization problem has the form
X Z vl
tl (s) ds,
(12.19)
minimize f (v) :=
(h,v )
lL
subject to T h = d,
v = h,
h 0|R| .
This is the classic traffic assignment problem.
Since the feasible set of the problem (12.19) is a bounded polyhedron there exists a nonempty and bounded set of optimal link and route
5 If t is continuously differentiable, then integrability is equivalent to the symmetry
of its Jacobian matrix t(v ) everywhere. Integrability is a more general property
than this symmetry property, since t need not be always be differentiable.
318
(12.20)
T h = d,
subject to
v h = 0|L| ,
h 0|R| .
(12.21)
subject to T 0|R| ,
= t(v ).
Eliminating through = t(v ) the primaldual optimality conditions
are, precisely, the Wardrop conditions (12.17), together with the consistency condition v = h.
Suppose, in addition, that each link cost function tl is increasing; this
is a natural assumption considering that congestion on a link, that is, the
travel time, increases with its volume. According to Theorem 3.40(b)
this means that the function f is convex, and therefore the problem
(12.19) is a convex one. Therefore also the optimality conditions stated
in the variational inequality (12.18) or the (equivalent) Wardrop conditions (12.17) are both necessary and sufficient for a volume v to be an
equilibrium one.
If further tl is strictly increasing for every l L then the solution v
is unique (cf. Proposition 4.11).
12.5.2
10
SD/Grad. proj. 1
SD/Grad. proj. 2
SD/Newton
FrankWolfe
10
10
10
10
10
10
10
10
10
10
11
10
10
20
30
40
50
CPU time (s)
60
70
80
90
100
12.6
Algorithms for linearly constrained optimization problems are disappearing from modern text books on optimization. It is perhaps a sign of maturity, as we are now better at solving optimization problem with general constraints, and therefore do no longer have to especially consider
the class of linearly constrained optimization problems. Nevertheless we
feel that it provides a link between linear programming and nonlinear
optimization problems with general constraints, being a subclass of nonlinear optimization problems for which primal feasibility can be retained
throughout the procedure.
The FrankWolfe method was developed for QP problems in [FrW56],
and later for more general problems, including non-polyhedral sets, in
[Gil66] and [PsD78, Section III.3], among others. The latter source includes several convergence results for the method under different step
length rules, assuming that f is Lipschitz continuous, for example a
Newton-type step length rule. The convergence Theorem 12.1 for the
FrankWolfe algorithm was taken from [Pat98, Theorem 5.8]. The convergence result for convex problems given in Theorem 12.2 is due to
Dunn and Harshbarger [DuH78]. The version of the FrankWolfe algorithm produced by the selection k := 1/k is known as the method of
successive averages (MSA).
The simplicial decomposition algorithm was developed in [vHo77].
Restricted simplicial decomposition methods have been developed in
[HLV87, Pat98].
The gradient projection method presented here was first given in
[Gol64, LeP66]; see also the textbook [Ber99]. Theorem 12.4 is due to
[Ius03], while Lemma 12.5 is due to [BGIS95].
The traffic equilibrium models of Section 12.5 are described and analyzed more fully in [She85, Pat94].
Apart from the algorithms developed here, there are other classical
algorithms for linearly constrained problems, including the reduced gradient method, Rosens gradient projection method, active set methods,
and other sub-manifold methods. They are not treated here, as some of
321
12.7
Exercises
Rn , with modulus L.
f (x + p) f (x) f (x)T p +
L
kpk2
2
holds.
Proof. Let R and g() := f (x + p). The chain rule yields
pT f (x + p). Then,
f (x + p) f (x) = g(1) g(0) =
Z
dg
() d =
d
pT f (x) d +
p f (x) d +
pT f (x) + kpk
pT f (x) +
1
0
1
0
1
1
0
dg
()
d
pT f (x + p) d
pT [f (x + p) f (x)] d
kpkkf (x + p) f (x)k d
Lkpk d
L
kpk2 .
2
We are done.
Apply this result to the inequality resulting from applying the Armijo
rule at a given iteration k, with x replaced by xk and p replaced by k pk .
322
Exercises
Summing all these inequalities and utilizing that {kpk k} isP
a bounded sequence
2
thanks to the boundedness of X, conclude that the sum
k=0 [zk (y k )] must
be convergent and therefore zk (y k ) 0 must hold.]
Exercise 12.3 (numerical example of the FrankWolfe algorithm) Consider the
problem to
minimize f (x) := (x1 + 2x2 6)2 + (2x1 x2 2)2 ,
x1 3,
x1 , x2 0.
(a) Show that the problem is convex.
(b) Apply one step of the FrankWolfe algorithm, starting at the origin.
Provide an interval where f lies.
Exercise 12.4 (numerical example of the FrankWolfe algorithm) Consider the
problem to
maximize f (x) := x21 4x22 + 16x1 + 24x2 ,
subject to x1 + x2 6,
x1 x2 3,
x1 , x2 0.
1
2
x1
1
2
2
1 2
x2 ,
2
x1 1,
x2 1,
x1 , x2 0.
Apply two iterations of the FrankWolfe algorithm, starting at
(1, 1)T . Give upper and lower bounds on the optimal value.
x0 :=
323
324
Constrained
optimization
XIII
13.1
Penalty methods
(13.1)
(13.2)
(
0,
if x S,
S (x) =
+, otherwise.
Constrained optimization
practical point of view we would like to replace the additional term S
with a numerically better behaving function.
There are two alternative approaches achieving this. The first is
called the penalty, or the exterior penalty method, in which we add
a penalty to the objective function for points not lying in the feasible
set and thus violating some of the constraints. This method typically
generates a sequence of infeasible points, approaching optimal solutions
to the original problem from the outside (exterior) of the feasible set,
whence the name of the method. The function S is approximated from
below in these methods.
Alternatively, in the barrier, or interior point methods, we add a continuous barrier term that equals + everywhere except in the interior
of the feasible set and thus ensure that globally optimal solutions to the
approximating unconstrained problems do not escape the feasible set of
the original constrained problem. The method thus generates a sequence
of interior points, whose limit is an optimal solution to the original constrained problem. The function S is approximated from above in
these methods.
Clearly we would like to transfer nice properties of original constrained problems, such as convexity, smoothness, to penalized problems
as well. We easily achieve this by carefully choosing penalty functions;
use Exercises 13.1 and 13.2 to verify that convexity may be easily transferred to penalized problems.
13.1.1
i = 1, . . . , m;
j = 1, . . . , },
(13.3)
X
m
i=1
max{0, gi (x)} +
X
j=1
hj (x) , (13.4)
where the real number > 0 is called the penalty parameter. The
different treatment of inequality and equality constraints in the equation (13.4) stems from the fact that equality constraints are violated at
326
Penalty methods
(13.5)
Constrained optimization
f (x ) + S (x ) = f (x ) holds for every positive 1 2 . In fact, we
can establish an even stronger inequality, which will be used later; see
the following lemma.
Lemma 13.2 (penalization constitutes a relaxation) For every positive
1 2 it holds that f (x1 ) f (x2 ).
Proof. The claim is trivial for 1 = 2 , thus we assume that 1 <
2 . Since x1 minimizes f (x) + 1
S (x), and x2 is feasible in this
(unconstrained) optimization problem, it holds that
f (x1 ) + 1
S (x1 ) f (x2 ) + 1
S (x2 ).
Similarly,
(13.6)
f (x2 ) + 2
S (x2 ) f (x1 ) + 2
S (x1 ).
(13.7)
Penalty methods
Thus,
S (x ) converges to zero as converges to +, and, owing to the
continuity of
S , every limit point of the sequence {x } must be feasible
in (13.1).
denote an arbitrary limit point of {x }, that is,
Now, let x
,
lim xk = x
for some sequence {k } converging to infinity. Then, we have the following chain of inequalities:
f (
x) = lim f (xk ) lim {f (xk ) + k
S (xk )} f (x ),
k+
k+
where the last inequality follows from (13.7). However, owing to the
in (13.1) the reverse inequality f (x ) f (
feasibility of x
x) must also
hold. The two inequalities combined imply the required claim.
We emphasize that Theorem 13.3 establishes the convergence of globally optimal solutions only; the result may therefore be of limited practical value for nonconvex nonlinear programs. However, assuming more
regularity of the stationary points, such as LICQ (see Definition 5.41),
and using specific continuously differentiable penalty functions, such as
(s) := s2 , we can show that every limit point of sequences of stationary points of (13.5) also is stationary (i.e., a KKT point) in (13.1).
Furthermore, we easily obtain estimates of the corresponding Lagrange
).
multipliers (,
Theorem 13.4 (convergence of a penalty method) Let the objective
function f : Rn R and the functions gi : Rn R, i = 1, . . . , m,
and hj : Rn R, j = 1, . . . , , defining the inequality and equality
constraints of (13.1) be in C 1 (Rn ). Further assume that the penalty
function : R R+ is in C 1 and that (s) 0 for all s 0.
Consider a sequence {xk } of points that are stationary for the sequence of problems (13.5), for some positive sequence of penalty param , and that
eters {k } converging to +. Assume that limk+ xk = x
. Then, if x
is feasible in (13.1) it must also verify the
LICQ holds at x
KKT conditions.
In other words,
xk stationary in (13.5)
as k +
xk x
stationary in (13.1).
= x
LICQ holds at x
feasible in (13.1)
x
329
Constrained optimization
Proof. [Sketch] Owing to the optimality conditions (4.14) for unconstrained optimization we know that every point xk , k = 1, 2, . . . , necessarily satisfies the equation
[f (xk ) + k
S (xk )] = f (xk )
m
X
k [max{0, gi (xk )}]gi (xk )
+
(13.8a)
(13.8b)
i=1
(13.8c)
j=1
Let, as before, I(
x) denote the index set of active inequality constraints
at x. If i 6 I(
x) then gi (xk ) < 0 for all large k, and the terms corresponding to this index do not contribute to (13.8).
, we know that the vectors { gi (
Since LICQ holds at x
x), hj (
x) |
i I(
x), j = 1, . . . , } are linearly independent. Therefore, we can easily
show that the sequence {k [max{0, gi (xk )}]} must converge to some
limit
i as k + for all i I(
x). Similarly, limk+ k [hj (xk )] =
j , j = 1, . . . , . At last, since k [max{0, gi (xk )}] 0 for all k =
0|I(x )| .
1, 2, . . . , i I(
x) it follows that
Passing to the limit as k + in (13.8) we deduce that
f (
x) +
iI(
x)
i gi (
x) +
X
j=1
j hj (
x) = 0n ,
13.1.2
Penalty methods
located on the boundary of the feasible region, then the method generates
a sequence of interior points that converges to it.
In this section we assume that the feasible set S of the optimization
problem (13.1) has the following form:
S := { x Rn | gi (x) 0,
i = 1, . . . , m }.
(13.9)
For the method to work, we need to assume that there exists a strictly
Rn , that is, such that gi (
feasible point x
x) < 0, i = 1, . . . , m. Thus, in
contrast with the exterior penalty algorithms, we cannot include equality
constraints into the penalty term. While it is possible to extend the
discussion to allow for equality constraints, we prefer to keep the notation
simple and assume that equality constraints are not present.
To formulate a barrier problem, we consider the following approximation of S :
( P
m
i=1 [gi (x)], if gi (x) < 0, i = 1, . . . , m,
S (x)
S (x) :=
+,
otherwise,
(13.10)
and the function : R R+ is a continuous nonnegative function such that (sk ) + for all negative sequences {sk } converging to zero. Typical examples of are 1 (s) := s1 , and 2 (s) :=
log[min{1, s}]. Note that 2 is not differentiable at the point s = 1.
However, dropping the nonnegativity requirement on , the famous differentiable logarithmic barrier function e2 (s) := log(s) gives rise to
the same convergence theory as we are going to present.
Example 13.5 Consider the simple one-dimensional set S := { x R |
x 0 }. Choosing = 1 = s1 , the graph of the barrier function
(13.11)
Constrained optimization
3
10
=1
=0.1
=0.01
10
10
10
10
10
10
0.5
0.5
1.5
2.5
Penalty methods
In other words,
xk stationary in (13.11)
as k + = x
xk x
stationary in (13.1).
LICQ holds at x
Proof. [Sketch] Owing to the optimality conditions (4.14) for unconstrained optimization we know that every point xk , k = 1, 2, . . . , necessarily satisfies the equation
[f (xk ) + k
S (xk )] =
m
X
f (xk ) +
k [gi (xk )]gi (xk ) = 0n .
(13.12)
i=1
is
Because every point xk is strictly feasible in (13.1), the limit x
clearly feasible in (13.1). Let I(
x) denote the index set of active in.
equality constraints at x
If i 6 I(
x) then gi (xk ) < 0 for all large k, and k [gi (xk )] 0 as
k 0.
, we know that the vectors { gi (
Since LICQ holds at x
x) | i
I(
x) } are linearly independent. Therefore, we can easily show that the
sequence {k [gi (xk )]} must converge to some limit
i as k + for
all i I(
x). At last, since k [gi (xk )] 0 for all k = 1, 2, . . . , i I(
x),
0|I(x )| .
it follows that
Passing to the limit as k + in (13.12) we deduce that
X
i gi (
x) = 0n ,
f (
x) +
iI(
x)
that is, x
For example, if we use (s) := 1 (s) := 1/s, then (s) = 1/s2 in
Theorem 13.6, and the sequence {k /gi2 (xk )} converges to the Lagrange
multiplier
i corresponding to the constraint i (i = 1, . . . , m).
13.1.3
Computational considerations
Constrained optimization
relatively large for barriers), and then proceed step after step slightly
modifying the penalty parameter (e.g., multiplying it with some number
close to 1).
It is natural to use the optimal solution xk as a starting point for an
iterative algorithm used to solve the approximating problem corresponding to the next value k+1 of the penalty parameter. The idea behind
such a warm start is that, typically, k k+1 implies xk xk+1 .
In fact, in many cases we can perform only few (maybe, only one)
steps of an iterative algorithm starting at xk to obtain a satisfactory approximation xk+1 of an optimal solution corresponding to the
penalty parameter k+1 , and still preserve the convergence xk x ,
as k +, towards optimal solutions of the original constrained problem (13.1). This technique is especially applicable to convex optimization problems, and all the complexity estimates for interior penalty algorithms depend on this fact.
13.1.4
1 2
(x + x22 ) + 2x2 ,
2 1
(13.13)
subject to x2 = 0.
The problem is convex with affine constraints; therefore, the KKT conditions are both necessary and sufficient for the global optimality. The
KKT system is in this case reduces to: x2 = 0 and
x1
0
0
+
=
.
x2 + 2
1
0
The only solution to this system is x = 02 , = 2.
Let us use the exterior penalty method with quadratic penalty (s) :=
s2 to solve this problem. That is, we want to
minimize
1 2
(x + x22 ) + 2x2 + x22 ,
2 1
Penalty methods
and that
4
= 2 = ,
1 + 2
where is the Lagrange multiplier corresponding to the equality constraint x2 = 0.
lim [(x )2 ] = lim
(13.14)
x22
0,
1) = 0.
2x1
2x1
0
+
=
.
1
0
1 x21 x22 2x2
1 ( 2 + 1)2
1
1
= = ,
= lim
+0 2 2 + 1 2
2
+0
+0
where is the Lagrange multiplier corresponding to the inequality constraint x21 + x22 1 0.
335
Constrained optimization
Example 13.9 (linear programming) Consider an LP problem of the following form:
maximize bT y,
(13.15)
subject to AT y c,
where b, y Rm , c Rn , and A Rmn . Using standard linear
programming theory (see Theorem 10.15), we can write the primaldual
optimality conditions for this problem in the form:
AT y c,
Ax = b,
x 0n ,
(13.16)
xT (c AT y) = 0,
(13.17)
s0 ,
and the corresponding system of optimality conditions will be:
AT y + s = c,
Ax = b,
n
(13.18)
x 0 , s 0 , x s = 0.
Now, let us apply the barrier method to the optimization problem (13.17). It has equality constraints, which we do not move into the
penalty function, but rather leave them as they are. Thus, we consider
the following problem with equality constraints only:
minimize bT y
T
n
X
log(sj ),
j=1
(13.19)
subject to A y + s = c,
where we use the logarithmic barrier function, > 0 is a penalty parameter, and we have multiplied the original objective function with
336
b
m
/s1
A
0
,
(13.20)
..
+
n x =
I
0n
.
/sn
where x Rn is a vector of Lagrange multipliers for the equality constraints in the problem (13.19). Further, the system (13.20) can be
rewritten in the following more convenient form:
AT y + s = c,
(13.21)
Ax = b,
xj sj = ,
j = 1, . . . , n.
Recalling that due to the presence of the barrier the vector s must be
strictly feasible, that is, s > 0n , and that the penalty parameter
is positive, the last equation in (13.21) does in fact imply the strict
inequality x > 0n .
Therefore, comparing (13.21) and (13.18) we see that for linear programs the introduction of a logarithmic barrier amounts to a small perturbation of the complementarity condition. Namely, instead of the requirement
x 0n , s 0n , xj sj = 0, j = 1, . . . , n,
x > 0n , s > 0n , xj sj = ,
j = 1, . . . , n.
For the case n = 1 the difference between the two is shown in Figure 13.3.
Note the smoothing effect on the feasible set introduced by the interior
penalty algorithm. We can use Newtons method to solve the system of
nonlinear equations (13.21), but not (13.18).
13.2
13.2.1
Introduction
(13.22a)
j = 1, . . . , ,
(13.22b)
337
Constrained optimization
= .1
= .01
= .001
=0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.4
0.6
0.8
Figure 13.3: The approximation of the complementarity constraint resulting from the use of logarithmic barrier functions in linear programming.
j hj (x ) = 0n ,
(13.23a)
L(x , ) := h(x ) = 0 .
(13.23b)
j=1
pk
= L(xk , k ),
vk
that is,
2
x x L(xk , k ) h(xk ) pk
x L(xk , k )
=
,
vk
h(xk )
h(xk )T
0
(13.24)
339
Constrained optimization
In the next section, we will therefore develop a modification of the
above algorithm, which is globally convergent to stationary points. Moreover, we will work also with inequality constraints, which is not immediate to incorporate into the above Newton-like framework.
13.2.2
In order to introduce a penalty function into the discussion, let us consider first the following one:
P (x, ) := kx L(x, )k2 + kh(x)k2 .
(13.26)
(13.27a)
i = 1, . . . , m,
j = 1, . . . , ,
(13.27b)
(13.27c)
S (x) :=
m
X
i=1
X
j=1
|hj (x)|,
(13.28)
(13.29)
where > 0.
Proposition 13.10 (an exact penalty function) Suppose that x satisfies the KKT conditions (5.17) of the problem (13.27), together with
340
|j |, j = 1, . . . , },
X
X
minimize f (x) +
yi +
zj ,
(13.30a)
i=1
subject to
j=1
yi gi (x) and yi 0, i = 1, . . . , m,
(13.30b)
zj hj (x) and zj hj (x), j = 1, . . . , . (13.30c)
Analyzing the KKT conditions for this problem, we can construct multipliers for the problem (13.30) from the multiplier vectors ( , ) and
show that x is a globally optimal solution to it (note the convexity assumptions).
There are similar results also for more general, non-convex, problems
that establish that if x is a (strict) local minimum to (13.27) then it is
also a (strict) local minimum of the exact penalty function.
We must note, however, that the implication is in a somewhat unsatisfactory direction: there may exist local minima of Pe that do not
correspond to constrained local minima in the original problem, for any
value of . The theory is much more satisfactory in the convex case.
We develop a penalty SQP algorithm, known as the MSQP method
(as in Merit SQP, merit function being synonymous with objective function), for solving the general problem (13.27). Given an iterate xk Rn
and a vector (k , k ) Rm
+ R , suppose we choose a positive definite,
n
symmetric matrix B k R
; for example, it can be an approximation
of 2x x L(xk , k , k ). We then solve the following subproblem:
1 T
p B k p + f (xk )T p,
2
subject to gi (xk ) + gi (xk )T p 0,
minimize
(13.31a)
hj (xk ) + hj (xk ) p = 0,
i = 1, . . . , m,
(13.31b)
j = 1, . . . , .
(13.31c)
Note that if we were to utilize B k := 2x x L(xk , k , k ) then the problem (13.31) would be the optimization problem associated with a secondorder approximation of the KKT conditions for the original problem
341
Constrained optimization
(13.27); the close connection to quasi-Newton methods in unconstrained
optimization should be obvious.
We also took the liberty to replace the term x L(xk , k , k )T p by
the term f (xk )T p. This is without any loss of generality, as the KKT
conditions for the problem (13.31) imply that they can be interchanged
without any loss of generalitythe only difference in the two objectives
lies in the constant term which plays no role in the optimization.
A quasi-Newton type method based on the subproblem (13.31) followed by a unit step and a proper update of the matrix B k , as in the
BFGS algorithm, is locally convergent with a superlinear speed, just
as in the unconstrained case. But we are still interested in a globally
convergent version, whence we develop the theory of an algorithm that
utilizes the exact penalty function (13.29) in a line search rather than
taking a unit step. Our first result shows when the subproblem solution
provides a descent direction with respect to this function.
Lemma 13.11 (a descent property) Given xk Rn consider the strictly
convex quadratic problem (13.31), where B k Rnn is symmetric and
positive definite. Suppose that pk solves this problem together with the
multipliers and . Assume that pk 6= 0n . If
maximum {1 , . . . , m , |1 |, . . . , | |}
then the vector pk is a direction of descent with respect to the l1 penalty
function (13.29) at (xk , k , k ).
Proof. Using the KKT conditions of the problem (13.31) we obtain that
f (xk )T p = pT B k p
= pT B k p +
pT B k p +
m
X
i=1
m
X
i gi (xk )T p
i gi (xk ) +
i=1
m
X
i=1
X
j=1
j hj (xk )T p
j hj (xk )
j=1
X
j=1
|j ||hj (xk )|
X
X
pT B k p +
maximum {0, gi (xk )} +
|hj (xk )| .
i=1
j=1
X
j=1
Constrained optimization
Theorem 13.12 (convergence of the MSQP method) The algorithm MSQP either terminates finitely at a KKT point for the problem (13.27)
or it produces an infinite sequence {xk }. In the latter case, we assume
that {xk } lies in a compact set X Rn and that for every x X
and symmetric and positive definite matrix B k the QP (13.31) has a
unique solution, and also unique multiplier vectors and satisfying
max {1 , . . . , m , |1 |, . . . , | |}, where > 0 is the penalty parameter. Furthermore, assume that the sequence {B k } of matrices is bounded
and that every limit point of this sequence is positive definite (or, the
sequence {B 1
k } of matrices is bounded). Then, every limit point of
{xk } is a KKT point for the problem (13.27).
Proof. [Sketch] Clearly, the algorithm stops precisely at KKT points,
so we concentrate on the case where {xk } is an infinite sequence. We
can consider an iteration as a descent step wherein we first construct a
descent direction pk , followed by a line search in the continuous function
Pe , and followed by an update of the matrix B k . By the properties
stated, each of these steps is well defined.
Since the sequence {xk } is bounded, it has an limit point, say x .
Consider from now on this subsequence. By the assumptions stated, also
the sequence {pk } must be bounded. (Why?) Suppose that p is an
limit point of {pk } within the above-mentioned subsequence. Suppose
that it is non-zero. By assumption the sequence {B k } also has limit
points within this subsequence, all of which are positive definite. Suppose B is one such matrix. Then, by Lemma 13.11 the vector p
must define a descent direction for Pe . This contradicts the assumption
that x is an limit point. (Why?) Therefore, it must be the case that
p = 0n , in which case x is stationary, that is, a KKT point. We are
done.
Note that we here have not described any rules for selecting the value
of . Clearly, this is a difficult task, which must be decided upon from
experiments including the results from the above line searches with respect to the merit function Pe . Further, we have no guarantees that the
QP subproblems (13.31) are feasible; in the above theorem we assumed
that the problem is well-defined. Further still, Pe is only continuous and
directionally differentiable, whence we cannot utilize several of the step
length rules devised in Section 11.3. Local superlinear or quadratic convergence of this algorithm can actually be impaired due to the use of this
merit function, as it is possible to construct examples where a unit step
does not reduce its value even very close to an optimal solution. (This
is known as the Maratos effect, after [Mar78].) The Notes Section 13.4
lead to further reading on these issues.
344
13.2.3
subject to g1 (x) :=
2x21
x2 0,
g2 (x) := x1 + 5x2 5 0,
g3 (x) := x1 0,
g4 (x) := x2 0.
(13.32a)
(13.32b)
(13.32c)
(13.32d)
(13.32e)
p1 + 5p2 0,
p1 0,
1 p2 0.
(13.33a)
(13.33b)
(13.33c)
(13.33d)
(13.33e)
7 T
Solving this problem yields the solution p1 = ( 35
31 , 31 ) and the multiT
plier vector 1 (0, 1.032258, 0, 0) .
We next perform a line search in the exact penalty function:
We obtain that 1 0.5835726. (Note that the unconstrained minimum of 7 f (x0 + p1 ) is = 1, which however leads to an infeasible
point having a too high penalty.)
This produces the next iterate, x1 (0.6588722, 0.8682256)T.
345
Constrained optimization
We ask the reader to confirm that this is a near-optimal solution by
checking the KKT conditions , and to confirm that the next QP problem
verifies this.
We were able to find the optimal solution this quickly, due to the
facts that the problem is quadratic and that the value = 10 is large
enough. (Check the value of the Lagrange multipliers.)
13.2.4
We have seen that the SQP algorithm above has an inherent decision
problem, namely to choose the right value of the penalty parameter .
In recent years, there has been a development of algorithms where the
penalty parameter is avoided altogether. We call such methods filterSQP methods.
In such methods we borrow a term from multi-objective optimization,
and say that x1 dominates x2 if (x
1 ) (x
2 ) and f (x1 ) f (x2 )
[where
=
S is our measure of infeasibility], that is, if x1 is at least
as good, both in terms of feasibility and optimality. A filter is a list of
pairs (
i , fi ) such that
i <
j or fi < fj for all j 6= i in the list. By
adding elements to the filter, we build up an efficient frontier, that is,
the Pareto set in the bi-criterion problem of simultaneously finding low
objective values and reduce the infeasibility. The filter is used in place
of the penalty function, when the standard Newton-like step cannot be
computed, for example because the subproblem is infeasible.
This algorithm class is quickly becoming popular, and has already
been found to be among the best general algorithms in nonlinear programming, especially because it does not rely on any parameters that
need to be estimated.
13.3
Quite a few algorithms of the penalty and SQP type exist, of which
only a small number could be summarized here. Which are the relative
strengths and weaknesses of these methods?
First, we may contrast the two types of methods with regards to
their ill-conditioning. The barrier methods of Section 13.1.2 solve a sequence of unconstrained optimization problems that become more and
more ill-conditioned. In contrast, exact penalty methods need not be
ill-conditioned and moreover only one approximate problem is, at least
in principle, enough to solve the original problem. However, it is known
at least for linear and quadratic programming problems that the inherent ill-conditioning of barrier methods can be eliminated (we say the
346
13.4
Constrained optimization
The issue of feasibility of the SQP subproblems is taken up in [Fle87].
The boundedness of the subproblem solution is often ensured by combining SQP with a trust region method (cf. Section 11.7), such that
the QP subproblem is further constrained. The Maratos effect has been
overcome during the last decade of research; cf. [PaT91, Fac95]. An excellent paper which addresses most of the computational issues within an
SQP algorithm and provides a very good compromise in the form of the
SNOPT softare is [GMS05]. Filter-SQP algorithms offer a substantial
development over the standard SQP methods. Good references to this
rapidly developing class of methods are [FLT02, UUV04].
We recommend a visit to the NEOS Server for Optimization at
http://www-neos.mcs.anl.gov/neos/ for a continuously updated list
of optimization solvers, together with an excellent software guide for
several types of classes of optimization models.
13.5
Exercises
Exercise 13.1 (convexity, exterior penalty method) Assume that the problem (13.1) is convex. Show that with the choice (s) := s2 [where enters the
definition of the penalty function via (13.4)], for every > 0 the problem (13.5)
is convex.
Exercise 13.2 (convexity, interior penalty method) Assume that the problem (13.1) is convex. Show that with the choice (s) := log(s) [where
enters the definition of the penalty function via (13.10)], for every > 0 the
problem (13.11) is convex.
Exercise 13.3 (numerical example, exterior penalty method) Consider the problem to
minimize f (x) :=
1 2
x1 + x22 ,
2
subject to x1 = 1.
Apply the exterior penalty method with the standard quadratic penalty function.
Exercise 13.4 (numerical example, logarithmic barrier method) Consider the
problem to
minimize f (x) :=
1 2
x1 + x22 ,
2
subject to x1 1.
Apply the interior penalty method with a logarithmic penalty function on the
constraint.
348
Exercises
Exercise 13.5 (logarithmic barrier, exam 990827) Consider the problem to
1 2
x1 + x22 ,
2
subject to x1 + 2x2 10.
minimize f (x) :=
Attack this problem with a logarithmic barrier method. Describe explicitly the
trajectory the method follows, as a function of the barrier parameter. Confirm
that the limit point of the trajectory solves the problem.
Exercise 13.6 (logarithmic barrier method in linear programming) Consider the
linear programming problem to
minimize f (x) := y1 + y2 ,
subject to
y2 1,
y1 1,
y 02 .
Apply the interior penalty method with a logarithmic penalty function on the
non-negativity restrictions on the slack variables.
Exercise 13.7 (sequential linear programming) Consider the optimization problem to
minimize
f (x),
subject to
gi (x) 0,
hj (x) = 0,
(13.34a)
i = 1, . . . , m,
j = 1, . . . , ,
(13.34b)
(13.34c)
subject to
f (
x)T p,
gi (
x) + gi (x)T p 0,
hj (
x) + hj (x)T p = 0,
i = 1, . . . , m,
j = 1, . . . , .
(P)
x X,
349
Constrained optimization
and
l := infimum l(x),
subject to
(R)
x G.
If X G and l(x) f (x) for all x X we say that (R) is a relaxation of (P);
cf. Section 6.1. Conversely, (P) is then a restrification of (R).
Consider the problem of the form
f := infimum f (x),
subject to gi (x) = 0,
i = 1, . . . , m,
P (y )
= 0,
> 0,
if
if
y = 0m ,
y 6= 0m .
m
X
i gi (x) + P (g(x)),
i=1
where g (x) is the m-vector of gi (x) and where > 0. Show that this problem
is a relaxation of the original one.
[Note: Algorithms based on the relaxation (R)which linearly combines
the Lagrangian and a penalty functionare known as augmented Lagrangian
methods, and the function is known as the augmented Lagrangian function.
They constitute an alternative to exact penalty methods, in that they also
can be made convergent without having to let the penalty parameter tend
to infinity, in this case because of the Lagrangian term; in augmented Lagrangian algorithms the multiplier plays a much more active role than in
SQP methods.]
350
Part VI
Appendix
Answers to the
exercises
subject to
x2 0,
0 y 6.
5
X
(15000yj + 7500xj )
j=1
160y1 50x1
6000
160y2 50x2
7000
160y3 50x3
8000
160y4 50x4
9500
160y5 50x5
11, 500
0.95y1 + x1 =
y2
0.95y2 + x2 =
y3
0.95y3 + x3 =
y4
0.95y4 + x4 =
y5
y1 =
50
yj , xj Z+, j = 1, . . . , 5.
subject to
i, i = 1, . . . , 3: Work place,
k, k = 1, . . . , 2: Connection point,
and variables
z,
(A.1)
i = 1, . . . , 3,
(A.2)
i = 1, . . . , 3,
(A.3)
i = 1, . . . , 3, j = 1, . . . , 3, i 6= j,
t1,k
(A.4)
l
(xi )2 + (yi 0)2 a2i ,
2
i = 1, . . . , 3,
(A.5)
b 2
) a2i ,
2
i = 1, . . . , 3,
(A.6)
i = 1, . . . , 3,
354
i = 1, . . . , 3, k = 1, 2,
(A.7)
(A.8)
i = 1, . . . , 3.
(A.9)
i: Warehouses (i = 1, . . . , 10),
j: Department stores (j = 1, . . . , 30),
and variables:
aij :=
if dij D,
otherwise,
1,
0,
i = 1, . . . , 10, j = 1, . . . , 30.
10
X
ci yi ,
i=1
xij aij yi ,
subject to
30
X
j=1
ej xij ki yi ,
10
X
i = 1, . . . , 10,
j = 1, . . . , 30,
i = 1, . . . , 10,
xij = 1,
j = 1, . . . , 30,
xij 0,
j = 1, . . . , 30,
i=1
yi {0, 1},
i = 1, . . . , 10.
The first constraint makes sure that only warehouses that are built and
which lie sufficiently close to a department store can supply any goods to it.
The second constraint describes the capacity of each warehouse, and the
demand at the various department stores.
The third and fourth constraints describe that the total demand at a department store must be a non-negative (in fact, convex) combination of the
contributions from the different warehouses.
(b) Additional constraints: xij {0, 1} for all i and j.
355
Chapter 3: Convexity
Exercise 3.1 Use the definition of convexity (Definition 3.1).
Exercise 3.2 (a) S is a polyhedron. It is the parallelogram with the corners
a1 + a2 , a1 a2 , a1 + a2 , a1 a2 , that is, S = conv {a1 + a2 , a1 a2 , a1 +
a2 , a1 a2 } which is a polytope and hence a polyhedron.
(b) S is a polyhedron.
(c) S is not a polyhedron. Note that although S is defined as an intersection
of halfspaces it is not a polyhedron, since we need infinitely many halfspaces.
(d) S = {x Rn | 1n x 1n }, that is, a polyhedron.
(e) S is a polyhedron. By squaring both sides of the inequality, it follows
that 2(x0 x1 )T x kx1 k22 kx0 k22 , so S is in fact a halfspace.
(f) S is a polyhedron. Similarly as in e) above it follows that S is the
intersection of the halfspaces
2(x0 xi )T x kxi k22 kx0 k22 ,
i = 1, . . . , k.
A
b
D := A A , d := bA .
I n
0n
P . Now, if
Then P is defined by Dx d. Further, P is nonempty, so let x
x is not an extreme point of P , then the rank of equality subsystem is lower
than n. By using this it is possible to construct an x P such that the
rank of the equality subsystem of x is at least one larger than the rank of the
. If this argument is used repeatedly we end up with
equality subsystem of x
an extreme point of P .
Exercise 3.5 We have that
1
1
= 0.5
0
1
1
+ 0.5
+ 0.5
,
1
0
1
and since (0, 1)T , (1, 0)T Q and (1, 1)T C we are done.
Exercise 3.6 Assume that a1 , a2 , a3 , b R satisfy
x A,
a1 x1 + a2 x2 + a3 x3 b, x B.
a1 x1 + a2 x2 + a3 x3 b,
356
(A.10)
(A.11)
R2 .
f (x, y) =
1
4
(x, y)
2
2
2
1
x
x
+ (3, 1)
.
y
y
357
xT x
(Ax (x) x) = 0n .
358
p with Ap = 0m .
.
We will especially look at the vector p := x x
Next, by assumption, f (
x) < f (x ), which implies that (x x )T Q(x
x ) < 0 holds. We utilize this strict inequality together with the above to last
establish that, for every > 0,
x + (x x )) < f (x),
f (
. We are done.
which contradicts the local optimality of x
(b)
Exercise 4.5 Utilize the variational inequality characterization of the projection operation.
Exercise 4.6 Utilize Proposition 4.23(b) for this special case of feasible
set. We obtain the following necessary conditions for x 0n to be local
minimum:
0 xj
f (x )
0,
xj
j = 1, 2, . . . , n,
where (for real values a and b) a b means the condition that a b = 0 holds.
In other words, if xj > 0 then the partial derivative of f at x with respect
to xj must be zero; conversely, if this partial derivative if non-zero then the
value of xj must be zero. (This is called complementarity.)
Exercise 4.7 By P
a logarithmic transformation, we may instead maximize
the function f (x) = n
j=1 aj log xj . The optimal solution is
aj
xj = Pn
i=1
ai
j = 1, . . . , n.
359
Ax b,
0,
T
A = 0,
T
(Ax b) = 0.
Combining the last two equations we obtain
T x = bT .
ExerciseP 5.4 (a) Clearly,Pthe two problems are equivalent. On the other
m
2
hand, { m
i=1 [hi (x)] } = 2
i=1 hi (x)hi (x) = 0 at every feasible solution.
Therefore, MFCQ is violated at every feasible point of the problem (5.22)
(even though Slaters CQ, LICQ, or at least MFCQ might hold for the original
problem).
(b) The objective function is non-differentiable. Therefore, we rewrite the
problem as
minimize z,
subject to f1 (x) z 0,
f2 (x) z 0,
The problem verifies MFCQ (e.g., the direction (0T , 1)T G(x, z) for all feasible points (xT , z)T . Therefore, the KKT conditions are necessary for local
optimality; these conditions are exactly what we need.
Exercise 5.5 The problem is convex and a CQ is fulfilled, so we need to
find an arbitrary KKT point. The KKT system is as follows:
x + AT = 0,
Ax = b.
360
aTi x bi , and xi = 0, i 6 I.
Thus we (in principle) have reduced the original non-convex problem that violates LICQ to 2n convex problems.
Exercise
c 1.
5.8
(b) = 9/16.
T
(d) x = (4/3, 8/3) .
(e) f = q = 9/4.
Exercise 6.2
361
min
(1 ,2 ) 2
+
1 + 22 ,
,
y0
if 1 + 22 1 and 1 + 2 1/2,
otherwise.
m
n X
X
j=1 i=1
e(1+j +i )
n
X
j=1
bj j
362
m
X
i=1
a i i ,
= x1 = 1, x2 = 2,
= x1 = 1, x2 = 5/2,
= x1 = 3, x2 = 3,
f (3, 3) = 21, so 43/4 f
infeasible,
infeasible,
feasible,
21.
q(1) = 6;
q(2) = 43/4;
q(3) = 9.
Exercise 6.8 (a) The value of the Lagrangian dual function is given by
q() := minimump{1,...,P } {f (xp ) + T g(xp )}, which is the point-wise minimum of P affine functions. Therefore, q is piece-wise linear, with no more than
P pieces; the number is less if for some value(s) of more than one element
xp attains the minimum value of the Lagrangian.
(b)
(c) q() := minimumi{1,...,I} {f (xi ) + T g (xi ), where xi , i {1, . . . , I},
are the extreme points of the polytope X. The number of pieces of the dual
function is bounded by the number I of extreme points of X.
Exercise 6.9 The dual problem is to maximize q() over R+ , where
q() := 5 + minimum (2 )x1 + minimum (1 )x2 .
x1 {0,1,...,4}
x2 {0,1,...,4}
m
X
yi ,
i=1
subject to
y Ax b y ,
1n x 1n .
363
m
X
minimize
yi + t,
i=1
y Ax b y ,
subject to
t1n x t1n .
(v1 )T
B
..
B
.
B
B
k T
B(v )
:= B
1 T
B (w )
B
..
B
.
(w l )T
1
.. C
. C
C
C
1C
C,
1C
C
.. C
. A
1
x :=
a
b
Then from the rank assumption it follows that rank B = n + 1, which means
that x 6= 0n+1 implies that Bx 6= 0k+l . Hence the problem can be solved by
solving the linear program
minimize
(0n+1 )T x,
subject to
Bx 0k+l ,
(1k+l )T Bx = 1.
(b) Let = R2 kxc k22 . Then the problem can be solved by solving the
linear program
minimize
(0n )T xc + 0,
subject to
kv i k22 2(v i )T xc ,
kw
i 2
k2
2(w )
i T
i = 1, . . . , k,
x , i = 1, . . . , l,
c
and compute R as R = + kxc k22 (from the first set of inequalities in the
LP above it follows that + kxc k22 0 so this is well defined).
Exercise 8.3 Since P is bounded there exists no y 6= 0n such that Ay 0m .
Hence there exists no feasible solution to the system
Ay 0m ,
dT y = 1,
which implies that z > 0 in every feasible solution to (8.11).
364
Exercise 8.4 The problem can be transformed into the standard form:
minimize
subject to
z =
x1 5x+
2 +5x2 7x3 +7x3 ,
5x1 2x+
2 +2x2 +6x3 6x3 s1
3x1
+4x+
2
7x1
+3x+
2
x1 ,
x+
2,
4x
2
3x
2
x
2 ,
9x+
3
+5x+
3
= 15,
+9x
3
5x
3
x+
3,
= 9,
+s2 = 23,
x
3 , s1 , s2 0,
where x1 = x1 + 2, x2 = x+
2 x2 , x3 = x3 x3 , and z = z 2.
Exercise 8.6 Assume that the column in the constraint matrix corresponding to the variable x+
j is aj . Then the column in the constraint matrix corresponding to the variable x
j is aj . The statement follows from the definition
of a BFS, since aj and aj are linearly dependent.
Exercise 8.7 Let P be the set of feasible solutions to (8.12) and Q be the set
of feasible solutions to (8.13). Obviously P Q. In order to show that Q P
assume that there exists an x Q such that x
/ P and derive a contradiction.
w = a1 + a2 ,
3x1 2x2 + x3 s1 + a1 = 3,
x1 + x2 2x3 s2 + a2 = 1,
x1 , x2 , x3 , s1 , s2 , a1 , a2 0.
365
3x1 + 2x2 + x3 ,
subject to
2x1 + x3 s1 = 3,
2x1 + 2x2 + x3 = 5,
x1 , x2 , x3 , s1 0.
By solving the phase I problem with the simplex algorithm we get the feasible
basis xB = (x1 , x2 )T . Then by solving the phase II problem with the simplex
algorithm we get the optimal solution x = (x1 , x2 , x3 )T = (0, 1, 3)T .
(b) No, the set of all optimal solution is given by the set
{ x R3 | (0, 1, 3)T + (1 )(0, 0, 5)T ;
[0, 1] }.
Exercise 9.3 The reduced cost for all the variables except for xj must be
greater than or equal to 0. Hence it follows that the current basis is optimal
to the problem that arises if xj is fixed to zero. The assertion then follows
from the fact that the current basis is non-degenerate.
Exercise 9.4
0,
y3 0.
366
b T y 1 + lT y 2 + u T y 3 ,
subject to AT y 1 +I n y 2 +I n y 3 =
,
y2
0n ,
y 3 0n .
maximize
y 1 = 0m ,
y 2 = (max{0, c1 }, . . . , max{0, cn })T ,
y 3 = (min{0, c1 }, . . . , min{0, cn })T .
Ax b,
AT y
,
T x = bT y ,
x 0n ,
y 0m .
T x =
Exercise 10.7 The dual problem only contains two variables and hence can
be solved graphically. We get the optimal solution y = (2, 0)T . The complementary slackness conditions then implies that x1 = x2 = x3 = x5 = 0. Hence,
let xB = (x4 , x6 )T . The optimal solution is x = (x1 , x2 , x3 , x4 , x5 , x6 )T =
(0, 0, 0, 3, 0, 1)T .
Exercise 10.8 From the complementary slackness conditions and the fact
367
cr
,
ar
yj = cj
yj = 0,
cr
aj , j = 1, . . . , r 1,
ar
j = r, . . . , n,
is a dual feasible solution which together with the given primal solution fulfil
the LP primaldual optimality conditions.
Exercise 10.9
Exercise 10.10
Exercise 10.11
Exercise 10.12
Exercise 10.13 The basis
c4 8.
368
10
4
4
.
2
Further,
(Q +
I n )(y xk ) = xk q (Q + I n )xk = (Qxk + q).
369
370
x0
x 2
uniquely solvable:
x1
= 0;
x1 + 2x2 10
2x2
2
=0
x1 + 2x2 10
2
yields that x1 = x2 must hold; the resulting
pquadratic equation 3x1 10x1 =
0 has two roots, of which x1 () = 5/3 + 25/9 + /3 is strictly feasible. As
0, x1 () = x2 () tends to 10/3.
One then shows that x = ( 10
, 10
)T is a KKT point. The constraint is
3
3
371
subject to
f (
x)T p,
gi (
x)T p gi (x),
i = 1, . . . , m,
x)T p = 0,
j = 1, . . . , .
hj (
Letting 0m and R be the dual variable vector for the inequality and
equality constraints, respectively, we obtain the following dual program:
maximize
( , )
subject to
m
X
i gi (
x),
i=1
m
X
i=1
i gi (
x)
X
j=1
j hj (
x) = f (x),
i 0,
i = 1, . . . , m.
LP duality now establishes the result sought: First, suppose that the optimal value of the above primal problem over p is zero. Then, the same is true
for the dual problem. Hence, by the sign conditions i 0 and gi (x) 0, each
term in the sum must be zero. Hence, we established that complementarity
holds. Next, the two constraints in the dual problem are precisely the dual
372
References
[Aba67]
[AMO93]
[Arm66]
[AHU58]
[AHU61]
[Avr76]
[AvG96]
[Ban22]
[Bar71]
[BaG69]
[BSS93]
J. Abadie, On the KuhnTucker theorem, in Nonlinear Programming (NATO Summer School, Menton, 1964), North-Holland,
Amsterdam, 1967, pp. 1936.
R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, Network
Flows: Theory, Algorithms, and Applications, Prentice-Hall, Englewood Cliffs, NJ, 1993.
L. Armijo, Minimization of functions having Lipschitz continuous
first partial derivatives, Pacific Journal of Mathematics, 16 (1966),
pp. 13.
K. J. Arrow, L. Hurwicz, and H. Uzawa, eds., Studies in
Linear and Non-Linear Programming, Stanford University Press,
Stanford, CA, 1958.
K. J. Arrow, L. Hurwicz, and H. Uzawa, Constraint qualifications in maximization problems, Naval Research Logistics Quarterly, 8 (1961), pp. 175191.
M. Avriel, Nonlinear Programming: Analysis and Methods,
Prentice Hall Series in Automatic Computation, Prentice Hall,
Englewood Cliffs, NJ, 1976.
M. Avriel and B. Golany, eds., Mathematical Programming
for Industrial Engineers, vol. 20 of Industrial Engineering, Marcel
Dekker, New York, NY, 1996.
S. Banach, Sur les operations dans les ensembles abstraits et leur
application aux equations integrales, Fundamenta Mathematicae,
3 (1922), pp. 133181.
R. H. Bartels, A stabilization of the simplex method, Numerische
Mathematik, 16 (1971), pp. 414434.
R. H. Bartels and G. H. Golub, The simplex method of linear programming using LU-decomposition, Communications of the
ACM, 12 (1969), pp. 266268 and 275278.
M. S. Bazaraa, H. D. Sherali, and C. M. Shetty, Nonlinear
Programming: Theory and Algorithms, John Wiley & Sons, New
York, NY, second ed., 1993.
References
[Ben62]
[Ber99]
[Ber04]
[BNO03]
[BeT89]
[BeT00]
[Bla77]
[BlO72]
[BGLS03]
[BoS00]
[BoL00]
[BHM77]
[Bre73]
[Bro09]
[Bro12]
[Bro70]
374
J. F. Benders, Partitioning procedures for solving mixed variables programming problems, Numerische Mathematik, 4 (1962),
pp. 238252.
D. P. Bertsekas, Nonlinear Programming, Athena Scientific,
Bellmont, MA, second ed., 1999.
, Lagrange multipliers with optimal sensitivity properties in
constrained optimization, Report LIDS 2632, Department of Electrical Engineering and Computer Science, Massachusetts Institute
of Technology, Cambridge, MA, 2004.
, and A. E. Ozdaglar, Convex
D. P. Bertsekas, A. Nedic
Analysis and Optimization, Athena Scientific, Belmont, MA, 2003.
D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed
Computation: Numerical Methods, Prentice Hall, London, U.K.,
1989.
D. P. Bertsekas and J. N. Tsitsiklis, Gradient convergence in
gradient methods with errors, SIAM Journal on Optimization, 10
(2000), pp. 627642.
R. G. Bland, New finite pivoting rules for the simplex method,
Mathematics of Operations Research, 2 (1977), pp. 103107.
E. Blum and W. Oettli, Direct proof of the existence theorem in
quadratic programming, Operations Research, 20 (1972), pp. 165
167.
J. F. Bonnans, J. C. Gilbert, C. Lemar
echal, and C. A.
bal, Numerical Optimization: Theoretical and PractiSagastiza
cal Aspects, Universitext, Springer-Verlag, Berlin, 2003. Translated from the original French edition, published by SpringerVerlag 1997.
J. F. Bonnans and A. Shapiro, Perturbation Analysis of Optimization Problems, Springer Series in Operations Research,
Springer-Verlag, New York, NY, 2000.
J. M. Borwein and A. S. Lewis, Convex Analysis and Nonlinear
Optimization: Theory and Examples, CMS Books in Mathematics,
Springer-Verlag, New York, NY, 2000.
S. P. Bradley, A. C. Hax, and T. L. Magnanti, Applied
Mathematical Programming, Addison-Wesley, Reading, MA, 1977.
R. P. Brent, Algorithms for Minimization Without Derivatives,
Prentice Hall Series in Automatic Computation, Prentice Hall,
Englewood Cliffs, NJ, 1973. Reprinted by Dover Publications,
Inc., Mineola, NY, 2002.
L. E. J. Brouwer, On continuous vector distributions on surfaces, Amsterdam Proceedings, 11 (1909).
, Uber
Abbildung von Mannigfaltigkeiten, Mathematische
Annalen, 71 (1912), pp. 97115.
C. G. Broyden, The convergence of single-rank quasi-Newton
methods, Mathematics of Computation, 24 (1970), pp. 365382.
References
[BGIS95]
[Car07]
C. Carath
eodory, Uber
den Variabilit
atsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen, Mathematische Annalen, 64 (1907), pp. 95115.
, Uber
den Variabilit
atsbereich der Fourierschen Konstan[Car11]
ten von positiven harmonischen Funktionen, Rendiconti del Circolo Matematico di Palermo, 32 (1911), pp. 193217.
[Casetal02] E. Castillo, A. J. Conejo, P. R. G. Pedregal, and N. Alguacil, Building and Solving Mathematical Programming Models
in Engineering and Science, Pure and Applied Mathematics, John
Wiley & Sons, New York, NY, 2002.
[Cau1847] A. Cauchy, Methode generale pour la resolution des syst`emes
dequations simultanees, Comptes Rendus Hebdomadaires des
Seances de lAcademie des Sciences (Paris), Serie A, 25 (1847),
pp. 536538.
[Cha52]
A. Charnes, Optimality and degeneracy in linear programming,
Econometrica, 20 (1952), pp. 160170.
tal, Linear Programming, Freeman, New York, NY,
[Chv83]
V. Chva
1983.
[CGT00]
A. R. Conn, N. I. M. Gould, and Ph. L. Toint, Trust-Region
Methods, vol. 1 of MPS/SIAM Series on Optimization, SIAM and
Mathematical Programming Society, Philadelphia, PA, 2000.
[Cro36]
H. Cross, Analysis of flow in networks of conduits or conductors, Bulletin 286, Engineering Experiment Station, University of
Illinois, Urbana, IL, 1936.
[Dan51]
G. B. Dantzig, Maximization of a linear function of variables
subject to linear inequalities, in Activity Analysis of Production
and Allocation, Tj. C. Koopmans, ed., New York, NY, 1951, John
Wiley & Sons, pp. 339347.
[Dan53]
, Computational algorithm of the revised simplex method, Report RM 1266, The Rand Corporation, Santa Monica, CA, 1953.
[Dan57]
, Concepts, origins, and use of linear programming, in Proceedings of the First International Conference on Operational Research, Oxford, 1957, M. Davies, R. T. Eddison, and T. Page, eds.,
London, U.K., 1957, The English Universities Press, pp. 100108.
, Linear Programming and Extensions, Princeton University
[Dan63]
Press, Princeton, NJ, 1963.
[DaM05]
G. B. Dantzig and N. T. Mukund, Linear programming 3: Implementation, Springer Series in Operations Research, SpringerVerlag, New York, NY, 2005.
375
References
[DaO53]
[DOW55]
G. B. Dantzig, A. Orden, and P. Wolfe, The generalized simplex method for minimizing a linear form under linear inequality
restraints, Pacific Journal of Mathematics, 5 (1955), pp. 183195.
[DaT97]
G. B. Dantzig and M. N. Thapa, Linear programming 1: Introduction, Springer Series in Operations Research, Springer-Verlag,
New York, NY, 1997.
[DaT03]
[DaW60]
G. B. Dantzig and P. Wolfe, Decomposition principle for linear programs, Operations Research, 8 (1960), pp. 101111.
[dAu47]
[Dav59]
W. C. Davidon, Variable metric method for minimization, Report ANL-5990 Rev, Argonne National Laboratories, Argonne,
IL, 1959. Also published in SIAM Journal on Optimization, 1
(1991), pp. 117.
[DeF49]
B. De Finetti, Sulla stratificazioni convesse, Annali di Matematica Pura ed Applicata, 30 (1949), pp. 173183.
[Den59]
J. B. Dennis, Mathematical Programming and Electrical Networks, John Wiley & Sons, New York, NY, 1959.
[DeS83]
J. E. Dennis and R. E. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice Hall,
Englewood Cliffs, NJ, 1983.
[DiJ79]
[Duf46]
[Duf47]
, Nonlinear networks, IIa, Bulletin of the American Mathematical Society, 53 (1947), pp. 963971.
[DuH78]
J. C. Dunn and S. Harshbarger, Conditional gradient algorithms with open loop step size rules, Journal of Mathematical
Analysis and Applications, 62 (1978), pp. 432444.
[Eav71]
[EHL01]
T. F. Edgar, D. M. Himmelblau, and L. S. Lasdon, Optimization of Chemical Processes, McGraw-Hill, New York, NY,
second ed., 2001.
376
References
[Eke74]
I. Ekeland, On the variational principle, Journal of Mathematical Analysis and Applications, 47 (1974), pp. 324353.
[Erm66]
Yu. M. Ermolev, Methods for solving nonlinear extremal problems, Kibernetika, 2 (1966), pp. 117. In Russian, translated into
English in Cybernetics, 2 (1966), pp. 114.
[Eva70]
J. P. Evans, On constraint qualifications in nonlinear programming, Naval Research Logistics Quarterly, 17 (1970), pp. 281286.
[Eve63]
[Fac95]
[Fal67]
J. Farkas, Uber
die Theorie der einfachen Ungleichungen, Journal f
ur die Reine und Angewandte Mathematik, 124 (1902), pp. 1
24.
[Far1902]
[Fen51]
[Fia83]
[FiM68]
[Fis81]
M. L. Fisher, The Lagrangian relaxation method for solving integer programming problems, Management Science, 27 (1981), pp. 1
18.
[Fis85]
[Fle70]
[Fle87]
[FLT02]
[FlP63]
377
References
[FlR64]
R. Fletcher and C. M. Reeves, Function minimization by conjugate gradients, Computer Journal, 7 (1964), pp. 149154.
[FrW56]
M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics Quarterly, 3 (1956), pp. 95110.
[GKT51]
D. H. Gale, H. W. Kuhn, and A. W. Tucker, Linear programming and the theory of games, in Activity Analysis of Production
and Allocation, Tj. C. Koopmans, ed., New York, NY, 1951, Wiley, pp. 317329.
[GaJ79]
M. R. Garey and D. S. Johnson, Computers and Intractability:
A Guide to the Theory of NP-Completeness, Freeman, New York,
NY, 1979.
[GaA05]
S. I. Gass and A. A. Assad, An Annotated Timeline of Operations Research. An Informal History, vol. 75 of International Series
in Operations Research & Management Science, Kluwer Academic
Publishers, New York, NY, 2005.
[Geo74]
A. M. Geoffrion, Lagrangean relaxation for integer programming: Approaches to integer programming, Mathematical Programming Study, 2 (1974), pp. 82114.
[Gil66]
E. G. Gilbert, An iterative procedure for computing the minimum of a quadratic form on a convex set, SIAM Journal on Control, 4 (1966), pp. 6180.
[GiM73]
P. E. Gill and W. Murray, A numerically stable form of the
simplex algorithm, Linear Algebra and Its Applications, 7 (1973),
pp. 99138.
[GMS05]
P. E. Gill, W. Murray, and M. A. Saunders, SNOPT: An
SQP algorithm for large-scale constrained optimization, SIAM Review, 47 (2005), pp. 99131.
[GMSW89] P. E. Gill, W. Murray, M. A. Saunders, and M. H. Wright,
A practical anti-cycling procedure for linearly constrained optimization, Mathematical Programming, 45 (1989), pp. 437474.
[Gol70]
D. Goldfarb, A family of variable-metric methods derived by
variational means, Mathematics of Computation, 24 (1970),
pp. 2326.
[Gol64]
A. A. Goldstein, Convex programming in Hilbert space, Bulletin
of the American Mathematical Society, 70 (1964), pp. 709710.
[GrD03]
A. Granas and J. Dugundji, Fixed Point Theory, Springer
Monographs in Mathematics, Springer-Verlag, New York, NY,
1969.
[Gri00]
A. Griewank, Evaluating Derivatives: Principles and Techniques
of Algorithmic Differentiation, vol. 19 of Frontiers in Applied
Mathematics, SIAM, Philadelphia, PA, 2000.
[Gui69]
M. Guignard, Generalized KuhnTucker conditions for mathematical programming problems in a Banach space, SIAM Journal
on Control, 7 (1969), pp. 232241.
378
References
[Had10]
[Han75]
[HaH96]
[HLV87]
[HWC74]
[HiL93]
[Hof53]
[Ius03]
[Joh48]
[JoM74]
[Jos03]
[Kar84a]
[Kar84b]
[Kha79]
[Kha80]
379
References
[Kir1847]
[KlM72]
[KLT03]
[Kre78]
[KuT51]
[LaP05]
[LPS96]
[LPS99]
[Las70]
[Law76]
[LRS91]
[LeP66]
[Lip1877]
[Lue84]
380
G. Kirchhoff, Uber
die Ausfl
osung der Gleichungen auf welche
man bei der Untersuchungen der Linearen Vertheilung Galvanisher Str
ome gef
uhrt wird, Pogendorff Annalen Der Physik, 72
(1847), pp. 497508. English translation, IRE Transactions on
Circuit Theory, CT-5 (1958), pp. 48.
V. Klee and G. J. Minty, How good is the simplex algorithm?,
in Inequalities, III. Proceedings of the Third Symposium on Inequalities held at the University of California, Los Angeles, CA,
September 19, 1969; dedicated to the memory of Theodore S.
Motzkin, O. Shisha, ed., New York, NY, 1972, Academic Press,
pp. 159175.
T. G. Kolda, R. M. Lewis, and V. Torczon, Optimization
by direct search: New perspectives on some classical and modern
methods, SIAM Review, 45 (2003), pp. 385482.
E. Kreyszig, Introductory Functional Analysis with Applications,
John Wiley & Sons, New York, NY, 1978.
H. W. Kuhn and A. W. Tucker, Nonlinear programming, in
Proceedings of the Second Berkeley Symposium on Mathematical
Statistics and Probability, 1950, Berkeley and Los Angeles, CA,
1951, University of California Press, pp. 481492.
T. Larsson and M. Patriksson, Global optimality conditions
for discrete and nonconvex optimizationwith applications to Lagrangian heuristics and column generation, technical report, Department of Mathematics, Chalmers University of Technology,
Gothenburg, Sweden, 2005. To appear in Operations Research.
mberg, CondiT. Larsson, M. Patriksson, and A.-B. Stro
tional subgradient optimizationtheory and applications, European Journal of Operational Research, 88 (1996), pp. 382403.
, Ergodic, primal convergence in dual subgradient schemes
for convex programming, Mathematical Programming, 86 (1999),
pp. 283312.
L. S. Lasdon, Optimization Theory for Large Systems, Macmillan, New York, NY, 1970.
E. Lawler, Combinatorial Optimization: Networks and Matroids,
Holt, Rinehart and Winston, New York, NY, 1976.
J. K. Lenstra, A. H. G. Rinnooy Kan, and A. Schrijver,
eds., History of Mathematical Programming. A Collection of Personal Reminiscences, North-Holland, Amsterdam, 1991.
E. S. Levitin and B. T. Polyak, Constrained minimization
methods, USSR Computational Mathematics and Mathematical
Physics, 6 (1966), pp. 150.
R. Lipschitz, Lehrbuch der Analysis, Cohn & Sohn, Leipzig, 1877.
D. G. Luenberger, Linear and Nonlinear Programming, Addison
Wesley, Reading, MA, second ed., 1984. Reprinted by Kluwer
Academic Publishers, Boston, MA, 2003.
References
[Man65]
[Man69]
[Man88]
[MaF67]
O. L. Mangasarian and S. Fromovitz, The Fritz John necessary optimality conditions in the presence of equality and inequality constraints, Journal of Mathematical Analysis and Applications, 17 (1967), pp. 3747.
[Mar78]
N. Maratos, Exact penalty function algorithms for finite dimensional and control optimization problems, PhD thesis, Imperial
College of Science and Technology, University of London, London,
U.K., 1978.
[Max1865] J. C. Maxwell, A dynamical theory of the electromagnetic field,
Philosophical Transactions of the Royal Society of London, 155
(1865), pp. 459512.
[Min10]
[Min11]
[Mot36]
T. Motzkin, Beitr
age zur Theorie del linearen Ungleichungen,
Azriel, Israel, 1936.
K. G. Murty, Linear Programming, John Wiley & Sons, New
York, NY, 1983.
[Mur83]
[Mur95]
[Nas50]
[Nas51]
[NaS96]
[NeW88]
381
References
[Orc54]
[OrR70]
[Pad99]
[PaT91]
[PaS82]
[PaS88]
[PaV91]
[Pat94]
[Pat98]
[PoR69]
[Pol69]
[Pow78]
[PsD78]
[Rad19]
382
H. Rademacher, Uber
partielle und totale Differenzierbarkeit von
References
Funktionen mehrerer Variabeln under u
ber die Transformation der
Doppelintegrale, Mathematische Annalen, 79 (1919), pp. 340359.
[Rar98]
[Roc70]
[RoW97]
R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, vol. 317 of Grundlehren der mathematischen Wissenschaften,
Springer-Verlag, Berlin, 1997.
[Sau72]
[Sch86]
[Sch03]
[Sha70]
D. F. Shanno, Conditioning of quasi-Newton methods for function minimization, Mathematics of Computation, 24 (1970),
pp. 647656.
[She85]
[She76]
[Sho77]
N. Z. Shor, Cut-off method with space extension in convex programming problems, Cybernetics, 13 (1977), pp. 9496.
[Sho85]
[StW70]
J. Stoer and C. Witzgall, Convexity and Optimization in Finite Dimensions I, Springer-Verlag, Berlin, 1970.
[Tah03]
[UUV04]
M. Ulbrich, S. Ulbrich, and L. N. Vicente, A globally convergent primal-dual interior-point filter method for nonlinear programming, Mathematical Programming, 100 (2004), pp. 379410.
[Van01]
R. J. Vanderbei, Linear Programming. Foundations and Extensions, vol. 37 of International Series in Operations Research &
Management Science, Kluwer Academic Publishers, Boston, MA,
second ed., 2001.
383
References
[vHo77]
[vNe28]
[vNe47]
[vNM43]
[Wag75]
[War52]
[Wil99]
[Wil63]
[Wol69]
[Wol75]
[Wol98]
[YuN77]
[Zan69]
384
Index
Index
contractive operator, 101
convergence rate, 296
geometric, 101, 170
linear, 296
quadratic, 296
superlinear, 296
convex analysis, 4172
convex combination, 43
convex function, 57, 96, 159
convex hull, 43
convex programming, 12
convex set, 41
coordinates, 35
CQ, 124
Danskins Theorem, 160
DantzigWolfe algorithm, 158
decision science, 10
decision variable, 6
degenerate basic solution, 215
descent direction, 85, 165
descent lemma, 322
DFP method, 293
Diet problem, 10
differentiability, 163
differentiable function, 38
differentiable optimization, 12
Dijkstras algorithm, 320
diode, 182
direction of unboundedness, 226
directional derivative, 38, 86, 159
distance function, 67
divergent series step length rule, 165,
307
domination, 346
dual feasible basis, 254
dual infeasible basis, 254
dual linear program, 242
dual simplex algorithm, 255
dual simplex method, 254
duality gap, 145
effective domain, 96, 144
efficient frontier, 346
eigenvalue, 36
eigenvector, 36
Ekelands variational principle, 110
386
Index
hard constraint, 18
Hessian matrix, 39
I(x), 89
identity matrix I n , 36
ill-conditioning, 346
implicit function, 13, 40, 296
Implicit Function Theorem, 164
indicator function (S ), 167, 325
inequality constraint, 11
infimum, 14
infinite-dimensional optimization, 13
integer programming, 12, 13
integrable function, 318
integrality property, 13
interior, 38
interior penalty function, 107
interior point algorithm, 238, 330
337
interpolation, 277
iso-cost line, 268
iso-curve, 268
Jacobi method, 105
Jacobian, 39, 300, 318, 339
Karmarkars algorithm, 238
KarushKuhnTucker (KKT) conditions, 125, 133
Kirchhoffs laws, 182
Lagrange function, 142
Lagrange multiplier method, 158,
174
Lagrange multiplier vector, 143, 179
Lagrange multipliers, 122
Lagrangian dual function, 143
Lagrangian dual problem, 143
Lagrangian duality, 141194
Lagrangian relaxation, 18, 142, 143,
185
least-squares data fitting, 267
level curve, 268
level set (levg (b)), 65, 66, 78, 80,
151, 167, 279, 281, 313
LevenbergMarquardt, 273, 301
LICQ, 132
limit, 37
limit points, 37
line search, 275
approximate, 276
Armijo step length rule, 277,
298, 312
Golden section, 277
interpolation, 277
Newtons method, 277
linear convergence rate, 296
linear function, 40
linear independence, 34
linear programming, 11, 13, 154,
197264, 336337
linear programming duality, 241
264
linear space, 34
linear-fractional programming, 223
Lipschitz continuity, 280
local convergence, 339
local minimum, 76
local optimum, 76
necessary conditions, 85, 86,
90, 117, 121, 125, 130
sufficient conditions, 87
logarithmic barrier, 331
logical constraint, 5
lower semi-continuity, 79
Maratos effect, 344
mathematical model, 4
mathematical programming, 9
matrix, 35
matrix game, 105
matrix inverse, 36
matrix norm, 35
matrix product, 35
matrix transpose, 35
max function, 160
mean-value theorem, 39
merit function, 341
method of successive averages (MSA),
321
MFCQ, 131
minimax theorem, 105
minimum, 14
387
Index
minimum distance (distS ), 167
multi-objective optimization, 13, 346
near-optimality, 95
negative curvature, 270
neighbourhood, 38
Newtons method, 272, 277, 299
NewtonRaphson method, 105, 272
Nobel laureates, 10
non-basic variables, 215
non-convex programming, 12
non-coordinability, 176
non-differentiable function, 283
non-differentiable optimization, 12
non-expansive operator, 99
nonlinear programming, 11, 13
nonsingular matrix, 36
norm, 34
normal cone (NX ), 94, 126
NP-hard problem, 77, 186
objective function, 4
Ohms law, 183
open ball, 37
open set, 37
operations research, 9
optimal BFS, 226
optimal solution, 5
optimal value, 5
optimality, 9
optimality conditions, 8488, 90
94, 111140, 147149, 175,
227, 228, 252253
optimization under uncertainty, 13
optimize, 3
orthogonality, 34, 148
orthonormal basis, 35
parametric optimization, 137
Pareto set, 346
partial pricing, 230
pattern search methods, 298
penalty, 19
penalty function, 19
penalty parameter, 326
perturbation function (p(u)), 178
Phase I, 315
388
Q-orthogonal, 286
quadratic convergence rate, 296
quadratic function, 40, 64, 77
quadratic programming, 77, 156,
315
quasi-convex function, 109
quasi-Newton methods, 273, 293,
342
Rademachers Theorem, 283
rank-two update, 293
recession cone, 81, 82
reduced cost, 227
redundant constraint, 107
relaxation, 19, 141, 142, 185, 328
Relaxation Theorem, 141
Representation Theorem, 50, 218,
308
resistor, 182
restricted master problem, 308
restricted simplicial decomposition,
309
restrification, 350
Index
revised simplex method, 238
saddle point, 105, 147, 270
scalar product, 34
secant method, 274
sensitivity analysis, 137, 178, 179
sensitivity analysis for LP, 257
separation of convex sets, 98
Separation Theorem, 52, 98, 163
sequential linear programming (SLP),
349
sequential quadratic programming
(SQP), 337346
set covering problem, 18
shadow price, 249
shortest route, 320
simplex method, 10, 225240
simplicial decomposition algorithm,
308
slack variable, 7
Slater CQ, 131
SLP algorithm, 349
soft constraint, 18, 177
spectral theorem, 136
SQP algorithm, 337350
square matrix, 36
stalling, 239
standard basis, 35
stationary point, 16, 85, 91
steepest descent, 269
steepest-edge rule, 230
stochastic programming, 13
strict inequality, 81
strict local minimum, 76
strictly convex function, 58
strictly quasi-convex function, 277
strong duality, 149
Strong Duality Theorem, 149, 152
154, 156, 248
subdifferentiability, 162
subdifferential, 158
subgradient, 158, 284
subgradient optimization, 166
subgradient projection method, 165
superlinear convergence rate, 296
symmetric matrix, 36
389
Row
Reads
Should read
2
13
11
17
20
21
Figure 6.4
14
Figure 11.2(b)
Exercise 10.5
y 0m