0% found this document useful (0 votes)
8 views

Classification of Optimization methods

This textbook on Discrete and Continuous Optimization is designed for the Computer Science program at Charles University, focusing primarily on continuous optimization. It covers various topics including linear regression, convexity, optimization methods, and applications, while referencing foundational texts in the field. The document also provides motivation examples and discusses the relationship between discrete and continuous optimization problems.

Uploaded by

jtwijnker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Classification of Optimization methods

This textbook on Discrete and Continuous Optimization is designed for the Computer Science program at Charles University, focusing primarily on continuous optimization. It covers various topics including linear regression, convexity, optimization methods, and applications, while referencing foundational texts in the field. The document also provides motivation examples and discusses the relationship between discrete and continuous optimization problems.

Uploaded by

jtwijnker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Milan Hladík

Discrete and Continuous Optimization


textbook

May 17, 2022


This is a textbook to course Discrete and continuous optimization for the Computer Science program
at Charles University, Faculty of Mathematics and Physics. In particular, it serves for the continuous
optimization part.
I build mostly on books Bazaraa et al. [2006]; Boyd and Vandenberghe [2004]; Luenberger and Ye
[2008].
You can report bugs or send your comments to: [email protected].

3
Contents

Contents 5

1 Introduction 7
1.1 Motivation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Continuous optimization: First steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Unconstrained optimization 17

3 Convexity 19
3.1 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 The first and second order characterization of convex functions . . . . . . . . . . . . . . . 23
3.4 Other rules for detecting convexity of a function . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Convex optimization 27
4.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Quadratic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Convex cone programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 Duality in convex cone programming . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.2 Second order cone programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.3 Semidefinite programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Good news – the ellipsoid method . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.2 Bad news – copositive programming . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.1 Robust PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.2 Minimum volume enclosing ellipsoid . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Karush–Kuhn–Tucker optimality conditions 41

6 Methods 47
6.1 Line search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Unconstrained problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 Constrained problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.1 Methods of feasible directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.2 Active-set methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.3 Penalty and barrier methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Conjugate gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7 Selected topics 59
7.1 Robust optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 Concave programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5
Appendix 63

Notation 65

Bibliography 67
Chapter 1

Introduction

An optimization problem (or a mathematical programming problem) reads


min f (x) subject to x ∈ M,
where f : Rn → R is the objective function and M ⊆ Rn is the feasible set. In general, this problem is
undecidable, that is, there is provably no algorithm that can solve it [Zhu, 2006]; another well-known
undecidable problem is the halting problem. On the other hand, there are effectively solvable sub-classes
of problems.
Depending on the character of the feasible set M , we distinguish two types:
• Discrete optimization.
The set M is (typically) finite, but usually large enough to inspect and process all feasible solutions.
Usually |M | ≥ 2n .
Examples include the shortest path problem, the minimum spanning tree problem or the minimum
matching problem in a graph. These problems are effectively solvable. In contrast, some problems
in discrete optimization are NP-hard: integer linear programming, the travelling salesman problem,
the knapsack problem or finding the max cut in a graph.
• Continuous optimization.
Here, the feasible set M is uncountably infinite. Surprisingly, this may pay off: linear programming
is polynomially solvable, but the additional integrality requirement makes it NP-hard.
The typical problems are linear programming (LP) and diverse kinds of nonlinear programming
(such as convex programming, quadratic programming or semidefinite programming).

Relation discrete vs. continuous optimization


Discrete and continuous optimization are not disjoint. In fact, they are closely related and techniques from
one area are used in the second one. To see it, consider integer programming: most of the methods are
based on a relaxation to a continuous problem and an iterative improvement.
Conversely, an integer condition can easily be reduced to a continuous one. For example, the condition
x ∈ {0, 1} is equivalent to x = x2 (in reality, however, this is not used).
We will illustrate a relation of both areas on the example of flows in networks.
Example 1.1 (Flows in networks). Consider a directed graph G = (V, E), where s ∈ V is the source
vertex and t ∈ V the terminal vertex. Each edge has a capacity, which is represented by a function
u : E 7→ R+ . The objective is to find a maximum flow from s to t. The flow coming into any intermediate
vertex needs to equal the flow going out of it (flow in = flow out, called the conservation law).
Including an artificial edge (t, s) in the graph, the maximum flow problem is then equivalently formu-
lated as finding the maximum flow through the additional edge
X X
max xts subject to xij − xji = 0, ∀i ∈ V
j:(i,j)∈E j:(j,i)∈E

0 ≤ xij ≤ uij , ∀(i, j) ∈ E,

7
8 Chapter 1. Introduction

which is an integer linear programming problem. Denoting A the incidence matrix of graph G, the problem
has a compact form
max xt,s subject to Ax = 0, 0 ≤ x ≤ u.
The best known algorithms utilize the discrete nature of the problem. On the other hand, the LP formula-
tion is beneficial, too. Since matrix A is totally unimodular, the resulting optimal solution is automatically
integral, provided the capacities are integral. Hence the problem is efficiently solvable by means of linear
programming, despite integer conditions.
Another advantage of the LP formulation is that we can easily modify it to different variants of the
problem. Consider for example the problem of finding a minimum-cost flow. Denote by cij the cost of
sending a unit of flow along the edge (i, j) ∈ E and by d > 0 the minimum required flow. Then the
problem reads as an LP problem
X
min cij xij subject to Ax = 0, 0 ≤ x ≤ u, xts ≥ d.
(i,j)∈E

1.1 Motivation examples


Example 1.2 (Theoretical: eigenvalues). Let A ∈ Rn×n be a symmetric matrix and λ1 ≥ · · · ≥ λn its
(real) eigenvalues. Consider the unit ball B in space Rn ; the ball is defined as B := {x ∈ Rn ; kxk2 ≤ 1}.
The maximal eigenvalue λ1 is attained as the maximal value of the quadratic form xTAx on ball B, and
similarly the minimal eigenvalue λn is attained as the minimal value of the quadratic form xTAx on B.
Formally:
λ1 = max xTAx, λn = min xTAx.
x:kxk2 ≤1 x:kxk2 ≤1

This is a statement of the Rayleigh–Ritz theorem. Let us prove it for λ1 :


Inequality “≤”: Let x1 be an eigenvector corresponding to λ1 and normalized such that kx1 k2 = 1.
Then Ax1 = λ1 x1 . Multiplying by xT1 from the left yields
λ1 = λ1 xT1 x1 = xT1 Ax1 ≤ max xT Ax.
x:kxk2 =1

Inequality “≥”: Let x ∈ Rn be an arbitrary vector such that kxk2 = 1. Let A = QΛQT be a spectral
decomposition of matrix A. Denoting y := QT x, we have kyk2 = 1 and
n
X n
X
xTAx = xT QΛQT x = y T Λy = λi yi2 ≤ λ1 yi2 = λ1 kyk22 = λ1 .
i=1 i=1

Example 1.3 (Functional optimization). In principle, the number of variables need not be finite. For
example, in a functional problem, we want to find a function satisfying certain constraints and minimizing
a specified criterion. For illustration, imagine a problem of computing the best trajectory for a spacecraft
traveling from Earth to Mercury; the variable here is the curve of the trajectory described by a function,
and the objective is to minimize travel time. Certain simple functional problems can be solved analytically,
but in general they are solved by discretization of the unknown function and then application of classical
optimization methods.
Isoperimetric problems belong to this area, too. It is well-known that the ball has the smallest surface
area of all surfaces that enclose a given volume. But how it is when two volumes are given and we wish to
minimize the surface area (including the separating surface)? This problem is known as the double bubble
problem and it had not been solved until Hutchings et al. [2002]. The minimum area shape consists of two
spherical surfaces meeting at angles of 120◦ = 32 π. The separating area is also a spherical surface; it is a
disc in case of two equally sized volumes. See illustration on Figure 1.1.
Example 1.4 (When the nature optimizes). Snell’s law quantifies the bending of light as it passes through
a boundary between two media. The less dense the medium, the faster light travels. The trajectory of light
is such that it is traversed in the least time (the so called Fermat’s principle of least time). See illustration
on Figure 1.2.
1.2. Continuous optimization: First steps 9

Figure 1.1: (Example 1.3) The double bubble problem.

less-dense medium

denser medium

Figure 1.2: (Example 1.4) Snell’s law and Fermat’s principle.

1.2 Continuous optimization: First steps


Local and global minima
A solution can be categorized in several types; see Figure 1.3. A point x∗ ∈ M is called

• a (global) minimum if f (x∗ ) ≤ f (x) for every x ∈ M ,

• a strict (global) minimum if f (x∗ ) < f (x) for every x∗ 6= x ∈ M ,

• local minimum if f (x∗ ) ≤ f (x) for every x ∈ M ∩ Oε (x∗ ),

• a strict local minimum if f (x∗ ) < f (x) for every x∗ 6= x ∈ M ∩ Oε (x∗ ).

Naturally, to solve a problem minx∈M f (x) means to find its minimum, called the optimal solution.
However, sometimes the problem is so hard that we are contented with an approximate solution instead.
Be ware that the minimal value of function f (x) on set M need not be attained. Consider for example
the problem minx∈R x, which is unbounded from below, or the problem minx∈R ex , which is bounded from
below. A sufficient condition for existence of a minimum is given by the Weierstrass theorem.

Theorem 1.5 (Weierstrass). If f (x) is continuous and M compact, then f (x) attains a minimum on M .

Another problem appears when local minima exist. The basic methods for solving optimization prob-
lems are iterative. They start at an initial point and move in the decreasing direction of the objective
function. When they approach a local minimum, they get stuck and they have to overcome this situation
problem.
10 Chapter 1. Introduction

f (x)

a strict local min a non-strict local min


a strict global min
0 M x

Figure 1.3: Local and global minima.

This phenomenon does occur in linear programming, or more generally in convex optimization, since
each local minimum is a global one (see Theorem 3.15).
Notice that the concept of a local minimum can be used in discrete optimization, too. For instance, in
the minimum spanning tree problem we can define a local neighbourhood as the set of all spanning trees
obtained by replacing just one edge.

Classification
The feasible set M is often defined by a system of equations and inequalities

gj (x) ≤ 0, j = 1, . . . , J,
hℓ (x) = 0, ℓ = 1, . . . , L,

where gj (x), hℓ (x) : Rn → R. We will employ a short form

g(x) ≤ 0, h(x) = 0,

where g : Rn → RJ and h : Rn → RL . Depending on the type of the objective function and the feasible
set, we classify the optimization problems as follows:

• Linear programming. Functions f (x), gj (x), hℓ (x) are linear. We assume that the reader has a basic
background in linear programming

• Unconstrained optimization. Here M = Rn .

• Convex optimization. Functions f (x), gj (x) are convex and hℓ (x) are linear.

Basic transformations
If one wants to find a maximum of f (x) on set M , then the problem is easily reduced to the minimization
problem
max f (x) = − min −f (x).
x∈M x∈M

An equation constraint can be reduced to inequalities since h(x) = 0 is equivalent to h(x) ≤ 0, h(x) ≥ 0,
but this is not recommended in view of numerical issues.

Transformations of functions. The optimization problem

min f (x) subject to g(x) ≤ 0, h(x) = 0


1.2. Continuous optimization: First steps 11

can be transformed to
min ϕ(f (x)) subject to ψ(g(x)) ≤ 0, η(h(x)) = 0,
provided
• ϕ(z) is increasing on its domain, e.g., z k , z 1/k , log(z);
• ψ(z) preserves nonnegativity, i.e., z ≤ 0 ⇔ ψ(z) ≤ 0, e.g., z 3 ;
• η(z) preserves roots, i.e., z = 0 ⇔ η(z) = 0, e.g., z 2 .
Both optimization problems then possess the same minima. The optimal values are different, but they can
be easily computed from the optimal solutions.
Example 1.6 (Geometric programming). The transformation turns out to be very convenient in geometric
programming, for instance. To illustrate it, consider the particular example
min x2 y subject to 5xy 3 ≤ 1, 7x−3 y ≤ 1, x, y > 0.
The logarithm of both sides yields
min 2 log(x) + log(y) subject to log(5) + log(x) + 3 log(y) ≤ 0, log(7) − 3 log(x) + log(y) ≤ 0.
The substitution x′ := log(x), y ′ := log(y) then leads to an LP problem
min 2x′ + y ′ subject to log(5) + x′ + 3y ′ ≤ 0, log(7) − 3x′ + y ′ ≤ 0.

Moving the objective function to the constraints. The frequently used transformation is to move
the objective function to the constraints, that is, the problem minx∈M f (x) is transformed to
min z subject to f (x) ≤ z, x ∈ M.
The objective function now is linear, and all possible obstacles are hidden in the constraints.
Example 1.7 (a finite minimax). Consider the problem
min max fi (x).
x∈M i=1,...,s

The problems of type min–max are very hard in general. However, in our situation, the outer objective
function is the maximum on a finite set. The problem thus can be written as
min z subject to fi (x) ≤ z, i = 1, . . . , s, x ∈ M.
In the original formulation, the outer objective function maxi=1,...,s fi (x) is nonsmooth. After the trans-
formation, the objective function is linear.
Surprisingly, the converse transformation can be sometimes convenient as well. Moving the constraints
into the objective function is addressed in Section 6.3.3.

Elimination of equations and variables. Consider the problem


min f (x) subject to g(x) ≤ 0, Ax = b,
where A ∈ Rm×n has full row rank. First, we solve the system of equations Ax = b. Suppose that the
solution set is not empty, so it has the form of x0 + Ker(A), where x0 is one (arbitrarily chosen) solution
and Ker(A) is the kernel of A. Construct matrix B ∈ Rn×(n−m) such that its columns form a basis of
Ker(A). Then any solution of Ax = b can be expressed as x = x0 + Bz, where z ∈ Rn−m . Substitution for
x results in optimization problem
min f (x0 + Bz) subject to g(x0 + Bz) ≤ 0.
This approach eliminates the equations and reduces the dimension of the problem (i.e., the number of
variables) by m.
12 Chapter 1. Introduction

Ax = b

a1 a2 am

Figure 1.4: Linear regression.

1.3 Linear regression


The problem of linear regression is to find a linear dependence in data (a1 , b1 ), . . . , (am , bm ) ∈ Rn+1 ; see
Figure 1.4. Linear regression is widely used in many disciplines, including economy, biology and computer
science. In pattern recognition, for example, one wants a computer system to predict and make deci-
sions autonomously (e.g., spam filtering, books and movie recommendations, face recognition, credit card
fraud detection). Of course, the true dependence need not be linear and there exist models for nonlinear
regression; we focus to the linear case only.
Let the matrix A ∈ Rm×n consist of rows a1 , . . . , am . Then the goal is to find a vector x ∈ Rn such
that Ax ≈ b. Since usually m ≫ n, the system of linear equations Ax = b is overdetermined and has no
solution. Therefore we will seek for an approximate solution.
Mathematically, we can model the problem as an optimization problem to find x ∈ Rn such that the
difference between the left and the right hand side is minimal in a certain norm:

min kAx − bk.


x∈Rn

The geometric interpretation of this problem is to find the projection of vector b ∈ Rm to the column
space S(A) of matrix A. The typical choices are the following norms:
Pm
• Euclidean norm. The problem then reads minx∈Rn kAx − bk22 = minx∈Rn 2
i=1 (Ai∗ x − bi ) , that
is, it is the ordinary least squares problem. If matrix A has full column rank, then the solution is
unique and has the form x∗ = (AT A)−1 AT b. This approach is also justified by statistics: Suppose
that the dependence is really linear and the entries of the right-hand side vector b are affected by
independent and normally distributed errors. Then x∗ is the best linear unbiased estimator and also
the the maximum likelihood estimator.

• Manhattan norm. The problem minx∈Rn kAx − bk1 can be expressed as the linear program

min eT z subject to − z ≤ Ax − b ≤ z, z ∈ Rm , x ∈ Rn .

This case has also a statistical interpretation. The optimal solution produces the maximum likelihood
estimator as long as the noise follows the Laplace distribution.

• Maximum norm. The problem minx∈Rn kAx − bk∞ is also equivalent to an LP problem

min z subject to − ze ≤ Ax − b ≤ ze, z ∈ R, x ∈ Rn .

Outliers. An outlier is an observation that differs significantly from the others; see Figure 1.5. Usually,
it is caused by some experimental error. An outlier spoils the linear tendency in data and the resulting
estimator can be distorted. The Manhattan norm is less sensitive to outliers than the other norms, but
still outliers can cause problems.
1.3. Linear regression 13

Ax = b

a1 a2 am

Figure 1.5: Linear regression with an outlier.

If we expect or estimate that there are k ≪ m outliers in data, then we can solve the linear regression
problem as follows

min kAI x − bI k subject to x ∈ Rn , I ⊆ {1, . . . , m}, |I| ≥ m − k,

where AI , bI denotes submatrices indexed by I. Nevertheless, this is a hard combinatorial optimization


problem.

Cardinality. The cardinality of a vector x ∈ Rn is the number of nonzero entries and it is denoted by

kxk0 = |{i; xi 6= 0}|.


pPn
This notation resembles the vector ℓp -norm kxkp = p p
i=1 |xi | . Indeed, the cardinality is obtained by
the limit transition, neglecting the p-th roots
Pn p
kxk0 = limp→0+ i=1 |xi | .

However, kxk0 is not a vector norm.


In regression, we usually aim to explain b by using a small number of variables (or regressors), that is,
we also want to minimize kxk0 . We join both criteria by a weighted sum, resulting to a formulation

min kAx − bk2 + γkxk0 ,

where γ > 0 is a suitably chosen constant. Again, this is a hard combinatorial problem. That is why kxk0
is approximated by the Manhattan norm (in some sense, it is the best approximation). As a consequence,
we get an effectively solvable optimization problem

min kAx − bk2 + γkxk1 .

Example 1.8 (Signal reconstruction). Consider the problem of a signal reconstruction. Let a vector
x̃ ∈ Rn represent the unknown signal, and let y = x̃ + err represent the observed noisy signal. We want to
smooth the noisy signal and find a good approximation of x̃. To this end, we will seek for a vector x ∈ Rn
that is close to y and that is also smoothed, i.e., there are not big oscillations.
This idea leads to multi-objective optimization problem

min kx − yk2 , |xi+1 − xi | ∀i.


x∈Rn

A single-objective scalarization is obtained by a weighted sum of the objectives


Pn−1
minn kx − yk2 + γ i=1 |xi+1 − xi |,
x∈R
14 Chapter 1. Introduction

where γ > 0 is a parameter. A smaller value of γ prioritizes the first objective and so the resulting signal is
closer to the observed signal, while a larger γ penalizes oscillations and produces more smoothed signals.
Denote by D ∈ R(n−1)×n the difference matrix with entries Dii = 1, Di,i+1 = −1 and zeros elsewhere.
Then the problem reads

min kx − yk2 + γkDxk1 .


x∈Rn

This can again be viewed as an approximation of a cardinality problem

min kx − yk2 + γkDxk0 ,


x∈Rn

in which we aim to find a signal approximation in the form of a piecewise constant function. This approach
is called total variation reconstruction, and it is used when processing digital signals.
A comparison by pictures is presented in Figure 1.6, originating from the website

http://stanford.edu/class/ee364a/lectures/approx.pdf

A similar approach is used for image analysis and processing, e.g., for deblurring of blurred images,
reconstruction of damaged images, etc. See website

http://www.imm.dtu.dk/~pcha/mxTV/
1.3. Linear regression 15

total variation reconstruction example


2

x̂i
2 0

1
−2
0 500 1000 1500 2000
x

0
2
−1

x̂i
−2 0
0 500 1000 1500 2000

2 −2
0 500 1000 1500 2000
1 2
xcor

x̂i
0
−1

−2 −2
0 500 1000 1500 2000 0 500 1000 1500 2000
i i
original signal x and noisy three solutions on trade-off curve
signal xcor kx̂ − xcork2 versus φquad(x̂)

quadratic smoothing smooths out noise and sharp transitions in signal

Approximation and fitting 6–15

2

2 0

1
−2
0 500 1000 1500 2000
x

0
2
−1

−2 0
0 500 1000 1500 2000

2 −2
0 500 1000 1500 2000
1 2
xcor

0

0
−1

−2 −2
0 500 1000 1500 2000 0 500 1000 1500 2000
i i
original signal x and noisy three solutions on trade-off curve
signal xcor kx̂ − xcork2 versus φtv (x̂)

total variation smoothing preserves sharp transitions in signal

Approximation and fitting 6–16

Figure 1.6: Example 1.8: In both pictures on the left-hand side, there is the original signal and beneath
it is the noisy signal. On the right-hand side, there are reconstructed signals with decreasing values of γ.
The top picture employs quadratic smoothing (i.e., kDxk2 instead of kDxk1 ), while the bottom picture
uses the total variation reconstruction, which better approximates the digital signal.
16 Chapter 1. Introduction
Chapter 2

Unconstrained optimization

An unconstrained optimization problem reads

min f (x) subject to x ∈ Rn .

The objective function f : Rn → R is either general or we impose some differentiability assumptions later
on. First we present the well-known first order necessary optimality condition.

Theorem 2.1 (First order necessary optimality condition). Let f (x) be differentiable and let x∗ ∈ Rn be
a local extremal point. Then ∇f (x∗ ) = o.

Proof. Without loss of generality assume that x∗ is a local minimum. Recall that for any i = 1, . . . , n

∂f (x∗ ) f (x∗1 , . . . , x∗i−1 , x∗i + h, x∗i+1 , . . . , x∗n ) − f (x∗ )


∇i f (x) = = lim .
∂xi h→0 h
The limit must be the same if we consider the limit from the left or from the right. In the first case,
f (x∗1 , . . . , x∗i−1 , x∗i + h, x∗i+1 , . . . , x∗n ) − f (x∗ )
∇i f (x) = lim ≥ 0,
h→0+ h
and in the second case analogously ∇i f (x) ≤ 0. Therefore ∇i f (x) = 0.

Obviously, the above condition is only a necessary condition for optimality since it cannot distinguish
between minima, maxima and inflection points; see Figure 2.1. The point with zero gradient is called a
stationary point.
We mention two second order optimality conditions, one is a necessary condition and one is a sufficient
condition.

Theorem 2.2 (Second order necessary optimality condition). Let f (x) be twice continuously differentiable
and let x∗ ∈ M be a local minimum. Then the Hessian matrix ∇2 f (x∗ ) is positive semidefinite.

Proof. The continuity of second partial derivatives implies that for every λ ∈ R and y ∈ Rn there is
θ ∈ (0, 1) such that
1
f (x∗ + λy) = f (x∗ ) + λ∇f (x∗ )T y + λ2 y T ∇2 f (x∗ + θλy)y. (2.1)
2
In other words, this is Taylor’s expansion with Lagrange remainder. Due to minimality of x∗ we have
f (x∗ + λy) ≥ f (x∗ ), and from Theorem 2.1 we have ∇f (x∗ ) = o. Hence

λ2 y T ∇2 f (x∗ + θλy)y ≥ 0.

By the limit transition λ → 0 we get y T ∇2 f (x∗ )y ≥ 0.

Theorem 2.3 (Second order sufficient optimality condition). Let f (x) be twice continuously differentiable.
If ∇f (x∗ ) = o and ∇2 f (x∗ ) is positive definite for a certain x∗ ∈ M , then x∗ is a strict local minimum.

17
18 Chapter 2. Unconstrained optimization

f (x)

0 x

Figure 2.1: Stationary points of function f (x) include local minima, local minima and inflection points.

Proof. We proceed similarly as in the proof of Theorem 2.2. In equation (2.1) we have for λ 6= 0, y 6= o
and sufficiently small θ
1 2 T 2
λ∇f (x∗ )T y = 0, λ y ∇ f (x∗ + θλy)y > 0.
2
Therefore f (x∗ + λy) > f (x∗ ).

We see that there is a quite tight gap between the necessary and the sufficient conditions. However, the
example f (x) = −x4 shows that the gap is not zero: The point x = 0 is a strict local maximum (and not a
minimum), the sufficient condition is not satisfied (as expected), but the necessary condition is satisfied.

Example 2.4 (The least squares method). Consider a system of linear equations Ax = b, where A ∈
Rm×n , b ∈ Rm and matrix A has rank n (cf. Section 1.3). Usually, m is much greater than n. Since
this system has practically never an exact solution, we seek for an approximate solution by means of an
optimization problem
minn kAx − bk2 .
x∈R

Here we aim to find such a vector x that minimizes the Euclidean norm of the difference between the
left and right hand sides of system Ax = b. Since the square is an increasing function, the minimum is
attained at the same point as for the problem

min kAx − bk22 = (Ax − b)T (Ax − b) = xT AT Ax − 2bT Ax + bT b.


x∈Rn

We now check for the assumptions of Theorem 2.3. The gradient of the objective function is 2AT Ax−2AT b
(see the appendix, page. 63). Since it should be zero, we get the condition AT Ax = AT b, whence x =
(AT A)−1 AT b. The Hessian of the objective function is 2AT A, which is a positive definite matrix. Therefore
the point x = (AT A)−1 AT b is a strict local minimum. Moreover, since the objective function is convex,
this solution is indeed the global minimum (we will see later from Theorem 4.4).
If matrix A has not full column rank, then any solution of the system of linear equations AT Ax = AT b
is a candidate for an optimum. In fact, one can show that all these infinitely many solutions of the system
of equations are optimal solutions of our problem.
Chapter 3

Convexity

Convex sets and convex function appeared more than 100 years ago and the topic was pioneered by Hölder
(1889), Jensen (1906), Minkowski (1910) and other famous mathematicians.

3.1 Convex sets


Definition 3.1. A set M ⊆ Rn is convex if for every x1 , x2 ∈ M and every λ1 , λ2 ≥ 0, λ1 + λ2 = 1, the
convex combination satisfies λ1 x1 + λ2 x2 ∈ M .

Example: The empty set ∅ or a singleton {x} are convex sets.


From the geometric point of view, the convexity of a set M means that for any two points in M the
set also includes the whole line segment connecting these two points. The line segment connecting points
x1 and x2 will be denoted

u(x1 , x2 ) := {x ∈ Rn ; x = λ1 x1 + λ2 x2 , λ1 , λ2 ≥ 0, λ1 + λ2 = 1}.

Convexity of a set can be equivalently characterized by using convex combinations of all k-tuples of
its points.

Theorem 3.2. P Let k ≥ 2. Then a set


PM ⊆ Rn is convex if and only if for any x1 , . . . , xk ∈ M and any
k k
λ1 , . . . , λk ≥ 0, i=1 λi = 1 one has i=1 λi xi ∈ M .

Proof. An exercise.

Obviously, the union of convex sets need not be convex. On the other hand, the intersection of convex
sets is always convex.

Theorem 3.3. If Mi ⊆ Rn , i ∈ I, are convex, then ∩i∈I Mi is convex.

Proof. Let x1 , x2 ∈ ∩i∈I Mi . Then for every i ∈ I we have x1 , x2 ∈ Mi , and hence also their convex
combination λ1 x1 + λ2 x2 ∈ Mi .

This property justifies introduction of the concept of the convex hull of a set M as the minimal (with
respect to inclusion) convex set containing M .

Definition 3.4. The convex hull of a set M ⊆ Rn is the intersection of all sets in Rn including M . We
denote it by conv(M ).

Now, convexity of a set can be characterized by yet another mean.

Theorem 3.5. A set M ⊆ Rn is convex if and only if M = conv(M ).

Proof. “⇒” Since M is convex, it is one the those convex sets that are intersected to conv(M ).
“⇐” Due to Theorem 3.3, the set conv(M ) is convex, so M is also convex.

19
20 Chapter 3. Convexity

aT x = b
aT x ≥b aT x ≤ b

M
N

Figure 3.1: Separation of sets M by N by a hyperplane aT x = b.

Recall that the relative interior of a set M ⊆ Rn is the interior of M when restricted to the smallest
affine subspace containing M . We denote it by ri(M ).
Theorem 3.6. If M ⊆ Rn is convex, then ri(M ) is convex.
Proof. Let x1 , x2 ∈ ri(M ). Then there exist their relative ε-neighbourhoods Oε (x1 ), Oε (x2 ) ⊆ M . Consider
a convex combination x := λ1 x1 + λ2 x2 and the point y ∈ Oε (o). Then an arbitrary point in Oε (x) has
the form of x + y = λ1 x1 + λ2 x2 + y = λ1 (x1 + y) + λ2 (x2 + y), which belongs to M thanks to the fact
that x1 + y, x2 + y ∈ M .

An important property of disjoint convex sets is their linear separability; see Figure 3.1.
Definition 3.7. Two nonempty sets M, N ⊆ Rn are separable if there exists a vector o 6= a ∈ Rn and a
number b ∈ R such that

aT x ≤ b ∀x ∈ M,
T
a x≥b ∀x ∈ N,

but not

aT x = b ∀x ∈ M ∪ N.

We state one version of the separation theorem below. We omit the proof as it is included in another
course.
Theorem 3.8 (Separation theorem). Let M, N ⊆ Rn be nonempty and convex. Then they are separable
if and only if ri(M ) ∩ ri(N ) = ∅.
Let M ⊆ Rn be convex and closed. Using separation property we can separate a boundary point
x∗ ∈ M and the set M by a hyperplane aT x = b; we call this hyperplane as a supporting hyperplane of M .
We then have aT x∗ = b (i.e., the hyperplane contains the point x∗ ) and set M lies in the positive halfspace
defined by the hyperplane, that is, aT x ≤ b for every x ∈ M .
Proposition 3.9. Let M ⊆ Rn be convex and closed. Then M is equal to the intersection of the positive
halfspaces determined by all supporting hyperplanes of M .
Proof. From property aT x ≤ b ∀x ∈ M we get that M lies in the intersection of the halfspaces. We prove
the converse inclusion by contradiction: If there is x∗ 6∈ M lying in the intersection of the halfspaces, then
we can separate it (or more precisely, its neighbourhood) from M by a supporting hyperplane. Thus we
found a halfspace not containing x∗ ; a contradiction.
3.2. Convex functions 21

Figure 3.2: Outer approximation of a set M by supporting hyperplanes.

The above statement is not only of a theoretical importance. Using supporting hyperplanes, we can
enclose set M to a convex polyhedron with an arbitrary precision; see Figure 3.2. This property is used
in certain algorithms, too; they start with an initial selection of supporting hyperplanes and then they
iteratively include other ones when needed, in particular when one has to separate some points from set M .

3.2 Convex functions


Convexity regards not only sets, but also functions.

Definition 3.10. Let M ⊆ Rn be a convex set. Then a function f : Rn → R is convex on M if for every
x1 , x2 ∈ M and every λ1 , λ2 ≥ 0, λ1 + λ2 = 1, one has

f (λ1 x1 + λ2 x2 ) ≤ λ1 f (x1 ) + λ2 f (x2 ).

If we have
f (λ1 x1 + λ2 x2 ) < λ1 f (x1 ) + λ2 f (x2 )

for every convex combination with x1 6= x2 and λ1 , λ2 > 0, then f is strictly convex on M .

Analogously we define a concave function: f (x) is concave if −f (x) is convex. Obviously, a function is
linear (or, more precisely, affine) if and only if it is both convex and concave.

Example 3.11. Any vector norm is a convex function because by definition for any x1 , x2 ∈ Rn and
λ1 , λ2 ≥ 0, λ1 + λ2 = 1,

kλ1 x1 + λ2 x2 k ≤ kλ1 x1 k + kλ2 x2 k = λ1 kx1 k + λ2 kx2 k.

In particular, the smooth Euclidean norm kxk2 is convex as well as the non-smooth norms kxk1 and kxk∞ ,
or any matrix norm.

Analogously as in Theorem 3.2 we can characterize convex functions by means of convex combinations
of k-tuples of points.

Theorem 3.12 (Jensen’s inequality). Let k ≥ 2 and let M ⊆ Rn be convex. n


PkThen a function f : R → R
is convex on M if and only if for any x1 , . . . , xk ∈ M and λ1 , . . . , λk ≥ 0, i=1 λi = 1, one has
Pk  Pk
f i=1 λi xi ≤ i=1 λi f (xi ).
22 Chapter 3. Convexity

E E

f (x) f (x)

0 M x 0 M x

Figure 3.3: The epigraph E of a nonconvex function (on the left) and a convex function (on the right).

Proof. We will proceed by mathematical induction Pk−1on k. The statement is obvious for k = 2, so we turn
Pk−1
our attention to the induction step. Define α := i=1 λi . Since α + λk = 1 and i=1 α−1 λi = 1, we get
using the induction hypothesis
Pk  Pk−1 −1  Pk−1 −1 
f i=1 λi xi = f α i=1 α λi xi + λk xk ≤ αf i=1 α λi xi + λk f (xk )
Pk−1 −1
α λi f (xi ) + λk f (xk ) = ki=1 λi f (xi ).
P
≤ α i=1

The following observation is useful for practical verification of convexity of a function.

Theorem 3.13. A function f (x) is convex on M if and only if it is convex on each segment in M . That
is, the function g(t) = f (x + ty) is convex on the corresponding compact interval domain of variable t for
every x ∈ M and every y of norm 1.

Another characterization of convex functions is by means of epigraphs; see Figure 3.3.

Definition 3.14. The epigraph of a function f : Rn → R on a set M ⊆ Rn is the set

{(x, z) ∈ Rn+1 ; x ∈ M, z ≥ f (x)}.

Theorem 3.15 (Fenchel, 1951). Let M ⊆ Rn be a convex set. Then a function f : Rn → R is convex if
and only if its epigraph is a convex set.

Proof. “⇒” Denote by E the epigraph of f (x) on M , and let (x1 , z1 ), (x2 , z2 ) ∈ E be arbitrarily chosen.
Consider their convex combination

λ1 (x1 , z1 ) + λ2 (x2 , z2 ) = (λ1 x1 + λ2 x2 , λ1 z1 + λ2 z2 ).

Due to convexity of M we have λ1 x1 + λ2 x2 ∈ M , and convexity of f (x) then implies

f (λ1 x1 + λ2 x2 ) ≤ λ1 f (x1 ) + λ2 f (x2 ) ≤ λ1 z1 + λ2 z2 .

“⇐” Let E be convex. For any x1 , x2 ∈ M we have (x1 , f (x1 )), (x2 , f (x2 )) ∈ E. Consider a convex
combination λ1 x1 + λ2 x2 ∈ M . Due to convexity of E we have

λ1 (x1 , f (x1 )) + λ2 (x2 , f (x2 )) = (λ1 x1 + λ2 x2 , λ1 f (x1 ) + λ2 f (x2 )) ∈ E,

whence f (λ1 x1 + λ2 x2 ) ≤ λ1 f (x1 ) + λ2 f (x2 ).

The following property, illustrated on Figure 3.4, is frequently used in optimization. We will see later
in Chapter 4 that the feasible set M of an optimization problem minx∈M f (x) is usually described by a
system of inequalities gj (x) ≤ 0, j = 1, . . . , J. If functions gj are convex, then the set M is convex, too.

Theorem 3.16. Let M ⊆ Rn be a convex set and f : Rn → R a convex function. For any b ∈ R the set
{x ∈ M ; f (x) ≤ b} is convex.
3.3. The first and second order characterization of convex functions 23

f (x) f (x)

b b

0 Mb x 0 Mb x

Figure 3.4: The set Mb := {x ∈ M ; f (x) ≤ b} illustrated for a convex function (on the left) and a
nonconvex function (on the right); see Theorem 3.16.

y
y = f (x)

f (x1 )
y = f (x1 ) + ∇f (x1 )T (x − x1 )

0 x1 x

Figure 3.5: The tangent line to the graph of a convex function f (x) at point (x1 , f (x1 )).

Proof. For arbitrary x1 , x2 ∈ {x ∈ M ; f (x) ≤ b} consider a convex combination λ1 x1 + λ2 x2 ∈ M . From


convexity of function f (x) we get

f (λ1 x1 + λ2 x2 ) ≤ λ1 f (x1 ) + λ2 f (x2 ) ≤ λ1 b + λ2 b = b.

Another nice property of convex functions is their continuity. We state this result without a proof,
which can be found e.g. in Lange [2016].

Theorem 3.17. Let M ⊆ Rn be a nonempty convex set of dimension n, and let f : Rn → R be a convex
function. Then f (x) is continuous and locally Lipschitz on int M .

3.3 The first and second order characterization of convex functions


The first order characterization of a convex function f (x) can be viewed visually; see Figure 3.5. The
tangent line to the graph of f (x) at any point (x1 , f (x1 )) must lie below the graph, that is, f (x) ≥
f (x1 ) + ∇f (x1 )T (x − x1 ).

Theorem 3.18 (The first order characterization of a convex function, Avriel 1976, Mangasarian 1969).
6 M ⊆ Rn be a convex set and let f (x) be a function differentiable on an open superset of M . Then
Let ∅ =
f (x) is convex on M if and only if for every x1 , x2 ∈ M

f (x2 ) − f (x1 ) ≥ ∇f (x1 )T (x2 − x1 ). (3.1)


24 Chapter 3. Convexity

Proof. “⇒” Let x1 , x2 ∈ M and λ ∈ (0, 1) be arbitrary. Then

f ((1 − λ)x1 + λx2 ) ≤ (1 − λ)f (x1 ) + λf (x2 ),


f ((1 − λ)x1 + λx2 ) − f (x1 ) ≤ λ(f (x2 ) − f (x1 )),
f (x1 + λ(x2 − x1 )) − f (x1 )
≤ f (x2 ) − f (x1 ).
λ
By the limit transition λ → 0 we get (3.1) utilizing the chain rule for the derivative of a composite function
g(λ) = f (x1 + λ(x2 − x1 )) with respect to λ.
“⇐” Let x1 , x2 ∈ M and consider a convex combination x = λ1 x1 + λ2 x2 . By (3.1) we have

f (x1 ) − f (x) ≥ ∇f (x)T (x1 − x) = ∇f (x)T (x1 − (λ1 x1 + λ2 x2 )) = λ2 ∇f (x)T (x1 − x2 ),


f (x2 ) − f (x) ≥ ∇f (x)T (x2 − x) = ∇f (x)T (x2 − (λ1 x1 + λ2 x2 )) = λ1 ∇f (x)T (x2 − x1 ).

Multiply the first inequality by λ1 , the second one by λ2 , and summing up we get

λ1 (f (x1 ) − f (x)) + λ2 (f (x2 ) − f (x)) ≥ 0,

or λ1 f (x1 ) + λ2 f (x2 ) ≥ f (x).

Remark 3.19. For strict convexity we have an analogous characterization

∀x1 , x2 ∈ M, x1 6= x2 : f (x2 ) − f (x1 ) > ∇f (x1 )T (x2 − x1 ). (3.2)

Theorem 3.20 (The second order characterization of a convex function, Fenchel, 1951). Let ∅ = 6 M ⊆ Rn
be an open convex set of dimension n, and suppose that a function f : M → R is twice continuously
differentiable on M . Then f (x) is convex on M if and only if the Hessian ∇2 f (x) is positive semidefinite
for every x ∈ M .

Proof. Let x∗ ∈ M be arbitrary. Due to continuity of the second partial derivatives we have that for every
λ ∈ R and y ∈ Rn , x∗ + λy ∈ M , there is θ ∈ (0, 1) such that
1
f (x∗ + λy) = f (x∗ ) + λ∇f (x∗ )T y + λ2 y T ∇2 f (x∗ + θλy)y. (3.3)
2
“⇒” From Theorem 3.18 we get

f (x∗ + λy) ≥ f (x∗ ) + λ∇f (x∗ )T y,

so that (3.3) implies

y T ∇2 f (x∗ + θλy)y ≥ 0.

By the limit transition λ → 0 we have y T ∇2 f (x∗ )y ≥ 0.


“⇐” Due to positive semidefiniteness of the Hessian we have y T ∇2 f (x∗ + θλy)y ≥ 0 in the expression
(3.3). Hence

f (x∗ + λy) ≥ f (x∗ ) + λ∇f (x∗ )T y,

which shows convexity of f (x) in view of Theorem 3.18.

Remark 3.21. For strict convexity, we can state the following conditions:
(1) If f is strictly convex, then the Hessian ∇2 f (x) is positive definite almost everywhere on M ; in the
remaining cases it is positive semidefinite there.
(2) If the Hessian ∇2 f (x) is positive definite on M , then f is strictly convex.
In the first item, we cannot claim positive definiteness everywhere on M . Using an analogous reasoning as
in the proof of Theorem 3.20 the limit transition λ → 0 can turn the strict inequality to a non-strict one.
3.4. Other rules for detecting convexity of a function 25

Example 3.22.
1. Function f (x) = x4 is strictly convex on R, but its Hessian f (x)′′ = 12x2 vanishes at x = 0.
2. Function f (x) = x−2 has the second derivatives positive everywhere on R \ {0}, but it is not convex
there. The reason is that R \ {0} is not a convex set, and also the definition of a convex function
is not satisfied even when zero avoids the convex combinations. Therefore it is necessary that the
domain is a convex set. Hence f (x) is convex separately on (0, ∞) and on (−∞, 0).
Example 3.23. Consider a quadratic function f : Rn → R given by formula f (x) = xT Ax + bT x + c,
where A ∈ Rn×n is symmetric, b ∈ Rn and c ∈ R (see the appendix, page. 63). Then
• f (x) is convex if and only if A is positive semidefinite,
• f (x) is strictly convex if and only if A is positive definite.

3.4 Other rules for detecting convexity of a function


In this section we discuss if or how is convexity preserved under addition, product, composition and other
operations.
Theorem 3.24. Let f, g : R → R.
(1) If f (x), g(x) are both convex, nonnegative and nondecreasing (or both nonincreasing), then f (x) ·
g(x) is convex.
(2) If f (x) is convex, nonnegative and nondecreasing, and g(x) is concave, positive and nonincreasing,
then f (x)/g(x) is convex.
Proof. We prove the first property, the second one is analogous.
We have (f g)′′ = f ′′ g + 2f ′ g′ + f g′′ ≥ 0 since each term is nonnegative. Therefore f g is convex in view
of Theorem 3.20.
Example 3.25. Both functions f (x) = x and g(y) = y are convex, but their product h(x, y) = xy is
not convex even when restricted to domain (x, y) ∈ [0, ∞)2 ; it is strictly concave on the segment between
points (1, 0) and (0, 1). Therefore, in order to apply Theorem 3.24, we need that both functions are of the
same variable. Notice that function f (x) · g(x) = x2 is convex now.
Theorem 3.26. Let f : Rn → Rk and g : Rk → R.
(1) If fi (x) is convex for each i = 1, . . . , k and g(y) is convex and nondecreasing in each coordinate,
then (g ◦ f )(x) = g(f (x)) is convex.
(2) If f (x) is concave for each i = 1, . . . , k and g(y) is convex and nonincreasing in each coordinate,
then (g ◦ f )(x) = g(f (x)) is convex.
Proof. We will show the first property only; the second property is analogous.
Let x1 , x2 ∈ Rn . For a convex combination λ1 x1 + λ2 x2 we have
g(f (λ1 x1 + λ2 x2 )) ≤ g(λ1 f (x1 ) + λ2 f (x2 )) ≤ λ1 g(f (x1 )) + λ2 g(f (x2 )),
where the first inequality follows from convexity of f and monotonicity of g, and the second inequality is
due to convexity of g.
Example 3.27.
x 2
1. If f : Rn → R is convex, then ef (x) is convex. For example, ee , ex −x , ex1 −x2 , . . .
2. If f (x) ≥ 0 and convex, then f (x)p is convex for every p ≥ 1.
3. If f (x) ≥ 0 and concave, then − log(f (x)) is convex.
Example 3.28. The monotonicity assumption in Theorem 3.26 is necessary, indeed. For example, func-
tions f (x) = x2 − 1 and g(y) = y 2 are convex, but g(f (x)) = (x2 − 1)2 is not convex.
Remark 3.29. Checking convexity of a function is a hard problem in general. Ahmadi et al. [2013] proved
that it is an NP-hard problem for a class of multivariate polynomials of degree at most 4 (i.e., the sum of
degrees in each term is at most 4). For a “general” function, it is still an open problem whether convexity
testing is decidable.
26 Chapter 3. Convexity
Chapter 4

Convex optimization

The problem of convex optimization reads

min f (x) subject to x ∈ M,

where f : Rn → R is a convex function and M ⊆ Rn is a convex set. Often the feasible set M is described
in the form as follows

M = {x ∈ Rn ; gj (x) ≤ 0, j = 1, . . . , J},

where gj (x) : Rn → R, j = 1, . . . , J, are convex functions. By Theorem 3.16 the set M is convex then. In
this chapter, however, we will deal with a general convex set M .

Example 4.1. An example of a convex optimization problem:

min x1 + x2 subject to x21 + x22 ≤ 2.

Another example:

min x21 + x22 + 2x2 subject to x21 + x22 ≤ 2.

4.1 Basic properties


Theorem 4.2 (Fenchel, 1951). For a convex optimization problem we have:
(1) Each local minimum is a global minimum.
(2) The optimal solution set is convex.
(3) If f (x) is a strictly convex function, then the minimum is either unique or none.

Proof.
(1) Let x0 ∈ M be a local minimum and suppose to the contrary that there is x∗ ∈ M such that
f (x∗ ) < f (x0 ). Consider the convex combination x = λx∗ + (1 − λ)x0 ∈ M , λ ∈ (0, 1). Then

f (x) ≤ λf (x∗ ) + (1 − λ)f (x0 ) < λf (x0 ) + (1 − λ)f (x0 ) = f (x0 ).

This is in contradiction with local minimality of x0 since for arbitrarily small λ > 0 we have
f (x) < f (x0 ).
(2) Let x1 , x2 ∈ M be two optimal solutions and denote by z = f (x1 ) = f (x2 ) the optimal value. The
convex combination x = λ1 x1 + λ2 x2 ∈ M then satisfies

f (x) ≤ λ1 f (x1 ) + λ2 f (x2 ) = λ1 z + λ2 z = z,

that is, x is also an optimal solution.

27
28 Chapter 4. Convex optimization

(3) Suppose to the contrary that x1 , x2 ∈ M , x1 6= x2 , are two optimal solutions. Denote by z =
f (x1 ) = f (x2 ) the optimal value. The convex combination x = λ1 x1 + λ2 x2 ∈ M , λ1 , λ2 > 0, then
satisfies
f (x) < λ1 f (x1 ) + λ2 f (x2 ) = λ1 z + λ2 z = z,
that is, x is better that the optimal solution; a contradiction.

Notice that a convex optimization problem need not possess an optimal solution. Consider, for example,
minx∈R ex . This situation may happen even if the feasible set is compact:
Example 4.3. Consider the function f : [1, 2] → R defined by
(
x if 1 < x ≤ 2,
f (x) =
2 if x = 1.
This function is convex, but not continuous, and the minimum on [1, 2] is not attained.
Theorem 4.4. Let ∅ =6 M ⊆ Rn be an open convex set and f : M → R a convex differentiable function
on M . Then x∗ ∈ M is an optimal solution if and only if ∇f (x∗ ) = o.
Proof. “⇒” Let x∗ ∈ M be an optimal solution. Then it is a local minimum, too, and according to
Theorem 2.1 we have ∇f (x∗ ) = o.
“⇐” Let ∇f (x∗ ) = o. By Theorem 3.18 we have f (x) − f (x∗ ) ≥ ∇f (x∗ )T (x − x∗ ) = 0 for any x ∈ M .
Therefore f (x) ≥ f (x∗ ) and x∗ is an optimal solution.

We cannot remove the assumption that M is open. For instance, for the problem minx∈[1,2] x we have
M = [1, 2] convex and the objective function f (x) = x is differentiable on R, but its derivative at the
optimal point x∗ = 1 is f ′ (1) = 1.
We can generalize the theorem as follows.
6 M ⊆ Rn be a convex set and f : M ′ → R a convex function differentiable on an
Theorem 4.5. Let ∅ =
open set M ⊇ M . Then x∗ ∈ M is an optimal solution if and only if ∇f (x∗ )T (y − x∗ ) ≥ 0 for every

y ∈ M.
Proof. “⇒” Suppose to the contrary that there is y ∈ M such that ∇f (x∗ )T (y − x∗ ) < 0. Consider the
convex combination xλ = λy + (1 − λ)x∗ = x∗ + λ(y − x∗ ) ∈ M . Then
f (x∗ + λ(y − x∗ )) − f (x∗ ) f (xλ ) − f (x∗ )
0 > ∇f (x∗ )T (y − x∗ ) = lim = lim .
λ→0+ λ λ→0+ λ
Hence f (xλ ) < f (x∗ ) for a sufficiently small λ > 0; a contradiction.
“⇐” By Theorem 3.18, for every y ∈ M we have f (y) − f (x∗ ) ≥ ∇f (x∗ )T (y − x∗ ) ≥ 0. Therefore
f (y) ≥ f (x∗ ), and x∗ is an optimal solution.

The condition from Theorem 4.5 is particularly satisfied if ∇f (x∗ ) = o. This means that each stationary
point is a global minimum.
Example 4.6. The first problem in Example 4.1 reads
min x1 + x2 subject to x21 + x22 ≤ 2.
Obviously, the optimum is x∗ = (−1, −1)T . We can verify it by means of Theorem 4.5. First, compute
∇f (x∗ ) = (1, 1)T . Now, we have to show that for each feasible y we have
 
∗ T ∗ y1 + 1
∇f (x ) (y − x ) = (1, 1) ≥ 0,
y2 + 1
or y1 + y2 ≥ −2. This is clearly true.
The second problem in Example 4.1 reads
min x21 + x22 + 2x2 subject to x21 + x22 ≤ 2.
We compute ∇f (x∗ ) = (2x1 , 2x2 + 2)T , and this gradient is zero at point x⋆ = (0, −1). Since this point
satisfies the constrait, it is the optimum.
4.2. Quadratic programming 29

Example 4.7 (Rating system). Many methods have been developed to provide ratings of sport teams
or other entities. Here we present the following method [Langville and Meyer, 2012]. Consider n teams
that we want to rate by numbers r1 , . . . , rn ∈ Rn . Let A ∈ Rn×n be a known scoring matrix, where
aij gives the scoring of team i against team j. This matrix is skew symmetric, that is A = −AT , since
aii = 0 and aij = −aji . The rating vector r = (r1 , . . . , rn )T should reflect the scorings, so ideally we have
aij = ri − rj , or in matrix form A = reT − er T . This is hardly satisfied in practice, but we aim to find the
best approximation, which leads to an optimization formulation

min f (x) = kA − (xeT − exT )k2 .


x∈Rn
qP p
We choose the Frobenius matrix norm, defined for M ∈ Rn×n as kM k = 2
i,j mij = tr(M T M ); that
is why we minimize the square of the norm. The objective function then reads
T
f (x) = tr A − (xeT − exT ) A − (xeT − exT )


= tr(AT A) − tr AT (xeT − exT ) + (xeT − exT )T A + tr (xeT − exT )T (xeT − exT )


 

= tr(AT A) − 4xT Ae + 2n(xT x) − 2(eT x)2 .

The gradient and the Hessian read

∇f (x) = −4Ae + 4nx − 4eeT x, ∇2 f (x) = 4(nIn − eeT ).

Since the Hessian is positive semidefinite, function f (x) is convex. The optimality condition ∇f (x) = 0
yields the system of linear equations

(nIn − eeT )x = Ae.

The matrix has rank n − 1 and so the solution set is the line x = n1 Ae + αe, α ∈ R. Function f (x) is
constant on this line, so the whole line is the optimal solution set. In practice, we usually normalize the
rating vector such that eT r = 0. Since eT Ae = 0, we obtain the resulting formula for the rating vector
r = n1 Ae.

Naturally, for special problems in convex optimization we can derive special properties. In the following
sections we will discuss several particular classes of convex optimization problems.

4.2 Quadratic programming


A quadratic programming problem reads

min xT Cx + dT x subject to x ∈ M,

where C ∈ Rn×n is symmetric, d ∈ Rn and M ⊆ Rn is a convex polyhedral set. If matrix C is positive


semidefinite, then it is a convex problem, called a convex quadratic program.
Convex quadratic programs are effectively solvable in polynomial time [Floudas and Pardalos, 2009,
Section “Complexity Theory: Quadratic Programming”]. If C is not positive semidefinite, then the prob-
lem is NP-hard, even finding a local minimum is NP-hard. It is interesting that NP-hardness remains
valid even for the subclass of problems defined by matrix C having exactly one eigenvalue negative
[Pardalos and Vavasis, 1991; Vavasis, 1991]. NP-hardness of the subclass having C negative definite is
proved below; we formulate the problem equivalently as maximization of a convex quadratic function as
maxx∈M xT Cx = − minx∈M −xT Cx..

Theorem 4.8. The problem maxx∈M xT Cx is NP-hard even when C is positive definite.

Proof. We will construct a reduction from the NP-complete problem Set-Partitioning: Given a set of
numbers α1 , . . . , αn ∈ N, can we group them into two subsets such that the sums of the numbers in both
30 Chapter 4. Convex optimization

subsets are the same? Equivalently, is there x ∈ {±1}n such that ni=1 αi xi = 0? This problem can be
P
formulated as follows
Xn n
X
2
max xi subject to αi xi = 0, x ∈ [−1, 1]n .
i=1 i=1

The optimal value of this problem is n if and only if Set-Partitioning is solvable. This optimization
problem follows the template since the constraints are linear and the objective function has the form of
xT Cx + dT x for C = In and d = o.

Example 4.9 (Portfolio selection problem). This is a textbook example of an application of convex
quadratic programming. The pioneer in this area was Harry Markowitz, a Nobel Prize winner in Economics
in 1990, awarding his results from 1952.
The problem is formulated as follows: capital K is to be invested in n investments. The return of
investment i is ci . The mathematical formulation of the portfolio selection problem is as a linear program

max cT x subject to eT x = K, x ≥ o.

The returns of investments are usually not known exactly and they are modelled as random quantities.
Suppose that the vector c is random, its expected value is c̃ := E c and the covariance matrix is Σ :=
cov c = E (c − c̃)(c − c̃)T , which is positive semidefinite (Proof: for every x ∈ Rn we have xT Σx =
xT (E (c − c̃)(c − c̃)T )x = E xT (c − c̃)(c − c̃)T x = E ((c − c̃)T x)2 ≥ 0). For a real vector x ∈ Rn , the expected
value of the objective function value cT x is E (cT x) = c̃T x, and the variance of cT x is var(cT x) = xT Σx.
Maximizing the expected value of the reward leads to the linear programming problem

max c̃T x subject to eT x = K, x ≥ o.

Taking into account the risks of investments, we model the problem as a convex quadratic program

max c̃T x − γxT Σx subject to eT x = K, x ≥ o,

where γ > 0 is the so called risk aversion coefficient.

Example 4.10 (Quadrocopter trajectory planning). We need to plan a trajectory for a quadrocopter
fleet such that a collision is avoided and the the fleet is transferred from an initial state to a terminal
state with minimum effort. In our model, time is discretized into time slots of length h. The variables are
the position pi (k), velocity vi (k) and acceleration ai (k) for quadrocopter i in time step k. The constraints
are:
• physical constraints: the relations between velocity and acceleration, position and velocity, . . .
(e.g., vi (k) = vi (k − 1) + h · ai (k − 1), pi (k) = pi (k − 1) + h · vi (k − 1), . . . )
• restrictions on the ts maximum velocity, acceleration and jerk (i.e., the derivative of acceleration),
• the initial and terminal state (positions etc.),
• the collision avoidance constraint is nonlinear (kpi (k) − pj (k)k2 ≥ r ∀i 6= j), so we have to linearize
it.
P
The objective function is given by the sum of norms of accelerations in particular time steps ( i,k kai (k)+
gk22 ). For more details see:
• https://www.youtube.com/watch?v=wwK7WvvUvlI
• F. Augugliaro, A.P. Schoellig, and R. D’Andrea, Generation of collision-free trajectories for a quadro-
copter fleet: A sequential convex programming approach, EEE/RSJ International Conference on
Intelligent Robots and Systems, 2012: pp. 1917–1922.

The practical importance of this problem is underlined by the fact that collision free planning is a very
topical research problem in air traffic control of airports.
4.3. Convex cone programming 31

4.3 Convex cone programming


This section comes maily from book Ben-Tal and Nemirovski [2001]. The motivation to cone programming
is as follows. The linear programming problem
min cT x subject to Ax ≥ b
can be generalized in several ways. In Section 4.2 we replaced linear functions with quadratic ones. Another
way of a generalization is to generalize the relation “≥”.
Definition 4.11. A set ∅ =6 K ⊆ Rn is a convex cone if two conditions are satisfied:
(1) for every α ≥ 0 and x ∈ K we have αx ∈ K,
(2) for every x, y ∈ K we have x + y ∈ K.
A cone is called pointed if it contains no complete line.
Proposition 4.12. If K is a pointed convex cone, then it induces
(1) a partial order by definition x ≥K y ⇔ x − y ∈ K,
(2) a strict partial order by definition x >K y ⇔ x − y ∈ int K.
From now on we consider only a pointed convex closed cone K with nonempty interior.
Example 4.13 (Examples of cones). The frequently used cones are:
• The nonnegative orthant Rn+ = {x ∈ Rn ; x ≥ 0}. The corresponding partial order is the standard
entrywise inequality ≥ for vectors.
qP
n−1 2
• Lorentz cone (ice cream cone) L = {x ∈ Rn ; xn ≥ n
i=1 xi } = {x ∈ R ; xn ≥ k(x1 , . . . , xn−1 )k2 }.
• Generalized Lorentz cone L = {x ∈ Rn ; xn ≥ k(x1 , . . . , xn−1 )k}, where k · k is an arbitrary norm.
• Convex polyhedral cone is characterized by the system Ax ≤ 0. This cathegory involves, for example,
the nonnegative orthant or the generalized Lorentz cone with the Manhattan or maximum norm.
• The cone of positive semidefinite matrices.
Now we are ready to introduce cone programming. The cone programming problem reads
min cT x subject to Ax ≥K b. (4.1)
Example 4.14 (Examples of cone programs).
• For K = Rn+ we get the standard linear programming.
• Employing the Lorentz cone, we have a more interesting example
min cT x subject to kBx − ak2 ≤ dT x + f. (4.2)
This problem is not easily transformable to a problem with convex quadratic constraints since the
squaring of both sides yields quadratic constraints which need not be convex (convexity of the
constraint function was destroyed by the squaring).
• The cone constraints can be combined, so we can consider also problems such as
min cT x subject to Ax ≥ b, kBx − ak2 ≤ dT x + f.
The reason is that the Cartesian product of cones is again a cone. In our case we used the cone
K := Rn+ × L.
This problem belongs to a called second order cone programming; see Section 4.3.2.
• The cone of positive semidefinite matrices leads to the problems of type
n
X
T
min c x subject to xk A(k) − B is positive semidefinite,
k=1

where A1 , . . . , An , B are symmetric matrices. Such problems are called semidefinite programs; see
Section 4.3.3.
32 Chapter 4. Convex optimization

4.3.1 Duality in convex cone programming


Motivation. Recall the derivation of duality of a linear program min{cT x; Ax ≥ b}: Let x be a feasible
solution. Then for every y ≥ 0 we have y T Ax ≥ y T b. If y in addition satisfies y T A = cT , then we get
cT x = y T Ax ≥ y T b. In other words, y T b is a lower bound on the optimal value for every y ≥ 0 such that
AT y = c. This leads to the dual problem formulation and weak duality

min{cT x; Ax ≥ b} ≥ max{bT y; AT y = c, y ≥ 0}.

Now the question is, in the case of convex cone programming (4.1), which relation should replace
y ≥ 0? It is not hard to see that neither y ≥ 0 nor y ≥K 0 works well. In fact, we are interested in such y,
for which we have y T a ≥ 0 for every a ≥K 0. Obviously, the set of such ys forms a cone – this cone is
called the dual cone of K.
Definition 4.15. Let K ⊆ Rn be a cone. Then its dual cone is the cone

K∗ = {y ∈ Rn ; y T a ≥ 0 ∀a ∈ K}.

By using the dual cone, we formulate the dual problem to (4.1) as follows

max bT y subject to AT y = c, y ≥K∗ 0. (4.3)

Weak duality is then a direct adaptation of weak duality in linear programming.


Theorem 4.16 (Weak duality). We have: min{cT x; Ax ≥K b} ≥ max{bT y; AT y = c, y ≥K∗ 0}.
Proof. For every y ≥K∗ 0 such that AT y = c and for every x such that Ax ≥K b we have

cT x = y T Ax ≥ y T b.

In other words, the objective value of each feasible solution is an upper bound on every objective value of
the dual problem. Therefore the inequality holds true even for the extremal values.

Now we state some basic properties of dual cones. Some of them are illustrated on Figure 4.1. For
instance, Proposition 4.18(4) is illustrated by Figures 4.1a and 4.1c.
Example 4.17.
• Nonnegative orthant is self-dual, that is, (Rn+ )∗ = Rn+ (see Figure 4.1a).
• The Lorentz cone is self-dual as well, L∗ = L (see Figure 4.1b).
• The cone of positive semidefinite matrices is also self-dual; herein,
P the scalar product of positive
semidefinite matrices A, B is defined by hA, Bi := tr(AB) = i,j aij bij .

Proposition 4.18. We have:


(1) K∗ is a closed convex cone.
(2) If K is a closed convex cone, then (K∗ )∗ = K.
(3) If K1 , K2 are cones, then K1 × K2 is a cone and (K1 × K2 )∗ = K1∗ × K2∗ .
(4) If K1 ⊆ K2 are cones, then K1∗ ⊇ K2∗ .
Based on the above properties, we can see that a convex cone program can also have the form of

min cT x subject to Ax ≥ b, Bx ≥K d,

the dual problem of which is

max bT y + dT z subject to AT y + B T z = c, y ≥ 0, z ≥K∗ 0.

Similarly we can consider combinations of other cone constraints.


Can we state strong duality in convex cone programming? In general not, but under mild assumptions
the strong duality holds.
4.3. Convex cone programming 33

x2

L
Rn+

0
0 x1

−(Rn+ )∗
−L∗

(a) Nonnegative orthant Rn+ and its dual (Rn+ )∗ for n = 2. (b) Lorentz cone L and its dual L∗ .

x2 L

0
0 x1

−K∗
−L∗

(c) A cone and its dual in R2 . (d) Generalized Lorentz cone L and its dual L∗ .

Figure 4.1: Cones and their duals (for the sake of better visibility the dual cones are multiplied by −1,
i.e., rotated around the origin).

Theorem 4.19 (Strong duality). The primal and dual optimal values are the same provided at least one
of the following conditions holds
(1) the primal problem is strictly feasible, that is, there is x such that Ax >K b,
(2) the dual problem is strictly feasible, that is, there is y >K∗ 0 such that AT y = c.

Proof. We present the basic idea of the proof of (1) without technical details; in view of duality the point
(2) is analogous.
Let f ∗ be the optimal value and assume that c 6= 0 (otherwise f ∗ = 0 and we have strong duality with
y = 0). Define the set
M := {y = Ax − b; x ∈ Rn , cT x ≤ f ∗ }.
It is easy to see that M ∩ int(K) = ∅. Otherwise there is x such that Ax >K b and cT x ≤ f ∗ , so by a
small change of x in the direction of −c we obtain a super-optimal value.
Since both M and K are convex sets, we can separate them by a hyperplane λT y = 0 (the zero right-
hand side follows from the fact that K is a cone). Since K lies in the positive halfspace, we have λT y ≥ 0
for every y ∈ K, whence λ ∈ K∗ . Since M lies in the negative halfspace, we have λT y ≤ 0 for every y ∈ M;
so λT Ax ≤ λT b for every x such that cT x ≤ f ∗ . This can happen only if the normal vectors AT λ and c
are linearly dependent, that is, AT λ = µc for µ ≥ 0.
34 Chapter 4. Convex optimization

x2

0 x1

Figure 4.2: (Example 4.20) A second order cone program, for which the optimal value is not attained.

Observe that µ > 0. Otherwise, if µ = 0, then AT λ = 0 and also λT b ≥ 0. Due to strict feasibility of
the primal problem there is x̃ such that Ax̃ >K b. Since λ ≥K∗ 0, λ 6= 0, we get by premultiplication that
λT (Ax̃ − b) > 0, or λT b < 0; a contradiction.
By normalizing µ ≡ 1 we obtain AT λ = c. This yields a dual feasible solution λ since it satisfies
A λ = c, λ ≥K∗ 0. Moreover, we know that λT b ≥ AλT x = cT x for every x such that cT x ≤ f ∗ (including
T

x such that cT x = f ∗ ), whence λT b ≥ f ∗ . In conjunction with weak duality we obtain equality.

Notice that even when strong duality holds and both primal and dual optimal values are (the same
and) finite, it may happen that the optimal value is not attained (formally, we should write “inf” instead
of “min”). The following example illustrates this situation.

Example 4.20. Consider the convex cone program of form (4.2)


p
min x1 subject to (x1 − x2 )2 + 1 ≤ x1 + x2 .

By squaring both sides of the inequality we get

min x1 subject to 4x1 x2 ≤ 1, x1 + x2 > 0.

Even though the problem is strictly feasible, the optimal value is not attained; see Figure 4.2.

The next example illustrates the situation when the assumption of Theorem 4.19 as well as strong
duality are not satisfied.

Example 4.21. Consider the second order cone program


q
min x2 subject to x21 + x22 ≤ x1 .

We express it equivalently as

min x2 subject to x2 = 0, x1 ≥ 0.

We can see that the optimal value is 0 and each feasible solution is optimal, that is, the optimal solution
set consists of the nonnegative part of the first axis.
To construct the dual problem, we rewrite the primal program into the canonical form

min x2 subject to (x1 , x2 , x1 )T ≥L 0.


4.3. Convex cone programming 35

The dual problem then reads

max 0 subject to y1 + y3 = 0, y2 = 1, y ≥L 0.
p
The inequality y ≥L 0 takes the form of y3 ≥ y12 + y22 , which together with y1 + y3 = 0 leads to y2 = 1; a
contradiction. Hence the dual problem is infeasible, even though the primal problem has a finite optimal
value.

4.3.2 Second order cone programming


Second order cone programming deals with convex cone programs with linear constraints and constraints
corresponding to the Lorentz cone. For the sake of simplicity we employ just one Lorentz cone, and so the
problem reads

min cT x subject to Ax ≥ b, Bx ≥L d. (4.4)

We express  
D f
(B | d) =
pT q

so the condition Bx ≥L d takes the form of kDx − f k2 ≤ pT x − q. Thus we have an explicit description
of problem (4.4)

min cT x subject to Ax ≥ b, kDx − f k2 ≤ pT x − q. (4.5)

Recall that the problem is not easily transformable to a convex quadratic problem, even when allowing
convex quadratic constraints. Thus we have a new class of optimization problems, which are efficiently
solvable and contain many interesting problems. Actually, a lot of functions and nonlinear conditions can
be expressed in the form of (4.5).

Example 4.22 (Examples of second order cone programs).


• Quadratic constraints. For example, the condition xT x ≤ z can be expressed as xT x + 14 (z − 1)2 ≤
1 2 T 1 1
4 (z + 1) , the square root of which gives k(x , 2 (z − 1))k2 ≤ 2 (z + 1).
• Hyperbola. The condition x · y ≥ 1 on y ≥ 0 can be expressed as 14 (x + y)2 ≥ 1 + 14 (x − y)2 , the
square root of which (notice y ≥ 0) gives k(1, 21 (x − y))k2 ≤ 12 (x + y).

On the other hand, condition ex ≤ z is not a second order cone constraint.

The dual problem is

max bT y + dT z subject to AT y + B T z = c, y ≥ 0, z ≥L∗ 0.

Letting z = (uT , v)T we get

max bT y + dT z subject to AT y + B T z = c, y ≥ 0, v ≥ kuk2 .

The dual problem is thus also a second order cone program.

4.3.3 Semidefinite programming


Employing the cone of positive semidefinite matrices in the convex cone programming problem (4.1), we
obtain the class of semidefinite programming problems
n
X
min cT x subject to xk A(k)  B, (4.6)
k=1
36 Chapter 4. Convex optimization

where c ∈ Rn , matrices A(1) , . . . , A(n) , B ∈ Rm×m are symmetric and the relation A  B means that
A − B is positive semidefinite.1)
How to construct the dual problem? According to (4.3), the dual problem has 2
m×m
P m variables, so that
they constitute a matrix of variables Y ∈ R . The dual objective function is i,j bij yij , the equations
P (k)
have the form of i,j aij yij = ck , and the condition Y ≥K∗ 0 takes the form Y  0. In total, the dual
problem reads

max tr(BY ) subject to tr(A(k) Y ) = ck , k = 1, . . . , n, Y  0. (4.7)

Example 4.23 (Examples of semidefinite programs).


• Linear constraints. Linear inequalities Ax ≤ b are expressed as semidefinite conditions as follows:
 
b1 −A1∗ x 0 ... 0
. ..
b2 −A2∗ x . .
 
 0 . 
  0.
 
 .
.. . .. . .. 0
 
 
0 ... 0 bm −Am∗ x

• Second order cone constraints. They can be expressed as semidefinite constraints. Basically, it is
sufficient to show it for the condition kxk2 ≤ z; the others can be handled by a linear transformation.
We have
 
z · In x
kxk2 ≤ z ⇔  0. (4.8)
xT z

Proof. For z = 0 the equivalence holds, so we assume z > 0. We consider the matrix as the matrix of
a quadratic form and we transform it to a block diagonal matrix by using row & column elementary
transformations. Subtracting z1 xT -multiple of the first block row from the second one, and applying
the same for the columns, we get
   
z · In x z · In 0
∼ .
xT z 0 z − 1z xT x

This matrix is positive semidefinite if and only if z > 0 and xT x ≤ z 2 , or after taking the square
root, kxk2 ≤ z.
• Eigenvalues. Many conditions on eigenvalues can be expressed as semidefinite programs. For in-
stance, the largest eigenvalue λmax of a symmetric matrix A ∈ Rn×n :

λmax = min z subject to z · In  A.

Example 4.24. Consider the portfolio selection problem (Example 4.9)

max cT x subject to eT x = K, x ≥ o,

where c is a random vector with the expected value c̃ := E c and the covariance matrix Σ := cov c =
E (c − c̃)(c − c̃)T . Assume that a portfolio x̃ is chosen, but for the covariance matrix we know only an
interval estimation Σ1 ≤ Σ ≤ Σ2 . What is the risk of portfolio x̃? The risk is given by the variance of the
reward cT x̃, which is equal to x̃T Σx̃. Thus the largest variance is computed by a semidefinite program

max x̃T Σx̃ subject to Σ1 ≤ Σ ≤ Σ2 , Σ  0.

The objective function is linear in variable Σ, and the constraints are easily transformed to the basic form
(4.7) by means of Example 4.23.
1)
Relation  defines a partial order, known also as the Löwner order. Karel Löwner was an American mathematician of
Czech origin (born near Prague).
4.4. Computational complexity 37

4.4 Computational complexity


In general, convex optimization problems are considered to be tractable. Indeed, under some assumptions
(see Section 4.4.1), they are solvable in polynomial time. On the other hand, there are some intractable
convex optimization problems, too. In Section 4.4.2 we present one such class of optimization problems.

4.4.1 Good news – the ellipsoid method


A convex optimization problem minx∈M f (x) is solvable in polynomial time by the ellipsoid method under
general assumptions. This result is, however, rather theoretical. To solve the problem practically, other
methods, such as interior point methods, are usually more convenient.
The ellipsoid method is designed to find a feasible solution, but the same idea works to find an optimal
solution as well. Thus we focus on the problem of finding a point x ∈ M or determining that M is empty.
First, we construct a sufficiently large ellipsoid E covering the whole set M . Then we check if the center c
of ellipsoid E lies in M . If yes, we are done. If not, then we construct a hyperplane containing point c and
being disjoint to M (e.g., it can be a hyperplane tangent to M and shifted to contain c) such that M lies
in halfspace aT x ≤ b. Then we construct a smaller (minimum volume) ellipsoid covering the intersection
E ∩ {x; aT x ≤ b}. We repeat this process until we find a feasible point or prove M = ∅. The convergence
is guaranteed by the fact that the size of the ellipsoids exponentially decreases.
In order that the above algorithm is correct and runs in polynomial time, we need to ensure certain
conditions:

• The feasible set M shouldn’t be too flat or too large. There must exist “reasonably” large numbers
r, R > 0 such that M contains a ball of radius r and also M lies in the ball {x; kxk2 ≤ R}.

• Separation oracle. For every x∗ ∈ Rn we need to check for x∗ ∈ M in polynomial time. If x∗ 6∈ M ,


then we need to find a vector a 6= o such that aT x∗ ≥ supx∈M aT x. This gives us a hyperplane
aT x = aT x∗ satisfying aT x ≤ aT x∗ for every x ∈ M .
The separation oracle is often implemented as follows. If the description of M contains a violated
constraint g(x) ≤ 0, then we can take a := ∇g(x∗ ) since from g(x∗ ) > 0 and convexity of g we have
aT (x − x∗ ) = ∇g(x∗ )T (x − x∗ ) ≤ g(x) − g(x∗ ) < g(x) ≤ 0 for every x ∈ M .

The ellipsoid method solves a √


problem up to certain accuracy. Nevertheless, the optimal solution can
be an irrational number such as 2, which is not exactly representable in the standard computational
model. To quantify the accuracy we need a measure of infeasibility. For linear constraints, we can use the
value
min z subject to Ax − ze ≤ b, z ≥ 0,
and for a semidefinite constraint nk=1 xk A(k)  B we can use the measure
P

Pn (k)
min z subject to k=1 xk A + zIm  B, z ≥ 0.

Example 4.25. In some cases, the ellipsoid method provides a polynomial algorithm for problems with
exponentially many or even infinitely many constraints. For example, let M be a unit ball described by
the tangent hyperplanes, that is,

M = {x ∈ Rn ; aT x ≤ 1, ∀a : kak2 = 1}.

To check if a given point x∗ ∈ Rn belongs to the set M , we do not need to process all the infinitely many
inequalities. It is sufficient to check the possibly violated constraint, which is the case of a = kx1∗ k2 x∗ .

4.4.2 Bad news – copositive programming


Not every convex optimization problem is tractable. Here we present a convex problem that is NP-hard.
Denote by
C := {A ∈ Rn×n ; A = AT , xT Ax ≥ 0 ∀x ≥ 0}
38 Chapter 4. Convex optimization

the convex cone of copositive matrices and by

C ∗ := conv{xxT ; x ≥ 0}

its dual cone of completely positive matrices. Obviously, the set C covers both nonnegative symmetric
matrices and positive semidefinite matrices, but it contains other matrices, too. Similarly the matrices
in C ∗ are nonnegative positive semidefinite, but not each such matrix belongs to C ∗ . Notice that even to
decide if a given matrix is copositive is a co-NP-complete problem [Murty and Kabadi, 1987]. Checking
complete positivity of a matrix is NP-hard [Dickinson and Gijben, 2014], but if the problem is in NP is
not known yet.
Consider a copositive program [Dür, 2010]

min tr(CX) subject to tr(Ai X) = bi , i = 1, . . . , m, X ∈ C, (4.9)

where C, A1 , . . . , Ak ∈ Rn×n and b1 , . . . , bk ∈ R. The objective function


X
tr(CX) = cij xij
i,j

is a linear function in variables X and the equations are linear, too. The only nonlinear constraint is
X ∈ C, which makes the problem to be a convex conic program. Consider also a convex program with a
complete positivity condition of matrix X:

min tr(CX) subject to tr(Ai X) = bi , i = 1, . . . , m, X ∈ C ∗ . (4.10)

Both problems are convex, but NP-hard. We prove it for the latter.

Theorem 4.26. Problem (4.10) is NP-hard.

The proof is based on a reduction from the maximum independent set problem. Let G = (V, E) be a
graph with n vertices and let α denote the size of a maximum independent set in graph G, that is, the
size of a maximum set I ⊆ V such that i, j ∈ I ⇒ {i, j} 6∈ E.

Theorem 4.27. We have

α = max tr(eeT X) subject to xij = 0 ∀{i, j} ∈ E, tr(X) = 1, X ∈ C ∗ . (4.11)

Proof. Consider the convex cone


{X ∈ C ∗ ; xij = 0 ∀{i, j} ∈ E}.
Its extreme directions are matrices of the form xxT , where x ≥ 0 and the support of vector x (i.e, the
indices of positive entries) corresponds to an independent set in graph G. The constraint tr(X) = 1 in
problem (4.11) then normalizes the vectors in this cone. Since the objective function of problem (4.11) is
linear, the optimal solution is attained in an extreme point. Therefore we can assume that the optimal
solution X ∗ takes the form of
X ∗ = x∗ x∗T , x∗ ≥ 0, kx∗ k = 1
p p √
since the condition tr(X ∗ ) = 1 is equivalent to 1 = tr(X ∗ ) = tr(x∗ x∗T ) = tr(x∗T x∗ ) = x∗T x∗ =
p

kx∗ k. The support of vector x∗ then corresponds to an independent set of size α(x∗ ). Denote by x̃ ∈ Rα(x )

the restriction of vector x to its positive entries, so the zero entries are removed. Then the optimal value
h of problem (4.11) can be expressed as

h = max (eT x̃)2 , x̃ ≥ 0


kx̃k=1

since tr(eeT X) = tr(eeT xxT ) = tr(eT xxT e) = (eT x)2 . It is not hard to see that the optimal solution of this
problem is a vector of identical entries, that is, x̃∗ = α(x∗ )−1/2 e. Hence h = (eT x̃∗ )2 = (α(x∗ )−1/2 α(x∗ ))2 =
α(x∗ ). That is why h equals the size of the maximum independent set in graph G.
4.5. Applications 39

4.5 Applications
4.5.1 Robust PCA
Let A ∈ Rm×n be a matrix representing certain data. The problem is to determine some essential infor-
mation hidden in the data. For example, if the matrix represents a picture, then we may want to recognize
some pattern (e.g., a face) or to perform some operations such as reconstruction of a damaged picture.
To this end the SVD decomposition of A may serve well, however, for some purposes it is not sufficient.
We will formulate the problem as the so called robust PCA (principal component analysis):
→ Decompose A = L + S such that L has low rank and S is sparse.
Then L represents the fundamental information in the data and S can be interpreted as a noise. This
problem is rather vaguely defined and that is why we consider the (approximate) optimization problem
formulation
min kLk∗ + kSkℓ1 subject to A = L + S, (4.12)
where kSkℓ1 is the entrywise sum norm defined as
X
kSkℓ1 := |sij |,
i,j

and kLk∗ the nuclear norm defined as the sum of the singular values, that is,
X
kLk∗ := σi (L).
i

Notice that the nuclear norm is a good approximation of the matrix rank since it is the best convex
underestimator of the rank on a unit ball. Similarly, the entrywise sum norm is a good approximation of
matrix sparsity.
Problem (4.12) is a convex optimization problem since a norm is always convex. Hence the problem is
effectively solvable even though the best algorithms used are not so easy to describe by simple means.

Foreground and background detection in a video


The Robust PCA technique can effectively be used to recognize foreground and background in a video
or a sequence of pictures. The columns of matrix A represent the particular video frames. Then we can
expect that matrix L corresponds to the background since it is static and the matrix has low rank. In
contrast, matrix S captures the foreground then.
For more details see:
• http://sites.google.com/site/rpcaforegrounddetection/
• E. Candes, X. Li, Y. Ma, J. Wright, Robust Principal Component Analysis?, J. ACM 58(3), 2011

4.5.2 Minimum volume enclosing ellipsoid


The aim is to find an ellipsoid with minimum volume and covering a given convex polyhedron [Todd,
2016]. Let x1 , . . . , xm ∈ Rn be vertices of a convex polyhedron that we want to enclose by an ellipsoid. For
simplicity we restrict to a full-dimensional ellipsoid centered in the origin. Such an ellipsoid is described
by xT Hx ≤ 1, where H ∈ Rn×n is a positive definite matrix (in short we write H ≻ 0). The volume of
the ellipsoid is inversely proportional to det(H). Therefore the problem can be formulated as
min − det(H) subject to H ≻ 0, xTi Hxi ≤ 1, i = 1, . . . , m,
where the unknown matrix H is variable. This problem is not convex, so take the logarithm of the objective
function
min − log det(H) subject to H ≻ 0, xTi Hxi ≤ 1, i = 1, . . . , m.
Function − log det(H) is strictly convex on the set of positive definite matrices, so we have a convex
optimization problem, which is efficiently solvable.
40 Chapter 4. Convex optimization
Chapter 5

Karush–Kuhn–Tucker optimality conditions

In this chapter we consider the following optimization problem

min f (x) subject to x ∈ M,

where f : Rn → R is a differentiable function and the feasible set M ⊆ Rn is described by the system

gj (x) ≤ 0, j = 1, . . . , J,
hℓ (x) = 0, ℓ = 1, . . . , L,

where gj (x), hℓ (x) : Rn → R.


By Theorem 2.1 we known that in case M = Rn the necessary condition for optimality of x ∈ Rn is
∇f (x) = 0. If f (x) is convex, then the condition is also sufficient (Theorem 4.4).
This chapter generalizes the above condition to a constrained optimization problem, which results to
the so called Karush–Kuhn–Tucker conditions. First we consider only equality constraints, and then we
extent the results to the general form.

Equality constraints
Consider for a while an equality constrained problem

min f (x) subject to h(x) = 0. (5.1)

Let x∗ be a feasible point. When x∗ is optimal? First we discuss the case when the constraints are linear.

Proposition 5.1. If x∗ ∈ Rn is a local optimum of

min f (x) subject to Ax = b,

then ∇f (x∗ ) ∈ R(A).

Proof. The feasible set is the solution set of the system Ax = b, so it is an affine subspace x∗ + Ker(A).
Let B be a matrix such that its columns form a basis of Ker(A). Then the feasible set can be expressed
as x = x∗ + Bv, v ∈ Rk . Substituting for x we obtain an unconstrained optimization problem

min f (x∗ + Bv) subject to v ∈ Rk .

By Theorem 2.1, the necessary condition for local optimality of v = 0 is zero gradient, that is, ∇f (x∗ )T B =
0T . In other words, ∇f (x∗ ) ∈ Ker(A)⊥ = R(A).

Now, the idea is based on linearization of possibly nonlinear functions hℓ . The equation hℓ (x) = 0 will
be replaced by the tangent hyperplane of the corresponding manifold at point x∗ :

∇hℓ (x∗ )T (x − x∗ ) = 0,

41
42 Chapter 5. Karush–Kuhn–Tucker optimality conditions

h1 (x) = 0 h1 (x) = 0

h2 (x) = 0

h2 (x) = 0

(a) Degenerate case: the intersection of the curves is a (b) Regular case: the intersection of the curves is a point
point, but the intersection of the tangent lines is a line. as well as the intersection of the tangent lines.

Figure 5.1: Linearization of nonlinear constraints – approximate curves by tangent lines.

so that the linearized constraints can be expressed as A(x − x∗ ) = 0. In order that x∗ is optimal, the
objective function gradient ∇f (x∗ ) must be perpendicular to the intersection of the tangent hyperplanes;
in other words, ∇f (x∗ ) must be a linear combination of the gradients ∇hℓ (x∗ ) of the tangent hyperplanes.
According to Proposition 5.1 we have ∇f (x∗ ) ∈ R(A). This leads to the condition
L
X
∇f (x∗ ) + ∇hℓ (x∗ )µℓ = 0.
ℓ=1

As illustrated by Figure 5.1, this idea can be wrong since a degenerate situation may appear as depicted
on the figure. Thus we need to avoid such a degenerate case. This can be achieved by the assumption on
linear independence of gradients ∇hℓ (x∗ ).
Theorem 5.2. Let ∇hℓ (x∗ ), ℓ = 1, . . . , L, be linearly independent. If x∗ is a local optimum, then there is
µ ∈ RL such that

∇f (x∗ ) + ∇h(x∗ )µ = 0.

Proof. See the basic calculus course.

Coefficients µ1 , . . . , µL are called Lagrange multipliers. The condition stated in the theorem is a nec-
essary condition. This is convenient for us since we can restrict the feasible set to a much smaller set of
candidates for optima – ideally the candidate is unique.

Equality and inequality constraints


Now we consider the general case with both equality and inequality constraints. The active set of a feasible
point x is the set of those inequalities that are satisfied as equations:

I(x) = {j; gj (x) = 0}.

Theorem 5.3 (KKT conditions). Let ∇hℓ (x∗ ), ℓ = 1, . . . , L, ∇gj (x∗ ), j ∈ I(x∗ ), be linearly independent.
If x∗ is a local optimum, then there exist λ ∈ RJ , λ ≥ 0, and µ ∈ RL such that

∇f (x∗ ) + ∇h(x∗ )µ + ∇g(x∗ )λ = 0, (5.2)


T ∗
λ g(x ) = 0. (5.3)

Remark. Condition (5.3) is called complementarity condition since it says that for every j = 1, . . . , J
we have λj = 0 or gj (x∗ ) = 0. If gj (x∗ ) < 0, then λj = 0 and hence variable λj does not act in the KKT
conditions; this corresponds to the situation that x∗ does not lie on the border of the set described by this
constraint. Conversely, if gj (x∗ ) = 0, then the complementarity makes no restriction on λj . In summary,
we can say that the complementarity condition enforces to consider the Lagrange multipliers λj for the
active constraints only.
43

Proof. (Main idea.) We linearize the problem such that the objective function and the constraint functions
are replaced by their tangent hyperplanes at point x∗ . This results in a linear programming problem

min ∇f (x∗ )T x subject to ∇gj (x∗ )T (x − x∗ ) ≤ 0, j ∈ I(x∗ ),


∇hℓ (x∗ )T (x − x∗ ) = 0, ℓ = 1, . . . , L.

Due to the linear independence assumption, the solution x∗ remains optimal (this is a small step for a
reader, but a giant leap in the proof). The dual problem to the linear program is
L
X X
∇hℓ (x∗ )T x∗ µℓ + ∇gj (x∗ )T x∗ λj subject to
 
max
ℓ=1 j∈I(x∗ )
L
X X

∇f (x ) + ∇hℓ (x∗ )µℓ + ∇gj (x∗ )λj = 0,
ℓ=1 j∈I(x∗ )

λj ≥ 0, j ∈ I(x ).

Since the primal problem has an optimum, the dual problem must be feasible. For j 6∈ I(x∗ ) define λj := 0
and we have that also the problem

max (x∗ )T ∇g(x∗ )λ + (x∗ )T ∇h(x∗ )µ subject to


∇f (x∗ ) + ∇h(x∗ )µ + ∇g(x∗ )λ = 0,
λ≥0

is feasible. Hence there exist λ ≥ 0, µ satisfying (5.2). Condition (5.3) is fulfilled since for j ∈ I(x∗ ) we
have gj (x∗ ) = 0 by definition, and for j 6∈ I(x∗ ) we can put λj = 0.

Conditions (5.2)–(5.3) are called Karush–Kuhn–Tucker conditions [Karush, 1939; Kuhn and Tucker,
1951], or KKT conditions in short.
Since the linear independence assumption is hard to check in general (notice that x∗ is unknown),
alternative assumptions were derived, too. Usually, they are more easy to verify but on account of stronger
assumptions. One commonly used assumption is Slater’s condition

∃x0 ∈ M : g(x0 ) < 0.

Theorem 5.4. Consider the optimization problem

min f (x) subject to g(x) ≤ 0, x ∈ M,

where f (x), gj (x) are convex functions and M is a convex set. Suppose that Slater’s condition is satisfied.
If x∗ is an optimum of the above problem, then there exists λ ≥ 0 such that x∗ is an optimum of the
problem

min f (x) + λT g(x) subject to x ∈ M, (5.4)

and, moreover, λT g(x∗ ) = 0.

Proof. Define the sets

A := {(r, s) ∈ RJ × R; r ≥ g(x), s ≥ f (x), x ∈ M },


B := {(r, s) ∈ RJ × R; r ≤ 0, s ≤ f (x∗ )};

see Figure 5.2. Both sets are convex, and their interiors are disjoint since otherwise there is a point
x ∈ M such that g(x) < 0 and f (x) < f (x∗ ). Therefore a separating hyperplane exists having the form of
λT r + λ0 s = c, where (λ, λ0 ) 6= 0. The separability implies:

∀(r, s) ∈ A : λT r + λ0 s ≥ c,
∀(r, s) ∈ B : λT r + λ0 s ≤ c.
44 Chapter 5. Karush–Kuhn–Tucker optimality conditions

f (x0 )
A

f (x∗ )

0 r
B
λT r + λ0 s = c

Figure 5.2: Illustration to the proof of Theorem 5.4.

Since (0, f (x∗ )) ∈ A∩B, this point lies on the hyperplane, and thus c = λ0 f (x∗ ). Analogously (g(x∗ ), f (x∗ )) ∈
A ∩ B, so this point also lies on the hyperplane, yielding
λT g(x∗ ) + λ0 f (x∗ ) = c = λ0 f (x∗ ),
which gives the complementarity constraint λT g(x∗ ) = 0.
For every i we have (−ei , f (x∗ )) ∈ B, so this point lies in the negative halfspace. This means that
−λT ei + λ0 f (x∗ ) ≤ c, from which λi ≥ 0. Therefore λ ≥ 0. Analogously we deduce λ0 ≥ 0: Since
(o, f (x∗ ) − 1) ∈ B, so λT o + λ0 (f (x∗ ) − 1) ≤ c, and hence λ0 ≥ 0.
Since g(x0 ) < 0, we have (r, f (x0 )) ∈ A for every r in the neighbourhood of 0. Hence the separating
hyperplane cannot be vertical, which means λ0 6= 0. Without loss of generality we normalize it such that
λ0 = 1. Let us prove it formally: Suppose to the contrary that λ0 = 0. Now, c = 0 and in view of λ 6= 0
there is i such that λi > 0. Substituting the point (−εei , f (x0 )) ∈ A, where ε > 0 is small enough, into
the inequality, we get −ελi ≥ 0; a contradiction.
For every x ∈ M we have (g(x), f (x)) ∈ A, which fulfills
λT g(x) + f (x) ≥ c = λT g(x∗ ) + f (x∗ ).
This proves that x∗ is the optimum of (5.4).

Applying the optimality conditions from Theorem 2.1 to problem (5.4), we obtain the KKT conditions
as a corollary:
Corollary 5.5. Suppose that Slater’s condition is satisfied for the convex optimization problem
min f (x) subject to g(x) ≤ 0.
If x∗ is an optimum, then there exists λ ≥ 0 such that the KKT conditions are satisfied, i.e.,
∇f (x∗ ) + ∇g(x∗ )λ = 0, (5.5a)
T ∗
λ g(x ) = 0. (5.5b)
We obtain also a general form involving equality constraints.
Corollary 5.6. Suppose that Slater’s condition is satisfied for the convex optimization problem
min f (x) subject to g(x) ≤ 0, Ax = b.
If x∗ is an optimum, then there exist λ ≥ 0 and µ such that the KKT conditions are satisfied, i.e.,
∇f (x∗ ) + ∇g(x∗ )λ + AT µ = 0,
λT g(x∗ ) = 0.
45

g1 (x)
min x1
∇g1 (x∗ )

x∗ M

∇g2 (x∗ )

g2 (x)

Figure 5.3: (Example 5.7) Slater’s condition is not satisfied and KKT conditions property (Corollary 5.5)
fails.

Example 5.7. If Slater’s condition is not satisfied, then the KKT conditions property (Corollary 5.5) can
fail. Consider an optimization problem minx∈M x1 illustrated in Figure 5.3. Two constraints describe the
feasible set having the form of a half-line starting from point x∗ . Point x∗ is optimal. The KKT conditions
read −∇f (x∗ ) = ∇g(x∗ )λ, but the point x∗ does not fulfill them since the gradients ∇g1 (x∗ ) = (0, −1)T
and ∇g2 (x∗ ) = (0, 1)T span a vertical line, not containing the opposite of the objective function gradient
−∇f (x∗ ) = (−1, 0)T .

In optimization, necessary optimality conditions are usually preferred to sufficient optimality conditions
since they often help to restrict the feasible set to a smaller set of candidate optimal solutions. Anyway,
sufficient optimality conditions are also of interest, and below we show that the KKT conditions do this
job under general assumptions.

Theorem 5.8 (Sufficient KKT conditions). Let x∗ ∈ Rn be a feasible solution of

min f (x) subject to g(x) ≤ 0,

let f (x) be a convex function, and let gj (x), j ∈ I(x∗ ), be convex functions, too. If KKT conditions (5.5)
are satisfied for x∗ with certain λ ≥ 0, then x∗ is an optimal solution.

Proof. Convexity of function f (x) implies f (x) − f (x∗ ) ≥ ∇f (x∗ )T (x − x∗ ) due to Theorem 3.18. Anal-
ogously, for functions gj (x), j ∈ I(x∗ ), we have gj (x) − gj (x∗ ) ≥ ∇gj (x∗ )T (x − x∗ ). KKT conditions give
∇f (x∗ ) = −∇g(x∗ )λ, from which premultiplying by (x − x∗ ) we get

f (x) − f (x∗ ) ≥ ∇f (x∗ )T (x − x∗ ) = −λT ∇g(x∗ )T (x − x∗ )


X X
=− λj ∇gj (x∗ )T (x − x∗ ) ≥ − λj (gj (x) − gj (x∗ ))
j∈I(x∗ ) j∈I(x∗ )
X
=− λj gj (x) ≥ 0.
j∈I(x∗ )

Therefore f (x∗ ) is the optimal value and x∗ is an optimal solution.


46 Chapter 5. Karush–Kuhn–Tucker optimality conditions
Chapter 6

Methods

To solve an optimization problem is a very difficult task in general; indeed, it is undecidable (provably
there cannot exist an algorithm)! Thus we can hardly hope to solve optimally every problem. Many
algorithms thus produce approximate solutions only – KKT solutions, local optima etc. If the problem
is large and hard, then we often use heuristic methods (genetic and evolutionary algorithms, simulated
annealing, tabu search,. . . ). On the other hand, many hard optimization problems can be solved by using
global optimization techniques. However, they work in small dimensions only since their computational
complexity is high. The choice of a suitable method thus depends not only on the type of the problem,
but also on the dimensions, time restrictions etc.
In the following sections, we present selected methods for basic types of optimization problems.

6.1 Line search


By a line search we mean minimization of a univariate function f (x) : R → R, that is, we have n = 1.
Even this particular case is important since it often serves as an auxiliary sub-procedure in the general
case.
Our goal is to find a local minimum (or its approximation) in the neighbourhood of the current point.
We present two approaches: Armijo rule, which aims to move to a point of a significant decrease of the
objective function, and the Newton method, which converges to a local minimum under certain conditions.

Armijo rule
We assume that f (x) is differentiable and f ′ (0) < 0, so it locally decreases at x = 0. We want to decrease
the objective function by moving to the right from point x = 0. We wish to decrease it significantly, that
is, not to get stuck locally close to x = 0, but to move away from this current point if possible.
Consider the condition

f (x) ≤ f (0) + ε · f ′ (0) · x, (6.1)

where 0 < ε < 1 is a given parameter; usually we take ε ≈ 0.2.


The condition is used as follows: Choose a value of parameter β > 0 (e.g., β = 2 or β = 10) and an
arbitrary x > 0. Now
• if condition (6.1) is satisfied, then set x := βx and while the condition holds, repeat this process;
• if condition (6.1) is not satisfied, then set x := x/β and repeat until the condition holds.
This procedure ensures that we move to a point with smaller objective value and simultaneously we move
far from the initial point; see Figure 6.1a.
Armijo rule is also used as the termination condition within other line search methods: Condition (6.1)
cannot be violated (which ensures that x is not too large) and simultaneously the converse inequality

f (x) ≥ f (0) + ε′ · f ′ (0) · x,

must be satisfied for certain parameter ε′ > ε (which ensures that x is not too large small); see Figure 6.1b.

47
48 Chapter 6. Methods

f (x) f (x)

0 x 0 x

(a) Armijo rule – seeking for the intersection point. (b) Armijo rule as the termination condition.

Figure 6.1: Armijo rule

q(x)

f (x)

0 xk+1 xk x

Figure 6.2: Newton method: approximation f (x) at point xk by a quadratic function q(x), and move to
its minimum xk+1 .

Newton method
It is the classical Newton method for finding a root of f ′ (x) = 0. Here we need f (x) to be twice differen-
tiable.
This method is iterative and we construct a sequence of points x0 = 0, x1 , x2 , . . . that, under some
assumptions, converge to a local minimum. The basic idea is to approximate function f (x) by a function
q(x) such that they both have the same value and the first and second derivatives and the current point xk
(in the kth iteration). Thus we want q(xk ) = f (xk ), q ′ (xk ) = f ′ (xk ) and q ′′ (xk ) = f ′′ (xk ); see Figure 6.2.
This suggests that it is suitable for q(xk ) to be a quadratic polynomial. Such a quadratic function is unique
and it is described
1
q(x) = f (xk ) + f ′ (xk )(x − xk ) + f ′′ (xk )(x − xk )2 .
2
(Proof: put x := xk .) The minimum of quadratic function q(xk ) is at the stationary point (where the
derivative is zero), so
0 = f ′ (xk ) + f ′′ (xk )(x − xk ).

From this we get


f ′ (xk )
x = xk − ,
f ′′ (xk )

which is the current point xk+1 of the subsequent iteration.


6.2. Unconstrained problems 49

-1

-2
-2 -1 0 1 2 3 4 5

Figure 6.3: In blue: the contours of the convex quadratic function f (x) = x21 + 4x22 . In red: the iterations
of the steepest descent method with initial point (5, −1)T .

6.2 Unconstrained problems


Consider the optimization problem

min f (x) subject to x ∈ Rn ,

where f (x) is a differentiable function.


A basic approach is an iterative method, generating a sequence of points x0 , x1 , x2 , . . . , which, under
certain assumptions, converge to a local minimum. The initial point x0 can be chosen arbitrarily, unless
we have some additional information that we can utilize. The iterations terminate when the objective
function values at points xk get stabilized.

Gradient methods1)
In kth iteration, the current point is xk . We determine a direction dk in which the objective function
locally decreases, that is, ∇f (xk )T dk < 0. Now we call a line search method applied to the function
ϕ(α) := f (xk + αdk ). Denote by αk the output value. Then the next point is set as xk+1 := xk + αk dk .
How to choose dk ? The simplest way is the steepest descent method, which takes dk := −∇f (xk ), that
is, the direction in which the objective function locally decreases the most rapidly. This choice need no be
the best one; see Figure 6.3, which illustrates the slow convergence even for the simple convex quadratic
function f (x) = x21 + 4x22 . There are advanced methods that take into account also the Hessian ∇2 f (xk )
or its approximation and they combine the steepest descent direction and the directions of the previous
iteration(s); see also the conjugate gradient methods in Section 6.4.

Example 6.1 (Learning of neural networks). Basically, the steepest descent method is used in learning of
artificial neural networks (for an introduction see Higham and Higham [2019]). The goal of the learning is
to set up weights of inputs of particular neurons such that the neural network performs best on the training
data. Mathematically speaking, the variables are the weights of inputs of the neurons. The objective
function that we minimize is the distance between the actual output vector and the ideal output vector.
It is hard ho find the optimal solution since this optimization problem is nonlinear, nonconvex and high-
dimensional. That is why the problem is solved iteratively and at each step the weights are refined by
means of the steepest descent. To compute the gradient of the objective function is also computationally
1)
The history of gradient methods dates back to 1847, when L.A. Cauchy introduced a gradient-like method to solve the
astronomical problem of calculating the orbit of a celestial body.
50 Chapter 6. Methods

0 xk x
Figure 6.4: The Hessian ∇2 f (xk ) is not positive definite.

demanding since there are usually large training data, so we simplify further and we approximate the
gradient by its partial value based on the gradient of a randomly chosen training sample point. This
approach is called stochastic gradient descent.
Example 6.2. Optimization techniques are also used to solve problems that are not optimization problems
in the essence. Consider for example the problem of solving a system of linear equations Ax = b, where A
is a positive definite matrix. Then the optimal solution of the convex quadratic program
1 T
minn x Ax − bT x
x∈R 2
is the point A−1 b, the same as the solution of the equations, since at this point the gradient ∇f (x) = Ax−b
of the objective function f (x) = 21 xT Ax − bT x vanishes. Thus we can solve linear equations by using
optimization techniques. This is really used in practice, in particular for large and sparse systems. There
exist several ways how to choose the vector dk in this context. For instance, the conjugate gradient method
combines the gradient and the previous direction, so it takes a linear combination of ∇f (xk ) and dk−1 ;
see Section 6.4.

Newton method
This works in a similar fashion as in the univariate case. We approximate the objective function by a
quadratic function, whose minimum is the current point of the subsequent iteration.
In step k, the current point is xk and at this point we approximate f (x) by using Taylor expansion
1
f (x) ≈ f (xk ) + ∇f (xk )T (x − xk ) + (x − xk )T ∇2 f (xk )(x − xk ).
2
This gives us a quadratic function. If its Hessian matrix ∇2 f (xk ) is positive definite, then its minimum is
unique and it is the point with zero gradient. This leads us to the system

∇f (xk ) + ∇2 f (xk )(x − xk ) = 0,

from which we express the solution

x = xk − (∇2 f (xk ))−1 ∇f (xk ).

This point is set as the current point xk+1 of the next iteration.
Comment. The expression y := (∇2 f (xk ))−1 ∇f (xk ) is evaluated by solving the system of linear equa-
tions ∇2 f (xk )y = ∇f (xk ), not by inverting the matrix.
The advantage of this method is a rapid convergence (if we are close to the minimum). The drawback
is that the Hessian ∇2 f (xk ) need not be positive definite; see example on Figure 6.4. Another drawback
is that the evaluation of the Hessian might be computationally demanding. Therefore, diverse variants of
this method exist (quasi-Newton methods) that approximate the Hessian matrix or regularize it.
6.3. Constrained problems 51

6.3 Constrained problems


Consider the optimization problem

min f (x) subject to x ∈ M,

where f : Rn → R is a differentiable function and the feasible set M ⊆ Rn is characterized by the system

gj (x) ≤ 0, j = 1, . . . , J,
hℓ (x) = 0, ℓ = 1, . . . , L,

where gj (x), hℓ (x) : Rn → R.


The solution methods are again iterative, where we construct a sequence of points x0 , x1 , x2 , . . . The
initial point x0 is chosen randomly, unless we have some additional knowledge about the problem. The
iterations terminate when the objective function values at points xk get stabilized.

6.3.1 Methods of feasible directions


These methods naturally the generalize gradient methods from unconstrained optimization. The basic
idea is the same and the only difference is in the line search, when we must stay within the feasible set M .
The equality constraints h(x) = 0 are hard to deal with in this case.
These methods are particularly convenient when M is a convex polyhedron. So in this section we
assume that M = {x ∈ Rn ; Ax ≤ b}.

Method by Frank and Wolfe [1956]


Let xk be the current feasible point in kth iteration. A feasible descent direction dk is computed by an
auxiliary linear program
min ∇f (xk )T x subject to Ax ≤ b.
Denote by x∗k its optimal solution. Then we take dk := x∗k − xk . This direction is feasible since x∗k ∈ M .
Moreover, dk corresponds to a steep descent since the objective function ∇f (xk )T (x − xk ) yields the
derivative of function f at point xk in the direction of x − xk (the term ∇f (xk )T xk is negligible since it
is constant).

Method by Zoutendijk [1960]


This method is similar to the previous one, but the auxiliary problem has the form

min ∇f (xk )T x subject to Ax ≤ b, kx − xk k ≤ 1.

If we use the Euclidean norm, then we are seeking for the steepest descent direction that is feasible. In
order that the auxiliary problem is easy to solve, we usually employ the maximum or the Manhattan
norm. For the latter, for example, the problem takes the form of a linear program, in which kx − xk k ≤ 1
is replaced by
eT z ≤ 1, x − xk ≤ z, −x + xk ≤ z.

6.3.2 Active-set methods


These methods reduce the problem to a sequence of optimization problems with equality constraints only.
Let xk be a current feasible solution and let

W := {j; gj (xk ) = 0}

be the active set. Then we solve an auxiliary problem

min f (x) subject to h(x) = 0, gj (x) = 0, j ∈ W.


52 Chapter 6. Methods

∇g1 (x∗ )

∇g2 (x∗ )
g1 (x) ≤ 0 x∗

−∇f (x∗ )

g2 (x) ≤ 0

Figure 6.5: At point x∗ we have ∇f (x∗ ) − ∇g1 (x∗ ) + ∇g2 (x∗ ) = 0, so index 1 is removed from the active
set and index 2 remains there.

If we move to the boundary of M during the computation and another constraint becomes active, then
we include it to W . If we achieve a local minimum x∗ during the computation of this auxiliary problem,
then we assume that KKT conditions are satisfied. That is, there exists λ such that
X
∇f (x∗ ) + ∇h(x∗ )µ + λj ∇gj (x∗ ) = 0.
j∈W

Now, if λj ≥ 0, then j remains in W ; otherwise the index j is removed from W . This treatment is based
on the interpretation of Lagrange multipliers as the negative derivatives of the objective function with
respect to the right-hand side of the constraints. Hence, λj < 0 implies that locally a decrease of gj (x)
makes a decrease of f (x); see Figure 6.5.
The schema of this method resembles the simplex method in linear programming, in which we move
from one feasible basis to another and dynamically change the active set. Therefore, the active-set method
is primarily used in optimization problems with linear constraints.

6.3.3 Penalty and barrier methods


These methods transform the problem in such a way that the constraint functions are added to the objec-
tive function and the problem is reduced to an unconstrained problem (in fact, to a series of unconstrained
problems). This transformation works such that we pay a penalization in the form of higher objective val-
ues for infeasible points (penalty methods), or we force the computed points to stay in the interior of the
feasible set by increasing the objective function values to infinity on its boundary (barrier methods).

Penalty methods
Consider the problem
min f (x) subject to x ∈ M,
where f (x) is a continuous function and M 6= ∅ is a closed set. A penalty function is any continuous
nonnegative function q : Rn → R satisfying the conditions:
• q(x) = 0 for every x ∈ M ,
• q(x) > 0 for every x 6∈ M .
Penalty methods are based on a transformation of the problem to an unconstrained problem

min f (x) + c · q(x) subject to x ∈ Rn ,

where c > 0 is a parameter.


6.3. Constrained problems 53

Penalty methods are implemented such that c is not constant, but it is increased during the iterations.
Too high value of c at the beginning leads to a numerically ill-conditioned problem. That is why in practice
the values from a suitable sequence ck > 0, where ck →k→∞ ∞, are used.
Theorem 6.3. Let xk be an optimal solution of problem

min f (x) + ck · q(x) subject to x ∈ Rn .

If xk →k→∞ x∗ , then x∗ is an optimal solution of the original problem minx∈M f (x).

Proof. If x∗ 6∈ M , then for k∗ large enough we have xk 6∈ M ∀k ≥ k∗ , and thus the objective function
grows without bound. Hence f (x∗ ) + ck · q(x∗ ) →k→∞ ∞ and also f (xk ) + ck · q(xk ) →k→∞ ∞, which
contradicts optimality of xk .
Consider now the case of x∗ ∈ M and suppose to the contrary that x∗ is not optimal. Then there is a
point x′ ∈ M such that f (x′ ) < f (x∗ ). Since the penalization is zero within the feasible set M , we get

f (x′ ) + ck · q(x′ ) < f (x∗ ) + ck · q(x∗ )

for every k ∈ N. Due to continuity we have for sufficiently large k

f (x′ ) + ck · q(x′ ) < f (xk ) + ck · q(xk ),

which is a contradiction to the optimality of xk in iteration k.

Example 6.4. For constraints of type g(x) ≤ 0 we often use the penalty function
J
X J
X
+ 2
q(x) := (gj (x) ) = max(0, gj (x))2 ,
j=1 j=1

which preserves smoothness of the objective function, and for constraints of type h(x) = 0 we can use the
penalty function
XL
q(x) := hℓ (x)2 .
ℓ=1

Barrier methods
Consider again the problem
min f (x) subject to x ∈ M,
where f (x) is a continuous function. Suppose that M is a connected set satisfying M = cl(int M ), that
is, it is equal to the closure if its interior. A barrier function is any continuous nonnegative function
q : int M → R such that q(x) → ∞ for every x → ∂M . This means that when x approaches to the
boundary of M , then the barrier function grows to infinity.
The original problem is then transformed to an unconstrained problem
1
min f (x) + q(x) subject to x ∈ Rn , (6.2)
c
where c > 0 is a parameter.
The algorithm is similar to penalty methods, that is, we iteratively seek for optimal solutions of
auxiliary problems when c → ∞. A drawback of this method if that we have to know an initial feasible
solution at the beginning. The advantage is its simplicity.
The pioneers of these methods are Fiacco and McCormick [1968].
Example 6.5. For constraints of type g(x) ≤ 0 we often use the barrier function in the form
J
X 1
q(x) := −
gj (x)
j=1
54 Chapter 6. Methods

or in the form
J
X
q(x) := − log(−gj (x)).
j=1

The latter is utilized in the popular interior point methods, which implementations can solve linear pro-
grams and certain convex optimization problems (such as quadratic programs) in polynomial time. For
example, the linear program
min cT x subject to Ax ≤ b

is transformed to the problem


m
1 X
min cT x − log(bi − Ai∗ x).
c
i=1

For semidefinite condition X  0 we can use the barrier function

q(X) := − log(det(X)).

Under certain assumptions the optimal solutions of the auxiliary problems converge to the optimum
of the original problem.

Theorem 6.6. Let ck > 0 be a sequence of numbers such that ck →k→∞ ∞. Let xk be an optimal solution
of problem
1
min f (x) + q(x) subject to x ∈ Rn .
ck
If xk →k→∞ x∗ , then x∗ is an optimal solution of the original problem minx∈M f (x).

Proof. Suppose to the contrary that x∗ is not optimal, that is, there is x′ ∈ M such that f (x′ ) < f (x∗ ).
Due to continuity of f (x) there is x′′ ∈ int M such that f (x′′ ) < f (x∗ ). Then for k large enough we have

1 1
f (x′′ ) + q(x′′ ) < f (x∗ ) + q(x∗ ).
ck ck

For k large enough we also have

1 1
f (x′′ ) + q(x′′ ) < f (xk ) + q(xk ),
ck ck

which is a contradiction to the optimality of xk in step k.

For convex optimization problems under general assumptions (e.g., strictly convex barrier function
and M bounded) the optimal solution x(c) of (6.2) is unique and the points x(c), c > 0, draw a smooth
curve, called the central path, whose limit as c → ∞ is the optimal solution of the original problem.
Certain algorithms use the same principle: For the increasing values of c they find (approximation
of) the optimal solutions x(c). With a small change of c the point x(c) moves continuously, so it is easy
and fast to reoptimize and find the new optimum. For theoretical analysis of polynomiality of certain
convex optimization problems short steps are used, but in practice larger steps are convenient. Typically,
we increase c with a factor of 1.1.
A natural question is, why not to choose a large value of c at the beginning? The numerical issues
cause troubles then. Next, such a choice makes not the algorithm faster. The Newton method (or other
methods used to solve (6.2)) is slow if we start far from the optimum. Therefore tracing the central path
using fast steps is the most convenient way. Notice that we have some difficulties at the beginning, but
this issue can be overcome.
6.4. Conjugate gradient method 55

6.4 Conjugate gradient method


This method was derived to solve a system of linear equations Ax = b, where matrix A ∈ Rn×n is
positive definite. Its authors are Hestenes and Stiefel [1952], and it belongs to both optimization textbooks
[Luenberger and Ye, 2008] and textbooks on numerical mathematics [Liesen and Strakoš, 2013]. Even
though the methods is iterative, it converges to the solution in at most n steps. Since it does not transform
matrix A and has low space complexity, its is convenient for very large systems in particular.
The basic idea is to consider the quadratic function
1
f (x) = xT Ax − bT x.
2
Since A is positive definite, the function is strictly convex and attains the unique minimum. The minimum
is the point, in which the gradient ∇f (x) = Ax − b is zero. Hence the minimum of function f (x) is the
same as the solution of Ax = b. In this way we reduced the problem of solving linear equations to an
optimization problem.
We will describe the method is a simplified way. First, instead of the standard basis of space Rn we
consider an orthonormal basis d1 , . . . , dn and the inner product hx, yi := xT Ay instead of the standard one;
to avoid confusion, the corresponding orthogonality is called A-orthogonality and the orthonormal basis
is called A-orthonormal. We will show later on how to choose the basis d1 , . . . , dn . Denote by x∗ := A−1 b
the solution we are seeking for, and denote by xk an approximate solution obtained in kth iteration. At
the beginning, the initial points x1 is chosen arbitrarily.

Basic scheme. We express vector x∗ − x1 as a linear combination of the basis vectors


n
X

x − x1 = αk dk .
k=1
The basic scheme of the method is simple – imagine we move from a vertex of a box to the opposite vertex
by using the (mutually perpendicular) edges:
Iterate xk+1 := xk + αk dk , k = 1, . . .
To implement the method we need to determine the basis d1 , . . . , dn and show how to compute coefficients
αk effectively. Denote gk := ∇f (xk ) = Axk − b, which represents not only the gradient at point xk in kth
iteration, but also the residual, that is, the difference between the left and right-hand sides of the system
(when the residual is 0, then we get the solution). Notice that for any j ∈ {1, . . . , k},
k
X
xk+1 = xk + αk dk = xk−1 + αk dk + αk−1 dk−1 = . . . = xj + αi di .
i=j

Computation of αk . Since d1 , . . . , dn is an A-orthonormal basis, the coordinates αk are the Fourier


and we compute them easily as αk = hdk , x∗ − x1 i. The problem is that x∗ is unknown. Since
coefficientsP
k−1
xk − x1 = i=1 αi di , vector xk − x1 is A-orthogonal to dk , that is, hdk , xk − x1 i = 0. We derive
αk = hdk , x∗ − x1 i = hdk , x∗ − xk + xk − x1 i = hdk , x∗ − xk i + hdk , xk − x1 i
= hdk , x∗ − xk i = dTk A(x∗ − xk ) = dTk (b − Axk ) = −dTk gk .
Proposition 6.7. Vector xk+1 is the minimum of f (x) on the affine subspace x1 + span{d1 , . . . , dk }, that
T d = 0 for j = 1, . . . , k (i.e., g
is, gk+1 j k+1 ⊥ dj meaning the standard orthogonality).

Proof. It is sufficient to show that vector ∇f (xk+1 ) = gk+1 is perpendicular to subspace x1 +span{d1 , . . . , dk },
that is, it is perpendicular to every vector d1 , . . . , dk . Write
 
gk+1 = Axk+1 − b = A xj + ki=j αi di − b
P

For any j ∈ {1, . . . , k} we have


    P 
dTj gk+1 = dTj A xj + ki=j+1 αi di − b = dTj (Axj − b) + dTj A k
P
α d
i=j i i

= dTj gj + αj = 0.
56 Chapter 6. Methods

p 1 , . . . , dk } = span{g1 , . . . , gk }
The choice of basis d1 , . . . , dn . We choose the basis such that span{d
for every k = 1, . . . , n. At the beginning we naturally put d1 := −g1 / hg1 , g1 i. In (k + 1)st iteration we
construct vector dk+1 from vector −gk+1 by making it orthogonal to subspace span{d1 , . . . , dk }.

Proposition 6.8. We have span{g1 , . . . , gk } = span{g1 , Ag1 , . . . , Ak−1 g1 }.

Proof. We prove it by mathematical induction on k. By definition and from the induction hypothesis we
have gk = Axk − b, where

xk ∈ x1 + span{g1 , . . . , gk−1 } = x1 + span{g1 , Ag1 , . . . , Ak−2 g1 }.

Hence
gk ∈ Ax1 − b + span{Ag1 , A2 g1 , . . . , Ak−1 g1 } ⊆ span{g1 , Ag1 , A2 g1 , . . . , Ak−1 g1 }.

In fact, we have equality since gk does not belong to span{g1 , . . . , gk−1 }. Otherwise, according to Propo-
sition 6.7, vector gk is orthogonal to this subspace and gk = Axk − b must be zero, meaning that xk is the
solution x∗ .

Since gk+1 is orthogonal (in the standard sense) to vectors d1 , . . . , dk , it is also orthogonal to g1 , . . . , gk ,
and by Proposition 6.8 it is A-orthogonal to vectors g1 , . . . , gk−1 , too. Thus, in order to compute dk+1 , it
is sufficient to make −gk+1 orthogonal to vector dk . This is performed by the following statement. Notice
that the resulting value of dk+1 is not normalized, so we have to normalize it afterwards.

Proposition 6.9. We have dk+1 = −gk+1 + βk+1 dk , where βk+1 = hdk , gk+1 i.

Proof. We already know that hgk+1 , di i = 0 for i = 1, . . . , k − 1. Hence dk+1 has the form of dk+1 =
−gk+1 + βk+1 dk for certain βk+1 . From the equality 0 = hdk , dk+1 i = dTk A(−gk+1 + βk+1 dk ) we derive the
dT
k Agk+1 hdk ,gk+1 i
value of βk+1 = dT
= hgk ,gk i = hdk , gk+1 i.
k Adk

Summary. Now we have all the ingredients to explicitly write the algorithm:
1: choose x1 ∈ Rn and put d0 := 0,
2: for k = 1, . . . , n do

gk := Axk − b,
βk := dTk−1 Agk ,
q
dk := −gk + βk dk−1 , dk := dk / dTk Adk ,
αk := −dTk gk ,
xk+1 := xk + αk dk ,

Remark 6.10. A few of comments to the conjugate gradient method:


(1) The method has low memory requirement and makes no operations on matrix A. The method is
beneficial particularly when matrix A is large and sparse. The running time of one iteration is
relatively low. Moreover, not all n iterations are needed to perform in general – we can achieve the
solution or its tight approximation much sooner.
(2) Often the method is presented without the normalization of vector dk . Then the expressions with
dk must be adjusted accordingly.
(3) If we choose x1 = 0, then span{d1 , . . . , dk } = span{b, Ab, A2 b, . . . , Ak−1 b} is called the Krylov
subspace and the theory behind is very interesting [Liesen and Strakoš, 2013].
6.4. Conjugate gradient method 57

The basic idea of the conjugate gradient method can be used to minimize a general nonlinear function
f (x) over space Rn . Herein, the key idea is to construct the improving direction dk as a linear combination
of gradient gk and the previous direction dk−1 . Vector gk is then the gradient of function f (x) at point
xk , and the coefficients are computed analogously. The resulting method is called the method of Fletcher–
Reeves (1964). There exist several variants, which differ in the values of coefficients βk .
There are also methods employing Krylov subspaces for solving systems Ax = b, where matrix A is
not necessarily symmetric positive definite. For example, let us mention GMRES (Generalized minimal
residual method, Saad & Schultz, 1986), which in kth iteration computes vector xk that minimizes the
Euclidean norm of the residual (i.e., kAx − bk) over subspace span{b, Ab, A2 b, . . . , Ak−1 b}.
58 Chapter 6. Methods
Chapter 7

Selected topics

7.1 Robust optimization


In practice, data are often inexact or subject to various uncertainties. This motivates us to seek for
solutions that are robust. There is no precise definition, but basically it means that a robust solution
is feasible and optimal even for specific data perturbations; see Ben-Tal et al. [2009]. We present two
approaches to robustness, the interval one and the ellipsoidal one.

Interval uncertainty (I)


Consider first a linear program in the form

min cT x subject to Ax ≤ b, x ≥ 0.

Suppose that A and b are not known exactly and the only information that we have are interval estimations
of the values. That is, we know a matrix of intervals [A, A] and the vector of interval right-hand sides
[b, b]. We say that a vector x is a robust feasible solution if it fulfills inequality Ax ≤ b for each A ∈ [A, A]
and b ∈ [b, b]. Due to nonnegativity of variables we have that x is robust feasible if and only if Ax ≤ b.
Hence the robust counterpart of the linear programu reads

min cT x subject to Ax ≤ b, x ≥ 0.

Example 7.1 (Catfish diet problem). This example comes from http://www.fao.org/3/x5738e/x5738e0h.
htm. It is a simplified example of an optimization model of finding a minimum cost catfish diet in Thailand.
The mathematical formulation reads

min cT x subject to Ax ≥ b, x ≥ 0, (7.1)

where variable xj stands for the number of units of food j to be consumed by the catfish, bi is the required
minimal amount of nutrient i, cj is the price per unit of food j, and aij is the amount of nutrient i
contained in one unit of food j. The data are recorded in Table 7.1. Thus we have
 
    2.15
9 65 44 12 0 30  8.0 
 
A = 1.10 3.90 2.57 1.99
 0  , b = 250 , c = 
 
 6.0 .

0.02 3.7 0.3 0.1 38.0 0.5  2.0 
0.4

Since the nutritive values are not known exactly, we assume that their accuracy is 5%. Hence the exact
value of each entry of matrix A lies in interval [0.95 · aij , 1.05 · aij ]. According to the lines described above,
the robust counterpart is obtained by setting the constraint matrix to be A, that is,
 
8.550 61.75 41.800 11.400 0.00
A = 1.045 3.705 2.4415 1.8905 0.00 .
0.019 3.515 0.2850 0.0950 36.1

59
60 Chapter 7. Selected topics

Cost Protein Energy Calcium


(THB/kg) (%) (Mcal/kg) (%)
Maize 2.15 9 1.10 0.02
Fishmeal 8.0 65 3.90 3.7
Soymeal 6.0 44 2.57 0.3
Ricebran 2.0 12 1.99 0.1
Limestone 0.4 0 0 38.0
Demand min 30 250 0.5

Table 7.1: (Example 7.1) Catfish diet problem: Nutritive value of foods and the nutritional demands

Interval uncertainty (II)


Consider now a linear program in the form with variables unrestricted in sign

min cT x subject to Ax ≤ b.

Let aT x ≤ d be a selected inequality. Let intervals [a, a] = ([a1 , a1 ], . . . , [an , an ])T and [d, d] be given. A
solution x is a robust solution of the selected inequality if it satisfies

aT x ≤ d ∀a ∈ [a, a], ∀d ∈ [d, d],

or,
max aT x ≤ d.
a∈[a,a]

Lemma 7.2. Denote by a∆ = 12 (a − a) the vector of interval radii and by ac = 12 (a + a) the vector of
interval midpoints. Then
max aT x = aTc x + aT∆ |x|.
a∈[a,a]

Proof. For every a ∈ [a, a] we have

aT x = aTc x + (a − ac )T x ≤ aTc x + |a − ac |T |x| ≤ aTc x + aT∆ |x|.

The inequality is attained as equation for certain a ∈ [a, a]. If x ≥ 0, then aTc x+aT∆ |x| = aTc x+aT∆ x = aT x.
If x ≤ 0, then aTc x + aT∆ |x| = aTc x − aT∆ x = aT x. Otherwise we apply this idea entrywise, so that inequality
is attained as equation for a each entry of which is the interval left or right endpoint.

We use this lemma to express the robust solution constraint as

aTc x + aT∆ |x| ≤ d.

The left-hand side function is convex, but not smooth. Nevertheless, we can rewrite the constraint as a
linear constraint by introducing an auxiliary variable y ∈ Rn

aTc x + aT∆ y ≤ d, x ≤ y, −x ≤ y.

Therefore linearity is preserved – the robust solutions of interval linear programs are also described by
linear constraints.

Example 7.3 (Robust classification). Consider two classes of data, the first one comprises given points
x1 , . . . , xp ∈ Rn , and the second one contains given points y1 , . . . , yq ∈ Rn . We wish to construct a classifier
that is able to predict to which class a new input belongs to. A basic linear classifier is based on data
separation by a widest separating band. Mathematically, we seek for a hyperplane aT x + b = 1 such that
the first set of points belongs to the positive halfspace, the second set of points belongs to the negative
halfspace, and the separating band is maximal. This leads to a convex quadratic program (see Figure 7.1a)

min kak2 subject to aT xi + b ≥ 1 ∀i, aT yj + b ≤ −1 ∀j.


7.1. Robust optimization 61

aT x + b = 1

(a) The widest separating band for real data. (b) The widest separating band for interval data.

Figure 7.1: (Example 7.3) A linear classifier for real data and the robust linear classifier for interval data.

Suppose now that data are not measured exactly and one knows them with a specified accuracy
only. Hence we are given vectors of intervals [xi , xi ] = [(xc )i − (x∆ )i , (xc )i + (x∆ )i ], i = 1, . . . , p, and
[y j , y j ] = [(yc )j −(y∆ )j , (yc )j +(y∆ )j ], j = 1, . . . , q, comprising the true data. Using the approach described
above, the robust counterpart model reads (see Figure 7.1b)

min kak2 subject to (xc )Ti a + (x∆ )Ti a′ + b ≤ 1 ∀i, (yc )Tj a + (y∆ )Tj a′ + b ≤ −1 ∀j, ±a ≤ a′ .

Again, it is a convex quadratic program (in variables a, a′ ∈ Rn and b ∈ R).

Ellipsoidal uncertainty
Consider again the linear program in the form with variables unrestricted in sign

min cT x subject to Ax ≤ b.

Let aT x ≤ d be a selected inequality. Consider an ellipsoid

E = {a ∈ Rn ; a = p + P u, kuk2 ≤ 1},

which is expressed as the image of a unit ball under a linear (or more precisely affine) mapping. A point
x is a robust solution of the selected inequality if it satisfies

aT x ≤ d ∀a ∈ E

or,
max aT x ≤ d.
a∈E

Lemma 7.4. We have


max aT x = pT x + kP T xk2 .
a∈E

Proof. Write

max aT x = max (p + P u)T x = pT x + max (P T x)T u


a∈E kuk2 ≤1 kuk2 ≤1
T T T 1
= p x + (P x) kP T xk2
PTx = pT x + kP T xk2 .

Using the lemma, we can express the robust solution constraint as

pT x + kP T xk2 ≤ d.

The left-hand side function is smooth and convex – indeed it is a second order cone constraint.
62 Chapter 7. Selected topics

Example 7.5. Consider again the portfolio selection problem (example 4.9)
max cT x subject to eT x = K, x ≥ o, (7.2)
where c is a random Gaussian vector, its expected value is c̃ := E c and the covariance matrix is Σ :=
cov c = E (c − c̃)(c − c̃)T . The level sets of the density function represent ellipsoids, so it is natural to work
with them. For a random vector c we have that √ the probability P (c − c̃ ∈ Eη ) = η, where Eη is a certain
ellipsoid (concretely, Eη = {d ∈√Rn ; d = F −1 (η) Σu, kuk2 ≤ 1}, where F −1 (η) is the quantile √ function
of the normal distribution and Σ is the positive semidefinite square root of matrix Σ, i.e., ( Σ)2 = Σ).
One of the possible ways to solve (7.2) is to consider the deterministic counterpart
max z subject to P (cT x ≥ z) ≥ η, eT x = K, x ≥ o,
where η ∈ [ 21 , 1] is a fixed value, e.g., η = 0.95. Obviously, condition P (cT x ≥ z) ≥ η is fulfilled if dT x ≥ z
holds for every d ∈ Eη + c̃. Hence we can approximate the problem as
max z subject to dT x ≥ z ∀d ∈ Eη + c̃, eT x = K, x ≥ o.
This optimization problem involves ellipsoidal uncertainty, so we can equivalently write it as

max z subject to c̃T x − F −1 (η)k Σxk2 ≥ z, eT x = K, x ≥ o.
Since F −1 (η) ≥ 0 for any η ≥ 12 , it is a second order cone programming problem.

7.2 Concave programming


Concave programming means minimizing a concave function on a convex set, or equivalently maximizing
a convex function
max f (x) subject to x ∈ M,
where M ⊆ Rn is a convex set and f : Rn → R is a convex function.
Theorem 7.6. Let M be a bounded convex polyhedron. Then the optimal solution exists and it is attained
in at least one of the vertices of M .
Remark. The theorem can be extended as follows: Any continuous convex function on a compact set M
attains its maximum in an extreme point of M ; see Avriel et al. [1988]. This property holds even more
generally, when function f (x) is so called quasiconvex.
Proof. Let v1 , . . . , vm be vertices of M , and without loss of generality assume thatPm f (v1 ) = maxi=1,...,m f (vi ).
Then every point x ∈ M can be expressed as a convex combination x = i=1 αi vi , where αi ≥ 0 and
P m
i=1 αi = 1. Now

f (x) = f ( m
P Pm Pm
i=1 αi vi ) ≤ i=1 αi f (vi ) ≤ i=1 αi f (v1 ) = f (v1 ).

Therefore v1 is an optimum.

This property holds in linear programming, too. For computing an optimal solution, however, it is
not very convenient since polyhedron M may contain many vertices, and we do not know which one is
optimal. By Theorem 4.8, concave programming is NP-hard.
Typical problems resulting in concave programming comprise
• Fixed charged problems. The objective function has the form f (x) = ki=1 fi (xi ), where fi (xi ) = 0
P
for xi = 0 and fi (xi ) = ci + gi (xi ) for xi > 0. Herein, fi (xi ) represents a price (e.g., the price for
the transport of goods of size xi ). Hence the price is naturally zero when xi = 0. When xi > 0, we
pay a fixed charge ci plus the price gi (xi ) depending on the size of xi . We can assume that gi (xi ) is
concave since the larger xi , the smaller relative price for the unit of goods (e.g., due to discounts).
• Multiplicative programming. The objective function has the form f (x) = ki=1 xi . This is not a
Q

concave function in general, but its logarithm gives a concave function log(f (x)) = ki=1 log(xi ).
P
Such problems appear in geometry, where, for example, we minimize the volume of a body (e.g., a
cuboid) subject to some constraints (e.g., the cuboid contains specified points).
Appendix

Derivative of matrix expressions. Let A ∈ Rn×n , b ∈ Rn and c ∈ R. Consider the quadratic function
f : Rn → R defined as

f (x) = xT Ax + bT x + c.

Its gradient reads

∇f (x) = (A + AT )x + b,

and the Hessian matrix takes the form

∇2 f (x) = A + AT .

In particular, if matrix A is symmetric, then

∇f (x) = 2Ax + b, ∇2 f (x) = 2A.

Proof. First we consider the linear term


n
∂ T ∂ X
b x= bi x i = bk ,
∂xk ∂xk
i=1

whence ∇bT x = b.
For the quadratic term, we get
 
n X
n
∂ T ∂ X ∂  X X
x Ax = aij xi xj = ak x2k + (aik + aki )xi xk + aij xi xj 
∂xk ∂xk ∂xk
i=1 j=1 i6=k i,j6=k
X n
X
(aik + aki )xi = (A + AT )x i .

= 2ak xk + (aik + aki )xi =
i6=k i=1

Hence the gradient reads ∇xT Ax = (A + AT )x. Since this is a linear function, the particular coordinates
are differentiated in the same way as for the linear term. Therefore ∇2 xT Ax = A + AT .

63
64 Appendix
Notation

Sets and numbers


N, Z, Q, R the set of natural numbers, integers, rational numbers, and reals, respectively
U +V sumset (Minkowski sum), U + V = {u + v; u ∈ U, v ∈ V }
U +v particular sumset, U + v = U + {v} = {u + v; u ∈ U }
conv(M ) the convex hull of a set M
int(M ) the topological interior of a set M
K∗ the dual cone to cone K, see Definition 4.15
r+ the real part of a real number, r + = max(r, 0)

Matrices and vectors


P
tr(A) the trace of a matrix A, tr(A) = i aii
S(A) the column space of a matrix A
AT the transposition of a matrix A
A≥B componentwise inequality, i.e., aij ≥ bij
A>B componentwise strict inequality, i.e., aij > bij
AB matrix A − B is positive semidefinite
A≻B matrix A − B is positive definite
Ai∗ the ith row of a matrix A
A∗j the jth column of a matrix A
diag(v) the diagonal matrix with entries v1 , . . . , vn
0n , 0 zero matrix (all entries are zero)
In , I the identity matrix (the diagonal matrix with ones on the diagonal)
ei the ith canonical unit vector, ei = I∗i = (0, . . . , 0, 1, . . . , 0)T
e the vector of ones, e = (1, . . . , 1)T

Functions
1
Pn p p
kxkp ℓp -norm of a vector x ∈ Rn , kxkp = i=1 |x|P
i
kxk1 Manhattan norm of a vector x ∈ Rn , kxk1 =q ni=1 |x|i
Pn
kxk2 Euclidean norm of a vector x ∈ Rn , kxk2 = 2
i=1 xi
kxk∞ maximum norm of a vector x ∈ Rn , kxk∞ = maxi=1,...,n |x|i
P (c) probability of a random event c

65
66 Notation
Bibliography

A. A. Ahmadi, A. Olshevsky, P. A. Parrilo, and J. N. Tsitsiklis. NP-hardness of deciding convexity of


quartic polynomials and related problems. Math. Program., 137(1-2):453–476, 2013.
https://arxiv.org/pdf/1012.1908. 25

M. Avriel, W. E. Diewert, S. Schaible, and I. Zang. Generalized concavity, volume 36 of Mathematical


Concepts and Methods in Science and Engineering. Plenum Press, New York, 1988. 62

M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. Nonlinear Programming. Theory and Algorithms. 3rd ed.
John Wiley & Sons., NJ, 2006. 3

A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization. Analysis, algorithms, and engi-
neering applications. SIAM, Philadelphia, PA, 2001.
https://www2.isye.gatech.edu/~nemirovs/Lect_ModConvOpt.pdf. 31

A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009.
https://www2.isye.gatech.edu/~nemirovs/FullBookDec11.pdf. 59

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.


http://web.stanford.edu/~boyd/cvxbook/. 3

S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi. A tutorial on geometric programming. Optim.
Eng., 8(1):67, 2007.

P. J. Dickinson and L. Gijben. On the computational complexity of membership problems for the com-
pletely positive cone and its dual. Comput. Optim. Appl., 57(2):403–415, 2014.
http://dx.doi.org/10.1007/s10589-013-9594-z. 38

M. Dür. Copositive programming – a survey. In M. Diehl, F. Glineur, E. Jarlebring, and W. Michiels,


editors, Recent Advances in Optimization and its Applications in Engineering: The 14th Belgian-French-
German Conference on Optimization, pages 3–20. Springer, Berlin, Heidelberg, 2010.
http://www.optimization-online.org/DB_FILE/2009/11/2464.pdf. 38

A. Fiacco and G. McCormick. Nonlinear Programming: Sequential Unconstrained Minimization Tech-


niques. Wiley, New York, 1968. 53

C. A. Floudas and P. M. Pardalos, editors. Encyclopedia of Optimization. 2nd ed. Springer, New York,
2009. 29

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Res. Logist. Quart., 3(1-2):
95–110, 1956. 51

M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. J. Res. Natl.
Bur. Stand., 49(6):409–436, 1952.
https://doi.org/10.6028/jres.049.044. 55

C. F. Higham and D. J. Higham. Deep learning: An introduction for applied mathematicians. SIAM Rev.,
61(4):860–891, 2019. 49

67
68 Bibliography

M. Hutchings, F. Morgan, M. Ritoré, and A. Ros. Proof of the double bubble conjecture. Ann. Math.,
155(2):459–489, 2002.
https://arxiv.org/pdf/math/0406017. 8

W. Karush. Minima of functions of several variables with inequalities as side constraints. M.Sc. disserta-
tion, Department of Mathematics, University of Chicago, Chicago, IL, USA, 1939. 43

H. W. Kuhn and A. W. Tucker. Nonlinear programming. In Proceedings of the Second Berkeley Symposium
on Mathematical Statistics and Probability, 1950, pages 481–492, Berkeley, 1951. University of California
Press. 43

K. Lange. MM Optimization Algorithms, volume 147 of Other Titles Appl. Math. SIAM, Philadelphia,
PA, 2016. 23

A. N. Langville and C. D. Meyer. Who’s #1? The science of rating and ranking. Princeton University
Press, Princeton, NJ, 2012. 29

J. Liesen and Z. Strakoš. Krylov Subspace Methods, Principles and Analysis. Oxford University Press,
Oxford, 2013. 55, 56

D. Luenberger and Y. Ye. Linear and Nonlinear Programming. Springer, New York, third edition, 2008.
3, 55

K. G. Murty and S. N. Kabadi. Some NP-complete problems in quadratic and nonlinear programming.
Math. Program., 39(2):117–129, 1987.
https://doi.org/10.1007/BF02592948. 38

Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. SIAM,


Philadelphia, 1994.

P. M. Pardalos and S. A. Vavasis. Quadratic programming with one negative eigenvalue is NP-hard. J.
Glob. Optim., 1(1):15–22, 1991.
https://doi.org/10.1007/BF00120662. 29

M. J. Todd. Minimum-Volume Ellipsoids: Theory and Algorithms, volume 23 of MOS-SIAM Series on


Optimization. SIAM & Mathematical Optimization Society, Philadelphia, PA, 2016. 39

S. A. Vavasis. Nonlinear Optimization: Complexity Issues. Oxford University Press, New York, 1991. 29

W. Zhu. Unsolvability of some optimization problems. Appl. Math. Comput., 174(2):921–926, 2006.
https://doi.org/10.1016/j.amc.2005.05.025. 7

G. Zoutendijk. Methods of feasible directions: A study in linear and nonlinear programming. PhD thesis,
University of Amsterdam, Amsterdam, Netherlands, 1960. 51

You might also like