MOSEKModeling Cookbook
MOSEKModeling Cookbook
Release 3.2.3
MOSEK ApS
02 June 2022
Contents
1 Preface 1
2 Linear optimization 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Linear modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Infeasibility in linear optimization . . . . . . . . . . . . . . . . . . . . . 11
2.4 Duality in linear optimization . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Semidefinite optimization 50
6.1 Introduction to semidefinite matrices . . . . . . . . . . . . . . . . . . . . 50
6.2 Semidefinite modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Semidefinite optimization case studies . . . . . . . . . . . . . . . . . . . 62
7 Practical optimization 71
7.1 Conic reformulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2 Avoiding ill-posed problems . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.4 The huge and the tiny . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.5 Semidefinite variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.6 The quality of a solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.7 Distance to a cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
i
8.1 Dual cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2 Infeasibility in conic optimization . . . . . . . . . . . . . . . . . . . . . . 86
8.3 Lagrangian and the dual problem . . . . . . . . . . . . . . . . . . . . . . 88
8.4 Weak and strong duality . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.5 Applications of conic duality . . . . . . . . . . . . . . . . . . . . . . . . 93
8.6 Semidefinite duality and LMIs . . . . . . . . . . . . . . . . . . . . . . . 94
Bibliography 119
Index 121
ii
Chapter 1
Preface
Content
We begin with a comprehensive chapter on linear optimization, including modeling ex-
amples, duality theory and infeasibility certificates for linear problems. Linear problems
are optimization problems of the form
minimize 𝑐𝑇 𝑥
subject to 𝐴𝑥 = 𝑏,
𝑥 ≥ 0.
Conic optimization is a generalization of linear optimization which handles problems of
the form:
minimize 𝑐𝑇 𝑥
subject to 𝐴𝑥 = 𝑏,
𝑥 ∈ 𝐾,
where 𝐾 is a convex cone. Various families of convex cones allow formulating different
types of nonlinear constraints. The following chapters present modeling with four types
of convex cones:
1
• quadratic cones,
• power cone,
• exponential cone,
• semidefinite cone.
2
Chapter 2
Linear optimization
In this chapter we discuss various aspects of linear optimization. We first introduce the
basic concepts of linear optimization and discuss the underlying geometric interpretations.
We then give examples of the most frequently used reformulations or modeling tricks used
in linear optimization, and finally we discuss duality and infeasibility theory in some
detail.
2.1 Introduction
2.1.1 Basic notions
The most basic type of optimization is linear optimization. In linear optimization we
minimize a linear function given a set of linear constraints. For example, we may wish to
minimize a linear function
𝑥1 + 2𝑥2 − 𝑥3
𝑥1 + 𝑥2 + 𝑥3 = 1, 𝑥1 , 𝑥2 , 𝑥3 ≥ 0.
The function we minimize is often called the objective function; in this case we have
a linear objective function. The constraints are also linear and consist of both linear
equalities and inequalities. We typically use more compact notation
minimize 𝑥1 + 2𝑥2 − 𝑥3
subject to 𝑥1 + 𝑥2 + 𝑥3 = 1, (2.1)
𝑥1 , 𝑥2 , 𝑥3 ≥ 0,
and we call (2.1) a linear optimization problem. The domain where all constraints are
satisfied is called the feasible set; the feasible set for (2.1) is shown in Fig. 2.1.
For this simple problem we see by inspection that the optimal value of the problem
is −1 obtained by the optimal solution
3
x3
x1 x2
Linear optimization problems are typically formulated using matrix notation. The stan-
dard form of a linear minimization problem is:
minimize 𝑐𝑇 𝑥
subject to 𝐴𝑥 = 𝑏, (2.2)
𝑥 ≥ 0.
There are many other formulations for linear optimization problems; we can have different
types of constraints,
𝐴𝑥 = 𝑏, 𝐴𝑥 ≥ 𝑏, 𝐴𝑥 ≤ 𝑏, 𝑙𝑐 ≤ 𝐴𝑥 ≤ 𝑢𝑐 ,
𝑙𝑥 ≤ 𝑥 ≤ 𝑢𝑥
or we may have no bounds on some 𝑥𝑖 , in which case we say that 𝑥𝑖 is a free variable.
All these formulations are equivalent in the sense that by simple linear transformations
and introduction of auxiliary variables they represent the same set of problems. The
important feature is that the objective function and the constraints are all linear in 𝑥.
𝐴𝑥 = 𝑏
4
a
aT x = γ
x0 x
aT x > γ a
x0
𝐴𝑥 ≤ 𝑏
5
a2
a3
a1 a4
a5
a2
a3
a1 a4
⋆
x
a5
6
2.2 Linear modeling
In this section we present useful reformulation techniques and standard tricks which
allow constructing more complicated models using linear optimization. It is also a guide
through the types of constraints which can be expressed using linear (in)equalities.
2.2.1 Maximum
The inequality 𝑡 ≥ max{𝑥1 , . . . , 𝑥𝑛 } is equivalent to a simultaneous sequence of 𝑛 in-
equalities
𝑡 ≥ 𝑥𝑖 , 𝑖 = 1, . . . , 𝑛
𝑡 ≤ 𝑥𝑖 , 𝑖 = 1, . . . , 𝑛.
Of course the same reformulation applies if each 𝑥𝑖 is not a single variable but a linear
expression. In particular, we can consider convex piecewise-linear functions 𝑓 : R𝑛 ↦→ R
defined as the maximum of affine functions (see Fig. 2.6):
a3 x + b3
a1 x + b1
a2 x + b2
Fig. 2.6: A convex piecewise-linear function (solid lines) of a single variable 𝑥. The
function is defined as the maximum of 3 affine functions.
The epigraph 𝑓 (𝑥) ≤ 𝑡 (see Sec. 12) has an equivalent formulation with 𝑚 inequalities:
𝑎𝑇𝑖 𝑥 + 𝑏𝑖 ≤ 𝑡, 𝑖 = 1, . . . , 𝑚.
Piecewise-linear functions have many uses linear in optimization; either we have a convex
piecewise-linear formulation from the onset, or we may approximate a more complicated
(nonlinear) problem using piecewise-linear approximations, although with modern non-
linear optimization software it is becoming both easier and more efficient to directly
formulate and solve nonlinear problems without piecewise-linear approximations.
7
2.2.2 Absolute value
The absolute value of a scalar variable is a special case of maximum
−𝑡 ≤ 𝑥 ≤ 𝑡.
‖𝑥‖1 ≤ 𝑡, (2.3)
with additional (auxiliary) variable 𝑧 ∈ R𝑛 . Clearly (2.3) and (2.4) are equivalent, in the
sense that they have the same projection onto the space of 𝑥 and 𝑡 variables. Therefore,
we can model (2.3) using linear (in)equalities
𝑛
∑︁
−𝑧𝑖 ≤ 𝑥𝑖 ≤ 𝑧𝑖 , 𝑧𝑖 = 𝑡, (2.5)
𝑖=1
with auxiliary variables 𝑧. Similarly, we can describe the epigraph of the norm of an
affine function of 𝑥,
‖𝐴𝑥 − 𝑏‖1 ≤ 𝑡
as
𝑛
∑︁
−𝑧𝑖 ≤ 𝑎𝑇𝑖 𝑥 − 𝑏𝑖 ≤ 𝑧𝑖 , 𝑧𝑖 = 𝑡,
𝑖=1
𝐴𝑥 = 𝑏
8
where 𝐴 ∈ R𝑚×𝑛 and 𝑚 ≪ 𝑛. The basis pursuit problem
minimize ‖𝑥‖1
(2.6)
subject to 𝐴𝑥 = 𝑏,
uses the ℓ1 norm of 𝑥 as a heuristic for finding a sparse solution (one with many zero
elements) to 𝐴𝑥 = 𝑏, i.e., it aims to represent 𝑏 as a linear combination of few columns
of 𝐴. Using (2.5) we can pose the problem as a linear optimization problem,
minimize 𝑒𝑇 𝑧
subject to −𝑧 ≤ 𝑥 ≤ 𝑧, (2.7)
𝐴𝑥 = 𝑏,
which is another example of a simple piecewise-linear function. Using Sec. 2.2.2 we model
‖𝑥‖∞ ≤ 𝑡
as
−𝑡 ≤ 𝑥𝑖 ≤ 𝑡, 𝑖 = 1, . . . , 𝑛.
‖𝐴𝑥 − 𝑏‖∞ ≤ 𝑡,
−𝑡 ≤ 𝑎𝑇𝑖 𝑥 − 𝑏 ≤ 𝑡, 𝑖 = 1, . . . , 𝑛.
Example 2.2 (Dual norms). It is interesting to note that the ℓ1 and ℓ∞ norms are
dual. For any norm ‖ · ‖ on R𝑛 , the dual norm ‖ · ‖* is defined as
Let us verify that the dual of the ℓ∞ norm is the ℓ1 norm. Consider
9
Obviously the maximum is attained for
{︂
+1, 𝑥𝑖 ≥ 0,
𝑣𝑖 =
−1, 𝑥𝑖 < 0,
i.e., ‖𝑥‖*,∞ = 𝑖 |𝑥𝑖 | = ‖𝑥‖1 . Similarly, consider the dual of the ℓ1 norm,
∑︀
2.2.5 Homogenization
Consider the linear-fractional problem
𝑎𝑇 𝑥+𝑏
minimize 𝑐𝑇 𝑥+𝑑
subject to 𝑐𝑇 𝑥 + 𝑑 > 0, (2.8)
𝐹 𝑥 = 𝑔.
Perhaps surprisingly, it can be turned into a linear problem if we homogenize the linear
constraint, i.e. replace it with 𝐹 𝑦 = 𝑔𝑧 for a single variable 𝑧 ∈ R. The full new
optimization problem is
minimize 𝑎𝑇 𝑦 + 𝑏𝑧
subject to 𝑐𝑇 𝑦 + 𝑑𝑧 = 1,
(2.9)
𝐹 𝑦 = 𝑔𝑧,
𝑧 ≥ 0.
If 𝑥 is a feasible point in (2.8) then 𝑧 = (𝑐𝑇 𝑥 + 𝑑)−1 , 𝑦 = 𝑥𝑧 is feasible for (2.9) with the
same objective value. Conversely, if (𝑦, 𝑧) is feasible for (2.9) then 𝑥 = 𝑦/𝑧 is feasible in
(2.8) and has the same objective value, at least when 𝑧 ̸= 0. If 𝑧 = 0 and 𝑥 is any feasible
point for (2.8) then 𝑥 + 𝑡𝑦, 𝑡 → +∞ is a sequence of solutions of (2.8) converging to the
value of (2.9). We leave it for the reader to check those statements. In either case we
showed an equivalence between the two problems.
Note that, as the sketch of proof above suggests, the optimal value in (2.8) may not
be attained, even though the one in the linear problem (2.9) always is. For example,
consider a pair of problems constructed as above:
minimize 𝑦1
minimize 𝑥1 /𝑥2
subject to 𝑦1 + 𝑦2 = 𝑧,
subject to 𝑥2 > 0,
𝑦2 = 1,
𝑥1 + 𝑥2 = 1.
𝑧 ≥ 0.
Both have an optimal value of −1, but on the left we can only approach it arbitrarily
closely.
10
2.2.6 Sum of largest elements
Suppose 𝑥 ∈ R𝑛 and that 𝑚 is a positive integer. Consider the problem
minimize
∑︀
𝑚𝑡 + 𝑖 𝑢𝑖
subject to 𝑢𝑖 + 𝑡 ≥ 𝑥𝑖 , 𝑖 = 1, . . . , 𝑛, (2.10)
𝑢𝑖 ≥ 0, 𝑖 = 1, . . . , 𝑛,
with new variables 𝑡 ∈ R, 𝑢𝑖 ∈ R𝑛 . It is easy to see that fixing a value for 𝑡 determines
the rest of the solution. For the sake of simplifying notation let us assume for a moment
that 𝑥 is sorted:
𝑥1 ≥ 𝑥2 ≥ · · · ≥ 𝑥𝑛 .
obj𝑡 = 𝑥1 + · · · + 𝑥𝑘 + 𝑡(𝑚 − 𝑘)
which is a linear function minimized at one of the endpoints of [𝑥𝑘 , 𝑥𝑘+1 ). Now we can
compute
It follows that obj𝑥𝑘 has a minimum for 𝑘 = 𝑚, and therefore the optimum value of (2.10)
is simply
𝑥1 + · · · + 𝑥𝑚 .
Since the assumption that 𝑥 is sorted was only a notational convenience, we conclude
that in general the optimization model (2.10) computes the sum of 𝑚 largest entries in
𝑥. In Sec. 2.4 we will show a conceptual way of deriving this model.
ℱ𝑝 = {𝑥 ∈ R𝑛 | 𝐴𝑥 = 𝑏, 𝑥 ≥ 0}
11
Example 2.3 (Linear infeasible problem). Consider the optimization problem:
𝑥1 + 𝑥2 + 2𝑥3 = 1, / ·1
− 2𝑥1 − 𝑥2 + 𝑥3 = −0.5, / ·2
− 𝑥1 + 5𝑥3 = −0.1, / ·(−1)
− 2𝑥1 − 𝑥2 − 𝑥3 = 0.1.
This clearly proves infeasibility: the left-hand side is negative and the right-hand side
is positive, which is impossible.
Lemma 2.1 (Farkas’ lemma). Given 𝐴 and 𝑏 as in (2.11), exactly one of the two state-
ments is true:
Farkas’ lemma implies that either the problem (2.11) is feasible or there is a certificate
of infeasibility 𝑦. In other words, every time we classify model as infeasible, we can certify
this fact by providing an appropriate 𝑦, as in Example 2.3.
12
2.3.2 Locating infeasibility
As we already discussed, the infeasibility certificate 𝑦 gives coefficients of a linear combi-
nation of the constraints which is infeasible “in an obvious way”, that is positive on one
side and negative on the other. In some cases, 𝑦 may be very sparse, i.e. it may have
very few nonzeros, which means that already a very small subset of the constraints is
the root cause of infeasibility. This may be interesting if, for example, we are debugging
a large model which we expected to be feasible and infeasibility is caused by an error
in the problem formulation. Then we only have to consider the sub-problem formed by
constraints with index set {𝑗 | 𝑦𝑗 ̸= 0}.
0 ≤ 𝑥1 ≤ 𝑥2 ≤ · · · ≤ 𝑥𝑛 ≤ −1.
Any problem with those constraints is infeasible, but dropping any one of the inequal-
ities creates a feasible subproblem.
minimize 𝑐𝑇 𝑥
subject to 𝐴𝑥 = 𝑏, (2.12)
𝑥 ≥ 0.
We denote the optimal objective value in (2.12) by 𝑝⋆ . There are three possibilities:
• 𝑝⋆ = −∞, meaning that there are feasible solutions with 𝑐𝑇 𝑥 decreasing to −∞, in
which case we say the problem is unbounded.
13
Lagrange function
We associate with (2.12) a so-called Lagrangian function 𝐿 : R𝑛 × R𝑚 × R𝑛+ → R that
augments the objective with a weighted combination of all the constraints,
𝐿(𝑥, 𝑦, 𝑠) = 𝑐𝑇 𝑥 + 𝑦 𝑇 (𝑏 − 𝐴𝑥) − 𝑠𝑇 𝑥.
The variables 𝑦 ∈ R𝑚 and 𝑠 ∈ R𝑛+ are called Lagrange multipliers or dual variables. For
any feasible 𝑥* ∈ ℱ𝑝 and any (𝑦 * , 𝑠* ) ∈ R𝑚 × R𝑛+ we have
𝐿(𝑥* , 𝑦 * , 𝑠* ) = 𝑐𝑇 𝑥* + (𝑦 * )𝑇 · 0 − (𝑠* )𝑇 𝑥* ≤ 𝑐𝑇 𝑥* .
Note the we used the nonnegativity of 𝑠* , or in general of any Lagrange multiplier as-
sociated with an inequality constraint. The dual function is defined as the minimum of
𝐿(𝑥, 𝑦, 𝑠) over 𝑥. Thus the dual function of (2.12) is
{︂ 𝑇
𝑇 𝑇 𝑇 𝑏 𝑦, 𝑐 − 𝐴𝑇 𝑦 − 𝑠 = 0,
𝑔(𝑦, 𝑠) = min 𝐿(𝑥, 𝑦, 𝑠) = min 𝑥 (𝑐 − 𝐴 𝑦 − 𝑠) + 𝑏 𝑦 =
𝑥 𝑥 −∞, otherwise.
Dual problem
For every (𝑦, 𝑠) the value of 𝑔(𝑦, 𝑠) is a lower bound for 𝑝⋆ . To get the best such bound
we maximize 𝑔(𝑦, 𝑠) over all (𝑦, 𝑠) and get the dual problem:
maximize 𝑏𝑇 𝑦
subject to 𝑐 − 𝐴𝑇 𝑦 = 𝑠, (2.13)
𝑠 ≥ 0.
The optimal value of (2.13) will be denoted 𝑑⋆ . As in the case of (2.12) (which from now
on we call the primal problem), the dual problem can be infeasible (𝑑⋆ = −∞), have an
optimal solution (−∞ < 𝑑⋆ < +∞) or be unbounded (𝑑⋆ = +∞). Note that the roles of
−∞ and +∞ are now reversed because the dual is a maximization problem.
Example 2.5 (Dual of basis pursuit). As an example, let us derive the dual of the
basis pursuit formulation (2.7). It would be possible to add auxiliary variables and
constraints to force that problem into the standard form (2.12) and then just apply
the dual transformation as a black box, but it is both easier and more instructive to
directly write the Lagrangian:
𝐿(𝑥, 𝑧, 𝑦, 𝑢, 𝑣) = 𝑒𝑇 𝑧 + 𝑢𝑇 (𝑥 − 𝑧) − 𝑣 𝑇 (𝑥 + 𝑧) + 𝑦 𝑇 (𝑏 − 𝐴𝑥)
where 𝑒 = (1, . . . , 1)𝑇 , with Lagrange multipliers 𝑦 ∈ R𝑚 and 𝑢, 𝑣 ∈ R𝑛+ . The dual
function
maximize 𝑏𝑇 𝑦
subject to 𝑒 = 𝑢 + 𝑣,
(2.14)
𝐴𝑇 𝑦 = 𝑢 − 𝑣,
𝑢, 𝑣 ≥ 0.
14
It is not hard to observe that an equivalent formulation of (2.14) is simply
maximize 𝑏𝑇 𝑦
(2.15)
subject to ‖𝐴𝑇 𝑦‖∞ ≤ 1,
which should be associated with duality between norms discussed in Example 2.2.
Example 2.6 (Dual of a maximization problem). We can similarly derive the dual of
problem (2.13). If we write it simply as
maximize 𝑏𝑇 𝑦
subject to 𝑐 − 𝐴𝑇 𝑦 ≥ 0,
𝐿(𝑦, 𝑢) = 𝑏𝑇 𝑦 + 𝑢𝑇 (𝑐 − 𝐴𝑇 𝑦) = 𝑦 𝑇 (𝑏 − 𝐴𝑢) + 𝑐𝑇 𝑢
minimize 𝑐𝑇 𝑢
subject to 𝐴𝑢 = 𝑏,
𝑢 ≥ 0,
so, as expected, the dual of the dual recovers the original primal problem.
so the dual objective value is a lower bound on the objective value of the primal. In
particular, any dual feasible point (𝑦 * , 𝑠* ) gives a lower bound:
𝑏𝑇 𝑦 * ≤ 𝑝 ⋆
15
Lemma 2.3 (Strong duality). If at least one of 𝑑⋆ , 𝑝⋆ is finite then 𝑑⋆ = 𝑝⋆ .
Proof. Suppose −∞ < 𝑝⋆ < ∞; the proof in the dual case is analogous. For any 𝜀 > 0
consider the feasibility problem with variable 𝑥 ≥ 0 and constraints
−𝑐𝑇 −𝑝⋆ + 𝜀 𝑐𝑇 𝑥 = 𝑝⋆ − 𝜀,
[︂ ]︂ [︂ ]︂
𝑥= that is
𝐴 𝑏 𝐴𝑥 = 𝑏.
Optimality of 𝑝⋆ implies that the above problem is infeasible. By Lemma 2.1 there exists
𝑦ˆ = [𝑦0 𝑦]𝑇 such that
If 𝑦0 = 0 then 𝐴𝑇 𝑦 ≤ 0 and 𝑏𝑇 𝑦 > 0, which by Lemma 2.1 again would mean that the
original primal problem was infeasible, which is not the case. Hence we can rescale so
that 𝑦0 = 1 and then we get
𝑐 − 𝐴𝑇 𝑦 ≥ 0 and 𝑏𝑇 𝑦 ≥ 𝑝⋆ − 𝜀.
The first inequality above implies that 𝑦 is feasible for the dual problem. By letting 𝜀 → 0
we obtain 𝑑⋆ ≥ 𝑝⋆ .
We can exploit strong duality to freely choose between solving the primal or dual
version of any linear problem.
Example 2.7 (Sum of largest elements). Suppose that 𝑥 is now a constant vector.
Consider the following problem with variable 𝑧:
maximize 𝑥 𝑇
∑︀ 𝑧
subject to 𝑖 𝑧𝑖 = 𝑚,
0 ≤ 𝑧 ≤ 1.
𝐿(𝑧, 𝑠, 𝑡, 𝑢) = 𝑥𝑇 𝑧 + 𝑡(𝑚 − 𝑒𝑇 𝑧) + 𝑠𝑇 𝑧 + 𝑢𝑇 (𝑒 − 𝑧) =
= 𝑧 𝑇 (𝑥 − 𝑡𝑒 + 𝑠 − 𝑢) + 𝑡𝑚 + 𝑢𝑇 𝑒
minimize
∑︀
𝑚𝑡 + 𝑖 𝑢𝑖
subject to 𝑢𝑖 + 𝑡 ≥ 𝑥𝑖 , 𝑖 = 1, . . . , 𝑛,
𝑢𝑖 ≥ 0, 𝑖 = 1, . . . , 𝑛,
which is exactly the problem (2.10) we studied in Sec. 2.2.6. Strong duality now implies
that (2.10) computes the sum of 𝑚 biggest entries in 𝑥.
16
2.4.3 Duality and infeasibility: summary
We can now expand the discussion of infeasibility certificates in the context of duality.
Farkas’ lemma Lemma 2.1 can be dualized and the two versions can be summarized as
follows:
Lemma 2.4 (Primal and dual Farkas’ lemma). For a primal-dual pair of linear problems
we have the following equivalences:
i. The primal problem (2.12) is infeasible if and only if there is 𝑦 such that 𝐴𝑇 𝑦 ≤ 0
and 𝑏𝑇 𝑦 > 0.
ii. The dual problem (2.13) is infeasible if and only if there is 𝑥 ≥ 0 such that 𝐴𝑥 = 0
and 𝑐𝑇 𝑥 < 0.
Weak and strong duality for linear optimization now lead to the following conclusions:
• If the problem is primal feasible and has finite objective value (−∞ < 𝑝⋆ < ∞)
then so is the dual and 𝑑⋆ = 𝑝⋆ . We sometimes refer to this case as primal and dual
feasible. The dual solution certifies the optimality of the primal solution and vice
versa.
• If the primal problem is feasible but unbounded (𝑝⋆ = −∞) then the dual is infeasi-
ble (𝑑⋆ = −∞). Part (ii) of Farkas’ lemma provides a certificate of this fact, that is
a vector 𝑥 with 𝑥 ≥ 0, 𝐴𝑥 = 0 and 𝑐𝑇 𝑥 < 0. In fact it is easy to give this statement
a geometric interpretation. If 𝑥0 is any primal feasible point then the infinite ray
𝑡 → 𝑥0 + 𝑡𝑥, 𝑡 ∈ [0, ∞)
belongs to the feasible set ℱ𝑝 because 𝐴(𝑥0 + 𝑡𝑥) = 𝑏 and 𝑥0 + 𝑡𝑥 ≥ 0. Along this
ray the objective value is unbounded below:
• If the primal problem is infeasible (𝑝⋆ = ∞) then a certificate of this fact is provided
by part (i). The dual problem may be unbounded (𝑑⋆ = ∞) or infeasible (𝑑⋆ =
−∞).
Example 2.8 (Primal-dual infeasibility). Weak and strong duality imply that the only
case when 𝑑⋆ ̸= 𝑝⋆ is when both primal and dual problem are infeasible (𝑑⋆ = −∞,
𝑝⋆ = ∞), for example:
minimize 𝑥
subject to 0 · 𝑥 = 1.
17
2.4.4 Dual values as shadow prices
Dual values are related to shadow prices, as they measure, under some nondegeneracy
assumption, the sensitivity of the objective value to a change in the constraint. Consider
again the primal and dual problem pair (2.12) and (2.13) with feasible sets ℱ𝑝 and ℱ𝑑
and with a primal-dual optimal solution (𝑥* , 𝑦 * , 𝑠* ).
Suppose we change one of the values in 𝑏 from 𝑏𝑖 to 𝑏′𝑖 . This corresponds to moving
one of the hyperplanes defining ℱ𝑝 , and in consequence the optimal solution (and the
objective value) may change. On the other hand, the dual feasible set ℱ𝑑 is not affected.
Assuming that the solution (𝑦 * , 𝑠* ) was a unique vertex of ℱ𝑑 this point remains optimal
for the dual after a sufficiently small change of 𝑏. But then the change of the dual
objective is
𝑦𝑖* (𝑏′𝑖 − 𝑏𝑖 )
and by strong duality the primal objective changes by the same amount.
Example 2.9 (Student diet). An optimization student wants to save money on the
diet while remaining healthy. A healthy diet requires at least 𝑃 = 6 units of protein,
𝐶 = 15 units of carbohydrates, 𝐹 = 5 units of fats and 𝑉 = 7 units of vitamins. The
student can choose from the following products:
P C F V price
takeaway 3 3 2 1 5
vegetables 1 2 0 4 1
bread 0.5 4 1 0 2
If 𝑦1 , 𝑦2 , 𝑦3 , 𝑦4 are the dual variables associated with the four inequality constraints
then the (unique) primal-dual optimal solution to this problem is approximately:
with optimal cost 𝑝⋆ = 12.5. Note 𝑦2 = 0 indicates that the second constraint is not
binding. Indeed, we could increase 𝐶 to 18 without affecting the optimal solution. The
remaining constraints are binding.
Improving the intake of protein by 1 unit (increasing 𝑃 to 7) will increase the cost
by 0.42, while doing the same for fat will cost an extra 1.78 per unit. If the student
had extra money to improve one of the parameters then the best choice would be to
increase the intake of vitamins, with shadow price of just 0.14.
If one month the student only had 12 units of money and was willing to relax one
of the requirements then the best choice is to save on fats: the necessary reduction of
18
𝐹 is smallest, namely 0.5 · 1.78−1 = 0.28. Indeed, with the new value of 𝐹 = 4.72 the
same problem solves to 𝑝⋆ = 12 and 𝑥 = (1.08, 1.48, 2.56).
We stress that a truly balanced diet problem should also include upper bounds.
19
Chapter 3
This chapter extends the notion of linear optimization with quadratic cones. Conic
quadratic optimization, also known as second-order cone optimization, is a straightfor-
ward generalization of linear optimization, in the sense that we optimize a linear func-
tion under linear (in)equalities with some variables belonging to one or more (rotated)
quadratic cones. We discuss the basic concept of quadratic cones, and demonstrate the
surprisingly large flexibility of conic quadratic modeling.
3.1 Cones
Since this is the first place where we introduce a non-linear cone, it seems suitable to
make our most important definition:
A set 𝐾 ⊆ R𝑛 is called a convex cone if
For example a linear subspace of R𝑛 , the positive orthant R𝑛≥0 or any ray (half-line)
starting at the origin are examples of convex cones. We leave it for the reader to check
that the intersection of convex cones is a convex cone; this property enables us to assemble
complicated optimization models from individual conic bricks.
The geometric interpretation of a quadratic (or second-order) cone is shown in Fig. 3.1
for a cone with three variables, and illustrates how the boundary of the cone resembles
an ice-cream cone. The 1-dimensional quadratic cone simply states nonnegativity 𝑥1 ≥ 0.
20
Fig. 3.1: Boundary of quadratic cone 𝑥1 ≥ 𝑥22 + 𝑥23 and rotated quadratic cone 2𝑥1 𝑥2 ≥
√︀
𝑥23 , 𝑥1 , 𝑥2 ≥ 0.
As the name indicates, there is a simple relationship between quadratic and rotated
quadratic cones. Define an orthogonal transformation
⎡ √ √ ⎤
1/√2 1/√2 0
𝑇𝑛 := ⎣ 1/ 2 −1/ 2 0 ⎦. (3.3)
0 0 𝐼𝑛−2
𝑥 ∈ 𝒬𝑛 ⇐⇒ 𝑇𝑛 𝑥 ∈ 𝒬𝑛𝑟 ,
and since 𝑇 is orthogonal we call 𝒬𝑛𝑟 a rotated cone; the transformation corresponds to
a rotation of 𝜋/4 in the (𝑥1 , 𝑥2 ) plane. For example if 𝑥 ∈ 𝒬3 and
⎤ ⎡ 1
√1 0 √1 (𝑥1 + 𝑥2 )
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
𝑧1 √ 𝑥1
2 2 2
⎣ 𝑧2 ⎦ = ⎣ √1 − √1 0 ⎦ · ⎣ 𝑥2 ⎦ = ⎣ √1 (𝑥1 − 𝑥2 ) ⎦
2 2 2
𝑧3 0 0 1 𝑥3 𝑥3
then
Thus, one could argue that we only need quadratic cones 𝒬𝑛 , but there are many examples
where using an explicit rotated quadratic cone 𝒬𝑛𝑟 is more natural, as we will see next.
21
3.2 Conic quadratic modeling
In the following we describe several convex sets that can be modeled using conic quadratic
formulations or, as we call them, are conic quadratic representable.
|𝑥| ≤ 𝑡 ⇐⇒ (𝑡, 𝑥) ∈ 𝒬2 .
The epigraph of the squared Euclidean norm can be described as the intersection of a
rotated quadratic cone with an affine hyperplane,
(1/2)𝑥𝑇 𝑄𝑥 + 𝑐𝑇 𝑥 + 𝑟 ≤ 0
may be rewritten as
𝑡 + 𝑐𝑇 𝑥 + 𝑟 = 0,
(3.4)
𝑥𝑇 𝑄𝑥 ≤ 2𝑡.
𝑥𝑇 𝑄𝑥 ≤ 2𝑡 (3.5)
𝑄 = 𝐹𝑇𝐹 (3.6)
22
(see Sec. 6 for properties of semidefinite matrices). For instance 𝐹 could be the Cholesky
factorization of 𝑄. Then
𝑥𝑇 𝑄𝑥 = 𝑥𝑇 𝐹 𝑇 𝐹 𝑥 = ‖𝐹 𝑥‖22
𝑄 = 𝐼 + 𝐹𝑇𝐹
𝑥𝑇 𝑄𝑥 = 𝑥𝑇 𝑥 + 𝑥𝑇 𝐹 𝑇 𝐹 𝑥 = ‖𝑥‖22 + ‖𝐹 𝑥‖22
and hence
(︂ [︂ ]︂ )︂
𝐼
𝑡, 1, 𝑥 ∈ 𝒬𝑟2+𝑛+𝑘
𝐹
or equivalently
𝑠 = 𝐴𝑥 + 𝑏,
𝑡 = 𝑐𝑇 𝑥 + 𝑑, (3.9)
(𝑡, 𝑠) ∈ 𝒬𝑚+1 .
As will be explained in Sec. 8, we refer to (3.8) as the dual form and (3.9) as the primal
form. An alternative characterization of (3.7) is
which shows that certain quadratic inequalities are conic quadratic representable.
23
3.2.5 Simple sets involving power functions
Some power-like inequalities are conic quadratic representable, even though it need not
be obvious at first glance. For example, we have
√
|𝑡| ≤ 𝑥, 𝑥 ≥ 0 ⇐⇒ (𝑥, 1/2, 𝑡) ∈ 𝒬3𝑟 ,
or in a similar fashion
1 √
𝑡≥ , 𝑥 ≥ 0 ⇐⇒ (𝑥, 𝑡, 2) ∈ 𝒬3𝑟 .
𝑥
For a more complicated example, consider the constraint
𝑡 ≥ 𝑥3/2 , 𝑥 ≥ 0.
because
1 1
2𝑠𝑡 ≥ 𝑥2 , 2 · 𝑥 ≥ 𝑠2 , =⇒ 4𝑠2 𝑡2 · 𝑥 ≥ 𝑥4 · 𝑠2 =⇒ 𝑡 ≥ 𝑥3/2 .
8 4
In practice power-like inequalities representable with similar tricks can often be expressed
much more naturally using the power cone (see Sec. 4), so we will not dwell on these
examples much longer.
In both cases the argument boils down to the observation that the target function is a
sum of an affine expression and the inverse of a positive affine expression, see Sec. 3.2.5.
24
3.2.7 Harmonic mean
Consider next the hypograph of the harmonic mean,
(︃ 𝑛 )︃−1
1 ∑︁ −1
𝑥 ≥ 𝑡 ≥ 0, 𝑥 ≥ 0.
𝑛 𝑖=1 𝑖
It is not obvious that the inequality defines a convex set, nor that it is conic quadratic
representable. However, we can rewrite it in the form
𝑛
∑︁ 𝑡2
≤ 𝑛𝑡,
𝑖=1
𝑥 𝑖
𝑥𝑇 𝐴𝑥 ≤ 0
is equivalent to
𝑛
∑︁
𝛼𝑗 (𝑞𝑗𝑇 𝑥)2 ≤ 𝛼1 (𝑞1𝑇 𝑥)2 . (3.13)
𝑗=2
25
3.2.9 Ellipsoidal sets
The set
ℰ = {𝑥 ∈ R𝑛 | ‖𝑃 (𝑥 − 𝑐)‖2 ≤ 1}
describes an ellipsoid centred at 𝑐. It has a natural conic quadratic representation, i.e.,
𝑥 ∈ ℰ if and only if
𝑥 ∈ ℰ ⇐⇒ (1, 𝑃 (𝑥 − 𝑐)) ∈ 𝒬𝑛+1 .
26
The worst-case objective can be evaluated as
where we used that sup‖𝑢‖2 ≤1 𝑣 𝑇 𝑢 = (𝑣 𝑇 𝑣)/‖𝑣‖2 = ‖𝑣‖2 . Thus the robust problem (3.17)
is equivalent to
minimize 𝑔 𝑇 𝑥 + ‖𝐹 𝑇 𝑥‖2
subject to 𝐴𝑥 = 𝑏,
𝑥 ≥ 0,
minimize 𝑔 𝑇 𝑥 + 𝑡
subject to 𝐴𝑥 = 𝑏,
(3.18)
(𝑡, 𝐹 𝑇 𝑥) ∈ 𝒬𝑛+1 ,
𝑥 ≥ 0.
𝜇 = E𝑟
and covariance
The return of our investment is also a random variable 𝑦 = 𝑟𝑇 𝑥 with mean (or expected
return)
E𝑦 = 𝜇𝑇 𝑥
(𝑦 − E𝑦)2 = 𝑥𝑇 Σ𝑥.
We then wish to rebalance our portfolio to achieve a compromise between risk and ex-
pected return, e.g., we can maximize the expected return given an upper bound 𝛾 on the
tolerable risk and a constraint that our total investment is fixed,
maximize 𝜇𝑇 𝑥
subject to 𝑥𝑇 Σ𝑥 ≤ 𝛾
(3.19)
𝑒𝑇 𝑥 = 1
𝑥 ≥ 0.
27
Suppose we factor Σ = 𝐺𝐺𝑇 (e.g., using a Cholesky or a eigenvalue decomposition). We
then get a conic formulation
maximize 𝜇𝑇 𝑥
√
subject to ( 𝛾, 𝐺𝑇 𝑥) ∈ 𝒬𝑛+1
(3.20)
𝑒𝑇 𝑥 = 1
𝑥 ≥ 0.
In practice both the average return and covariance are estimated using historical data. A
recent trend is then to formulate a robust version of the portfolio optimization problem
to combat the inherent uncertainty in those estimates, e.g., we can constrain 𝜇 to an
ellipsoidal uncertainty set as in Sec. 3.3.2.
It is also common that the data for a portfolio optimization problem is already given
in the form of a factor model Σ = 𝐹 𝑇 𝐹 of Σ = 𝐼 +𝐹 𝑇 𝐹 and a conic quadratic formulation
as in Sec. 3.3.1 is most natural. For more details see Sec. 10.3.
𝑦 = 𝑧𝑥.
Since a positive 𝑧 can be chosen arbitrarily and (𝜇 − 𝑟𝑓 𝑒)𝑇 𝑥 > 0, we can without loss of
generality assume that
(𝜇 − 𝑟𝑓 𝑒)𝑇 𝑦 = 1.
Thus, we obtain the following conic problem for maximizing the Sharpe ratio,
minimize 𝑡
subject to (𝑡, 𝐺𝑇 𝑦) ∈ 𝒬𝑘+1 ,
𝑒𝑇 𝑦 = 𝑧,
(𝜇 − 𝑟𝑓 𝑒)𝑇 𝑦 = 1,
𝑦, 𝑧 ≥ 0,
28
3.3.5 A resource constrained production and inventory problem
The resource constrained production and inventory problem [Zie82] can be formulated as
follows:
∑︀𝑛
minimize (𝑑𝑗 𝑥𝑗 + 𝑒𝑗 /𝑥𝑗 )
∑︀𝑗=1
𝑛
subject to 𝑗=1 𝑟𝑗 𝑥𝑗 ≤ 𝑏, (3.21)
𝑥𝑗 ≥ 0, 𝑗 = 1, . . . , 𝑛,
where 𝑛 denotes the number of items to be produced, 𝑏 denotes the amount of common
resource, and 𝑟𝑗 is the consumption of the limited resource to produce one unit of item 𝑗.
The objective function represents inventory and ordering costs. Let 𝑐𝑝𝑗 denote the holding
cost per unit of product 𝑗 and 𝑐𝑟𝑗 denote the rate of holding costs, respectively. Further,
let
𝑐𝑝𝑗 𝑐𝑟𝑗
𝑑𝑗 =
2
so that
𝑑 𝑗 𝑥𝑗
is the average holding costs for product 𝑗. If 𝐷𝑗 denotes the total demand for product 𝑗
and 𝑐𝑜𝑗 the ordering cost per order of product 𝑗 then let
𝑒𝑗 = 𝑐𝑜𝑗 𝐷𝑗
and hence
𝑒𝑗 𝑐𝑜𝑗 𝐷𝑗
=
𝑥𝑗 𝑥𝑗
is the average ordering costs for product 𝑗. In summary, the problem finds the optimal
batch size such that the inventory and ordering cost are minimized while satisfying the
constraints on the common resource. Given 𝑑𝑗 , 𝑒𝑗 ≥ 0 problem (3.21) is equivalent to the
conic quadratic problem
∑︀𝑛
minimize (𝑑𝑗 𝑥𝑗 + 𝑒𝑗 𝑡𝑗 )
∑︀𝑗=1
𝑛
subject to 𝑗=1 𝑟√ 𝑗 𝑥𝑗 ≤ 𝑏,
(𝑡𝑗 , 𝑥𝑗 , 2) ∈ 𝒬3𝑟 , 𝑗 = 1, . . . , 𝑛.
It is not always possible to produce a fractional number of items. In such case 𝑥𝑗 should
be constrained to be integers. See Sec. 9.
29
Chapter 4
So far we studied quadratic cones and their applications in modeling problems involving,
directly or indirectly, quadratic terms. In this part we expand the quadratic and rotated
quadratic cone family with power cones, which provide a convenient language to express
models involving powers other than 2. We must stress that although the power cones
include the quadratic cones as special cases, at the current state-of-the-art they require
more advanced and less efficient algorithms.
which means that the basic building block we need to consider is the three-dimensional
power cone
More generally, we can also consider power cones with “long left-hand side”. That is, for
𝑚 < 𝑛 and a sequence of exponents 𝛼1 , . . . , 𝛼𝑚 with 𝛼1 + · · · + 𝛼𝑚 = 1, we have the most
general power cone object defined as
{︁ ∏︀𝑚 𝛼𝑖 √︁∑︀𝑛 }︁
𝒫𝑛𝛼1 ,··· ,𝛼𝑚 = 𝑥 ∈ R𝑛 : 𝑥
𝑖=1 𝑖 ≥ 𝑥
𝑖=𝑚+1 𝑖
2
, 𝑥1 , . . . , 𝑥 𝑚 ≥ 0 . (4.4)
The left-hand side is nothing but the weighted geometric mean of the 𝑥𝑖 , 𝑖 = 1, . . . , 𝑚
with weights 𝛼𝑖 . As we will see later, also this most general cone can be modeled as a
composition of three-dimensional cones 𝒫3𝛼,1−𝛼 , so in a sense that is the basic object of
interest.
30
There are some notable special cases we are familiar with. If we let 𝛼 → 0 then in the
limit we get 𝒫𝑛0,1 = R+ × 𝒬𝑛−1 . If 𝛼 = 12 then we have a rescaled version of the rotated
quadratic cone, precisely:
1 1
,2 √ √
(𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ∈ 𝒫𝑛2 ⇐⇒ (𝑥1 / 2, 𝑥2 / 2, 𝑥3 , . . . , 𝑥𝑛 ) ∈ 𝒬𝑛r .
Fig. 4.1: The boundary of 𝒫3𝛼,1−𝛼 seen from a point inside the cone for 𝛼 =
0.1, 0.2, 0.35, 0.5.
31
4.2 Sets representable using the power cone
In this section we give basic examples of constraints which can be expressed using power
cones.
4.2.1 Powers
For all values of 𝑝 ̸= 0, 1 we can bound 𝑥𝑝 depending on the convexity of 𝑓 (𝑥) = 𝑥𝑝 .
• For 𝑝 > 1 the inequality 𝑡 ≥ |𝑥|𝑝 is equivalent to 𝑡1/𝑝 ≥ |𝑥| and hence corresponds
to
1/𝑝,1−1/𝑝
𝑡 ≥ |𝑥|𝑝 ⇐⇒ (𝑡, 1, 𝑥) ∈ 𝒫3 .
2/3,1/3
For instance 𝑡 ≥ |𝑥|1.5 is equivalent to (𝑡, 1, 𝑥) ∈ 𝒫3 .
• For 0 < 𝑝 < 1 the function 𝑓 (𝑥) = 𝑥𝑝 is concave for 𝑥 ≥ 0 and so we get
• For 𝑝 < 0 the function 𝑓 (𝑥) = 𝑥𝑝 is convex for 𝑥 > 0 and in this range the inequality
𝑡 ≥ 𝑥𝑝 is equivalent to
1/(1−𝑝),−𝑝/(1−𝑝)
𝑡 ≥ 𝑥𝑝 ⇐⇒ 𝑡1/(1−𝑝) 𝑥−𝑝/(1−𝑝) ≥ 1 ⇐⇒ (𝑡, 𝑥, 1) ∈ 𝒫3 .
2/3,1/3
For example 𝑡 ≥ √1
𝑥
is the same as (𝑡, 𝑥, 1) ∈ 𝒫3 .
For 𝑝 = 2 this is precisely the quadratic cone. We can model the 𝑝-norm cone by writing
the inequality 𝑡 ≥ ‖𝑥‖𝑝 as:
∑︁
𝑡≥ |𝑥𝑖 |𝑝 /𝑡𝑝−1
𝑖
and bounding each summand with a power cone. This leads to the following model:
𝑝−1 1/𝑝,1−1/𝑝
𝑟∑︀
𝑖𝑡 ≥ |𝑥𝑖 |𝑝 ((𝑟𝑖 , 𝑡, 𝑥𝑖 ) ∈ 𝒫3 ),
(4.6)
𝑟𝑖 = 𝑡.
When 0 < 𝑝 < 1 or 𝑝 < 0 the formula for ‖𝑥‖𝑝 gives a concave, rather than convex
function on R𝑛+ and in this case it is possible to model the set
{︂ (︁∑︁ )︁1/𝑝 }︂
𝑝
(𝑡, 𝑥) : 0 ≤ 𝑡 ≤ 𝑥𝑖 , 𝑥𝑖 ≥ 0 , 𝑝 < 1, 𝑝 ̸= 0.
32
4.2.3 The most general power cone
Consider the most general version of the power cone with “long left-hand side” defined in
(4.4). We show that it can be expressed using the basic three-dimensional cones 𝒫3𝛼,1−𝛼 .
Clearly it suffices to consider a short right-hand side, that is to model the cone
𝛼1 ,...,𝛼𝑚
𝒫𝑚+1 = {𝑥𝛼1 1 𝑥𝛼2 2 · · · 𝑥𝛼𝑚𝑚 ≥ |𝑧|, 𝑥1 , . . . , 𝑥𝑚 ≥ 0} , (4.7)
𝛼 /𝑠 𝛼 /𝑠
𝑥1 1 · · · 𝑥𝑚−1
𝑚−1
≥ |𝑡|, 𝑥1 , . . . , 𝑥𝑚−1 ≥ 0,
𝑠 𝛼𝑚 (4.8)
𝑡 𝑥𝑚 ≥ |𝑧|, 𝑥𝑚 ≥ 0,
𝛼1 ,...,𝛼𝑚 𝛼 /𝑠,...,𝛼𝑚−1 /𝑠
and this way we expressed 𝒫𝑚+1 using two power cones 𝒫𝑚1 and 𝒫3𝑠,𝛼𝑚 .
Proceeding by induction gives the desired splitting.
𝛼 /𝛽,...,𝛼𝑚 /𝛽,𝑠
(𝑥1 , 𝑥2 , . . . , 𝑥𝑚 , 1, 𝑧) ∈ 𝒫𝑚+2
1
with 𝑠 = 1 −
∑︀
𝑖 𝛼𝑖 /𝛽.
33
portfolio:
maximize √ 𝜇𝑇 𝑥
subject to 𝑥𝑇 Σ𝑥 ≤ 𝛾, (4.10)
𝑥𝑖 ≥ 0, 𝑖 = 1, . . . , 𝑛,
In a realistic model we would have to consider transaction costs which decrease the
expected return. In particular if a really large volume is traded then the trade itself will
affect the price of the asset, a phenomenon called market impact. It is typically modeled
by decreasing the expected return of 𝑖-th asset by a slippage cost proportional to 𝑥𝛽 for
some 𝛽 > 1, so that the objective function changes to
(︃ )︃
∑︁ 𝛽
maximize 𝜇𝑇 𝑥 − 𝛿𝑖 𝑥𝑖 .
𝑖
A popular choice is 𝛽 = 3/2. This objective can easily be modeled with a power cone as
in Sec. 4.2:
maximize 𝜇𝑇 𝑥 − 𝛿 𝑇 𝑡
1/𝛽,1−1/𝛽
subject to (𝑡𝑖 , 1, 𝑥𝑖 ) ∈ 𝒫3 (𝑡𝑖 ≥ 𝑥𝛽𝑖 ),
···
3/2
In particular if 𝛽 = 3/2 the inequality 𝑡𝑖 ≥ 𝑥𝑖 has conic representation (𝑡𝑖 , 1, 𝑥𝑖 ) ∈
2/3,1/3
𝒫3 .
maximize 𝑡
subject to 𝑡 ≤ (𝑥1 · · · 𝑥𝑛 )1/𝑛 ,
(4.11)
𝑥𝑖 ≥ 0,
(𝑝1 + 𝑒1 𝑥1 , . . . , 𝑝𝑛 + 𝑒𝑛 𝑥𝑛 ) ∈ 𝐾, ∀𝑒1 , . . . , 𝑒𝑛 ∈ {0, 1},
where the last constraint states that all vertices of the cuboid are in 𝐾. The optimal
volume is then 𝑣 = 𝑡𝑛 . Modeling the geometric mean with power cones was discussed in
Sec. 4.2.4.
Maximizing the volume of an arbitrary (not necessarily axis-parallel) cuboid inscribed
in 𝐾 is no longer a convex problem. However, it can be solved by maximizing the solution
to (4.11) over all sets 𝑇 (𝐾) where 𝑇 is a rotation (orthogonal matrix) in R𝑛 . In practice
one can approximate the global solution by sampling sufficiently many rotations 𝑇 or
using more advanced methods of optimization over the orthogonal group.
34
Fig. 4.2: The maximal volume cuboid inscribed in the regular icosahedron √ takes up
approximately 0.388 of the volume of the icosahedron (the exact value is 3(1 + 5)/25).
that is a point which minimizes the sum of distances to all the given points. Here ‖ · ‖
can be any norm on R𝑛 . The most classical case is the Euclidean norm, where the
geometric median is the solution to the basic facility location problem minimizing total
transportation cost from one depot to given destinations.
For a general 𝑝-norm ‖𝑥‖𝑝 with 1 ≤ 𝑝 < ∞ (see Sec. 4.2.2) the geometric median is
the solution to the obvious conic problem:
minimize
∑︀
𝑖 𝑡𝑖
subject to 𝑡𝑖 ≥ ‖𝑦 − 𝑥𝑖 ‖𝑝 , (4.12)
𝑛
𝑦∈R .
In Sec. 4.2.2 we showed how to model the 𝑝-norm bound using 𝑛 power cones.
The Fermat-Torricelli point of a triangle is the Euclidean geometric mean of its ver-
tices, and a classical theorem in planar geometry (due to Torricelli, posed by Fermat),
states that it is the unique point inside the triangle from which each edge is visible at
the angle of 120∘ (or a vertex if the triangle has an angle of 120∘ or more). Using (4.12)
we can compute the 𝑝-norm analogues of the Fermat-Torricelli point. Some examples are
shown in Fig. 4.3.
35
Fig. 4.3: The geometric median of three triangle vertices in various 𝑝-norms.
𝑔˜ : [𝑦1 , 𝑦𝑛 ] → R+
with break points at (𝑦𝑖 , 𝑥𝑖 ), 𝑖 = 1, . . . , 𝑛, where the variables 𝑥𝑖 > 0 are estimators for
𝑔(𝑦𝑖 ). The slope of the 𝑖-th linear segment of 𝑔˜ is
𝑥𝑖+1 − 𝑥𝑖
.
𝑦𝑖+1 − 𝑦𝑖
Hence the convexity requirement leads to the constraints
𝑥𝑖+1 − 𝑥𝑖 𝑥𝑖+2 − 𝑥𝑖+1
≤ , 𝑖 = 1, . . . , 𝑛 − 2.
𝑦𝑖+1 − 𝑦𝑖 𝑦𝑖+2 − 𝑦𝑖+1
Recall the area under the density function must be 1. Hence,
𝑛−1 (︂ )︂
∑︁ 𝑥𝑖+1 + 𝑥𝑖
(𝑦𝑖+1 − 𝑦𝑖 ) =1
𝑖=1
2
36
Chapter 5
Thus the exponential cone is the closure in R3 of the set of points which satisfy
which immediately shows that 𝐾exp is in fact a cone, i.e. 𝛼𝑥 ∈ 𝐾exp for 𝑥 ∈ 𝐾exp and
𝛼 ≥ 0. Convexity of 𝐾exp follows from the fact that the Hessian of 𝑓 (𝑥, 𝑦) = 𝑦 exp(𝑥/𝑦),
namely
𝑦 −1 −𝑥𝑦 −2
[︂ ]︂
2 𝑥/𝑦
𝐷 (𝑓 ) = 𝑒
−𝑥𝑦 −2 𝑥2 𝑦 −3
37
Fig. 5.1: The boundary of the exponential cone 𝐾exp . On the left, the red isolines are
graphs of 𝑥2 → 𝑥2 log(𝑥1 /𝑥2 ) for fixed 𝑥1 , see (5.3). On the right, they are graphs of
𝑥3 → 𝑥2 𝑒𝑥3 /𝑥2 for fixed 𝑥2 .
5.2.1 Exponential
The epigraph 𝑡 ≥ 𝑒𝑥 is a section of 𝐾exp :
5.2.2 Logarithm
Similarly, we can express the hypograph 𝑡 ≤ log 𝑥, 𝑥 ≥ 0:
5.2.3 Entropy
The entropy function 𝐻(𝑥) = −𝑥 log 𝑥 can be maximized using the following representa-
tion which follows directly from (5.3):
38
5.2.4 Relative entropy
The relative entropy or Kullback-Leiber divergence of two probability distributions is
defined in terms of the function 𝐷(𝑥, 𝑦) = 𝑥 log(𝑥/𝑦). It is convex, and the minimization
problem 𝑡 ≥ 𝐷(𝑥, 𝑦) is equivalent to
Because of this reparametrization the exponential cone is also referred to as the relative
entropy cone, leading to a class of problems known as REPs (relative entropy problems).
Having the relative entropy function available makes it possible to express epigraphs of
other functions appearing in REPs, for instance:
𝑢 + 𝑣 ≤ 1,
(𝑢, 1, 𝑥 − 𝑡) ∈ 𝐾exp , (5.8)
(𝑣, 1, −𝑡) ∈ 𝐾exp .
5.2.6 Log-sum-exp
We can generalize the previous example to a log-sum-exp (logarithm of sum of exponen-
tials) expression
𝑡 ≥ log(𝑒𝑥1 + · · · + 𝑒𝑥𝑛 ).
𝑒𝑥1 −𝑡 + · · · + 𝑒𝑥𝑛 −𝑡 ≤ 1,
39
5.2.7 Log-sum-inv
The following type of bound has applications in capacity optimization for wireless network
design:
(︂ )︂
1 1
𝑡 ≥ log + ··· + , 𝑥𝑖 > 0.
𝑥1 𝑥𝑛
Since the logarithm is increasing, we can model this using a log-sum-exp and an expo-
nential as:
𝑡 ≥ log(𝑒𝑦1 + · · · + 𝑒𝑦𝑛 ),
𝑥𝑖 ≥ 𝑒−𝑦𝑖 , 𝑖 = 1, . . . , 𝑛.
Alternatively, one can also rewrite the original constraint in equivalent form:
(︂ )︂−1
−𝑡 1 1
𝑒 ≤𝑠≤ + ··· + , 𝑥𝑖 > 0.
𝑥1 𝑥𝑛
and then model the right-hand side inequality using the technique from Sec. 3.2.7. This
approach requires only one exponential cone.
𝑊 (𝑥)𝑒𝑊 (𝑥) = 𝑥.
It is the real branch of a more general function which appears in applications such as diode
modeling. The 𝑊 function is concave. Although there is no explicit analytic formula for
𝑊 (𝑥), the hypograph {(𝑥, 𝑡) : 0 ≤ 𝑥, 0 ≤ 𝑡 ≤ 𝑊 (𝑥)} has an equivalent description:
2 /𝑡
𝑥 ≥ 𝑡𝑒𝑡 = 𝑡𝑒𝑡
and so it can be modeled with a mix of exponential and quadratic cones (see Sec. 3.1.2):
40
5.2.10 Other simple sets
Here are a few more typical sets which can be expressed using the exponential and
quadratic cones. The presentations should be self-explanatory; we leave the simple veri-
fications to the reader.
where the exponents 𝑎𝑖 are arbitrary real numbers and 𝑐 > 0. A posynomial (positive
polynomial) is a sum of monomials. Thus the difference between a posynomial and a
standard notion of a multi-variate polynomial known from algebra or calculus is that (i)
posynomials can have arbitrary exponents, not just integers, but (ii) they can only have
positive coefficients.
For example, the following functions are monomials (in variables 𝑥, 𝑦, 𝑧):
(5.12)
√︀
𝑥𝑦, 2𝑥1.5 𝑦 −1 𝑥0.3 , 3 𝑥𝑦/𝑧, 1
minimize 𝑓0 (𝑥)
subject to 𝑓𝑖 (𝑥) ≤ 1, 𝑖 = 1, . . . , 𝑚, (5.14)
𝑥𝑗 > 0, 𝑗 = 1, . . . , 𝑛,
41
where 𝑓0 , . . . , 𝑓𝑚 are posynomials and 𝑥 = (𝑥1 , . . . , 𝑥𝑛 ) is the variable vector.
A geometric program (5.14) can be modeled in exponential conic form by making a
substitution
𝑥𝑗 = 𝑒𝑦𝑗 , 𝑗 = 1, . . . , 𝑛.
where 𝑎𝑖,𝑘 ∈ R𝑛 and 𝑐𝑖,𝑘 ∈ R for all 𝑖, 𝑘. These are now log-sum-exp constraints we
already discussed in Sec. 5.2.6. In particular, the problem (5.15) is convex, as opposed
to the posynomial formulation (5.14).
Example
We demonstrate this reduction on a simple example. Take the geometric problem
minimize 2
√ 𝑥 + 𝑦−1𝑧
subject to 0.1 𝑥 + 2𝑦 ≤ 1,
−1 −2
𝑧 + 𝑦𝑥 ≤ 1.
By substituting 𝑥 = 𝑒𝑢 , 𝑦 = 𝑒𝑣 , 𝑧 = 𝑒𝑤 we get
minimize 𝑡
subject to log(𝑒𝑢 + 𝑒2𝑣+𝑤 ) ≤ 𝑡,
log(𝑒0.5𝑢+log 0.1 + 𝑒−𝑣+log 2 ) ≤ 0,
log(𝑒−𝑤 + 𝑒𝑣−2𝑢 ) ≤ 0.
and using the log-sum-exp reduction from Sec. 5.2.6 we write an explicit conic problem:
minimize 𝑡
subject to (𝑝1 , 1, 𝑢 − 𝑡), (𝑞1 , 1, 2𝑣 + 𝑤 − 𝑡) ∈ 𝐾exp , 𝑝1 + 𝑞1 ≤ 1,
(𝑝2 , 1, 0.5𝑢 + log 0.1), (𝑞2 , 1, −𝑣 + log 2) ∈ 𝐾exp , 𝑝2 + 𝑞2 ≤ 1,
(𝑝3 , 1, −𝑤), (𝑞3 , 1, 𝑣 − 2𝑢) ∈ 𝐾exp , 𝑝3 + 𝑞3 ≤ 1.
42
Monomials
If 𝑚(𝑥) is a monomial then the constraint 𝑚(𝑥) = 𝑐 is equivalent to two posynomial
inequalities 𝑚(𝑥)𝑐−1 ≤ 1 and 𝑚(𝑥)−1 𝑐 ≤ 1, so it can be expressed in the language of
geometric programs. In practice it should be added to the model (5.15) as a linear
constraint
∑︁
𝑎𝑘 𝑦𝑘 = log 𝑐.
𝑘
• If 𝑓, 𝑔 are posynomials, 𝑚 is a monomial and we know that 𝑚(𝑥) ≥ 𝑔(𝑥) then the
𝑓 (𝑥)
constraint 𝑚(𝑥)−𝑔(𝑥) ≤ 𝑡 is equivalent to 𝑡−1 𝑓 (𝑥)𝑚(𝑥)−1 + 𝑔(𝑥)𝑚(𝑥)−1 ≤ 1.
where 𝑚𝑘 (𝑥) are monomials. After the change of variables 𝑥 = 𝑒𝑦 we get a slightly
modified version of (5.15):
minimize 𝑡 + 𝑏𝑇 𝑦
subject to 𝑇
∑︀
∑︀ 𝑘 exp(𝑎0,𝑘,* 𝑦 + log 𝑐0,𝑘 ) ≤ 𝑡,
log( 𝑘 exp(𝑎𝑇𝑖,𝑘,* 𝑦 + log 𝑐𝑖,𝑘 )) ≤ 0, 𝑖 = 1, . . . , 𝑚,
(note the lack of one logarithm) which can still be expressed with exponential cones.
43
5.3.3 Geometric programming case studies
Frobenius norm diagonal scaling
Suppose we have a matrix 𝑀 ∈ R𝑛×𝑛 and we want to rescale the coordinate system
using a diagonal matrix 𝐷 = Diag(𝑑1 , . . . , 𝑑𝑛 ) with 𝑑𝑖 > 0. In the new basis the linear
transformation given by 𝑀 will now be described by the matrix 𝐷𝑀 𝐷−1 = (𝑑𝑖 𝑀𝑖𝑗 𝑑−1 𝑗 )𝑖,𝑗 .
To choose 𝐷 which leads to a “small” rescaling we can for example minimize the Frobenius
norm
∑︁ (︀ )︀2 ∑︁ 2 2 −2
‖𝐷𝑀 𝐷−1 ‖2𝐹 = (𝐷𝑀 𝐷−1 )𝑖𝑗 = 𝑀𝑖𝑗 𝑑𝑖 𝑑𝑗 .
𝑖𝑗 𝑖𝑗
Minimizing the last sum is an example of a geometric program with variables 𝑑𝑖 (and
without constraints).
For example, if (𝑛0 , 𝑛1 , 𝑛2 ) = (30, 53, 16) then the above problem solves with 𝑝 = 0.29.
An Olympiad problem
The 26th Vojtěch Jarník International Mathematical Competition, Ostrava 2016. Let
𝑎, 𝑏, 𝑐 be positive real numbers with 𝑎 + 𝑏 + 𝑐 = 1. Prove that
(︂ )︂ (︂ )︂ (︂ )︂
1 1 1 1 1 1
+ + + ≥ 1728
𝑎 𝑏𝑐 𝑏 𝑎𝑐 𝑐 𝑎𝑏
44
Using the tricks introduced in Sec. 5.3.2 we formulate this problem as a geometric pro-
gram:
minimize 𝑝𝑞𝑟
subject to 𝑝 𝑎 + 𝑝 𝑏 𝑐
−1 −1 −1 −1 −1
≤ 1,
𝑞 −1 𝑏−1 + 𝑞 −1 𝑎−1 𝑐−1 ≤ 1,
𝑟−1 𝑐−1 + 𝑟−1 𝑎−1 𝑏−1 ≤ 1,
𝑎+𝑏+𝑐 ≤ 1.
Unsurprisingly, the
)︀ optimal value of this program is 1728, achieved for (𝑎, 𝑏, 𝑐, 𝑝, 𝑞, 𝑟) =
, , , 12, 12, 12 .
(︀ 1 1 1
3 3 3
Maximizing the minimal SINR over all receivers (max min𝑖 𝑠𝑖 ), subject to bounded
power output of the transmitters, is equivalent to the geometric program with variables
𝑝1 , . . . , 𝑝𝑛 , 𝑡:
minimize 𝑡−1
subject to ∑︀𝑝min ≤ 𝑝𝑗 ≤ 𝑝max , 𝑗 = 1, . . . , 𝑛, (5.17)
𝑡(𝜎𝑖 + 𝑗̸=𝑖 𝐺𝑖𝑗 𝑝𝑗 )𝐺−1 𝑝
𝑖𝑖 𝑖
−1
≤ 1, 𝑖 = 1, . . . , 𝑛.
minimize 𝑠−1 −1
1 · · · · 𝑠𝑛
subject to ∑︀𝑝min ≤ 𝑝𝑗 ≤−1 𝑝max , 𝑗 = 1, . . . , 𝑛, (5.18)
−1
𝑠𝑖 (𝜎𝑖 + 𝑗̸=𝑖 𝐺𝑖𝑗 𝑝𝑗 )𝐺𝑖𝑖 𝑝𝑖 ≤ 1, 𝑖 = 1, . . . , 𝑛.
45
5.4.1 Risk parity portfolio
Consider a simple version of the Markowitz portfolio optimization
√ problem introduced in
Sec. 3.3.3, where we simply ask to minimize the risk 𝑟(𝑥) = 𝑥𝑇 Σ𝑥 of a fully-invested
long-only portfolio:
√
minimize ∑︀𝑥𝑇 Σ𝑥
subject to 𝑛
𝑖=1 𝑥𝑖 = 1,
(5.19)
𝑥𝑖 ≥ 0, 𝑖 = 1, . . . , 𝑛,
where Σ is a symmetric positive definite covariance matrix. We can derive from the
first-order optimality conditions that the solution to (5.19) satisfies 𝜕𝑥
𝜕𝑟
𝑖
𝜕𝑟
= 𝜕𝑥 𝑗
whenever
𝑥𝑖 , 𝑥𝑗 > 0, i.e. marginal risk contributions of positively invested assets are equal. In
practice this often leads to concentrated portfolios, whereas it would benefit diversification
to consider portfolios where all assets have the same total contribution to risk:
𝜕𝑟 𝜕𝑟
𝑥𝑖 = 𝑥𝑗 , 𝑖, 𝑗 = 1, . . . , 𝑛. (5.20)
𝜕𝑥𝑖 𝜕𝑥𝑗
We call (5.20) the risk parity condition. It indeed models equal risk contribution from
all the assets, because as one can easily check 𝜕𝑥
𝜕𝑟
𝑖
= √(Σ𝑥) 𝑖
𝑥𝑇 Σ𝑥
and
∑︁ 𝜕𝑟
𝑟(𝑥) = 𝑥𝑖 .
𝑖
𝜕𝑥𝑖
Risk parity portfolios satisfying condition (5.20) can be found with an auxiliary optimiza-
tion problem:
√
minimize
∑︀
𝑥𝑇 Σ𝑥 − 𝑐 𝑖 log 𝑥𝑖
(5.21)
subject to 𝑥𝑖 > 0, 𝑖 = 1, . . . , 𝑛,
for any 𝑐 > 0. More precisely, the gradient of the objective function in (5.21) is zero
when 𝜕𝑥 𝜕𝑟
𝑖
= 𝑐/𝑥𝑖 for all 𝑖, implying the parity condition (5.20) holds. Since (5.20) is
scale-invariant,∑︀
we can rescale any solution of (5.21) and get a fully-invested risk parity
portfolio with 𝑖 𝑥𝑖 = 1.
The conic form of problem (5.21) is:
minimize 𝑡 − 𝑐𝑒𝑇 𝑠 √
subject to (𝑡, Σ1/2 𝑥) ∈ 𝒬𝑛+1 , (𝑡 ≥ 𝑥𝑇 Σ𝑥), (5.22)
(𝑥𝑖 , 1, 𝑠𝑖 ) ∈ 𝐾exp , (𝑠𝑖 ≤ log 𝑥𝑖 ).
where ℐ defines additional constraints on the probability distribution 𝑝 (these are known
as prior information). In the absence of complete information about 𝑝 the maximum
46
entropy principle of Jaynes posits to choose the distribution which maximizes uncertainty,
that is entropy, subject to what is known. Practitioners think of the solution to (5.23) as
the most random or most conservative of distributions consistent with ℐ.
Maximization of the entropy function 𝐻(𝑥) = −𝑥 log 𝑥 was explained in Sec. 5.2.
Often one has an a priori distribution 𝑞, and one tries to minimize the distance between
𝑝 and 𝑞, while remaining consistent with ℐ. In this case it is standard to minimize the
Kullback-Leiber divergence
∑︁
𝒟𝐾𝐿 (𝑝||𝑞) = 𝑝𝑖 log 𝑝𝑖 /𝑞𝑖
𝑖
minimize
∑︀
𝑖 𝑝𝑖 log∑︀
𝑝𝑖 /𝑞𝑖
subject to 𝑖 𝑝𝑖 = 1, (5.24)
𝑝𝑖 ≥ 0,
𝑝 ∈ ℐ.
where we assume for simplicity that 𝐴 = Diag(𝑎1 , . . . , 𝑎𝑛 ) with 𝑎𝑖 < 0 with initial
condition x(0)𝑖 = 𝑥𝑖 . The resulting dynamical system x(𝑡) = x(0) exp(𝐴𝑡) converges to
0 and one can ask, for instance, for the time it takes to approach the limit up to distance
𝜀. The resulting optimization problem is
minimize √︁∑︀ 𝑡
2
subject to 𝑖 (𝑥𝑖 exp(𝑎𝑖 𝑡)) ≤ 𝜀,
minimize 𝑡
subject to (𝜀, 𝑥1 𝑞1 , . . . , 𝑥𝑛 𝑞𝑛 ) ∈ 𝒬𝑛+1 ,
(𝑞𝑖 , 1, 𝑎𝑖 𝑡) ∈ 𝐾exp , 𝑖 = 1, . . . , 𝑛.
See Fig. 5.2 for an example. Other criteria for the target set of the trajectories are also
possible. For example, polyhedral constraints
𝑐𝑇 x ≤ 𝑑, 𝑐 ∈ R𝑛+ , 𝑑 ∈ R+
are also expressible in exponential conic form for starting points x(0) ∈ R𝑛+ , since they
correspond to log-sum-exp constraints of the form
(︃ )︃
∑︁
log exp(𝑎𝑖 x𝑖 + log(𝑐𝑖 𝑥𝑖 )) ≤ log 𝑑.
𝑖
47
Fig. 5.2: With 𝐴 = Diag(−0.3, −0.06) and starting point 𝑥(0) = (2.2, 1.3) the trajectory
reaches distance 𝜀 = 0.5 from origin at time 𝑡 ≈ 15.936.
and choosing label 𝑦 = 0 if ℎ𝜃 (𝑥) < 12 and 𝑦 = 1 for ℎ𝜃 (𝑥) ≥ 21 . Here ℎ𝜃 (𝑥) is interpreted
as the probability that 𝑥 belongs to class 1. The optimal parameter vector 𝜃 should be
learned from the training set, so as to maximize the likelihood function:
∏︁
ℎ𝜃 (𝑥𝑖 )𝑦𝑖 (1 − ℎ𝜃 (𝑥𝑖 ))1−𝑦𝑖 .
𝑖
minimize
∑︀
𝑖 𝑡𝑖 + 𝜆𝑟
subject to 𝑡𝑖 ≥ − log(ℎ𝜃 (𝑥)) = log(1 + exp(−𝜃𝑇 𝑥𝑖 )) if 𝑦𝑖 = 1,
𝑡𝑖 ≥ − log(1 − ℎ𝜃 (𝑥)) = log(1 + exp(𝜃𝑇 𝑥𝑖 )) if 𝑦𝑖 = 0,
𝑟 ≥ ‖𝜃‖2 ,
involving softplus type constraints (see Sec. 5.2) and a quadratic cone. See Fig. 5.3 for
an example.
48
Fig. 5.3: Logistic regression example with none, medium and strong regularization (small,
medium, large 𝜆). The two-dimensional dataset was converted into a feature vector
𝑥 ∈ R28 using monomial coordinates of degrees at most 6. Without regularization we get
obvious overfitting.
49
Chapter 6
Semidefinite optimization
In this chapter we extend the conic optimization framework introduced before with sym-
metric positive semidefinite matrix variables.
𝑧 𝑇 𝑋𝑧 ≥ 0, ∀𝑧 ∈ R𝑛 .
𝒮+𝑛 = {𝑋 ∈ 𝒮 𝑛 | 𝑧 𝑇 𝑋𝑧 ≥ 0, ∀𝑧 ∈ R𝑛 }. (6.1)
For brevity we will often use the shorter notion semidefinite instead of symmetric positive
semidefinite, and we will write 𝑋 ⪰ 𝑌 (𝑋 ⪯ 𝑌 ) as shorthand notation for (𝑋 − 𝑌 ) ∈ 𝒮+𝑛
((𝑌 − 𝑋) ∈ 𝒮+𝑛 ). As inner product for semidefinite matrices, we use the standard trace
inner product for general matrices, i.e.,
∑︁
⟨𝐴, 𝐵⟩ := tr(𝐴𝑇 𝐵) = 𝑎𝑖𝑗 𝑏𝑖𝑗 .
𝑖𝑗
It is easy to see that (6.1) indeed specifies a convex cone; it is pointed (with origin
𝑋 = 0), and 𝑋, 𝑌 ∈ 𝒮+𝑛 implies that (𝛼𝑋 + 𝛽𝑌 ) ∈ 𝒮+𝑛 , 𝛼, 𝛽 ≥ 0. Let us review a
few equivalent definitions of 𝒮+𝑛 . It is well-known that every symmetric matrix 𝐴 has a
spectral factorization
𝑛
∑︁
𝐴= 𝜆𝑖 𝑞𝑖 𝑞𝑖𝑇 .
𝑖=1
where 𝑞𝑖 ∈ R𝑛 are the (orthogonal) eigenvectors and 𝜆𝑖 are eigenvalues of 𝐴. Using the
spectral factorization of 𝐴 we have
𝑛
∑︁
𝑇
𝑥 𝐴𝑥 = 𝜆𝑖 (𝑥𝑇 𝑞𝑖 )2 ,
𝑖=1
50
which shows that 𝑥𝑇 𝐴𝑥 ≥ 0 ⇔ 𝜆𝑖 ≥ 0, 𝑖 = 1, . . . , 𝑛. In other words,
𝑥𝑇 𝐴𝑥 = 𝑥𝑇 𝑉 𝑇 𝑉 𝑥 = ‖𝑉 𝑥‖22 ,
i.e., if 𝐴 = 𝑉 𝑇 𝑉 then 𝑥𝑇 𝐴𝑥 ≥ 0 for all 𝑥. On the other hand, from the positive
spectral√ factorization
√ 𝐴 = 𝑄Λ𝑄𝑇 we have 𝐴 = 𝑉 𝑇 𝑉 with 𝑉 = Λ1/2 𝑄𝑇 , where Λ1/2 =
Diag( 𝜆1 , . . . , 𝜆𝑛 ). We thus have the equivalent characterization
In a completely analogous way we define the cone of symmetric positive definite matrices
as
𝑛
𝒮++ = {𝑋 ∈ 𝒮 𝑛 | 𝑧 𝑇 𝑋𝑧 > 0, ∀𝑧 ∈ R𝑛 }
= {𝑋 ∈ 𝒮 𝑛 | 𝜆𝑖 (𝑋) > 0, 𝑖 = 1, . . . , 𝑛}
= {𝑋 ∈ 𝒮+𝑛 | 𝑋 = 𝑉 𝑇 𝑉 for some 𝑉 ∈ R𝑛×𝑛 , rank(𝑉 ) = 𝑛},
[︂ ]︂
𝑥1 𝑥3
𝑋=
𝑥3 𝑥2
𝑥1 𝑥2 ≥ 𝑥23 , 𝑥1 , 𝑥2 ≥ 0,
and
𝑡 𝑥𝑇
[︂ ]︂
𝑛+1
(𝑡, 𝑥) ∈ 𝒬 ⇐⇒ ∈ 𝒮+𝑛+1 ,
𝑥 𝑡𝐼
where the latter equivalence follows immediately from Lemma 6.1. Thus both the linear
and quadratic cone are embedded in the semidefinite cone. In practice, however, linear
and quadratic cones should never be described using semidefinite constraints, which would
result in a large performance penalty by squaring the number of variables.
51
Example 6.1. As a more interesting example, consider the symmetric matrix
⎡ ⎤
1 𝑥 𝑦
𝐴(𝑥, 𝑦, 𝑧) = 𝑥
⎣ 1 𝑧 ⎦ (6.4)
𝑦 𝑧 1
(shown in Fig. 6.1) is called a spectrahedron and is perhaps the simplest bounded
semidefinite representable set, which cannot be represented using (finitely many) linear
or quadratic cones. To gain a geometric intuition of 𝑆, we note that
𝑥2 + 𝑦 2 + 𝑧 2 − 2𝑥𝑦𝑧 = 1,
or equivalently as
[︂ ]︂𝑇 [︂ ]︂ [︂ ]︂
𝑥 1 −𝑧 𝑥
= 1 − 𝑧2.
𝑦 −𝑧 1 𝑦
For 𝑧 = 0 this describes a circle in the (𝑥, 𝑦)-plane, and for −1 ≤ 𝑧 ≤ 1 it characterizes
an ellipse (for a fixed 𝑧).
52
6.1.2 Properties of semidefinite matrices
Many useful properties of (semi)definite matrices follow directly from the definitions
(6.1)-(6.3) and their definite counterparts.
• The diagonal elements of 𝐴 ∈ 𝒮+𝑛 are nonnegative. Let 𝑒𝑖 denote the 𝑖th standard
basis vector (i.e., [𝑒𝑖 ]𝑗 = 0, 𝑗 ̸= 𝑖, [𝑒𝑖 ]𝑖 = 1). Then 𝐴𝑖𝑖 = 𝑒𝑇𝑖 𝐴𝑒𝑖 , so (6.1) implies that
𝐴𝑖𝑖 ≥ 0.
• Any principal submatrix of 𝐴 ∈ 𝒮+𝑛 (𝐴 restricted to the same set of rows as columns)
is positive semidefinite; this follows by restricting the Grammian characterization
𝐴 = 𝑉 𝑇 𝑉 to a submatrix of 𝑉 .
where strict positivity follows from the assumption that 𝑈 has full column-rank,
i.e., 𝑈 𝑉 𝑇 ̸= 0.
• The inverse of a positive definite matrix is positive definite. This follows from the
positive spectral factorization 𝐴 = 𝑄Λ𝑄𝑇 , which gives us
𝐴−1 = 𝑄𝑇 Λ−1 𝑄
𝐴 𝐵𝑇
[︂ ]︂
𝑋= .
𝐵 𝐶
Let us find necessary and sufficient conditions for 𝑋 ≻ 0. We know that 𝐴 ≻ 0 and
𝐶 ≻ 0 (since any principal submatrix must be positive definite). Furthermore, we
can simplify the analysis using a nonsingular transformation
[︂ ]︂
𝐼 0
𝐿=
𝐹 𝐼
53
to diagonalize 𝑋 as 𝐿𝑋𝐿𝑇 = 𝐷, where 𝐷 is block-diagonal. Note that det(𝐿) = 1,
so 𝐿 is indeed nonsingular. Then 𝑋 ≻ 0 if and only if 𝐷 ≻ 0. Expanding 𝐿𝑋𝐿𝑇 =
𝐷, we get
𝐴𝐹 𝑇 + 𝐵 𝑇
[︂ ]︂ [︂ ]︂
𝐴 𝐷1 0
= .
𝐹 𝐴 + 𝐵 𝐹 𝐴𝐹 𝑇 + 𝐹 𝐵 𝑇 + 𝐵𝐹 𝑇 + 𝐶 0 𝐷2
Since det(𝐴) ̸= 0 (by assuming that 𝐴 ≻ 0) we see that 𝐹 = −𝐵𝐴−1 and direct
substitution gives us
[︂ ]︂ [︂ ]︂
𝐴 0 𝐷1 0
= .
0 𝐶 − 𝐵𝐴−1 𝐵 𝑇 0 𝐷2
In the last part we have thus established the following useful result.
Lemma 6.1 (Schur complement lemma). A symmetric matrix
𝐴 𝐵𝑇
[︂ ]︂
𝑋= .
𝐵 𝐶
is positive definite if and only if
𝐴 ≻ 0, 𝐶 − 𝐵𝐴−1 𝐵 𝑇 ≻ 0.
𝐴(𝑥) = 𝐴0 + 𝑥1 𝐴1 + · · · + 𝑥𝑛 𝐴𝑛 . (6.5)
𝐴0 + 𝑥1 𝐴1 + · · · + 𝑥𝑛 𝐴𝑛 ⪰ 0 (6.6)
with
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
0 1 0 0 0 1 0 0 0
𝐴0 = 𝐼, 𝐴1 = ⎣ 1 0 0 ⎦ , 𝐴2 = ⎣ 0 0 0 ⎦ , 𝐴3 = ⎣ 0 0 1 ⎦ .
0 0 0 1 0 0 0 1 0
54
Alternatively, we can describe the linear matrix inequality 𝐴(𝑥, 𝑦, 𝑧) ⪰ 0 as
𝑋 ∈ 𝒮+3 , 𝑥11 = 𝑥22 = 𝑥33 = 1,
i.e., as a semidefinite variable with fixed diagonal; these two alternative formulations
illustrate the difference between primal and dual form of semidefinite problems, see Sec.
8.6.
Sum of eigenvalues
The sum of the eigenvalues corresponds to
𝑚
∑︁
𝜆𝑖 (𝐴) = tr(𝐴).
𝑖=1
Largest eigenvalue
The largest eigenvalue can be characterized in epigraph form 𝜆1 (𝐴) ≤ 𝑡 as
𝑡𝐼 − 𝐴 ⪰ 0. (6.7)
To verify this, suppose we have a spectral factorization 𝐴 = 𝑄Λ𝑄𝑇 where 𝑄 is orthogonal
and Λ is diagonal. Then 𝑡 is an upper bound on the largest eigenvalue if and only if
𝑄𝑇 (𝑡𝐼 − 𝐴)𝑄 = 𝑡𝐼 − Λ ⪰ 0.
Thus we can minimize the largest eigenvalue of 𝐴.
Smallest eigenvalue
The smallest eigenvalue can be described in hypograph form 𝜆𝑚 (𝐴) ≥ 𝑡 as
𝐴 ⪰ 𝑡𝐼, (6.8)
i.e., we can maximize the smallest eigenvalue of 𝐴.
Eigenvalue spread
The eigenvalue spread can be modeled in epigraph form
𝜆1 (𝐴) − 𝜆𝑚 (𝐴) ≤ 𝑡
by combining the two linear matrix inequalities in (6.7) and (6.8), i.e.,
𝑧𝐼 ⪯ 𝐴 ⪯ 𝑠𝐼,
(6.9)
𝑠 − 𝑧 ≤ 𝑡.
55
Spectral radius
The spectral radius 𝜌(𝐴) := max𝑖 |𝜆𝑖 (𝐴)| can be modeled in epigraph form 𝜌(𝐴) ≤ 𝑡 using
two linear matrix inequalities
−𝑡𝐼 ⪯ 𝐴 ⪯ 𝑡𝐼.
𝜇𝐼 ⪯ 𝐴 ⪯ 𝜇𝑡𝐼,
from which we recover the solution 𝑥 = 𝑧/𝜈. In essence, we first normalize the spectrum
by the smallest eigenvalue, and then minimize the largest eigenvalue of the normalized
linear matrix inequality. Compare Sec. 2.2.5.
6.2.3 Log-determinant
Consider again a symmetric positive-definite matrix 𝐴 ∈ 𝒮+𝑚 . The determinant
𝑚
∏︁
det(𝐴) = 𝜆𝑖 (𝐴)
𝑖=1
is neither convex or concave, but log det(𝐴) is concave and we can write the inequality
𝑡 ≤ log det(𝐴)
On the other hand the optimal value det(𝐴) is attained for 𝑍 = 𝐿𝐷 if 𝐴 = 𝐿𝐷𝐿𝑇 is the
LDL factorization of 𝐴.
The last inequality in problem (6.11) can of course be modeled using∏︀the exponential
cone as in Sec. 5.2. Note that we can replace that bound with 𝑡 ≤ ( 𝑖 𝑍𝑖𝑖 )1/𝑚 to get
instead the model of 𝑡 ≤ det(𝐴)1/𝑚 using Sec. 4.2.4.
56
6.2.4 Singular value optimization
We next consider a non-square matrix 𝐴 ∈ R𝑚×𝑝 . Assume 𝑝 ≤ 𝑚 and denote the singular
values of 𝐴 by
(6.12)
√︀
𝜎𝑖 (𝐴) = 𝜆𝑖 (𝐴𝑇 𝐴),
and if 𝐴 is square and symmetric then 𝜎𝑖 (𝐴) = |𝜆𝑖 (𝐴)|. We show next how to optimize
several functions of the singular values.
𝐴𝑇 𝐴 ⪯ 𝑡2 𝐼,
The largest singular value 𝜎1 (𝐴) is also called the spectral norm or the ℓ2 -norm of 𝐴,
‖𝐴‖2 := 𝜎1 (𝐴).
It turns out that the nuclear norm corresponds to the sum of the singular values,
𝑚
∑︁ 𝑛
∑︁ √︀
‖𝑋‖* = 𝜎𝑖 (𝑋) = 𝜆𝑖 (𝑋 𝑇 𝑋), (6.15)
𝑖=1 𝑖=1
= sup tr(Σ𝑇 𝑌 )
‖𝑌 ‖2 ≤1
𝑝 𝑝
∑︁ ∑︁
= sup 𝜎𝑖 𝑦𝑖 = 𝜎𝑖 .
|𝑦𝑖 |≤1 𝑖=1 𝑖=1
57
with the dual problem (see Example 8.8)
In other words, using strong duality we can characterize the epigraph ‖𝐴‖* ≤ 𝑡 with
𝑈 𝐴𝑇
[︂ ]︂
⪰ 0, tr(𝑈 + 𝑉 )/2 ≤ 𝑡. (6.18)
𝐴 𝑉
For a symmetric matrix the nuclear norm corresponds to the sum of absolute values of
eigenvalues, and for a semidefinite matrix it simply corresponds to the trace of the matrix.
𝐴𝑇 𝐵 −1 𝐴 ⪯ 𝐶
if and only if
𝐶 𝐴𝑇
[︂ ]︂
⪰ 0.
𝐴 𝐵
𝑣(𝑡) = 1, 𝑡, . . . , 𝑡2𝑛 .
(︀ )︀
if and only if it can be written as a sum of squared polynomials of degree 𝑛 (or less), i.e.,
for some 𝑞1 , 𝑞2 ∈ R𝑛+1
58
When there is no ambiguity, we drop the superscript on 𝐻𝑖 . For example, for 𝑛 = 2 we
have
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 0 0 0 1 0 0 0 0
𝐻0 = ⎣ 0 0 0 ⎦ , 𝐻1 = ⎣ 1 0 0 ⎦, ... 𝐻4 = ⎣ 0 0 0 ⎦ .
0 0 0 0 0 0 0 0 1
To verify that (6.19) and (6.21) are equivalent, we first note that
2𝑛
∑︁
𝑇
𝑢(𝑡)𝑢(𝑡) = 𝐻𝑖 𝑣𝑖 (𝑡),
𝑖=0
i.e.,
⎡ ⎤⎡ ⎤𝑇 ⎡ ⎤
1 1 1 𝑡 . . . 𝑡𝑛
⎢ 𝑡 ⎥⎢ 𝑡 ⎥ ⎢ 𝑡 𝑡2 . . . 𝑡𝑛+1 ⎥
.. .. ⎥ =⎢ .. .. ... .. ⎥.
⎢ ⎥⎢ ⎥ ⎢ ⎥
. . . . .
⎢ ⎥⎢
⎣ ⎦⎣ ⎦ ⎣ ⎦
𝑡𝑛 𝑡𝑛 𝑡𝑛 𝑡𝑛+1 ... 𝑡 2𝑛
i.e., we have 𝑓 (𝑡) = 𝑥𝑇 𝑣(𝑡) with 𝑥𝑖 = ⟨𝑋, 𝐻𝑖 ⟩, 𝑋 = (𝑞1 𝑞1𝑇 + 𝑞2 𝑞2𝑇 ) ⪰ 0. Conversely, assume
that (6.21) holds. Then
2𝑛
∑︁ 2𝑛
∑︁
𝑓 (𝑡) = ⟨𝐻𝑖 , 𝑋⟩𝑣𝑖 (𝑡) = ⟨𝑋, 𝐻𝑖 𝑣𝑖 (𝑡)⟩ = ⟨𝑋, 𝑢(𝑡)𝑢(𝑡)𝑇 ⟩ ≥ 0
𝑖=0 𝑖=0
59
• Even degree. Let 𝑛 = 2𝑚 and denote
We choose 𝑤1 (𝑡) = 1 and 𝑤2 (𝑡) = (𝑡 − 𝑎)(𝑏 − 𝑡) and note that 𝑤2 (𝑡) ≥ 0 on [𝑎, 𝑏].
Then 𝑓 (𝑡) ≥ 0, ∀𝑡 ∈ [𝑎, 𝑏] if and only if
𝑋 ∈ ℋ𝑛 ⇐⇒ 𝑋 ∈ C𝑛×𝑛 , 𝑋 𝐻 = 𝑋, (6.25)
In other words,
[︂ ]︂
ℜ𝑋 −ℑ𝑋
𝑋∈ 𝑛
ℋ+ ⇐⇒ ∈ 𝒮+2𝑛 . (6.26)
ℑ𝑋 ℜ𝑋
60
6.2.8 Nonnegative trigonometric polynomials
As a complex-valued variation of the sum-of-squares representation we consider trigono-
metric polynomials; optimization over cones of nonnegative trigonometric polynomials
has several important engineering applications. Consider a trigonometric polynomial
evaluated on the complex unit-circle
𝑛
∑︁
𝑓 (𝑧) = 𝑥0 + 2ℜ( 𝑥𝑖 𝑧 −𝑖 ), |𝑧| = 1 (6.27)
𝑖=1
𝑣(𝑧) = (1, 𝑧, . . . , 𝑧 𝑛 ).
The Riesz-Fejer Theorem states that a trigonometric polynomial 𝑓 (𝑧) in (6.27) is non-
negative (i.e., 𝑥 ∈ 𝐾0,𝜋
𝑛
) if and only if for some 𝑞 ∈ C𝑛+1
𝑥𝑖 = ⟨𝑋, 𝑇𝑖 ⟩, 𝑖 = 0, . . . , 𝑛, 𝑛+1
𝑋 ∈ ℋ+ (6.29)
i.e.,
⎡ ⎤⎡ ⎤𝐻 ⎡ ⎤
1 1 1 𝑧 −1 . . . 𝑧 −𝑛
⎢ 𝑧 ⎥⎢ 𝑧 ⎥ ⎢ 𝑧 1 . . . 𝑧 1−𝑛 ⎥
.. .. ⎥ =⎢ .. .. .. .. ⎥.
⎢ ⎥⎢ ⎥ ⎢ ⎥
. . . . . .
⎢ ⎥⎢
⎣ ⎦⎣ ⎦ ⎣ ⎦
𝑧𝑛 𝑧𝑛 𝑧 𝑛 𝑧 𝑛−1 ... 1
61
Next assume that (6.28) is satisfied. Then
𝑓 (𝑧) = ⟨𝑞𝑞 𝐻 , 𝑣(𝑧)𝑣(𝑧)𝐻 ⟩
= ⟨𝑞𝑞 𝐻 , 𝐼⟩ + ⟨𝑞𝑞 𝐻 , 𝑛𝑖=1 𝑇𝑖 𝑣𝑖 (𝑧)⟩ + ⟨𝑞𝑞 𝐻 , 𝑛𝑖=1 𝑇𝑖𝑇 𝑣𝑖 (𝑧)⟩
∑︀ ∑︀
∑︀𝑛 ∑︀𝑛
= ⟨𝑞𝑞 𝐻 , 𝐼⟩ + 𝐻 𝐻 𝑇
∑︀𝑛 𝑖=1 ⟨𝑞𝑞 , 𝑇𝑖 ⟩𝑣𝑖 (𝑧) + 𝑖=1 ⟨𝑞𝑞 , 𝑇𝑖 ⟩𝑣𝑖 (𝑧)
= 𝑥0 + 2ℜ( 𝑖=1 𝑥𝑖 𝑣𝑖 (𝑧))
Nonnegativity on a subinterval
We next sketch a few useful extensions. An extension of the Riesz-Fejer Theorem states
that a trigonometric polynomial 𝑓 (𝑧) of degree 𝑛 is nonnegative on 𝐼(𝑎, 𝑏) = {𝑧 | 𝑧 =
𝑒𝑗𝑡 , 𝑡 ∈ [𝑎, 𝑏] ⊆ [0, 𝜋]} if and only if it can be written as a weighted sum of squared
trigonometric polynomials
for 𝑋1 ∈ ℋ+
𝑛+1
, 𝑋2 ∈ ℋ+
𝑛
, i.e.,
𝑛
𝐾0,𝛼 = {𝑥 ∈ R × C𝑛 | 𝑥𝑖 = ⟨𝑋1 , 𝑇𝑖𝑛+1 ⟩ + ⟨𝑋2 , 𝑇𝑖+1
𝑛 𝑛
⟩ + ⟨𝑋2 , 𝑇𝑖−1 ⟩
(6.31)
−2 cos(𝛼)⟨𝑋2 , 𝑇𝑖𝑛 ⟩, 𝑋1 ∈ ℋ+
𝑛+1 𝑛
, 𝑋2 ∈ ℋ+ }.
Similarly 𝑓 (𝑧) ≥ 0, ∀𝑧 ∈ 𝐼(𝛼, 𝜋) if and only if
i.e.,
𝑛
𝐾𝛼,𝜋 = {𝑥 ∈ R × C𝑛 | 𝑥𝑖 = ⟨𝑋1 , 𝑇𝑖𝑛+1 ⟩ + ⟨𝑋2 , 𝑇𝑖+1
𝑛 𝑛
⟩ + ⟨𝑋2 , 𝑇𝑖−1 ⟩
(6.32)
+2 cos(𝛼)⟨𝑋2 , 𝑇𝑖𝑛 ⟩, 𝑋1 ∈ ℋ+
𝑛+1 𝑛
, 𝑋2 ∈ ℋ+ }.
𝑆 = {𝑋 ∈ 𝒮+𝑛 | 𝑋𝑖𝑖 = 1, 𝑖 = 1, . . . , 𝑛}
62
(shown in Fig. 6.1 for 𝑛 = 3). For 𝐴 ∈ 𝒮 𝑛 the nearest correlation matrix is
i.e., the projection of 𝐴 onto the set 𝑆. To pose this as a conic optimization we define
the linear operator
√ √ √ √
svec(𝑈 ) = (𝑈11 , 2𝑈21 , . . . , 2𝑈𝑛1 , 𝑈22 , 2𝑈32 , . . . , 2𝑈𝑛2 , . . . , 𝑈𝑛𝑛 ),
which extracts and scales the lower-triangular part of 𝑈 . We then get a conic formulation
of the nearest correlation problem exploiting symmetry of 𝐴 − 𝑋,
minimize 𝑡
subject to ‖𝑧‖2 ≤ 𝑡,
svec(𝐴 − 𝑋) = 𝑧, (6.33)
diag(𝑋) = 𝑒,
𝑋 ⪰ 0.
This is an example of a problem with both conic quadratic and semidefinite constraints
in primal form. We can add different constraints to the problem, for example a bound 𝛾
on the smallest eigenvalue by replacing 𝑋 ⪰ 0 with 𝑋 ⪰ 𝛾𝐼.
𝑆 = {𝑥 ∈ R𝑛 | 𝑎𝑇𝑖 𝑥 ≤ 𝑏𝑖 , 𝑖 = 1, . . . , 𝑚}.
The ellipsoid
ℰ := {𝑥 | 𝑥 = 𝐶𝑢 + 𝑑, ‖𝑢‖ ≤ 1}
‖𝐶𝑎𝑖 ‖2 + 𝑎𝑇𝑖 𝑑 ≤ 𝑏𝑖 , 𝑖 = 1, . . . , 𝑚.
maximize det(𝐶)
subject to ‖𝐶𝑎𝑖 ‖2 + 𝑎𝑇𝑖 𝑑 ≤ 𝑏𝑖 , 𝑖 = 1, . . . , 𝑚,
𝐶 ⪰ 0.
In Sec. 6.2.2 we show how to maximize the determinant of a positive definite matrix.
63
Minimal enclosing ellipsoid
Next consider a polytope given as the convex hull of a set of points,
𝑆 ′ = conv{𝑥1 , 𝑥2 , . . . , 𝑥𝑚 }, 𝑥𝑖 ∈ R𝑛 .
The ellipsoid
ℰ ′ := {𝑥 | ‖𝑃 𝑥 − 𝑐‖2 ≤ 1}
has Vol(ℰ ′ ) ≈ det(𝑃 )−1/𝑛 , so the minimum-volume enclosing ellipsoid is the solution to
maximize det(𝑃 )
subject to ‖𝑃 𝑥𝑖 − 𝑐‖2 ≤ 1, 𝑖 = 1, . . . , 𝑚,
𝑃 ⪰ 0.
𝑓 (𝑡) = 𝑥0 + 𝑥1 𝑡 + 𝑥2 𝑡2 + · · · + 𝑥𝑛 𝑡𝑛 .
Often we wish to fit such a polynomial to a given set of measurements or control points
𝑓 (𝑡𝑗 ) ≈ 𝑦𝑗 , 𝑗 = 1, . . . , 𝑚.
64
We can then express the desired curve-fit compactly as
𝐴𝑥 ≈ 𝑦,
i.e., as a linear expression in the coefficients 𝑥. When the degree of the polynomial equals
the number measurements, 𝑛 = 𝑚, the matrix 𝐴 is square and non-singular (provided
there are no duplicate rows), so we can can solve
𝐴𝑥 = 𝑦
to find a polynomial that passes through all the control points (𝑡𝑖 , 𝑦𝑖 ). Similarly, if 𝑛 > 𝑚
there are infinitely many solutions satisfying the underdetermined system 𝐴𝑥 = 𝑦. A
typical choice in that case is the least-norm solution
On the other hand, if 𝑛 < 𝑚 we generally cannot find a solution to the overdetermined
system 𝐴𝑥 = 𝑦, and we typically resort to a least-squares solution
𝑓 (𝑡) := 𝑥0 + 𝑥1 𝑡 + · · · + 𝑥𝑛 𝑡𝑛 ≥ 0, ∀𝑡 ∈ [𝑎, 𝑏]
or
𝑛−1
−(𝑥1 , 2𝑥2 , . . . , 𝑛𝑥𝑛 ) ∈ 𝐾𝑎,𝑏 ,
respectively.
65
• Convexity or concavity. Convexity (or concavity) of 𝑓 (𝑡) corresponds to 𝑓 ′′ (𝑡) ≥ 0
(or 𝑓 ′′ (𝑡) ≤ 0), i.e.,
𝑛−2
(2𝑥2 , 6𝑥3 , . . . , (𝑛 − 1)𝑛𝑥𝑛 ) ∈ 𝐾𝑎,𝑏 ,
or
𝑛−2
−(2𝑥2 , 6𝑥3 , . . . , (𝑛 − 1)𝑛𝑥𝑛 ) ∈ 𝐾𝑎,𝑏 ,
respectively.
As an example, we consider fitting a smooth polynomial
𝑓𝑛 (𝑡) = 𝑥0 + 𝑥1 𝑡 + · · · + 𝑥𝑛 𝑡𝑛
to the points {(−1, 1), (0, 0), (1, 1)}, where smoothness is implied by bounding |𝑓𝑛′ (𝑡)|.
More specifically, we wish to solve the problem
minimize 𝑧
subject to |𝑓𝑛′ (𝑡)| ≤ 𝑧, ∀𝑡 ∈ [−1, 1]
𝑓𝑛 (−1) = 1, 𝑓𝑛 (0) = 0, 𝑓𝑛 (1) = 1,
or equivalently
minimize 𝑧
subject to 𝑧 − 𝑓𝑛′ (𝑡) ≥ 0, ∀𝑡 ∈ [−1, 1]
𝑓𝑛′ (𝑡) − 𝑧 ≥ 0, ∀𝑡 ∈ [−1, 1]
𝑓𝑛 (−1) = 1, 𝑓𝑛 (0) = 0, 𝑓𝑛 (1) = 1.
Finally, we use the characterizations 𝐾𝑎,𝑏
𝑛
to get a conic problem
minimize 𝑧
subject to (𝑧 − 𝑥1 , −2𝑥2 , . . . , −𝑛𝑥𝑛 ) ∈ 𝐾−1,1
𝑛−1
𝑛−1
(𝑥1 − 𝑧, 2𝑥2 , . . . , 𝑛𝑥𝑛 ) ∈ 𝐾−1,1
𝑓𝑛 (−1) = 1, 𝑓𝑛 (0) = 0, 𝑓𝑛 (1) = 1.
In Fig. 6.3 we show the graphs for the resulting polynomails of degree 2, 4 and 8, re-
spectively. The second degree polynomial is uniquely determined by the three constraints
𝑓2 (−1) = 1, 𝑓2 (0) = 0, 𝑓2 (1) = 1, i.e., 𝑓2 (𝑡) = 𝑡2 . Also, we obviously have a lower bound
on the largest derivative max𝑡∈[−1,1] |𝑓𝑛′ (𝑡)| ≥ 1. The computed fourth degree polynomial
is given by
3 1
𝑓4 (𝑡) = 𝑡2 − 𝑡4
2 2
after rounding coefficients to rational numbers. Furthermore, the largest derivative is
given by
√ √
𝑓4′ (1/ 2) = 2,
√
and 𝑓4′′ (𝑡) < 0 on (1/ 2, 1] so, although not visibly clear, the polynomial is nonconvex
on [−1, 1]. In Fig. 6.4 we show the graphs of the corresponding polynomials where we
added a convexity constraint 𝑓𝑛′′ (𝑡) ≥ 0, i.e.,
𝑛−2
(2𝑥2 , 6𝑥3 , . . . , (𝑛 − 1)𝑛𝑥𝑛 ) ∈ 𝐾−1,1 .
66
3
f2 (t)
2
1
2 f4 (t)
f8 (t)
−2 −1 0 1 2 t
3
f2 (t)
2
f4 (t)
1
2
f8 (t)
−2 −1 0 1 2 t
67
We often wish a transfer function where 𝐻(𝜔) ≈ 1 for 0 ≤ 𝜔 ≤ 𝜔𝑝 and 𝐻(𝜔) ≈ 0 for
𝜔𝑠 ≤ 𝜔 ≤ 𝜋 for given constants 𝜔𝑝 , 𝜔𝑠 . One possible formulation for achieving this is
minimize 𝑡
subject to 0 ≤ 𝐻(𝜔) ∀𝜔 ∈ [0, 𝜋]
1 − 𝛿 ≤ 𝐻(𝜔) ≤ 1 + 𝛿 ∀𝜔 ∈ [0, 𝜔𝑝 ]
𝐻(𝜔) ≤ 𝑡 ∀𝜔 ∈ [𝜔𝑠 , 𝜋],
which corresponds to minimizing 𝐻(𝑤) on the interval [𝜔𝑠 , 𝜋] while allowing 𝐻(𝑤) to
depart from unity by a small amount 𝛿 on the interval [0, 𝜔𝑝 ]. Using the results from Sec.
6.2.8 (in particular (6.30), (6.31) and (6.32)), we can pose this as a conic optimization
problem
minimize 𝑡
subject to 𝑥 ∈ 𝑛
𝐾0,𝜋 ,
(𝑥0 − (1 − 𝛿), 𝑥1:𝑛 ) ∈ 𝑛
𝐾0,𝜔𝑝 , (6.34)
𝑛
−(𝑥0 − (1 + 𝛿), 𝑥1:𝑛 ) ∈ 𝐾0,𝜔 𝑝
,
𝑛
−(𝑥0 − 𝑡, 𝑥1:𝑛 ) ∈ 𝐾𝜔𝑠 ,𝜋 ,
which is a semidefinite optimization problem. In Fig. 6.5 we show 𝐻(𝜔) obtained by
solving (6.34) for 𝑛 = 10, 𝛿 = 0.05, 𝜔𝑝 = 𝜋/4 and 𝜔𝑠 = 𝜔𝑝 + 𝜋/8.
1+δ
1−δ
H(ω)
t?
ωp ωs π
ω
68
which is, in fact, equivalent to a rank constraint on a semidefinite variable,
𝑋 = 𝑥𝑥𝑇 , diag(𝑋) = 𝑥.
minimize 𝑐𝑇 𝑥
subject to 𝐴𝑥 = 𝑏,
(6.36)
diag(𝑋) = 𝑥,
𝑋 ⪰ 𝑥𝑥𝑇 ,
0 ≤ 𝑋𝑖𝑗 ≤ 1, 𝑖 = 1, . . . , 𝑛, 𝑗 = 1, . . . , 𝑛,
𝑋𝑖𝑖 ≥ 𝑋𝑖𝑗 , 𝑖 = 1, . . . , 𝑛, 𝑗 = 1, . . . , 𝑛,
and so on. This will usually have a dramatic impact on solution times and memory
requirements. Already constraining a semidefinite matrix to be doubly nonnegative (𝑋𝑖𝑗 ≥
0) introduces additional 𝑛2 linear inequality constraints.
𝐼 = {(𝑢, 𝑣) ∈ 𝐸 | 𝑢 ∈ 𝑆, 𝑣 ∈ 𝑇 }.
The capacity of a cut is then defined as the number of edges in the cut-set, |𝐼|.
69
v1
v2 v0
v6 v5
v3 v4
Fig. 6.6: Undirected graph. The cut {𝑣2 , 𝑣4 , 𝑣5 } has capacity 9 (thick edges).
minimize 𝑥𝑇 𝐴𝑥
(6.38)
subject to 𝑥 ∈ {−1, +1}𝑛 ,
70
Chapter 7
Practical optimization
Example 7.1. In Sec. 5.2.6 it was shown that the logarithm of sum of exponentials
𝑡 ≥ log(𝑒𝑥1 + · · · + 𝑒𝑥𝑛 ),
𝑡 ≥ 𝑠 log(𝑒𝑥1 /𝑠 + · · · + 𝑒𝑥𝑛 /𝑠 ),
71
𝑐0 , 𝐴(𝑥/𝑠, 𝑟) + 𝑏 ∈ 𝐾] ⇔ [𝑡 ≥ 𝑐(𝑥, 𝑟¯) + 𝑐0 𝑠, 𝐴(𝑥, 𝑟¯) + 𝑏𝑠 ∈ 𝐾], where 𝑟¯ is taken as a new
)︀
1. {︃
If 𝑓 is convex and 𝑔 is convex, then 𝑡 ≥ 𝑓 (𝑟), 𝑟 ≥ 𝑔(𝑥) is equivalent to 𝑡 ≥
𝑓 (𝑔(𝑥)) if 𝑔(𝑥) ∈ ℱ + ,
𝑓min otherwise ,
2. {︃
If 𝑓 is convex and 𝑔 is concave, then 𝑡 ≥ 𝑓 (𝑟), 𝑟 ≤ 𝑔(𝑥) is equivalent to 𝑡 ≥
𝑓 (𝑔(𝑥)) if 𝑔(𝑥) ∈ ℱ − ,
𝑓min otherwise ,
• The inequality 2𝑠𝑡 ≥ 𝑔(𝑥)2 can be rewritten as 2𝑠𝑡 ≥ 𝑟2 and 𝑟 ≥ 𝑔(𝑥) for all
nonnegative convex functions 𝑔(𝑥). This is because 𝑓 (𝑟) = 𝑟2 is nondecreasing
on 𝑟 ≥ 0.
72
Example 7.3. Some amount of work is generally required to find the correct reformu-
lation of a nested nonlinear constraint. For instance, suppose that for 𝑥, 𝑦, 𝑧 ≥ 0 with
𝑥𝑦𝑧 > 1 we want to write
1
𝑡≥ .
𝑥𝑦𝑧 − 1
A natural first attempt is:
1
𝑡≥ , 𝑟 ≤ 𝑥𝑦𝑧 − 1,
𝑟
which corresponds to Relaxation 2 with 𝑓 (𝑟) = 1/𝑟 and 𝑔(𝑥, 𝑦, 𝑧) = 𝑥𝑦𝑧 − 1 > 0. The
function 𝑓 is indeed convex and nonincreasing on all of 𝑔(𝑥, 𝑦, 𝑧), and the inequality
𝑡𝑟 ≥ 1 is moreover representable with a rotated quadratic cone. Unfortunately 𝑔 is
not concave. We know that a monomial like 𝑥𝑦𝑧 appears in connection with the power
cone, but that requires a homogeneous constraint such as 𝑥𝑦𝑧 ≥ 𝑢3 . This gives us an
idea to try
1
𝑡≥ , 𝑟3 ≤ 𝑥𝑦𝑧 − 1,
𝑟3
which is 𝑓 (𝑟) = 1/𝑟3 and 𝑔(𝑥, 𝑦, 𝑧) = (𝑥𝑦𝑧 − 1)1/3 > 0. This provides the right balance:
all conditions for validity and exactness of Relaxation 2 are satisfied. Introducing
another variable 𝑢 we get the following model:
We refer to Sec. 4 to verify that all the constraints above are representable using power
cones. We leave it as an exercise to find other conic representations, based on other
transformations of the original inequality.
Suppose 𝑓 (𝑥) is convex (in particular each piece is convex by itself) and 𝑓𝑖 (𝛼𝑖 ) = 𝑓𝑖+1 (𝛼𝑖 ).
In representing the epigraph 𝑡 ≥ 𝑓 (𝑥) it is helpful to proceed via the equivalent represen-
tation:
𝑡 = 𝑘𝑖=1 𝑡𝑖 − 𝑘−1
∑︀ ∑︀
𝑓 (𝛼 ),
∑︀𝑖=1 𝑖 𝑖
𝑥 = 𝑘𝑖=1 𝑥𝑖 − 𝑘−1
∑︀
𝑖=1 𝛼𝑖 ,
𝑡𝑖 ≥ 𝑓𝑖 (𝑥𝑖 ) for 𝑖 = 1, . . . , 𝑘,
𝑥1 ≤ 𝛼1 , 𝛼𝑖−1 ≤ 𝑥𝑖 ≤ 𝛼𝑖 for 𝑖 = 2, . . . , 𝑘 − 1, 𝛼𝑘−1 ≤ 𝑥𝑘 .
Proof. In the special case when 𝑘 = 2 and 𝛼1 = 𝑓1 (𝛼1 ) = 𝑓2 (𝛼1 ) = 0 the epigraph of 𝑓 is
equal to the Minkowski sum {(𝑡1 +𝑡2 , 𝑥1 +𝑥2 ) : 𝑡𝑖 ≥ 𝑓𝑖 (𝑥𝑖 )}. In general the epigraph over
73
two consecutive pieces is obtained by shifting them to (0, 0), computing the Minkowski
sum and shifting back. Finally more than two pieces can be joined by continuing this
argument by induction.
As the reformulation grows in size with the number of pieces, it is preferable to
keep this number low. Trivially, if 𝑓𝑖 (𝑥) = 𝑓𝑖+1 (𝑥), these two pieces can be merged.
Substituting 𝑓𝑖 (𝑥) and 𝑓𝑖+1 (𝑥) for max(𝑓𝑖 (𝑥), 𝑓𝑖+1 (𝑥)) is sometimes an invariant change
to facilitate this merge. For instance, it always works for affine functions 𝑓𝑖 (𝑥) and
𝑓𝑖+1 (𝑥). Finally, if 𝑓 (𝑥) is symmetric around some point 𝛼, we can represent its epigraph
via a piecewise function, with only half the number of pieces, defined in terms of a new
variable 𝑧 = |𝑥 − 𝛼| + 𝛼.
𝑡 ≥ 𝑡1 + 𝑡2 + 𝑡3 − 2,
𝑥 = 𝑥1 + 𝑥2 + 𝑥3 ,
𝑡1 = −2𝑥1 − 1, 𝑡2 ≥ 𝑥22 , 𝑡3 = 2𝑥3 − 1,
𝑥1 ≤ −1, −1 ≤ 𝑥2 ≤ 1, 1 ≤ 𝑥3 ,
In this particular example, however, unless the absolute value 𝑧 from 𝑧 ≥ |𝑥| is used
elsewhere, the cost of introducing it does not outweigh the savings achieved by going
from three pieces in 𝑥 to two pieces in 𝑧.
74
possible by defining a condition number. This is an attractive, but not very practical met-
ric, as its evaluation requires solving several auxiliary optimization problems. Therefore
we only make the modest recommendations to avoid problems
Example 7.5 (Near linear dependencies). To give some idea about the perils of near
linear dependence, consider this pair of optimization problems:
maximize 𝑥
subject to 2𝑥 − 𝑦 ≥ −1,
2.0001𝑥 − 𝑦 ≤ 0,
and
maximize 𝑥
subject to 2𝑥 − 𝑦 ≥ −1,
1.9999𝑥 − 𝑦 ≤ 0.
The first problem has a unique optimal solution (𝑥, 𝑦) = (104 , 2 · 104 + 1), while the
second problem is unbounded. This is caused by the fact that the hyperplanes (here:
straight lines) defining the constraints in each problem are almost parallel. Moreover,
we can consider another modification:
maximize 𝑥
subject to 2𝑥 − 𝑦 ≥ −1,
2.001𝑥 − 𝑦 ≤ 0,
with optimal solution (𝑥, 𝑦) = (103 , 2 · 103 + 1), which shows how a small perturbation
of the coefficients induces a large change of the solution.
Typical examples of problems with nearly linearly dependent constraints are dis-
cretizations of continuous processes, where the constraints invariably become more cor-
related as we make the discretization finer; as such there may be nothing wrong with
the discretization or problem formulation, but we should expect numerical difficulties for
sufficiently fine discretizations.
One should also be careful not to specify problems whose optimal value is only achieved
in the limit. A trivial example is
minimize 𝑒𝑥 , 𝑥 ∈ R.
The infimum value 0 is not attained for any finite 𝑥, which can lead to unspecified be-
haviour of the solver. More examples of ill-posed conic problems are discussed in Fig. 7.1
and Sec. 8. In particular, Fig. 7.1 depicts some troublesome problems in two dimensions.
In (a) minimizing 𝑦 ≥ 1/𝑥 on the nonnegative orthant is unattained approaching zero
for 𝑥 → ∞. In (b) the only feasible point is (0, 0), but objectives maximizing 𝑥 are not
blocked by the purely vertical normal vectors of active constraints at this point, falsely
75
suggesting local progress could be made. In (c) the intersection of the two subsets is
empty, but the distance between them is zero. Finally in (d) minimizing 𝑥 on 𝑦 ≥ 𝑥2
is unbounded at minus infinity for 𝑦 → ∞, but there is no improving ray to follow. Al-
though seemingly unrelated, the four cases are actually primal-dual pairs; (a) with (b)
and (c) with (d). In fact, the missing normal vector property in (b)—desired to certify
optimality—can be attributed to (a) not attaining the best among objective values at
distance zero to feasibility, and the missing positive distance property in (c)—desired to
certify infeasibility—is because (d) has no improving ray.
7.3 Scaling
Another difficulty encountered in practice involves models that are badly scaled. Loosely
speaking, we consider a model to be badly scaled if
• variables are measured on very different scales,
• constraints or bounds are measured on very different scales.
For example if one variable 𝑥1 has units of molecules and another variable 𝑥2 measures
temperature, we might expect an objective function such as
𝑥1 + 1012 𝑥2
and in finite precision the second term dominates the objective and renders the contri-
bution from 𝑥1 insignificant and unreliable.
A similar situation (from a numerical point of view) is encountered when using pe-
nalization or big-𝑀 strategies. Assume we have a standard linear optimization problem
(2.12) with the additional constraint that 𝑥1 = 0. We may eliminate 𝑥1 completely from
the model, or we might add an additional constraint, but suppose we choose to formulate
a penalized problem instead
minimize 𝑐𝑇 𝑥 + 1012 𝑥1
subject to 𝐴𝑥 = 𝑏,
𝑥 ≥ 0,
76
reasoning that the large penalty term will force 𝑥1 = 0. However, if ‖𝑐‖ is small we
have the exact same problem, namely that in finite precision the penalty term will com-
pletely dominate the objective and render the contribution 𝑐𝑇 𝑥 insignificant or unreliable.
Therefore, the penalty term should be chosen carefully.
Example 7.7 (Explicit redundant bounds). Consider again the problem (2.12), but
with additional (redundant) constraints 𝑥𝑖 ≤ 𝛾. This is a common approach for some
optimization practitioners. The problem we solve is then
minimize 𝑐𝑇 𝑥
subject to 𝐴𝑥 = 𝑏,
𝑥 ≥ 0,
𝑥 ≤ 𝛾𝑒,
maximize 𝑏𝑇 𝑦 − 𝛾𝑒𝑇 𝑧
subject to 𝐴𝑇 𝑦 + 𝑠 − 𝑧 = 𝑐,
𝑠, 𝑧 ≥ 0.
Suppose we do not know a-priori an upper bound on ‖𝑥‖∞ , so we choose a very large
𝛾 = 1012 reasoning that this will not change the optimal solution. Note that the large
variable bound becomes a penalty term in the dual problem; in finite precision such a
large bound will effectively destroy accuracy of the solution.
Example 7.8 (Big-M). Suppose we have a mixed integer model which involves a big-M
constraint (see Sec. 9), for instance
𝑎𝑇 𝑥 ≤ 𝑏 + 𝑀 (1 − 𝑧)
77
as in Sec. 9.1.3. The constant 𝑀 must be big enough to guarantee that the constraint
becomes redundant when 𝑧 = 0, or we risk shrinking the feasible set, perhaps to the
point of creating an infeasible model. On the other hand, the user might set 𝑀 to a
huge, safe value, say 𝑀 = 1012 . While this is mathematically correct, it will lead to a
model with poorer linear relaxations and most likely increase solution time (sometimes
dramatically). It is therefore very important to put some effort into estimating a
reasonable value of 𝑀 which is safe, but not too big.
Sum of squares
The sum of squares 𝑥21 + · · · + 𝑥2𝑛 can be bounded above using the rotated quadratic cone:
1
𝑡 ≥ 𝑥21 + · · · + 𝑥2𝑛 ⇐⇒ ( , 𝑡, 𝑥) ∈ 𝒬𝑛r ,
2
but in most cases it is better to bound the square root with a quadratic cone:
√︁
′
𝑡 ≥ 𝑥21 + · · · + 𝑥2𝑛 ⇐⇒ (𝑡′ , 𝑥) ∈ 𝒬𝑛 .
The latter has slightly better numerical properties: 𝑡′ is roughly comparable with 𝑥 and
is measured on the same scale, while 𝑡 grows as the square of 𝑥 which will sooner lead to
numerical instabilities.
Exponential cone
Using the function 𝑒𝑥 for 𝑥 ≤ −30 or 𝑥 ≥ 30 would be rather questionable if 𝑒𝑥 is supposed
to represent any realistic value. Note that 𝑒30 ≈ 1013 and that 𝑒30.0000001 − 𝑒30 ≈ 106 , so a
tiny perturbation of the exponent produces a huge change of the result. Ideally, 𝑥 should
have a single-digit absolute value. For similar reasons, it is even more important to have
well-scaled data. For instance, in floating point arithmetic we have the “equality”
since the smaller summand disappears in comparison to the other one, 1017 times bigger.
For the same reason it is advised to replace explicit inequalities involving exp(𝑥) with
log-sum-exp variants (see Sec. 5.2.6). For example, suppose we have a constraint such as
which has the advantage that 𝑡′ is of the same order of magnitude as 𝑥𝑖 , and that the
exponents 𝑥𝑖 − 𝑡′ are likely to have much smaller absolute values than simply 𝑥𝑖 .
78
Power cone
The power cone is not reliable when one of the exponents is very small. For example,
consider the function 𝑓 (𝑥) = 𝑥0.01 , which behaves almost like the indicator function of
𝑥 > 0 in the sense that
Now suppose 𝑥 = 10−20 . Is the constraint 𝑓 (𝑥) > 0.5 satisfied? In principle yes, but in
practice 𝑥 could have been obtained as a solution to an optimization problem and it may
in fact represent 0 up to numerical error. The function 𝑓 (𝑥) is sensitive to changes in 𝑥
well below standard numerical accuracy. The point 𝑥 = 0 does not satisfy 𝑓 (𝑥) > 0.5 but
it is only 10−30 from doing so.
• efficiency drops with growing number of semidefinite terms involved in linear con-
straints. This can have much bigger impact on the solution time than increasing
the dimension of the semidefinite variable.
Let us consider a few examples.
Block matrices
Given two matrix variables 𝑋, 𝑌 ⪰ 0 do not assemble them in a block matrix and write
the constraints as
[︂ ]︂
𝑋 0
⪰ 0.
0 𝑌
This increases the dimension of the problem and, even worse, introduces unnecessary
constraints for a large portion of entries of the block matrix.
Schur complement
Suppose we want to model a relaxation of a rank-one constraint:
[︂ ]︂
𝑋 𝑥
⪰ 0.
𝑥𝑇 1
where 𝑥 ∈ R𝑛 and 𝑋 ∈ 𝒮+𝑛 . The correct way to do this is to set up a matrix variable
𝑌 ∈ 𝒮+𝑛+1 with only a linear number of constraints:
𝑌𝑖,𝑛+1 = 𝑥𝑖 , 𝑖 = 1, . . . , 𝑛
𝑌𝑛+1,𝑛+1 = 1,
𝑌 ⪰ 0,
79
and use the upper-left 𝑛 × 𝑛 part of 𝑌 as the original 𝑋. Going the other way around, i.e.
starting with a variable 𝑋 and aligning it with the corner of another, bigger semidefinite
matrix 𝑌 introduces 𝑛(𝑛 + 1)/2 equality constraints and will quickly have formidable
solution times.
Sparse LMIs
Suppose we want to model a problem with a sparse linear matrix inequality (see Sec.
6.2.1) such as:
minimize 𝑐𝑇 𝑥
subject to 𝐴0 + 𝑘𝑖=1 𝐴𝑖 𝑥𝑖 ⪰ 0,
∑︀
and the linear constraint requires a full set of 𝑛(𝑛 + 1)/2 equalities, many of which are
just 𝑋𝑘,𝑙 = 0, regardless of the sparsity of 𝐴𝑖 . However the dual problem (see Sec. 8.6) is:
maximize −⟨𝐴0 , 𝑍⟩
subject to ⟨𝐴𝑖 , 𝑍⟩ = 𝑐𝑖 , 𝑖 = 1, . . . , 𝑘,
𝑍 ⪰ 0,
and the number of nonzeros in linear constraints is just joint number of nonzeros in 𝐴𝑖 .
It means that large, sparse LMIs should almost always be dualized and entered in that
form for efficiency reasons.
−1.4142135623730951.
80
√
The true objective value is − 2, so the approximate objective value is wrong by the
amount
√
1.4142135623730951 − 2 ≈ 10−16 .
Most likely this difference is irrelevant for all practical purposes. Nevertheless, in general
a solution obtained using floating point arithmetic is only an approximation. Most (if
not all) commercial optimization software uses double precision floating point arithmetic,
implying that about 16 digits are correct in the computations performed
√ internally by
the software. This also means that irrational numbers such as 2 and 𝜋 can only be
stored accurately within 16 digits.
Verifying feasibility
A good practice after solving an optimization problem is to evaluate the reported solution.
At the very least this process should be carried out during the initial phase of building
a model, or if the reported solution is unexpected in some way. The first step in that
process is to check that the solution is feasible; in case of the small example (7.1) this
amounts to checking that:
𝑥1 + 𝑥2 = 0.0000000000000000 + 1.4142135623730951 ≤ 3,
𝑥22 = 2.0000000000000004 ≤ 2, (?)
𝑥1 = 0.0000000000000000 ≥ 0,
𝑥2 = 1.4142135623730951 ≥ 0,
which demonstrates that one constraint is slightly violated due to computations in finite
precision. It is up to the user to assess the significance of a specific violation; for example
a violation of one unit in
𝑥1 + 𝑥2 ≤ 1
𝑥1 + 𝑥2 ≤ 109 .
The right-hand side of 109 may itself be the result of a computation in finite precision,
and may only be known with, say 3 digits of accuracy. Therefore, a violation of 1 unit is
not significant since the true right-hand side could just as well be 1.001 · 109 . Certainly a
violation of 4 · 10−16 as in our example is within the numerical tolerance for zero and we
would accept the solution as feasible. In practice it would be standard to see violations
of order 10−8 .
Verifying optimality
Another question is how to verify that a feasible approximate solution is actually optimal,
which is answered by duality theory, which is discussed in Sec. 2.4 for linear problems
and in Sec. 8 in full generality. Following Sec. 8 we can derive an equivalent version of
the dual problem as:
maximize −3𝑦1 − 𝑦2 − 𝑦3
subject to 2𝑦2 𝑦3 ≥ (𝑦1 + 1)2 ,
𝑦1 ≥ 0.
81
√ √
This
√ problem has a feasible point (𝑦 1 , 𝑦2 , 𝑦3 ) = (0, 1/ 2, 1/ 2) with objective value
− 2, matching the objective value of the primal problem. By duality theory this proves
that both the primal and dual solutions are optimal. Again, in finite precision the dual
objective value may be reported as, for example
−1.4142135623730950
leading to a duality gap of about 10−16 between the approximate primal and dual objective
values. It is up to the user to assess the significance of any duality gap and accept or
reject the solution as sufficiently good approximation of the optimal value.
Example 7.9 (Surprisingly small distance). Let 𝐾 = 𝒫30.1,0.9 be the power cone defined
by the inequality 𝑥0.1 𝑦 0.9 ≥ |𝑧|. The point
is clearly not in the cone, in fact 𝑥0.1 𝑦 0.9 = 0 while |𝑧| = 500, so the violation of the
conic inequality is seemingly large. However, the point
82
Distance to certain cones
For some basic cone types the projection problem (7.2) can be solved analytically. De-
noting [𝑥]+ = max(0, 𝑥) we have that:
• For 𝑥 ∈ R the projection onto the nonnegative half-line is projR+ (𝑥) = [𝑥]+ .
∑︀𝑛
• For 𝑋 = 𝑖=1 𝜆𝑖 𝑞𝑖 𝑞𝑖𝑇 the projection onto the semidefinite cone is
𝑛
∑︁
proj𝒮+𝑛 (𝑋) = [𝜆𝑖 ]+ 𝑞𝑖 𝑞𝑖𝑇 .
𝑖=1
83
Chapter 8
In Sec. 2 we introduced duality and related concepts for linear optimization. Here we
present a more general version of this theory for conic optimization and we illustrate it
with examples. Although this chapter is self-contained, we recommend familiarity with
Sec. 2, which some of this material is a natural extension of.
Duality theory is a rich and powerful area of convex optimization, and central to
understanding sensitivity analysis and infeasibility issues. Furthermore, it provides a
simple and systematic way of obtaining non-trivial lower bounds on the optimal value for
many difficult non-convex problems.
From now on we consider a conic problem in standard form
minimize 𝑐𝑇 𝑥
subject to 𝐴𝑥 = 𝑏, (8.1)
𝑥 ∈ 𝐾,
𝐾 = 𝐾1 × · · · × 𝐾𝑚
ℱ𝑝 = {𝑥 ∈ R𝑛 | 𝐴𝑥 = 𝑏} ∩ 𝐾
𝑝⋆ = inf{𝑐𝑇 𝑥 : 𝑥 ∈ ℱ𝑝 },
84
Example 8.1 (Unattained problem value). The conic quadratic problem
minimize 𝑥
subject to (𝑥, 𝑦, 1) ∈ 𝒬3𝑟 ,
𝐾 * = {𝑦 ∈ R𝑛 : 𝑦 𝑇 𝑥 ≥ 0 ∀𝑥 ∈ 𝐾}. (8.2)
For simplicity we write 𝑦 𝑇 𝑥, denoting the standard Euclidean inner product, but every-
thing in this section applies verbatim to the inner product of matrices in the semidefinite
context. In other words 𝐾 * consists of vectors which form an acute (or right) angle with
every vector in 𝐾. We see easily that 𝐾 * is in fact a convex cone. The main example,
associated with linear optimization, is the dual of the positive orthant:
Lemma 8.1 (Dual of linear cone).
(R𝑛+ )* = R𝑛+ .
Proof. This is obvious for 𝑛 = 1, and then we use the simple fact
(𝐾1 × · · · × 𝐾𝑚 )* = 𝐾1* × · · · × 𝐾𝑚
*
,
85
Proof. We only sketch some parts of the proof and indicate geometric intuitions. First,
let us show (𝒬𝑛 )* = 𝒬𝑛 . For (𝑡, 𝑥), (𝑠, 𝑦) ∈ 𝒬𝑛 we have
𝑠𝑡 + 𝑦 𝑇 𝑥 ≥ ‖𝑦‖2 ‖𝑥‖2 + 𝑦 𝑇 𝑥 ≥ 0
using the inequality exp(𝑡) ≥ 1 + 𝑡. We refer to the literature for full proofs of all the
statements.
Finally it is nice to realize that (𝐾 * )* = 𝐾 and that the cones {0} and R are each
others’ duals. We leave it to the reader to check these facts.
Problems which fall under option 2. (limit-feasible) are ill-posed: an arbitrarily small
perturbation of input data puts the problem in either category 1. or 3. This fringe
case should therefore not appear in practice, and if it does, it signals issues with the
optimization model which should be addressed.
86
Example 8.2 (Limit-feasible model). Here is an example of an ill-posed limit-feasible
model created by fixing one of the root variables of a rotated quadratic cone to 0.
minimize 𝑢
subject to (𝑢, 𝑣, 𝑤) ∈ 𝒬3𝑟 ,
𝑣 = 0,
𝑤 ≥ 1.
The problem is clearly infeasible, but the sequence (𝑢𝑛 , 𝑣𝑛 , 𝑤𝑛 ) = (𝑛, 1/𝑛, 1) ∈ 𝒬3𝑟 with
(𝑣𝑛 , 𝑤𝑛 ) → (0, 1) makes it limit-feasible as in alternative 2. of Lemma 8.3. There is no
infeasibility certificate as in alternative 3.
Having cleared this detail we continue with the proof and example for the actually
useful part of conic Farkas’ lemma.
Proof. Consider the set
𝑏𝑇 𝑦 > 0, (𝐴𝑥)𝑇 𝑦 ≤ 0 ∀𝑥 ∈ 𝐾.
and the same argument works in the limit if 𝑥𝑛 is a sequence as in 2. That proves the
lemma.
Therefore Farkas’ lemma implies that (up to certain ill-posed cases) either the problem
(8.1) is feasible (first alternative) or there is a certificate of infeasibility 𝑦 (last alternative).
In other words, every time we classify a model as infeasible, we can certify this fact by
providing an appropriate 𝑦. Note that when 𝐾 = R𝑛+ we recover precisely the linear
version, Lemma 2.1.
minimize 𝑥1
subject to −𝑥1 + 𝑥2 − 𝑥4 = 1,
2𝑥3 −√︀
3𝑥4 = −1,
𝑥1 ≥ 𝑥22 + 𝑥23 ,
𝑥4 ≥ 0.
87
It can be expressed in the standard form (8.1) with
[︂ ]︂ [︂ ]︂
−1 1 0 −1 1
𝐴= , 𝑏= , 𝐾 = 𝒬3 × R+ , 𝑐 = [1, 0, 0, 0]𝑇 .
0 0 2 −3 −1
𝑥2 − 1 ≥ 𝑥2 − 1 − 𝑥4 = 𝑥1 ≥ |𝑥2 |.
88
For any feasible 𝑥* ∈ ℱ𝑝 and any (𝑦 * , 𝑠* ) ∈ R𝑚 × 𝐾 * we have
𝐿(𝑥* , 𝑦 * , 𝑠* ) = 𝑐𝑇 𝑥* + (𝑦 * )𝑇 · 0 − (𝑠* )𝑇 𝑥* ≤ 𝑐𝑇 𝑥* . (8.5)
Note the we used the definition of the dual cone to conclude that (𝑠* )𝑇 𝑥* ≥ 0. The dual
function is defined as the minimum of 𝐿(𝑥, 𝑦, 𝑠) over 𝑥. Thus the dual function of (8.1)
is
{︂ 𝑇
𝑇 𝑇 𝑇 𝑏 𝑦, 𝑐 − 𝐴𝑇 𝑦 − 𝑠 = 0,
𝑔(𝑦, 𝑠) = min 𝐿(𝑥, 𝑦, 𝑠) = min 𝑥 (𝑐 − 𝐴 𝑦 − 𝑠) + 𝑏 𝑦 =
𝑥 𝑥 −∞, otherwise.
Dual problem
From (8.5) every (𝑦, 𝑠) ∈ R𝑚 × 𝐾 * satisfies 𝑔(𝑦, 𝑠) ≤ 𝑝⋆ , i.e. 𝑔(𝑦, 𝑠) is a lower bound for
𝑝⋆ . To get the best such bound we maximize 𝑔(𝑦, 𝑠) over all (𝑦, 𝑠) ∈ R𝑚 × 𝐾 * and get
the dual problem:
maximize 𝑏𝑇 𝑦
subject to 𝑐 − 𝐴𝑇 𝑦 = 𝑠, (8.6)
𝑠 ∈ 𝐾 *,
or simply:
maximize 𝑏𝑇 𝑦
(8.7)
subject to 𝑐 − 𝐴𝑇 𝑦 ∈ 𝐾 * .
The optimal value of (8.6) will be denoted 𝑑⋆ . As in the case of (8.1) (which from now
on we call the primal problem), the dual problem can be infeasible (𝑑⋆ = −∞), have an
optimal solution (−∞ < 𝑑⋆ < +∞) or be unbounded (𝑑⋆ = +∞). As before, the value
𝑑⋆ is defined as a supremum of 𝑏𝑇 𝑦 and may not be attained. Note that the roles of −∞
and +∞ are now reversed because the dual is a maximization problem.
Example 8.4 (More general constraints). We can just as easily derive the dual of a
problem with more general constraints, without necessarily having to transform the
problem to standard form beforehand. Imagine, for example, that some solver accepts
conic problems of the form:
minimize 𝑐𝑇 𝑥 + 𝑐𝑓
subject to 𝑙𝑐 ≤ 𝐴𝑥 ≤ 𝑢𝑐 ,
(8.8)
𝑙 𝑥 ≤ 𝑥 ≤ 𝑢𝑥 ,
𝑥 ∈ 𝐾.
Then we define a Lagrangian with one set of dual variables for each constraint appearing
in the problem:
𝐿(𝑥, 𝑠𝑐𝑙 , 𝑠𝑐𝑢 , 𝑠𝑥𝑙 , 𝑠𝑥𝑢 , 𝑠𝑥𝑛 ) = 𝑐𝑇 𝑥 + 𝑐𝑓 − (𝑠𝑐𝑙 )𝑇 (𝐴𝑥 − 𝑙𝑐 ) − (𝑠𝑐𝑢 )𝑇 (𝑢𝑐 − 𝐴𝑥)
−(𝑠𝑥𝑙 )𝑇 (𝑥 − 𝑙𝑥 ) − (𝑠𝑥𝑢 )𝑇 (𝑢𝑥 − 𝑥) − (𝑠𝑥𝑛 )𝑇 𝑥
= 𝑥𝑇 (𝑐 − 𝐴𝑇 𝑠𝑐𝑙 + 𝐴𝑇 𝑠𝑐𝑢 − 𝑠𝑥𝑙 + 𝑠𝑥𝑢 − 𝑠𝑥𝑛 )
+(𝑙𝑐 )𝑇 𝑠𝑐𝑙 − (𝑢𝑐 )𝑇 𝑠𝑐𝑢 + (𝑙𝑥 )𝑇 𝑠𝑥𝑙 − (𝑢𝑐 )𝑇 𝑠𝑥𝑢 + 𝑐𝑓
and that gives a dual problem
maximize (𝑙𝑐 )𝑇 𝑠𝑐𝑙 − (𝑢𝑐 )𝑇 𝑠𝑐𝑢 + (𝑙𝑥 )𝑇 𝑠𝑥𝑙 − (𝑢𝑐 )𝑇 𝑠𝑥𝑢 + 𝑐𝑓
subject to 𝑐 + 𝐴𝑇 (−𝑠𝑐𝑙 + 𝑠𝑐𝑢 ) − 𝑠𝑥𝑙 + 𝑠𝑥𝑢 − 𝑠𝑥𝑛 = 0,
𝑠𝑐𝑙 , 𝑠𝑐𝑢 , 𝑠𝑥𝑙 , 𝑠𝑥𝑢 ≥ 0,
𝑠𝑥𝑛 ∈ 𝐾 * .
89
Example 8.5 (Dual of simple portfolio). Consider a simplified version of the portfolio
optimization problem, where we maximize expected return subject to an upper bound
on the risk and no other constraints:
maximize 𝜇𝑇 𝑥
subject to ‖𝐹 𝑥‖2 ≤ 𝛾,
𝐿(𝑥, 𝑣, 𝑤) = 𝜇𝑇 𝑥 + 𝑣𝛾 + 𝑤𝑇 𝐹 𝑥 = 𝑥𝑇 (𝜇 + 𝐹 𝑇 𝑤) + 𝑣𝛾
with (𝑣, 𝑤) ∈ (𝒬𝑚+1 )* = 𝒬𝑚+1 . Note that we chose signs to have 𝐿(𝑥, 𝑣, 𝑤) ≥ 𝜇𝑇 𝑥
since we are dealing with a maximization problem. The dual is now determined by the
conditions 𝐹 𝑇 𝑤 + 𝜇 = 0 and 𝑣 ≥ ‖𝑤‖2 , so it can be formulated as
minimize 𝛾‖𝑤‖2
(8.10)
subject to 𝐹 𝑇 𝑤 = −𝜇.
Note that it is actually more natural to view problem (8.9) as the dual form and
problem (8.10) as the primal. Indeed we can write the constraint in (8.9) as
[︂ ]︂ [︂ ]︂
𝛾 0
− 𝑥 ∈ (𝑄𝑚+1 )*
0 −𝐹
which fits naturally into the form (8.7). Having done this we can recover the dual as
a minimization problem in the standard form (8.1). We leave it to the reader to check
that we get the same answer as above.
so the dual objective value is a lower bound on the objective value of the primal. In
particular, any dual feasible point (𝑦 * , 𝑠* ) gives a lower bound:
𝑏𝑇 𝑦 * ≤ 𝑝 ⋆
90
Lemma 8.4 (Weak duality). 𝑑⋆ ≤ 𝑝⋆ .
It follows that if 𝑏𝑇 𝑦 * = 𝑐𝑇 𝑥* then 𝑥* is optimal for the primal, (𝑦 * , 𝑠* ) is optimal for
the dual and 𝑏𝑇 𝑦 * = 𝑐𝑇 𝑥* is the common optimal objective value. This way we can use
the optimal dual solution to certify optimality of the primal solution and vice versa.
Complementary slackness
Moreover, (8.11) asserts that 𝑏𝑇 𝑦 * = 𝑐𝑇 𝑥* is equivalent to orthogonality
(𝑠* )𝑇 𝑥* = 0
i.e. complementary slackness. It is not hard to verify what complementary slackness
means for particular types of cones, for example
• for 𝑠, 𝑥 ∈ R+ we have 𝑠𝑥 = 0 iff 𝑠 = 0 or 𝑥 = 0,
• vectors (𝑠1 , 𝑠˜), (𝑥1 , 𝑥˜) ∈ 𝒬𝑛+1 are orthogonal iff (𝑠1 , 𝑠˜) and (𝑥1 , −˜
𝑥) are parallel,
• vectors (𝑠1 , 𝑠2 , 𝑠˜), (𝑥1 , 𝑥2 , 𝑥˜) ∈ 𝒬𝑛+2
r are orthogonal iff (𝑠1 , 𝑠2 , 𝑠˜) and (𝑥2 , 𝑥1 , −˜
𝑥) are
parallel.
One implicit case is worth special attention: complementary slackness for a linear
inequality constraint 𝑎𝑇 𝑥 ≤ 𝑏 with a non-negative dual variable 𝑦 asserts that (𝑎𝑇 𝑥* −
𝑏)𝑦 * = 0. This can be seen by directly writing down the appropriate Lagrangian for this
type of constraint. Alternatively, we can introduce a slack variable 𝑢 = 𝑏 − 𝑎𝑇 𝑥 with a
conic constraint 𝑢 ≥ 0 and let 𝑦 be the dual conic variable. In particular, if a constraint
is non-binding in the optimal solution (𝑎𝑇 𝑥* < 𝑏) then the corresponding dual variable
𝑦 * = 0. If 𝑦 * > 0 then it can be related to a shadow price, see Sec. 2.4.4 and Sec. 8.5.2.
Strong duality
The obvious question is now whether 𝑑⋆ = 𝑝⋆ , that is if optimality of a primal solution
can always be certified by a dual solution with matching objective value, as for linear
programming. This turns out not to be the case for general conic problems.
minimize 𝑥3 √︀
subject to 𝑥1 ≥ 𝑥22 + 𝑥23 ,
𝑥2 ≥ 𝑥1 , 𝑥3 ≥ −1.
The only feasible points are (𝑥, 𝑥, 0), so 𝑝⋆ = 0. The dual problem is
maximize −𝑦2 √︀
subject to 𝑦1 ≥ 𝑦12 + (1 − 𝑦2 )2 ,
91
with feasible set {𝑥1 = 0, 𝑥2 ≥ 0} and optimal value 𝑝⋆ = 0. The dual problem can be
formulated as
maximize −𝑧
⎡ 2 ⎤
𝑧1 (1 − 𝑧2 )/2 0
subject to ⎣ (1 − 𝑧2 )/2 0 0 ⎦ ∈ 𝒮+3 ,
0 0 𝑧2
which has a feasible set {𝑧1 ≥ 0, 𝑧2 = 1} and dual optimal value 𝑑⋆ = −1.
To ensure strong duality for conic problems we need an additional regularity assump-
tion. As with the conic version of Farkas’ lemma Lemma 8.3, we stress that this is a
technical condition to eliminate ill-posed problems which should not be formed in prac-
tice. In particular, we invite the reader to think that strong duality holds for all well
formed conic problems one is likely to come across in applications, and that having a
duality gap signals issues with the model formulation.
We say problem (8.1) is very nicely posed if for all values of 𝑐0 the feasibility problem
𝑐𝑇 𝑥 = 𝑐0 , 𝐴𝑥 = 𝑏, 𝑥 ∈ 𝐾
−𝑐𝑇 −𝑝⋆ + 𝜀 𝑐𝑇 𝑥 = 𝑝⋆ − 𝜀,
[︂ ]︂ [︂ ]︂
𝑥 =
𝐴 𝑏 that is 𝐴𝑥 = 𝑏,
𝑥 ∈ 𝐾 𝑥 ∈ 𝐾.
Optimality of 𝑝⋆ implies that the above problem is infeasible. By Lemma 8.3 and because
we assumed very-nice-posedness there exists 𝑦ˆ = [𝑦0 𝑦]𝑇 such that
If 𝑦0 = 0 then −𝐴𝑇 𝑦 ∈ 𝐾 * and 𝑏𝑇 𝑦 > 0, which by Lemma 8.3 again would mean that
the original problem was infeasible, which is not the case. Hence we can rescale so that
𝑦0 = 1 and then we get
𝑐 − 𝐴𝑇 𝑦 ∈ 𝐾 * and 𝑏𝑇 𝑦 ≥ 𝑝⋆ − 𝜀.
The first constraint means that 𝑦 is feasible for the dual problem. By letting 𝜀 → 0 we
obtain 𝑑⋆ ≥ 𝑝⋆ .
There are more direct conditions which guarantee strong duality, such as below.
Lemma 8.6 (Slater constraint qualification). Suppose that (8.1) is strictly feasible: there
exists a point 𝑥 ∈ int(𝐾) in the interior of 𝐾 such that 𝐴𝑥 = 𝑏. Then strong duality
holds if 𝑝⋆ is finite. Moreover, if both primal and dual problem are strictly feasible then
𝑝⋆ and 𝑑⋆ are attained.
92
We omit the proof which can be found in standard texts. Note that the first problem
from Example 8.6 does not satisfy Slater constraint qualification: the only feasible points
lie on the boundary of the cone (we say the problem has no interior ). That problem
is not very nicely posed either: the point (𝑥1 , 𝑥2 , 𝑥3 ) = (0.5𝑐20 𝜀−1 + 𝜀, 0.5𝑐20 𝜀−1 , 𝑐0 ) ∈ 𝒬3
violates the inequality 𝑥2 ≥ 𝑥1 by an arbitrarily small 𝜀, so the problem is infeasible but
limit-feasible (second alternative in Lemma 8.3).
minimize 𝑡
subject to (𝑡, 𝐴𝑥 − 𝑏) ∈ 𝒬𝑚+1 ,
𝑢 = 1, 𝐴𝑇 𝑣 = 0, (𝑢, 𝑣) ∈ 𝑄𝑚+1 .
The problem exhibits strong duality with both the primal and dual values attained in
the optimal solution. The primal solution clearly satisfies 𝑡 = ‖𝐴𝑥 − 𝑏‖2 , and so comple-
mentary slackness for the quadratic cone implies that the vectors (𝑢, −𝑣) = (1, −𝑣) and
(𝑡, 𝐴𝑥−𝑏) are parallel. As a consequence the constraint 𝐴𝑇 𝑣 = 0 becomes 𝐴𝑇 (𝐴𝑥−𝑏) = 0
or simply
𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏
so if 𝐴𝑇 𝐴 is invertible then 𝑥 = (𝐴𝑇 𝐴)−1 𝐴𝑇 𝑏. This is the so-called normal equation for
least-squares regression, which we now obtained as a consequence of strong duality.
maximize 𝛼𝑇 𝑥 − 21 𝑐𝑥𝑇 Σ𝑥
(8.12)
subject to 𝐴𝑥 ≤ 𝑏,
where the linear part represents any set of additional constraints: total budget, sector
constraints, diversification constraints, individual relations between positions etc. In the
absence of additional constraints the solution to the unconstrained maximization problem
is easy to derive using basic calculus and equals
𝑥ˆ = 𝑐−1 Σ−1 𝛼.
93
We would like to understand the difference 𝑥* − 𝑥ˆ, where 𝑥* is the solution of (8.12), and
in particular to measure which of the linear constraints actually cause 𝑥* to deviate from
𝑥ˆ and to what degree. This can be quantified using the dual variables.
The conic version of problem (8.12) is
maximize 𝛼𝑇 𝑥 − 𝑐𝑟
subject to 𝐴𝑥 ≤ 𝑏,
(1, 𝑟, 𝐹 𝑥) ∈ 𝒬r
with dual
minimize 𝑏𝑇 𝑦 + 𝑠
subject to 𝐴𝑇 𝑦 = 𝛼 + 𝐹 𝑇 𝑢,
(𝑠, 𝑐, 𝑢) ∈ 𝒬r ,
𝑦 ≥ 0.
(𝑠* , 𝑐, 𝑢* ) = 𝛽(𝑟* , 1, −𝐹 𝑥* )
𝐴𝑇 𝑦 * = 𝛼 − 𝑐𝐹 𝑇 𝐹 𝑥*
or equivalently
𝑥* = 𝑥ˆ − 𝑐−1 Σ−1 𝐴𝑇 𝑦 * .
This equation splits the difference between the constrained and unconstrained solutions
into contributions from individual constraints, where the weights are precisely the dual
variables 𝑦 * . For example, if a constraint is not binding (𝑎𝑇𝑖 𝑥* − 𝑏𝑖 < 0) then by com-
plementary slackness 𝑦𝑖* = 0 and, indeed, a non-binding constraint has no effect on the
change in solution.
minimize ⟨𝐶, 𝑋⟩
subject to ⟨𝐴𝑖 , 𝑋⟩ = 𝑏𝑖 , 𝑖 = 1, . . . , 𝑚, (8.13)
𝑋 ∈ 𝒮+𝑛 .
We can quickly repeat the derivation of the dual problem in this notation. The Lagrangian
is
∑︀
𝐿(𝑋, 𝑦, 𝑆) = ⟨𝐶, 𝑋⟩∑︀− 𝑖 𝑦𝑖 (⟨𝐴𝑖 , 𝑋⟩ − 𝑏𝑖 ) − ⟨𝑆, 𝑋⟩
= ⟨𝐶 − 𝑖 𝑦𝑖 𝐴𝑖 − 𝑆, 𝑋⟩ + 𝑏𝑇 𝑦
94
and we get the dual problem
maximize 𝑏𝑇 𝑦 ∑︀
(8.14)
subject to 𝐶 − 𝑚 𝑛
𝑖=1 𝑦𝑖 𝐴𝑖 ∈ 𝒮+ .
Obviously we should only use these transformations when necessary; if we have a problem
that is more naturally interpreted in either primal or dual form, we should be careful to
recognize that structure.
95
Example 8.8 (Sum of singular values revisited). In Sec. 6.2.4, and specifically in
(6.16), we expressed the problem of minimizing the sum of singular values of a non-
symmetric matrix 𝑋. Problem (6.16) can be written as an LMI:
maximize
∑︀
𝑖,𝑗 𝑋𝑖𝑗 𝑧𝑖𝑗 [︂
0 𝑒𝑗 𝑒𝑇𝑖
]︂
subject to 𝐼 − 𝑖,𝑗 𝑧𝑖𝑗
∑︀
⪰ 0.
𝑒𝑖 𝑒𝑇𝑗 0
Treating this as the dual and going back to the primal form we get:
which is equivalent to the claimed (6.17). The dual formulation has the advantage of
being linear in 𝑋.
96
Chapter 9
In other chapters of this cookbook we have considered different classes of convex problems
with continuous variables. In this chapter we consider a much wider range of non-convex
problems by allowing integer variables. This technique is extremely useful in practice,
and already for linear programming it covers a vast range of problems. We introduce
different building blocks for integer optimization, which make it possible to model useful
non-convex dependencies between variables in conic problems. It should be noted that
mixed integer optimization problems are very hard (technically NP-hard), and for many
practical cases an exact solution may not be found in reasonable time.
minimize 𝑐𝑇 𝑥
subject to 𝐴𝑥 = 𝑏,
(9.1)
𝑥 ∈ 𝐾,
𝑥𝑖 ∈ Z, ∀𝑖 ∈ ℐ,
where 𝐾 is a cone and ℐ ⊆ {1, . . . , 𝑛} denotes the set of variables that are constrained
to be integers.
Two major techniques are typical for mixed integer optimization. The first one is the
use of binary variables, also known as indicator variables, which only take values 0 and
1, and indicate the absence or presence of a particular event or choice. This restriction
can of course be modeled in the form (9.1) by writing:
0 ≤ 𝑥 ≤ 1 and 𝑥 ∈ Z.
The other, known as big-M, refers to the fact that some relations can only be modeled
linearly if one assumes some fixed bound 𝑀 on the quantities involved, and this constant
enters the model formulation. The choice of 𝑀 can affect the performance of the model,
see Example 7.8.
97
9.1.1 Implication of positivity
Often we have a real-valued variable 𝑥 ∈ R satisfying 0 ≤ 𝑥 < 𝑀 for a known upper
bound 𝑀 , and we wish to model the implication
𝑥 > 0 =⇒ 𝑧 = 1. (9.2)
Example 9.1 (Fixed setup cost). Assume that production of a specific item 𝑖 costs 𝑢𝑖
per unit, but there is an additional fixed charge of 𝑤𝑖 if we produce item 𝑖 at all. For
instance, 𝑤𝑖 could be the cost of setting up a production plant, initial cost of equipment
etc. Then the cost of producing 𝑥𝑖 units of product 𝑖 is given by the discontinuous
function
{︂
𝑤 𝑖 + 𝑢 𝑖 𝑥𝑖 , 𝑥 𝑖 > 0
𝑐𝑖 (𝑥𝑖 ) =
0, 𝑥𝑖 = 0.
If we let 𝑀 denote an upper bound on the quantities we can produce, we can then
minimize the total production cost of 𝑛 products under some affine constraint 𝐴𝑥 = 𝑏
with
minimize 𝑢𝑇 𝑥 + 𝑤𝑇 𝑧
subject to 𝐴𝑥 = 𝑏,
𝑥𝑖 ≤ 𝑀 𝑧𝑖 , 𝑖 = 1, . . . , 𝑛
𝑥 ≥ 0,
𝑧 ∈ {0, 1}𝑛 ,
98
ui
Total cost wi
0 1 2 3 4 5
Quantities produced
𝑧 = 1 =⇒ 𝑎𝑇 𝑥 ≤ 𝑏.
Suppose we know in advance an upper bound 𝑎𝑇 𝑥 − 𝑏 ≤ 𝑀 . Then we can write the above
as a linear inequality
𝑎𝑇 𝑥 ≤ 𝑏 + 𝑀 (1 − 𝑧).
Introducing binary variables 𝑧1 , . . . , 𝑧𝑘 , we can use Sec. 9.1.3 to write a linear model
𝑧1 + · · · + 𝑧𝑘 ≥ 1,
𝑧1 , . . . , 𝑧𝑘 ∈ {0, 1},
𝑎𝑇𝑖 𝑥 ≤ 𝑏𝑖 + 𝑀 (1 − 𝑧𝑖 ), 𝑖 = 1, . . . , 𝑘.
Note that 𝑧𝑗 = 1 implies that the 𝑗-th constraint is satisfied, but not vice-versa. Achieving
that effect is described in the next section.
99
9.1.5 Constraint satisfaction
Say we want to define an optimization model that will behave differently depending on
which of the inequalities
𝑎𝑇 𝑥 ≤ 𝑏 or 𝑎𝑇 𝑥 ≥ 𝑏
is satisfied. Suppose we have lower and upper bounds for 𝑎𝑇 𝑥 − 𝑏 in the form of 𝑚 ≤
𝑎𝑇 𝑥 − 𝑏 ≤ 𝑀 . Then we can write a model
𝑏 + (𝑚 − 𝜖)𝑧 + 𝜖 ≤ 𝑎𝑇 𝑥 (9.6)
|𝑥| = 𝑡. (9.7)
It defines a non-convex set, hence it is not conic representable. If we split 𝑥 into positive
and negative part 𝑥 = 𝑥+ − 𝑥− , where 𝑥+ , 𝑥− ≥ 0, then |𝑥| = 𝑥+ + 𝑥− as long as either
𝑥+ = 0 or 𝑥− = 0. That last alternative can be modeled with a binary variable, and we
get a model of (9.7):
𝑥 = 𝑥+ − 𝑥− ,
𝑡 = 𝑥+ + 𝑥− ,
0 ≤ 𝑥+ , 𝑥− ,
+ (9.8)
𝑥 ≤ 𝑀 𝑧,
𝑥− ≤ 𝑀 (1 − 𝑧),
𝑧 ∈ {0, 1},
where the constant 𝑀 is an a priori known upper bound on |𝑥| in the problem.
where 𝑥 ∈ R𝑛 is a decision variable and 𝑐 is a constant. Such constraints arise for instance
in fully invested portfolio optimizations scenarios (with short-selling). As before, we split
100
𝑥 into a positive and negative part, using a sequence of binary variables to guarantee
that at most one of them is nonzero:
𝑥 = 𝑥+ − 𝑥− ,
0 ≤ 𝑥+ , 𝑥− ,
𝑥+ ≤ 𝑐𝑧,
− (9.10)
∑︀ + ∑︀ 𝑥− ≤ 𝑐(𝑒 − 𝑧),
𝑖 𝑥𝑖 + 𝑖 𝑥𝑖 = 𝑐,
𝑧 ∈ {0, 1}𝑛 , 𝑥+ , 𝑥− ∈ R𝑛 .
9.1.8 Maximum
The exact equality 𝑡 = max{𝑥1 , . . . , 𝑥𝑛 } can be expressed by introducing a sequence of
mutually exclusive indicator variables 𝑧1 , . . . , 𝑧𝑛 , with the intention that 𝑧𝑖 = 1 picks the
variable 𝑥𝑖 which actually achieves maximum. Choosing a safe bound 𝑀 we get a model:
𝑥𝑖 ≤ 𝑡 ≤ 𝑥𝑖 + 𝑀 (1 − 𝑧𝑖 ), 𝑖 = 1, . . . , 𝑛,
𝑧1 + · · · + 𝑧𝑛 = 1, (9.11)
𝑧 ∈ {0, 1}𝑛 .
101
• (sign) 𝑥 ∈ {−1, 1} ⇐⇒ 𝑥 = 2𝑧 − 1, 𝑧 ∈ {0, 1},
𝑥𝑦 = 0.
|𝑥| ≤ 𝑀 𝑧,
|𝑦| ≤ 𝑀 (1 − 𝑧),
𝑧 ∈ {0, 1},
for a suitable constant 𝑀 . The absolute value can be omitted if both 𝑥, 𝑦 are nonnegative.
Otherwise it can be modeled as in Sec. 2.2.2.
−𝑀 (1 − 𝑧) ≤ 𝑥𝑖 ≤ 𝑀 𝑧, 𝑖 = 1, . . . , 𝑛.
102
f (x)
α1 α2 α3 α4 α5 x
where 𝜆𝑗 𝛼𝑗 + 𝜆𝑗+1 𝛼𝑗+1 = 𝑥 and 𝜆𝑗 + 𝜆𝑗+1 = 1. If we add a constraint that only two
(adjacent) variables 𝜆𝑗 , 𝜆𝑗+1 can be nonzero, we can characterize every value 𝑓 (𝑥) over
the entire interval [𝛼1 , 𝛼5 ] as some convex combination,
4
∑︁
𝑓 (𝑥) = 𝜆𝑗 𝑓 (𝛼𝑗 ).
𝑗=1
The condition that only two adjacent variables can be nonzero is sometimes called an
SOS2 constraint. If we introduce indicator variables 𝑧𝑖 for each pair of adjacent variables
(𝜆𝑖 , 𝜆𝑖+1 ), we can model an SOS2 constraint as:
𝜆1 ≤ 𝑧1 , 𝜆2 ≤ 𝑧1 + 𝑧2 , 𝜆3 ≤ 𝑧2 + 𝑧3 , 𝜆4 ≤ 𝑧4 + 𝑧3 , 𝜆5 ≤ 𝑧4
𝑧1 + 𝑧2 + 𝑧3 + 𝑧4 = 1, 𝑧 ∈ {0, 1}4 ,
𝑥 = 𝑛𝑗=1 𝜆𝑗 𝛼𝑗 ,
∑︀ ∑︀𝑛
𝑗=1 𝜆𝑗 𝑓 (𝛼𝑗 ) ≤ 𝑡
𝜆1 ≤ 𝑧1 , 𝜆 ≤ 𝑧𝑗 + 𝑧𝑗−1 , 𝑗 = 2, . . . , 𝑛 − 1,
∑︀𝑛𝑗 ∑︀𝑛−1
𝜆𝑛 ≤ 𝑧𝑛−1 , (9.14)
𝑛−1
𝜆 ≥ 0, 𝑗=1 𝜆𝑗 = 1, 𝑗=1 𝑧𝑗 = 1, 𝑧 ∈ {0, 1} ,
for a piecewise-linear function 𝑓 (𝑥) with 𝑛 terms. This approach is often called the
lambda-method.
For the function in Fig. 9.2 we can reduce the number of integer variables by using a
Gray encoding
00 10 11 01
α1 α2 α3 α4 α5
of the intervals [𝛼𝑗 , 𝛼𝑗+1 ] and an indicator variable 𝑦 ∈ {0, 1}2 to represent the four
different values of Gray code. We can then describe the constraints on 𝜆 using only two
indicator variables,
(𝑦1 = 0) → 𝜆3 = 0,
(𝑦1 = 1) → 𝜆1 = 𝜆5 = 0,
(𝑦2 = 0) → 𝜆4 = 𝜆5 = 0,
(𝑦2 = 1) → 𝜆1 = 𝜆2 = 0,
103
which leads to a more efficient characterization of the epigraph 𝑓 (𝑥) ≤ 𝑡,
𝑥 = 5𝑗=1 𝜆𝑗 𝛼𝑗 ,
∑︀ ∑︀5
𝑗=1 𝜆𝑗 𝑓 (𝛼𝑗 ) ≤ 𝑡,
𝜆3 ≤ 𝑦1 , 𝜆1 + 𝜆5 ≤ (1 − 𝑦1 ), 𝜆4 + 𝜆5 ≤ 𝑦2 , 𝜆1 + 𝜆2 ≤ (1 − 𝑦2 ),
∑︀5
𝜆 ≥ 0, 𝑗=1 𝜆𝑗 = 1, 𝑦 ∈ {0, 1}2 ,
𝑥 = 6𝑖=1 𝜆𝑖 𝑣𝑖 ,
∑︀ ∑︀6
𝑖=1 𝜆𝑖 𝑓 (𝑣𝑖 ) ≤ 𝑡,
𝜆1 ≤ 𝑧1 + 𝑧2 , 𝜆2 ≤ 𝑧1 , 𝜆3 ≤ 𝑧2 + 𝑧3 ,
𝜆4 ≤ 𝑧1 + 𝑧2 + 𝑧3 + 𝑧4 , 𝜆5 ≤ 𝑧3 + 𝑧4 , 𝜆6 ≤ 𝑧4 , (9.15)
∑︀6 ∑︀4
𝜆 ≥ 0, 𝜆
𝑖=1 𝑖 = 1, 𝑧
𝑖=1 𝑖 = 1,
𝑧 ∈ {0, 1}4 .
𝑥 = 𝜆1 𝛼1 + (𝜆2 + 𝜆3 + 𝜆4 )𝛼2 + 𝜆5 𝛼3 ,
𝜆1 𝑓 (𝛼1 ) + 𝜆2 𝑓− (𝛼2 ) + 𝜆3 𝑓 (𝛼2 ) + 𝜆4 𝑓+ (𝛼2 ) + 𝜆5 𝑓 (𝛼3 ) ≤ 𝑡,
(9.16)
𝜆1 + 𝜆2 ≤ 𝑧1 , 𝜆3 ≤ 𝑧2 , 𝜆4 + 𝜆5 ≤ 𝑧3 ,
∑︀5 ∑︀3
𝜆 ≥ 0, 𝑖=1 𝜆𝑖 = 1, 𝑖=1 𝑧𝑖 = 1, 𝑧 ∈ {0, 1}3 ,
where we have a different decision variable for the intervals [𝛼1 , 𝛼2 ), [𝛼2 , 𝛼2 ], and (𝛼2 , 𝛼3 ].
As a special case this gives us an alternative characterization of fixed charge models
considered in Sec. 9.1.1.
104
f (x)
α1 α2 α3 x
minimize
∑︀ 𝛼
𝑗 𝑟𝑗
subject to 𝑟∑︀
𝑗 ≥ ‖𝑝𝑗 − 𝑥𝑖 ‖2 − 𝑀 (1 − 𝑧𝑖𝑗 ), 𝑖 = 1, . . . , 𝑛, 𝑗 = 1, . . . , 𝑘,
(9.17)
𝑗 𝑧𝑖𝑗 ≥ 1, 𝑖 = 1, . . . , 𝑛,
𝑝𝑗 ∈ R2 , 𝑧𝑖𝑗 ∈ {0, 1}.
The objective can be realized by summing power bounds 𝑡𝑗 ≥ 𝑟𝑗𝛼 or by directly bounding
the 𝛼-norm of (𝑟1 , . . . , 𝑟𝑘 ). The latter approach would be recommended for large 𝛼.
This is a type of clustering problem. For 𝛼 = 1, 2, respectively, we are minimizing the
perimeter and area of the covered region. In practical applications the power exponent 𝛼
can be as large as 6 depending on various factors (for instance terrain). In the linear cost
model (𝛼 = 1) typical solutions contain a small number of huge disks covering most of
the clients. For increasing 𝛼 large ranges are penalized more heavily and the disks tend
to be more balanced.
105
9.2.2 Avoiding small trades
The standard portfolio optimization model admits a number of mixed-integer extensions
aimed at avoiding solutions with very small trades. To fix attention consider the model
maximize 𝜇𝑇 𝑥
subject to (𝛾, 𝐺𝑇 𝑥) ∈ 𝒬𝑛+1 ,
(9.18)
𝑒𝑇 𝑥 = 𝑤 + 𝑒𝑇 𝑥0 ,
𝑥 ≥ 0,
with initial holdings 𝑥0 , initial cash amount 𝑤, expected returns 𝜇, risk bound 𝛾 and
decision variable 𝑥. Here 𝑒 is the all-ones vector. Let ∆𝑥𝑗 = 𝑥𝑗 − 𝑥0𝑗 denote the change
of position in asset 𝑗.
Transaction costs
A transaction cost involved with nonzero ∆𝑥𝑗 could be modeled as
{︃
0, ∆𝑥𝑗 = 0,
𝑇𝑗 (𝑥𝑗 ) =
𝛼𝑗 ∆𝑥𝑗 + 𝛽𝑗 , ∆𝑥𝑗 ̸= 0,
similarly to the problem from Example 9.1. Including transaction costs will now lead to
the model:
maximize 𝜇𝑇 𝑥
subject to (𝛾, 𝐺𝑇 𝑥) ∈ 𝒬𝑛+1 ,
𝑒𝑇 𝑥 + 𝛼𝑇 𝑥 + 𝛽 𝑇 𝑧 = 𝑤 + 𝑒𝑇 𝑥0 , (9.19)
𝑥 − 𝑥0 ≤ 𝑀 𝑧, 𝑥0 − 𝑥 ≤ 𝑀 𝑧,
𝑥 ≥ 0, 𝑧 ∈ {0, 1}𝑛 ,
Cardinality constraints
Another option is to fix an upper bound 𝑘 on the number of nonzero trades. The meaning
of 𝑧 is the same as before:
maximize 𝜇𝑇 𝑥
subject to (𝛾, 𝐺𝑇 𝑥) ∈ 𝒬𝑛+1 ,
𝑒𝑇 𝑥 = 𝑤 + 𝑒𝑇 𝑥0 ,
(9.20)
𝑥 − 𝑥0 ≤ 𝑀 𝑧, 𝑥0 − 𝑥 ≤ 𝑀 𝑧,
𝑒𝑇 𝑧 ≤ 𝑘,
𝑥 ≥ 0, 𝑧 ∈ {0, 1}𝑛 .
106
Trading size constraints
We can also demand a lower bound on nonzero trades, that is |∆𝑥𝑗 | ∈ {0} ∪ [𝑎, 𝑏] for all
𝑗. To this end we combine the techniques from Sec. 9.1.6 and Sec. 9.1.2 writing 𝑝𝑗 , 𝑞𝑗 for
the indicators of ∆𝑥𝑗 > 0 and ∆𝑥𝑗 < 0, respectively:
maximize 𝜇𝑇 𝑥
subject to (𝛾, 𝐺𝑇 𝑥) ∈ 𝒬𝑛+1 ,
𝑒𝑇 𝑥 = 𝑤 + 𝑒𝑇 𝑥0 ,
𝑥 − 𝑥0 = 𝑥+ − 𝑥− ,
(9.21)
𝑥+ ≤ 𝑀 𝑝, 𝑥− ≤ 𝑀 𝑞,
𝑎(𝑝 + 𝑞) ≤ 𝑥+ + 𝑥− ≤ 𝑏(𝑝 + 𝑞),
𝑝 + 𝑞 ≤ 𝑒,
𝑥, 𝑥+ , 𝑥− ≥ 0, 𝑝, 𝑞 ∈ {0, 1}𝑛 .
where 𝑘 is the number of segments we want to consider. The quality of the fit is measured
with least squares as
∑︁
(𝑓 (𝑥𝑖 ) − 𝑦𝑖 )2 .
𝑖
Note that we do not specify the locations of nodes (breakpoints), i.e. points where 𝑓 (𝑥)
changes slope. Finding them is part of the fitting problem.
As in Sec. 9.1.8 we introduce binary variables 𝑧𝑖𝑗 indicating that 𝑓 (𝑥𝑖 ) = 𝑎𝑗 𝑥𝑖 + 𝑏𝑗 ,
i.e. it is the 𝑗-th linear function that achieves maximum at the point 𝑥𝑖 . Following Sec.
9.1.8 we now have a mixed integer conic quadratic problem
minimize ‖𝑦 − 𝑓 ‖2
subject to 𝑎𝑗 𝑥𝑖 + 𝑏𝑗 ≤ 𝑓𝑖 , 𝑖 = 1, . . . , 𝑛, 𝑗 = 1, . . . , 𝑘,
∑︀ 𝑓𝑖 ≤ 𝑎𝑗 𝑥𝑖 + 𝑏𝑗 + 𝑀 (1 − 𝑧𝑖𝑗 ), 𝑖 = 1, . . . , 𝑛, 𝑗 = 1, . . . , 𝑘, (9.22)
𝑗 𝑧𝑖𝑗 = 1, 𝑖 = 1 . . . , 𝑛,
𝑧𝑖𝑗 ∈ {0, 1}, 𝑖 = 1, . . . , 𝑛, 𝑗 = 1, . . . , 𝑘,
𝑧𝑖,𝑗+1 + 𝑧𝑖+𝑖′ ,𝑗 ≤ 1,
which indicate that each linear segment covers a contiguous subset of the sample and
additionally force these segments to come in the order of increasing 𝑗 as 𝑖 increases from
left to right. The last statement is an example of symmetry breaking.
107
Fig. 9.5: Convex piecewise linear fit with 𝑘 = 2, 3, 4 segments.
108
Chapter 10
Quadratic optimization
minimize 12 𝑥𝑇 𝑄𝑥 + 𝑐𝑇 𝑥
subject to 𝐴𝑥 = 𝑏, (10.1)
𝑥 ≥ 0,
with 𝑄 ∈ 𝒮+𝑛 is conceptually a simple extension of a standard linear optimization problem
with a quadratic term 𝑥𝑇 𝑄𝑥. Note the important requirement that 𝑄 is symmetric
positive semidefinite; otherwise the objective function would not be convex.
109
• The problem is infeasible, i.e., {𝑥 | 𝐴𝑥 = 𝑏, 𝑥 ≥ 0} = ∅.
a1
a2
x⋆
a0 a3
a4
Possibly because of its simple geometric interpretation and similarity to linear opti-
mization, quadratic optimization has been more widely adopted by optimization practi-
tioners than conic quadratic optimization.
𝑄𝑥 = 𝐴𝑇 𝑦 + 𝑠 − 𝑐,
110
or alternatively, using the optimality condition 𝑄𝑥 = 𝐴𝑇 𝑦 + 𝑠 − 𝑐 we can write
maximize 𝑏𝑇 𝑦 − 12 𝑥𝑇 𝑄𝑥
subject to 𝑄𝑥 = 𝐴𝑇 𝑦 − 𝑐 + 𝑠, (10.4)
𝑠 ≥ 0.
Note that this is an unusual dual problem in the sense that it involves both primal and
dual variables.
Weak duality, strong duality under Slater constraint qualification and Farkas infeasi-
bility certificates work similarly as in Sec. 8. In particular, note that the constraints in
both (10.1) and (10.4) are linear, so Lemma 2.4 applies and we have:
i. The primal problem (10.1) is infeasible if and only if there is 𝑦 such that 𝐴𝑇 𝑦 ≤ 0
and 𝑏𝑇 𝑦 > 0.
ii. The dual problem (10.4) is infeasible if and only if there is 𝑥 ≥ 0 such that 𝐴𝑥 = 0,
𝑄𝑥 = 0 and 𝑐𝑇 𝑥 < 0.
minimize 𝑟 + 𝑐𝑇 𝑥
subject to 𝐴𝑥 = 𝑏,
(10.5)
𝑥 ≥ 0,
(1, 𝑟, 𝐹 𝑥) ∈ 𝒬𝑘+2
r ,
maximize 𝑏𝑇 𝑦 − 𝑢
subject to −𝐹 𝑇 𝑣 = 𝐴𝑇 𝑦 − 𝑐 + 𝑠,
(10.6)
𝑠 ≥ 0,
(𝑢, 1, 𝑣) ∈ 𝒬𝑘+2
r .
Note that in an optimal primal-dual solution we have 𝑟 = 12 ‖𝐹 𝑥‖22 , hence the com-
plementary slackness for 𝒬𝑘+2 r demands 𝑣 = −𝐹 𝑥 and −𝐹 𝑇 𝐹 𝑣 = 𝑄𝑥, as well as
𝑢 = 21 ‖𝑣‖22 = 12 𝑥𝑇 𝑄𝑥. This justifies why some of the dual variables in (10.6) and (10.4)
have the same names - they are in fact equal, and so both the primal and dual solution
to the original quadratic problem can be recovered from the primal-dual solution to the
conic reformulation.
minimize 1 𝑇
𝑥 𝑄0 𝑥 + 𝑐𝑇0 𝑥 + 𝑟0
2 (10.7)
subject to 1 𝑇
2
𝑥 𝑄𝑖 𝑥 + 𝑐𝑇𝑖 𝑥 + 𝑟𝑖 ≤ 0, 𝑖 = 1, . . . , 𝑚,
111
where 𝑄𝑖 ∈ 𝒮+𝑛 . This corresponds to minimizing a convex quadratic function over an
intersection of convex quadratic sets such as ellipsoids or affine halfspaces. Note the
important requirement 𝑄𝑖 ⪰ 0 for all 𝑖 = 0, . . . , 𝑚, so that the objective function is convex
and the constraints characterize convex sets. For example, neither of the constraints
1 𝑇 1 𝑇
𝑥 𝑄𝑖 𝑥 + 𝑐𝑇𝑖 𝑥 + 𝑟𝑖 = 0, 𝑥 𝑄𝑖 𝑥 + 𝑐𝑇𝑖 𝑥 + 𝑟𝑖 ≥ 0
2 2
characterize convex sets, and therefore cannot be included.
112
• Either the dual problem (10.10) is feasible or there exists 𝑥 ∈ R𝑛 satisfying
To see why the certificate proves infeasibility, suppose for instance that (10.12) and
(10.7) are simultaneously satisfied. Then
]︂𝑇 [︂
2𝑟𝑖 𝑐𝑇𝑖
[︂ ]︂ [︂ ]︂ (︂ )︂
∑︁ 1 1 ∑︁ 1 𝑇
0< 𝜆𝑖 =2 𝜆𝑖 𝑥 𝑄𝑖 𝑥 + 𝑐𝑇𝑖 𝑥 + 𝑟𝑖 ≤0
𝑥 𝑐𝑖 𝑄𝑖 𝑥 2
𝑖 𝑖
minimize 𝑡0 + 𝑐𝑇0 𝑥 + 𝑟0
subject to (1, 𝑡0 , 𝐹0 𝑥) ∈ 𝒬𝑘r 0 +2 , (10.14)
(1, −𝑐𝑇𝑖 𝑥 − 𝑟𝑖 , 𝐹𝑖 𝑥) ∈ 𝒬𝑘r 𝑖 +2 , 𝑖 = 1, . . . , 𝑚.
The primal and dual solution of (10.14) recovers the primal and dual solution of (10.7)
similarly as for quadratic optimization in Sec. 10.1.3. Let us see for example, how a
(conic) primal infeasibility certificate for (10.14) implies an infeasibility certificate in the
form (10.12). Infeasibility in (10.14) involves only the last set of conic constraints. We
can derive the infeasibility certificate for those constraints from Lemma 8.3 or by directly
writing the Lagrangian
𝑇 𝑇
∑︀ (︀ )︀
𝐿 = − (︀∑︀ 𝑖 𝑢𝑖 + 𝑣𝑖 (−𝑐 𝑖 𝑥 − 𝑟𝑖 ) + 𝑤 𝑖 𝐹𝑖 𝑥
= 𝑥𝑇
∑︀ 𝑇 )︀ ∑︀ ∑︀
𝑖 𝑣 𝑖 𝑐𝑖 − 𝑖 𝐹 𝑖 𝑤𝑖 + ( 𝑖 𝑣𝑖 𝑟𝑖 − 𝑖 𝑢𝑖 ) .
The dual maximization problem is unbounded (i.e. the primal problem is infeasible) if
we have
∑︀ ∑︀ 𝑇
∑︀𝑖 𝑣𝑖 𝑐𝑖 = ∑︀𝑖 𝐹𝑖 𝑤𝑖 ,
𝑖 𝑣𝑖 𝑟𝑖 > 𝑖 𝑢𝑖 ,
2𝑢𝑖 𝑣𝑖 ≥ ‖𝑤𝑖 ‖2 , 𝑢𝑖 , 𝑣𝑖 ≥ 0.
113
10.3 Example: Factor model
Recall from Sec. 3.3.3 that the standard Markowitz portfolio optimization problem is
maximize 𝜇𝑇 𝑥
subject to 𝑥𝑇 Σ𝑥 ≤ 𝛾,
(10.15)
𝑒𝑇 𝑥 = 1,
𝑥 ≥ 0,
where 𝜇 ∈ R𝑛 is a vector of expected returns for 𝑛 different assets and Σ ∈ 𝒮+𝑛 denotes
the corresponding covariance matrix. Problem (10.15) maximizes the expected return
of an investment given a budget constraint and an upper bound 𝛾 on the allowed risk.
Alternatively, we can minimize the risk given a lower bound 𝛿 on the expected return of
investment, i.e., we can solve the problem
minimize 𝑥𝑇 Σ𝑥
subject to 𝜇𝑇 𝑥 ≥ 𝛿,
(10.16)
𝑒𝑇 𝑥 = 1,
𝑥 ≥ 0.
Both problems (10.15) and (10.16) are equivalent in the sense that they describe the same
Pareto-optimal trade-off curve by varying 𝛿 and 𝛾.
Next consider a factorization
Σ = 𝑉 𝑇𝑉 (10.17)
for some 𝑉 ∈ R𝑘×𝑛 . We can then rewrite both problems (10.15) and (10.16) in conic
quadratic form as
maximize 𝜇𝑇 𝑥
subject to (1/2, 𝛾, 𝑉 𝑥) ∈ 𝒬𝑘+2 ,
𝑟
(10.18)
𝑒𝑇 𝑥 = 1,
𝑥 ≥ 0,
and
minimize 𝑡
subject to (1/2, 𝑡, 𝑉 𝑥) ∈ 𝒬𝑘+2
𝑟 ,
𝜇𝑇 𝑥 ≥ 𝛿, (10.19)
𝑒𝑇 𝑥 = 1,
𝑥 ≥ 0,
114
Data matrix
Σ might be specified directly in the form (10.17), where 𝑉 is a normalized data-matrix
with 𝑘 observations of market data (for example daily returns) of the 𝑛 assets. When the
observation horizon 𝑘 is shorter than 𝑛, which is typically the case, the conic representa-
tion is both more parsimonious and has better numerical properties.
Factor model
For a factor model we have
Σ = 𝐷 + 𝑈 𝑇 𝑅𝑈
115
Chapter 11
Bibliographic notes
The material on linear optimization is very basic, and can be found in any textbook.
For further details, we suggest a few standard references [Chvatal83], [BT97] and [PS98],
which all cover much more that discussed here. [NW06] gives a more modern treatment
of both theory and algorithmic aspects of linear optimization.
Material on conic quadratic optimization is based on the paper [LVBL98] and the
books [BenTalN01], [BV04]. The papers [AG03], [ART03] contain additional theoretical
and algorithmic aspects.
For more theory behind the power cone and the exponential cone we recommend the
thesis [Cha09].
Much of the material about semidefinite optimization is based on the paper [VB96]
and the books [BenTalN01], [BKVH07]. The section on optimization over nonnegative
polynomials is based on [Nes99], [Hac03]. We refer to [LR05] for a comprehensive survey
on semidefinite optimization and relaxations in combinatorial optimization.
The chapter on conic duality follows the exposition in [GartnerM12]
Mixed-integer optimization is based on the books [NW88], [Wil93]. Modeling of piece-
wise linear functions is described in the survey paper [VAN10].
116
Chapter 12
R and Z denote the sets of real numbers and integers, respectively. R𝑛 denotes the set of
𝑛-dimensional vectors of real numbers (and similarly for Z𝑛 and {0, 1}𝑛 ); in most cases
we denote such vectors by lower case letters, e.g., 𝑎 ∈ R𝑛 . A subscripted value 𝑎𝑖 then
refers to the 𝑖-th entry in 𝑎, i.e.,
𝑎 = (𝑎1 , 𝑎2 , . . . , 𝑎𝑛 ).
The symbol 𝑒 denotes the all-ones vector 𝑒 = (1, . . . , 1)𝑇 , whose length always follows
from the context.
All vectors are interpreted as column-vectors. For 𝑎, 𝑏 ∈ R𝑛 we use the standard inner
product,
⟨𝑎, 𝑏⟩ := 𝑎1 𝑏1 + 𝑎2 𝑏2 + · · · + 𝑎𝑛 𝑏𝑛 ,
which we also write as 𝑎𝑇 𝑏 := ⟨𝑎, 𝑏⟩. We let R𝑚×𝑛 denote the set of 𝑚 × 𝑛 matrices, and
we use upper case letters to represent them, e.g., 𝐵 ∈ R𝑚×𝑛 is organized as
⎡ ⎤
𝑏11 𝑏12 . . . 𝑏1𝑛
⎢ 𝑏21 𝑏22 . . . 𝑏2𝑛 ⎥
𝐵 = ⎢ .. .. . . .. ⎥
⎢ ⎥
⎣ . . . . ⎦
𝑏𝑚1 𝑏𝑚2 . . . 𝑏𝑚𝑛
117
A set 𝑆 ⊆ R𝑛 is convex if and only if for any 𝑥, 𝑦 ∈ 𝑆 and 𝜃 ∈ [0, 1] we have
𝜃𝑥 + (1 − 𝜃)𝑦 ∈ 𝑆
A function 𝑓 : R𝑛 ↦→ R is convex if and only if its domain dom(𝑓 ) is convex and for all
𝜃 ∈ [0, 1] we have
epi(f )
Fig. 12.1: The shaded region is the epigraph of the function 𝑓 (𝑥) = − log(𝑥).
minimize 𝑡
subject to 𝑓 (𝑥) ≤ 𝑡
maximize 𝑡
subject to 𝑓 (𝑥) ≥ 𝑡,
118
Bibliography
[Cha09] Peter Robert Chares. Cones and interior-point algorithms for structed convex
optimization involving powers and exponentials. PhD thesis, Ecole polytech-
nique de Louvain, Universitet catholique de Louvain, 2009.
119
[Nes99] Yu. Nesterov. Squared functional systems and optimization problems. In
H. Frenk, K. Roos, T. Terlaky, and S. Zhang, editors, High Performance
Optimization. Kluwer Academic Publishers, 1999.
120
Index
121
function L
concave, 118 Lagrange function, 14, 88, 110, 112
convex, 118 Lambert W function, 40
dual, 14, 88 limit-feasibility, 87
entropy, 38 linear
exponential, 38, 40 matrix inequality, 54, 80, 94
Lagrange, 14, 88, 110, 112 near dependence, 75
Lambert W, 40 optimization, 2
logarithm, 38, 41 linear matrix inequality, 54, 80, 94
lower semicontinuous, 104 linear-fractional problem, 10
piecewise linear, 7, 102 LMI, 54, 80, 94
power, 23, 32, 43, 105 LO, 2
sigmoid, 48 log-determinant, 56
softplus, 39 log-sum-exp, 39
log-sum-inv, 40
G logarithm, 38, 41
geometric mean, 33, 34, 36 logistic regression, 48
geometric median, 35 lowpass filter, 67
geometric programming, 41
GP, 41 M
Grammian matrix, 51 Markowitz model, 27
matrix
H adjacency, 69
halfspace, 5 correlation, 62
harmonic mean, 25 covariance, 28, 114
Hermitian matrix, 60 Grammian, 51
hitting time, 47 Hermitian, 60
homogenization, 10, 28 inner product, 50
Huber loss, 74 positive definite, 50
hyperplane, 4 pseudo-inverse, 53, 109
hypograph, 118 semidefinite, 50, 109, 111
I variable, 50
MAX-CUT, 70
ill-posed, 74, 75, 84, 87
maximum, 7, 9, 43, 101
indicator
maximum likelihood, 36, 44
constraint, 99
mean
variable, 97
geometric, 33
infeasibility, 84
harmonic, 25
certificate, 11, 12, 86, 87, 111, 112
MIO, 96
locating, 13
monomial, 41
inner product, 117
matrix, 50 N
integer variable, 97 nearest correlation matrix, 62
inventory, 29 network
K design, 105
wireless, 45, 105
Kullback-Leiber divergence, 39, 47
norm
1-norm, 8, 100
122
2-norm, 22 market impact, 33
dual, 9 Markowitz model, 27
Euclidean, 22 risk factor exposure, 115
Frobenius, 44 risk parity, 46
infinity norm, 9 Sharpe ratio, 28
nuclear, 57 trading size, 107
p-norm, 32, 35 transaction costs, 106
normal equation, 93 posynomial, 41
power, 23, 43
O power cone optimization, 29
objective, 109 power control, 45
objective function, 3 precision, 78, 81
optimal value, 3, 84 principal submatrix, 53
unattainment, 10, 84 pseudo-inverse, 53
optimization
binary, 68 Q
boolean, 69 QCQO, 26, 111
eigenvalue, 55 QO, 22, 107
exponential, 36 quadratic
linear, 2 cone, 19
mixed integer, 96 duality, 110, 112
p-order cone, 32 optimization, 107
power cone, 29 rotated cone, 20
practical, 70 quadratic optimization, 22, 26
quadratic, 107
robust, 26
R
semidefinite, 48 rate allocation, 45
overflow, 78 redundant constraints, 77
regression
P linear, 93
penalization, 77 logistic, 48
perturbation, 74 piecewise linear, 107
piecewise linear regularization, 48
function, 7 regularization, 48
regression, 107 relative entropy, 39, 46
pOCO, 32 relaxation
polyhedron, 5, 34, 109 semidefinite, 68, 69, 79
polynomial Riesz-Fejer Theorem, 61
curve fitting, 64 risk parity, 46
nonnegative, 58, 59, 65 robust optimization, 26
trigonometric, 61, 62, 67 rotated quadratic cone, 20
portfolio optimization
cardinality constraint, 106
S
constraint attribution, 93 scaling, 76
covariance matrix, 28, 114 Schur complement, 53, 58, 79, 112
duality, 90 SDO, 48
factor model, 23, 28, 115 second-order cone, 19, 23
fully invested, 46, 100 semicontinuous variable, 98
semidefinite
123
cone, 50
optimization, 48
relaxation, 68, 69, 79
set
covering, 101
packing, 101
partitioning, 101
setup cost, 98
shadow price, 18
Sharpe ratio, 28
sigmoid, 48
signal processing, 67
signal-to-noise, 45
singular value, 57, 96
Slater constraint qualification, 92, 111, 112
SOCO, 19
softplus, 39
SOS1, 101
SOS2, 103
spectrahedron, 52
spectral factorization, 25, 50, 53
spectral radius, 56
sum of squares, 58, 61
symmetry breaking, 107
T
trading size, 107
transaction costs, 106
trigonometric polynomial, 61, 62, 67
U
unattainment, 75
V
variable
binary, 97
indicator, 97
integer, 97
matrix, 50
semicontinuous, 98
verification, 80
violation, 81
volume, 34, 63
124