0% found this document useful (0 votes)
24 views97 pages

Convex Optimizatiom IP

Uploaded by

hshavababajajsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views97 pages

Convex Optimizatiom IP

Uploaded by

hshavababajajsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Convex Optimization with

Applications to Image Processing


J. Lellmann

Lecture Notes

Mathematical Tripos, Part III


University of Cambridge, Michaelmas 2013

Please send questions and corrections to:


[email protected]

Last updated 13/02/2014


Table of contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Course Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Extended real-valued functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Existence of minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Cones and Generalized Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Conjugate Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1 The Legendre-Fenchel Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Duality Correspondences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Duality in Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8 Numerical Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.1 The smooth and the non-smooth case . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.2 The numerical primal-dual gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.3 Infeasibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9 First-Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.1 Forward and backward steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.2 Primal-dual methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10 Interior-Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
11 Using Solvers and Discretization Issues . . . . . . . . . . . . . . . . . . . . . . 73
12 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
12.1 Introduction to machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
12.2 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Primal problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Dual problem and optimality conditions. . . . . . . . . . . . . . . . . . . . 77
Evaluating the linear function. . . . . . . . . . . . . . . . . . . . . . . . . . 78
12.3 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
13 Total Variation and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.1 Functions of Bounded Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.2 Infimal Convolution and TGV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
13.3 Meyers G-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
13.4 Non-local regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
14 Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3
Chapter 1
Introduction

1.1 Motivation
Why optimization in image processing? It is a convenient strategy to design algorithms
1. by defining what the result should look like, not how to find it,
2. that can be analyzed,
3. that are modular and can be easily modified as requirements change.
A classical example is image denoising, because the problem is very clear: we are given an
input image I: Ω ⊆ Rd → [0, 1] that is corrupted by noise – such as sensor noise or JPEG
artifacts – and would like to reconstruct the uncorrupted image u as accurately as possible.
Simply convolving the image with a Gaussian kernel or computing the average in a
small window around each point results in an image that is much too smooth, i.e., much
of the structure in the original image is lost.
A common first approach is to use “ad hoc” methods. For example, we might notice
that natural images tend to have a high amount of self-similarity, so it seems reasonable
to try to find similar patches in the image, and for each point compute the average over
the corresponding pixels in similar patches.
Such methods may work, and are often simple to implement and fast. The disadvantage
is that if they do not work, it may be very hard to precisely point out why. Similarly it is
often difficult to make specific statements about the quality of their output – which features
does it remove or preserve? Does it remove features below a certain size? How strong can
the noise be if we want it effectively removed? Does it keep edges?
A very useful way out of this dilemma is to follow a variational approach. We postulate
that the output of our method is the minimizer of a function

min f (u).
u
This abstracts from the actual implementation of the algorithm, and allows to focus on
modelling the problem, i.e., finding a function f such that the minimizer has the desired
properties.
A prototypical variational model is the following:
Z Z
min f (u) 8
1
ku − I k2 dx + λ k∇uk2 dx.
u:Ω→R 2 Ω Ω
The left term is commonly called the data term, and serves to keep u close to the observed
input image I. The right term, which we refer to as the regularization term, is weighted
by a scalar λ > 0 and advocates solutions that are smooth. Finding suitable regularizers
is often the more difficult part, as they encode what we know about the characteristics of
the desired solution – our prior knowledge that does not depend on the actual data.

5
6 Introduction

Apart from this intuitive explanation, there is also a statistical interpretation: min-
imizing f is the same as maximizing e−f (x). In the above case, this is the product of a
Gaussian density with mean I and another density that encodes the regularizer. The
minimizer of the variational model is therefore the image u that is most likely with respect
to a certain density that depends on the input image.
A drawback of the above model is that it tends to over-smooth edges. The reason can
be easily seen by inspecting the regularizer: it clearly prefers continuous transitions between
two values c1 and c2 to a “jump” of the same height (going from 0 to 1 via 1/2 is penalized
by (1/2)2 + (1/2)2 = 1/2, but a sharp transition from 0 to 1 is penalized by 12 = 1). We
have to make sure that they cost the same or we will get smoothing!
This leas to the Rudin-Osher-Fatemi (ROF) model:
Z Z
min f (u) 8
1 2
ku − I k dx + λ k∇uk dx (1.1)
u:Ω→R 2 Ω Ω

(the notation is not very precise, since the ROF model requires a generalization of the
gradient to discontinuous functions, but it is enough to illustrate the point). The idea is
that monotone transitions between two values have exactly the same regularization cost,
regardless of how sharp the transition is.
For the ROF model it is possible to show that if the input image I assumes only the
values 0 and 1 then jumps will be preserved. This highlights an important point: it is often
relatively easy to explain unexpected results from basic properties of the objective, and
then to slightly modify the objective to get rid of these properties.
Taking the idea one step further, we arrive at the TV − L1 model:
Z Z
min f (u) 8 ku − I k dx + λ k∇uk dx.
u:Ω→R Ω Ω

Compared to ROF, this has the advantage that it does not reduce the contrast of the
image. In fact, it is possible to show that characteristic function of discs with radius of at
least λ/2 will be preserved, and discs with radius strictly less than λ/2 will be completely
removed – we have a good intuition of what happens to features of a certain size.
The drawback of variational methods is that the model description in terms of f does
not include any hint on how to implement the numerical minimization. In fact, there can be
many obstacles such as large-scale problems, f being non-differentiable, or having multiple
local minima.
This raises an important question: let us assume that the solver hits the stopping
criterion and returns a potential minimizer u of f , but we find that u is not what we would
like the output to look like. It could be that the model f is not appropriate, but it could
also be that the model is fine but the solver just did not find a good minimizer.
This is where the concept of convexity comes into play:
convexity ⇒ local minimizer = global minimizer.

If f if convex (and in fact the ROF model (1.1) is), then every local minimizer is also a
global minimizer. Therefore, if the output is not as expected (and the solver returns a
local minimum), we know that the only correct way to improve the situation is to modify
the model. Therefore convexity decouples the modelling and implementation/optimization
stage.
A difficulty when implementing convex solvers for image processing tasks is that we
often have to deal with problem that are

non-smooth and large-scale.


1.2 Course Layout 7

This causes several problems: we cannot evaluate the gradient everywhere, and much less
the Hessian. Even if we could, the gradient is not Lipschitz-continuous and does not carry
any information about how close we are to the minimum. If we decide to replace the non-
smooth terms by a smoothed version (such as k∇uk ≈ (u2x + u2y + ε2)1/2, close to the non-
smoothness the gradient can be a very bad approximation of the behaviour of the function.
But the good news is:

convexity ⇒ (very often) efficiently solvable.

In the last decades very efficient general-purpose and dedicated convex solvers have been
developed, so that even problems with 10 million variables can often be solved in reasonable
time. This is not the case for general non-convex problems: minimizing

min f (x) s.t. xi (1 − xi) = 0,


x∈Rn

is already challenging for n = 50 and practically impossible in general for n = 1000, because
at minimum f needs to be evaluated at 2n points.

1.2 Course Layout


1. Introduction: variational approach, data term and regularizer, separation of model-
ling and implementation
2. Existence: extended real-valued functions, lower semi-continuity, level-boundedness,
existence of minimizers of extended real-valued functions.
3. Convexity: convex sets and functions, epigraphs and effective domain, Jensen
inequality, characterization of convex functions and convex calculus, local min-
imizers are global minimizers, derivative tests, projections, convex hulls.
4. Cones and Generalized Inequalities: cones, generalized inequalities, conic programs,
standard-, second-order-, and semidefinite cones.
5. Subgradients: set-valued mappings, subdifferential, generalized Fermat condition,
normal cones, subdifferentials as normal cones of the epigraph, relative interior,
subdifferential calculus.
6. Conjugate Functions: convex hull and closure of a function, envelope representation
of convex sets and functions, Legendre-Fenchel transform and properties, inversion
rule for the subdifferential, conjugate calculus, support functions.
7. Duality in Optimization: primal and dual problems, perturbation formulation, weak
and strong duality, necessary and sufficient primal-dual optimality conditions, dual
for conic problems, Lagrangians, saddle-point formulation of the optimality condi-
tions.
8. Numerical Optimality: difficulties in non-smooth optimization, numerical primal-
dual gap, infeasibilities.
9. First-Order Methods: forward and backward steps, proximal step formulation, for-
ward and backward stepping, splitting principle, gradient-projection, primal-dual
methods, Augmented Lagrangian.
10. Interior-Point Methods: canonical barriers, primal-dual central path, duality gap
on the central path, tracing the central path.
8 Introduction

11. Using Solvers and Discretization Issues: transforming problems into normal forms,
representing multi-dimensional data in vector form, implementing linear operators,
discretizing variational problems, adjoint differential operators
12. Support Vector Machines: supervised and unsupervised machine learning, linear
maximum-margin classifiers, reformulation as convex problem, dual problem and
optimality conditions, computing primal from dual solutions, support vectors,
kernel trick
13. Total Variation: functions of bounded variation, dual formulation of the total vari-
ation, coarea formula, higher-order total variation, infimal convolution, conjugate
of infinal convolution, total generalized variation, Meyer’s G-norm, non-local regu-
larizers.
14. Relaxation: non-convexity in image processing, Chan-Vese, Mumford-Shah, convex
relaxation, genralized coarea condition, thresholding theorem, discretized energy,
anisotropy.

1.3 Suggested Reading


• Rockafellar, Wets: Variational Analysis, 2009: The notation and structure of the
variational analysis part of this course follows Rockafellar’s excellent book. The
book can be a little abstract at times as it develops a much more general theory
that also covers nonconvex functions. The predecessor classic (Convex Analysis)
from the same author contains several simpler proofs for the convex case.
• Boyd, Vandenberghe: Convex Optimization, 2004: A good overview with many
applications, examples and exercises. Focuses on the classical KKT representa-
tion of optimality conditions. An excellent read, and available online for free.
• Ben-Tal, Nemirovski: Lectures on Modern Convex Optimization, 2001: A good intu-
itive introduction to interior point methods including some complexity analysis.
• Paragios, Chen, Faugeras: Handbook of Mathematical Models in Computer Vision,
2006: This is a good complement to the course as it covers the modelling aspect and
is a good reference and includes many of the standard methods in modern image
processing.
Chapter 2
Existence

2.1 Extended real-valued functions


In the literature, optimization problems are commonly formulated using an objective func-
tion f0: Rn → R and constraint functions f1, , fm: Rn → R, e.g.,
min f0(x) s.t. x ∈ C ,
x
C = {x ∈ Rn |fi(x) 6 0, i = 1, , m}.
By allowing +∞ as the value of the objective function we can rewrite this in a very compact
form:
min f (x),
x

where f : Rn → R ∪ {±∞}, with the definition x  C ⇔ f (x) = +∞.

Definition 2.1. (extended real line) We define R̄ 8 R ∪ {+∞, −∞} with the rules:
1. ∞ + c = ∞, − ∞ + c = −∞ for all c ∈ R,
2. 0 · ∞ = 0, 0 · (−∞) = 0,
3. inf R = sup ∅ = −∞, inf ∅ = sup R = +∞.
4. +∞ − ∞ = −∞ + ∞ = +∞ (sometimes; careful: −∞ = λ (∞ − ∞)  λ ∞ − λ ∞ = ∞
if λ < 0)

Remark 2.2. (rules for extended real values)


inf {f (x) + g(x)} > inf f (x) + inf g(x), inf λ f = λ inf f , λ > 0, inf f = −sup (−f ),
x x x
sup {f (x) + g(x)} 6 sup f (x) + sup g(x), sup λ f = λ sup f , λ > 0, sup f = −inf (−f ),
x x x
inf {f (x) + g(y)} = inf f (x) + inf g(y).
x,y x y

The last rule does not hold for sup ! Example: f (x) = x, g(y) = −∞, then
sup {f (x) + g(y)} = sup (x − ∞) = sup (−∞) = −∞  +∞ = +∞ − ∞ = sup f + sup g.
x,y x,y x,y x y

Another special case to watch out for is that inf C 6 sup C does not hold if C = ∅. Also
a > b ⇔ a − b > 0 always holds, but a > b ⇔ b − a 6 0 does not!

Definition 2.3. (indicator function) For C ⊆ Rn, denote



δC : R → R̄, δC (x) 8
n 0, x ∈ C,
+∞, x  C.

9
10 Existence

Example 2.4. (constrained minimization via addition of indicator function): Assume f :


Rn → R, C ⊆ Rn, C  ∅. Then

x ′ minimizes f over C ⇔ x ′ minimizes f + δC over Rn.

Definition 2.5. (argmin, effective domain, proper) For f : Rn → R̄, denote


1. dom f 8 {x ∈ Rn |f (x) < +∞} (effective domain, set of feasible solutions)

2. arg min f 8

∅, f ≡+∞,
{x ∈ Rn |f (x) = inf f }, f < +∞.
(set of minimizers/optimal solutions)

3. f is “proper” : ⇔ dom f  ∅ and f (x) > −∞∀x ∈ Rn (i.e., f  +∞ and f > −∞).

By the definition of the arg min , if x ∈ arg min f then f (x) < +∞, however f (x) = −∞
is possible. Proper functions are the “interesting” function for minimization: if f = +∞
then the problem does not have any solution, and if there are x such that f (x) = −∞ then
arg min f consists of exactly these points.

Definition 2.6. (epigraphs, level sets) For f : Rn → R̄, denote

epi f 8 {(x, α) ∈ Rn × R|f (x) 6 α}.

Epigraphs are an alternative way to define functions and are often a convenient way to
derive properties functions using theorem about sets. While every epigraph is a set, not
every set is the epigraph of a function – in fact a set C is an epigraph iff for every x there
is an α ∈ R̄ such that C ∩ (x × R) = [α, +∞], i.e., all vertical one-dimensional sections must
be closed upper half-lines. If f is proper then epi f is not empty and does not include a
complete vertical line.

2.2 Existence of minimizers


Definition 2.7. (lower-semicontinuity) For f : Rn → R̄, define

lim inf f (x)


x→x ′
8 lim inf
δ ց0 kx−x ′k2 6δ
f (x) = lim inf
k→∞ kx−x ′k2 61/k
f (x).

f is “lower semi-continuous (lsc) at x ′” : ⇔

f (x ′) 6 lim inf f (x). (2.1)


x→x ′

f is “lsc on Rn” (or just “lsc”) : ⇔ f is lsc at every x ′ ∈ Rn.

Proposition 2.8. (sequence characterization of lower semi-continuity)

lim inf f (x) = min {α ∈ R̄|∃(xk) → x ′: f (xk) → α}. (2.2)


x→x ′

In particular, f is lsc at x ′ iff

f (x ′) 6 lim inf f (xk) for all convergent sequences xk → x ′. (2.3)


k→∞

Proof. First part:


2.2 Existence of minimizers 11

“6”: We show that lim infx→x ′ f (x) 6 α for all α ∈ S 8 {α ∈ R̄|∃(xk) → x ′: f (xk) → α}.
For such an α take (xk) → x ′ from the definition of S. Then f (xk) → α, so for all j exists
k j such that xk j ∈ B1/j (x ′). Therefore
inf f (x) 6 f xkj

x∈B1/j (x ′)
j→∞ xk→x
⇒ lim inf f (x) 6 lim f xkj = lim f (xk) = α.
j→∞x∈B1/j (x ′) j→∞ k→∞
lim infx→x ′ f (x)

“>”: We show that there exists a sequence xk → x ′ such that f (xk) → lim infx→x ′ f (x): if this
is true, then “>” holds and we also have shown that the “min” notation is justified, i.e., S
contains a minimal element.
For every k we can find xk ∈ B1/k(x ′) such that
1
f (xk) 6 inf f (x) +
x∈B1/k(x ′) k
(this requires that the lim inf is finite, but we can make a similar argument by adding −∞
if it is −∞). Since f (xk) > infx∈B1/k(x ′) f (x) this implies
1
inf f (x) 6f (xk) 6 inf f (x) + .
x∈B1/k (x ′) x∈B1/k(x ′) k
Since this holds for all k we can take k → ∞ and obtain that
lim f (xk) = lim inf f (x)
k→∞ k→∞x∈B1/k(x ′)

By definition xk → x ′, so we have found the desired sequence.


Second part: This is based on the fact that we can identify the limits of converging
sequences with the lim inf s of arbitrary sequences, because we can always find a subse-
quence converging to the lim inf .
“⇐”: By 1., take any (xk) from the definition of S such that f (xk) → α ′ 8 min S. Then
xk → x ′, therefore from f (x ′) 6 lim infk→∞ f (xk) = α ′ = lim infx→x ′ f (x).
“⇒”: Take any convergent sequence xk → x ′ in (2.3). Choose a subsequence such that
j→∞ ∞
f xkj → lim infk→∞ f (xk). Then xk j j=1 is in S, therefore
(2.2) 1. (2.1)
lim inf f (xk) = lim f xkj > α ′=lim inf f (x) > f (x ′).
k→∞ j→∞ x→x ′


Example 2.9.

1, x > 0
1. f (x) = 0, x 6 0
is lsc,

is not lsc (but f is lsc in all x  0),



1, x > 0,
2. f (x) = 0, x < 0

3. f (x) = δC is lsc in the open sets int C and ext C. It is lsc on Rn iff C is closed:
lim infx→x ′δC (x) is always 0 for x ′ ∈ bnd C, therefore δC is lsc iff δC (x) = 0 for all
x ∈ bnd C, which is the case iff C is closed.

Theorem 2.10. (semicontinuity and the epigraph) Let f : Rn → R̄. Then the following
properties are equivalent:
1. f is lsc on Rn,
2. epi f is closed in Rn × R,
12 Existence

3. the sublevelsets lev6αf 8 {x ∈ Rn |f (x) 6 α} are closed in Rn for all α ∈ R̄.

Proof. Idea (1 ⇔ 2): epi f can only be not closed along vertical lines.
1.⇒2.: Assume (xk , αk) ∈ epi f , (xk , αk) → (x, α), α ∈ R. Then f (xk) 6 αk (because
(xk , αk) ∈ epi f ), and lim infk→∞ f (xk) 6 lim infk→∞αk = α. The second part of Lemma
gives f (x) 6 lim infk→∞ f (xk); therefore f (x) 6 . 6 α and (x, α) ∈ epi f .
2.⇒3.: We have
epi f closed
⇒ epi f ∩ (Rn × {α ′}) closed for all α ′ ∈ R
⇔ {(x, α) ∈ Rn × R|α = α ′, f (x) 6 α} closed for all α ′ ∈ R
⇔ {x ∈ Rn |f (x) 6 α ′} = lev6α ′ f closed in Rn for all α ′ ∈ R.
This shows 3. for all α ′ ∈ R. For α ′ = +∞ we have lev6+∞ f = Rn which is always closed.
For the case α ′ = −∞ we note that
\
lev6−∞ f = lev6kf ,
k∈N

is also closed as the countable intersection of closed sets.


3.⇒1.: For any x ′, from Lemma 2.8 we get a sequence xk → x ′ with f (xk) →
lim infx→x ′ f (x) =: C. If C = +∞ we are done, since then f (x ′) 6 lim infx→x ′ f (x) =
+∞ always holds.
Assume now C ∈ R. Then for every ε > 0 we can find K(ε) such that f (xk) 6 C + ε for
all k > K(ε). This implies
xk ∈ lev6C +εf ,
for all ε > 0 and k > K(ε). Since all level sets are closed by assumption and the subsequence
converges to x ′, we get
x ′ ∈ lev6C +εf ∀ε > 0
⇔f (x ′) 6 C + ε ∀ε > 0,
therefore
f (x ′) 6 C = lim inf f (x),
x→x ′

which shows that f is lsc in x ′.


If C = −∞ then we use a similar argument, replacing lev6C +ε by lev6−1/ε. 

Definition 2.11. f : Rn → R̄ is “level-bounded” : ⇔ lev6αf is bounded for all α ∈ R.

Example 2.12. f1(x) 8 x2 is level-bounded. f2(x) 8 x is not bounded below and not
level-bounded. f3(x) 8 1/|x| (with the +∞ extension at 0) is bounded below and not
level-bounded. f4(x) = min {1, |x|} is bounded from below and not level-bounded, since
lev6αf = Rn for α > 1.

Proposition 2.13. f : Rn → R̄ is level-bounded if and only if f (xk) → +∞ for all sequences


(xk) satisfying kxk k2 → +∞.

Proof. “⇒”: Assume we have kxk k2 → +∞. Then for any α ∈ R there is K(α) such that
xk  lev6αf for k > K(α), because all these sets are bounded. Therefore f (xk) > α for all
k > K(α). This holds for all α, therefore f (xk) → +∞.
2.2 Existence of minimizers 13

“⇐”: if f is not level-bounded then there is an α such that lev6αf is unbounded.


Thus we can find a sequence xk in this set with kxk k2 → ∞ and f (xk) 6 α. Therefore
f (xk) 9 +∞. 

The property in Prop. 2.13 is also often referred to as coercivity in the literature.

Theorem 2.14. (existence of minimizers): Assume f : Rn → R̄ is lsc, level-bounded, and


proper. Then
inf f (x) ∈ (−∞, +∞),
x
and arg min f is nonempty and compact.

Proof. f is proper ⇒ f < +∞⇒ inf f < +∞. We have


arg min f = {x ∈ Rn |f (x) 6 inf f }
= {x ∈ Rn |f (x) 6 α ∀α ∈ R: α > inf f }
\
= lev6αf .
α∈R,α>inf f

For any α in the intersection, lev6αf is closed since f is lsc (Thm. 2.10, 3.). But f is also
level-bounded, therefore lev6αf is bounded. Together lev6αf is compact.
We can also make the intersection countable (use α = inf f + 1/k if inf f ∈ R, and α = −k
if inf f = −∞). All sets in the intersection are nonempty and compact. Therefore Cantor’s
intersection theorem states that their intersection is also nonempty (it is also compact).
(Why? Take any sequence with xk ∈ lev6αkf , we can do this explicitly in Rn through inf .
(xk) has a converging subsequence because it lies completely in lev6α1 f which is compact.
Denote the limit by x ′. For every k we can find a “tail” of (xk) that lies completely in
lev6αk (and still converges to x ′), and because all these sets are closed we have x ′ ∈ lev6αk.
Therefore x ′ is also contained in the intersection.)
The only thing that remains to be shown is that inf f >−∞. Assume that inf f = −∞.
By the previous part, arg min f  ∅. Therefore there exists x ∈ arg min f , but then
f (x) = inf f = −∞, which contradicts the properness of f . 

Remark 2.15. The proof of the theorem does not require full level-boundedness, it suffices
to have lev6αf bounded and nonempty for at least one α ∈ R: closedness of the sublevelsets
follows from f being lsc, and boundedness is only required for all lev6α ′ with α ′ 6 α, for
which it automatically holds because lev6α ′ ⊆ lev6α.
An example for such a function is f (x) = 1 − e−|x|, which is bounded from above by 1
and therefore not level-bounded, but it is lsc, proper, and attains its minimum in x = 0. All
the sets lev 6 α are bounded for α < 1, and, as one would expect arg min f = {0} is non-
empty, closed and convex.

Proposition 2.16. (lower semi-continuity of sums and scalar multiples):


1. f , g lsc and proper ⇒ f + g lsc
2. f lsc, λ > 0 ⇒ λ f lsc,
3. f : Rn → R̄ lsc and g: Rm → Rn continuous ⇒ f ◦ g lsc.

Proof. Exercise. 
Chapter 3
Convexity
Definition 3.1. (convex sets and functions)
1. f : Rn → R̄ is “convex” : ⇔

f ((1 − τ ) x + τ y) 6 (1 − τ ) f (x) + τ f (y) ∀x, y ∈ Rn , τ ∈ (0, 1). (3.1)

2. C ⊆ Rn is “convex” : ⇔ δC is convex ⇔(1 − τ ) x + τ y ∈ C ∀x, y ∈ C , τ ∈ (0, 1).


3. f : Rn → R̄ is “strictly convex” : ⇔ f convex and ( 3.1) holds strictly for all x  y
with f (x), f (y) ∈ R and for all τ ∈ (0, 1).

Remark 3.2. C ⊆ Rn is convex iff for every two points x, y ∈ C the whole line segment
between them is contained in C:

{(1 − τ ) x + τ y |t ∈ (0, 1)} ⊆ C ∀x, y ∈ C.

If −f is convex then we say that f is concave. Not that this does not correspond to
reversing the inequality sign in (3.1) alone due to the +∞ − ∞ = +∞ convention (Def. 2.1),
but also requires to change the convention to +∞ − ∞ = −∞ in addition to the reversed
sign. In order to avoid this problem we prefer to say that −f is convex to saying that f
is concave.

Example 3.3.
1. Rn is convex,
2. {x ∈ Rn |x > 0} is convex,
3. {x ∈ Rn | kxk2 6 1} is convex,
4. {x ∈ Rn | kxk2 6 1, x  0} is not convex,
5. the half-spaces {x|a⊤ x + b > 0} are convex,
6. f (x) = a⊤ x + b is convex (inequality holds as an equality) but not strictly convex,
7. f (x) = kxk22 is strictly convex,
8. f (x) = kxk2 is convex but not strictly convex.

Definition 3.4. (convex combination) Assume x0, , xm ∈ Rn and λ0, , λm > 0,


P m
i=0 λi = 1. We call the linear combination
m
X
λi xi
i=0

a “convex combination” of the points x0, , xm.

15
16 Convexity

Theorem 3.5. (convex combinations and Jensen’s inequality)


1. f : Rn → R̄ convex ⇔
m
! m m
X X X
f λixi 6 λi f (xi) for all m > 0, xi ∈ Rn , λi > 0, λi = 1. (3.2)
i=0 i=0 i=0

2. C ⊆ Rn convex ⇔ C contains all convex combinations of its elements.

Proof.
1. “⇐”: For m = 1, (3.2) is the convexity condition.
“⇒”: Induction: m = 0 is trivial, and the case m = 1 follows from convexity.
Assume that (3.2) holds for all m ′ < m for some m > 2, and consider an arbitrary
convex combination x = λ0 x0 + + λm xm.
W.l.o.g. assume that all λi > 0 (if not we can remove it from the sum together
with the corresponding xi) and all λi < 1 (if not then all other λi = 0 and we have
the trivial case m = 0). Then

x = λ0 x0 + + λm xm
m−1
X λi
= (1 − λm) xi + λm xm.
1 − λm
i=0
Thu
m−1
!
m ′ =1 X λi
f (x) 6 (1 − λm)f xi + λm f (xm)
1 − λm
i=0
m ′ =m−1 m−1
X λi
6 (1 − λm) f (xi) + λm f (xm)
1 − λm
i=0
m
X
= λi f (xi).
i=0

2. C is convex iff δC convex, and


m
! m
X X
δC λixi 6 λi δC (xi)
i=0 i=0

iff there exists xi  C or m


P
i=0 λi xi ∈ C. Therefore (3.2) holds for δC iff C contains
all convex combinations of its elements, and 2. follows from 1.


Proposition 3.6. (effective domain of convex functions)

f : Rn → R̄ convex ⇒ dom f convex .

Proof. Let x, y ∈ dom f , τ ∈ (0, 1). Then f (x), f (y) < +∞ and by convexity f ((1 − τ ) x +
τ y) 6 (1 − τ ) f (x) + τ f (y) < ∞, therefore (1 − τ ) x + τ y ∈ dom f . 

Proposition 3.7. (convexity of the epigraph)


1. f : Rn → R̄ convex ⇔ epi f is convex in Rn × R.
2. f : Rn → R̄ convex ⇔ {(x, α) ∈ Rn × R|f (x) < α} (strict epigraph set) is convex in
Rn × R.
Convexity 17

Proof.
1.

epi f convex ⇔ (1 − τ ) (x0, α0) + τ (x1, α1) ∈ epi f ∀(x0, α0), (x1, α1) ∈ epi f ,
τ ∈ (0, 1)
⇔ f ((1 − τ ) x0 + τ x1) 6 (1 − τ ) α0 + τ α1 ∀x0, x1, ∀α0 > f (x0),
α1 > f (x1), τ ∈ (0, 1)
⇔ f ((1 − τ ) x0 + τ x1) 6 (1 − τ ) f (x0) + τ f (x1) ∀x0, x1∀τ ∈ (0, 1)

(last equivalence: “⇒” follows immediately with αi = f (xi), “⇐” follows because
τ ∈ (0, 1) and therefore (1 − τ ) > 0 and τ > 0).
2. similar with strict inequalities.


Proposition 3.8. (convexity of sublevelsets) f : Rn → R̄ convex ⇒ lev6αf convex for all


α ∈ R̄.

Proof. α = +∞ is trivial (lev6+∞ f = Rn). Let α < +∞, x, y ∈ lev6αf and τ ∈ (0, 1), then
f convex
f ((1 − τ ) x + τ y) 6 (1 − τ ) f (x) + τ f (y)
lev6α
6 (1 − τ ) α + τ α
= α.

Therefore (1 − τ ) x + τ y ∈ lev6αf . 

Theorem 3.9. (global optimality) Assume f : Rn → R̄ is convex. Then


1. arg min f is convex.
2. x is a local minimizer of f ⇒ x is a global minimizer of f.
3. f strictly convex and proper ⇒ f has at most one global minimizer.

Proof.
1. If f  +∞ then arg min f = lev6inf ff and we can use Prop. 3.8. If f = +∞ then
arg min f = ∅, which is convex.
2. Assume x is a local minimizer and y ∈ Rn, and assume f (y) < f (x), i.e., x is not a
global minimizer. By convexity:

f ((1 − τ ) x + τ y) 6 (1 − τ ) f (x) + τ f (y) ∀τ ∈ (0, 1)


⇒f (x + τ (y − x)) < (1 − τ ) f (x) + τ f (x) ∀τ ∈ (0, 1)
⇒f (x + τ (y − x)) < f (x) ∀τ ∈ (0, 1).

⇒ x cannot be a local minimizer.


3. Assume x, y are global minimizers. Then f (x) = f (y) = inf f ∈ R (f is proper,
Thm. 2.14). If x  y then by strict convexity:

f ((1 − τ ) x + τ y) < (1 − τ ) f (x) + τ f (y) ∀τ ∈ (0, 1)


⇒f ((1 − τ ) x + τ y) < inf f ∀τ ∈ (0, 1).
18 Convexity

This is impossible, therefore x = y. 

Proposition 3.10. (operations that preserve convexity) Let I be an arbitrary index set.
Then
1. fi , i ∈ I convex ⇒ f (x) 8 supi∈Ifi(x) is convex,
2. fi , i ∈ I strictly convex, I finite ⇒ f (x) 8 supi∈Ifi(x) is strictly convex,
T
3. Ci , i ∈ I convex ⇒ i∈I Ci convex,
4. fk , k ∈ N convex ⇒ f (x) 8 lim supk→∞ fk(x) is convex.

Proof. Exercise.
1. from the definition of convexity and ai 6 bi ⇒ supi∈Iai 6 supi∈Ibi.
2. same with strict inequalities for finite sup.
3. δ∩i∈ICi = supi∈IδCi and 1.
4. similar to 1. 

Example 3.11.
1. The union of convex sets is generally not convex (but can be): C = [0, 1], D = [2, 3].
2. f : Rn → R convex and C convex ⇒ f + δC is convex ⇒ set of minimizers of f on
C is convex (Thm. 3.9).
3. f (x) = |x| = max {x, −x} is convex (Prop. 3.10).
4. f (x) = kxk2 = supkyk261 y ⊤ x is convex (Prop. 3.10) (similarly: f (x) = kxk p is convex,
look at dual norm k·k q with 1/p + 1/q = 1). It is not strictly convex: set x = 0, y  0,
then f (x/2 + y/2) = f (y/2) = f (x)/2 + f (y)/2.

Theorem 3.12. (derivative tests) Assume C ⊆ Rn is open and convex, and f : C → R (i.e.,
real-valued!) is differentiable. Then the following conditions are equivalent:
1. f is [strictly] convex,
2. hy − x, ∇f (y) − ∇f (x)i > 0 for all x, y ∈ C [and >0 if x  y],
3. f (x) + hy − x, ∇f (x)i 6 f (y) for all x, y ∈ C [and < f (y) if x  y],
4. if f is additionally twice differentiable: ∇2 f (x) is positive semidefinite for all x ∈ C.
If f is twice differentiable and ∇2 f is positive definite, then f is strictly convex. The
opposite does not hold.

Proof. Exercise, reduce to the one-dimensional case. 

Remark 3.13. The second condition is a monotonicity condition on the gradient: in one
dimension it becomes

(y − x) (f ′(y) − f ′(x)) > 0,

which is equivalent to y > x ⇔ f ′(y) > f ′(x). This means that the derivative must increase
when going towards larger values of x. The third condition says that f is never below any
of its local linear approximations. The fourth condition in one dimension means f ′′ > 0,
the graph is “curved upwards”.
Convexity 19

Proposition 3.14.
1. (nonnegative linear combination) Assume f1, , fm: Rn → R̄ are convex, λ1, ,
λm > 0. Then
m
f8
X
λ i fi
i=1
is convex. If at least one of the fi with λi > 0 is strictly convex, then f is strictly
convex.
2. (separable sum) Assume fi: Rni → R̄, i = 1, , m are convex. Then
f : Rn1 × × Rnm → R̄,
m
, xm) 8
X
f (x1, fi(xi)
i=1
is convex. If all fi are strictly convex, then f is strictly convex.
3. (linear composition) Assume f : Rm → R̄ is convex, A ∈ Rm×n, and b ∈ Rm. Then
g(x) 8 f (A x + b)
is convex.

Proof.
1. From Def. 3.1:
m
X
f ((1 − τ ) x + τ y) = λi fi((1 − τ ) x + τ y)
i=1
Xm
6[<] λi ((1 − τ ) fi(x) + τ fi(y))
i=1
= (1 − τ ) f (x) + τ f (y).
2. From Def. 3.1:
m
X
f ((1 − τ ) x + τ y) = fi((1 − τ ) xi + τ yi)
i=1
Xm
6[<] ((1 − τ ) fi(xi) + τ fi(yi))
i=1
= (1 − τ ) f (xi) + τ f (yi).
3. From Def. 3.1:
g((1 − τ ) x + τ y) = f (A ((1 − τ ) x + τ y) + b)
= f ((1 − τ )(A x + b) + τ (A y + b))
f convex
6 (1 − τ ) f (A x + b) + τ f (A y + b)
= (1 − τ ) g(x) + τ g(y).


Proposition 3.15. (convexity properties of sets)


1. C1, , Cm convex ⇒ C1 × × Cm convex.
2. C ⊆ Rn convex, A ∈ Rm×n, b ∈ Rm, L(x) 8 A x + b ⇒ L(C) convex.
3. C ⊆ Rm convex, A ∈ Rm×n, b ∈ Rm, L(x) 8 A x + b ⇒ L−1(C) convex.
20 Convexity

4. C1, C2 convex ⇒ C1 + C2 convex.


5. C convex, λ ∈ R ⇒ λ C convex.

Proof. Exercise. 

Definition 3.16. For any set S ⊆ Rn and any point x ∈ Rn, we define the “projection of
x onto S” as
ΠS (y) 8 arg min kx − y k2.
x∈S

Proposition 3.17. Assume that C ⊆ Rn is convex, closed, and C  ∅. Then ΠC is single-


valued, i.e., the projection of x onto C is unique for every x ∈ Rn.

Proof. We can rewrite the problem using a more convenient (differentiable) objective:
1
ΠC (y) = arg min kx − y k22
2
x∈C
1
= arg min kx − y k22 + δC (x)
x 2
Quick proof:
• 1
x 2 kx − y k22 is lsc (it is continuous) and level-bounded (f (x) → +∞ as kxk2 →
+∞) and proper (never −∞, not always +∞)
δC (x) is lsc because C is closed (Ex. 2.9, alternatively Thm. 2.10: epi δC is closed
⇒ δC lsc).
⇒f (x) 8 2 kx − y k2 + δC is lsc, level-bounded and proper
1

⇒ arg min f  ∅ by Thm. 2.14.


• 1
x 2 kx − y k22 is strictly convex (derivative is x − y, thus hy − x, ∇f (y) − ∇f (x)) =
ky − xk2 > 0 if x  y, this yields strict convexity using Thm. 3.12).
 We could  also
1
use the second-order criterion in Thm. 3.12 and verify that ∇2 2 k·−y k22 = I is
positive definite.

Alternative proof without using differentiability: s (s − t)2 is strictly convex
(prove this directly) ⇒ x  kx − y k22 is strictly convex (sum of strictly convex
 1 2
functions) ⇒x 2 kx − y k2 is strictly convex (positive multiple of strictly convex
function)
δC is convex because C is convex (Rem. 3.2) and proper because C  ∅ (otherwise
δC would be +∞)
⇒f is strictly convex (sum of strictly convex and convex function)
⇒f has at most one minimizer by Thm. 3.9.

Definition 3.18. (convex hull) For an arbitrary set S ⊆ Rn,


con S 8
\
C
C convex ,S ⊆C

is the “convex hull” of S.

Remark 3.19. The convex hull con S is the smallest convex set that contains S: con S
is convex by Prop. 3.10 (intersection of convex sets) and every set C that is convex and
contains S also contains con S by definition.
Convexity 21

Theorem 3.20. (convex hulls from convex combinations) Assume S ⊆ Rn, then
( p p
)
X X
con S = λi xi xi ∈ S , λi > 0, λi = 1, p > 0 .
i=0 i=0

Proof. We denote the right-hand side by D and need to show that con S =D.
“⊇”: S ⊆ con S. con S is convex ⇒ (Thm 3.5): con S contains all convex combinations
of points in con S ⇒ con S contains all convex combinations of points in S ⇔ D ⊆ con S.
“⊆”: if x, y ∈ D then for some xi , yi , λi , µi:
mx
X my
X
(1 − τ ) x + τ y = (1 − τ ) λi xi + τ µi y i
i=0 i=0
mx
X my
X
= (1 − τ )λi xi + τµi yi.
i=0 i=0
with
mx
X my
X mx
X my
X
(1 − τ ) λi + τ λi = (1 − τ ) λi + τ µi
i=0 i=0 i=0 i=0
= (1 − τ ) · 1 + τ · 1
= 1.
⇒ convex combinations of elements in D are in D.
⇒(Thm 3.5) D is convex.
⇒(Def. 3.18, S ⊆ D) con S ⊆ D. 

Definition 3.21. (closure, interior) For a set C ⊆ Rn, denote


cl C 8 {x ∈ Rn |for all (open) neighborhoods N of x we have N ∩ C  ∅},
int C 8 {x ∈ Rn |there exists an (open) neighborhood N of x such that N ⊆ C },
bnd C 8 cl C \ int C.

Remark 3.22. The closure is to closedness what the convex hull is to convexity:
\
cl C = S.
S closed,C ⊆S
Chapter 4
Cones and Generalized Inequalities
Definition 4.1. (cones) K ⊆ Rn “cone” : ⇔
0 ∈ K, λx∈K ∀x ∈ K , λ > 0.
A cone K is “pointed” : ⇔
x1 + + xm = 0, xi ∈ K ∀i ∈ {1, , m} ⇒ x1 = = xm = 0.

Note that cones can also be nonconvex , such as the cone K = (R>0 × {0}) ∪ ({0} × R>0).

Proposition 4.2. (convex cones) Assume K ⊆ Rn is an arbitrary set. Then the following
conditions are equivalent:
1. K is a convex cone,
2. K is a cone and K + K ⊆ K,
3. K  ∅ and m i=0 αi xi ∈ K for all xi ∈ K and αi > 0 (not necessarily summing to 1).
P

Pm Pm α
Proof. Exercise, idea: x = i=0 αi xi ∈ K, αi > 0 ⇔ i=0
P i x ∈ K, and the latter is a
j αj
convex combination with coefficients summing to 1 (if there is at least one αi  0). 

Proposition 4.3. (pointed cones) Assume K is a convex cone. Then


K pointed ⇔ K ∩ −K = {0}.
Def. 4.1
Proof. “⇒”: x ∈ K ∩ −K ⇒ x, −x ∈ K. Then 0 = x + (−x) ⇒ x = −x = 0.
“⇐”: K not pointed ⇒ there exist x1 + + xm = 0, xi  0, xi ∈ K (w.l.o.g. with x1  0)
⇒ x1 + (x2 + + xm) = 0. Therefore x2 + + xm = −x1, and x2 + + xm ∈ K (Prop. 4.2)
⇒ x1 ∈ K ∩ −K ⇒ K ∩ −K  {0}. 

Proposition 4.4. (generalized inequalities) For a closed convex cone K ⊆ Rn we define


the “generalized inequality”
x >K y : ⇔ x − y ∈ K.
Then
1. x >K x (reflexivity),
2. x >K y, y >K z ⇒ x >K z (transitivity),
3. x >K y ⇒ −y >K −x,
4. x >K y, λ > 0 ⇒ λ x >K λ y,
5. x >K y, x ′ >K y ′ ⇒ x + x ′ >K y + y ′ ,

23
24 Cones and Generalized Inequalities

6. If xk → x and y k → y with xk >K y k for all k ∈ N, then x >K y.


7. x >K y, y >K x ⇒ x = y for all x, y ∈ Rn (antisymmetry) holds iff K is pointed.
If “ >” is a relation on Rn satisfying 1.-6., then it can be represented as >K for a closed
convex cone.

Proof. 1.-6. from the definition of a cone. Converse: recover K = {x ∈ Rn |x > 0}, then
5.
x > y ⇔ x − y > 0 ⇔ x − y ∈ K ⇔ x >K y. Then show that K is a closed convex cone.
Antisymmetry: x >K y, y >K x ⇔ x − y ∈ K , y − x ∈ K ⇔ x − y ∈ K ∩ −K. This holds for
all x, y iff K ∩ −K = {0}, which by Prop. 4.3 is equivalent to K being pointed. 

Definition 4.5. (conic program) For any pointed, closed, convex cone K ⊆ Rm, a matrix
A ∈ Rm×n and vectors c ∈ Rn, b ∈ Rm, we define the “conic program” or “conic problem”
(CP)
inf c⊤x
x
s.t. A x >K b.

Example 4.6. (standard cone): The “standard cone”


KnLP 8 {x ∈ Rn |x1, , xn > 0}
is a pointed, closed, convex cone. The associated conic program is the “linear program” (LP)
inf c⊤ x
x
s.t. A x>b.
This is surprisingly powerful: for example, the problem
min |x1 − x2| s.t. x1 = −1, x2 > 0
x
could be written as
min y s.t. y > |x1 − x2|, x1 > −1, x1 6 −1, x2 > 0,
x,y

then
min y s.t. y > x1 − x2, y > x2 − x1, x1 > −1, −x1 > 1, x2 > 0,
x,y

and finally
 
x1
min (0, 0, 1) x2
 
(x1,x2,y)∈R3
y
   
−1 1 1   0
 1 −1 1  x1  0 
   
s.t.  1 0 0  x2 > −1 .
   
 −1 0 0  y  1 
0 1 0 0

Example 4.7. (second-order cone): The “second-order cone” (also called “Lorentz cone”,
“ice-cream cone”)
 
8 x ∈ R xn > x1 + + xn−1
q
SOCP n 2 2
Kn
Cones and Generalized Inequalities 25

is a pointed, closed, convex cone. Conic programs with K = KnSOCP


1
× × KnSOCP
l
are
called “second-order conic programs” (SOCP).
Example:
min kxk2 s.t. x1 + x2 > 1.
x
This can be rewritten as a second-order cone program:
min y
x,y
s.t. y > kxk x1 + x2 − 1 > 0.
 2,   
x x
⇔I3 >K3SOCP 0, ( 1 1 0 ) >K1SOCP ( 1 ).
y y

Example 4.8. (positive semidefinite cone): The “positive semidefinite cone”


KnSDP 8 {X ∈ Rn×n |X symmetric positive semidefinite}
is a pointed, closed, convex cone. Conic programs with K = KnSDP
1
× × KnSDP
l
are
called “semidefinite programs” (SDP):
inf c⊤ x
x∈Rn
s.t. A x − b positive semidefinite.
Here A is a linear operator A: Rn → Rm×m, and b ∈ Rm×m. Often x and c are also written
as matrices X , C ∈ Rn×n with the inner product hC , X i 8 i,j Cij Xij replacing c⊤ x.
P

Proof. The set of symmetric matrices Rn×n


sym is a closed convex cone: every set

Kij 8 {A ∈ Rn×n |Aij = A ji }


is a closed convex cone. Finite intersections of closed convex cones are closed convex cones,
n×n T
so Rsym = i  j Kij is a closed convex cone.
Closedness: x⊤ Ak x > 0 for all x and k implies x⊤ A x > 0 for Ak → A.
Cone: follows immediately, 0 ∈ K and A psd ⇒ λ A psd if λ > 0 (nonneg. eigenvalues)
Convex cone: A, B ∈ K. Then A, B are symmetric. For every x ∈ Rn: x⊤ A x > 0,
x B x > 0 (A, B pos. semidef.) ⇒ x⊤(A + B) x > 0 ⇒ A + B pos. semidef. ⇒A + B ∈ K.

Therefore (Prop. 4.2) K is a convex cone.


Pointed: A ∈ K , A ∈ −K ⇒ A symm., all eigenvalues >0, all eigenvalues 60 ⇒ K = 0. 
Chapter 5
Subgradients
Definition 5.1. (set-valued mappings) X , U sets. Then
S: X → 2U
is a “set-valued mapping S: X ⇒ U”.

Remark 5.2. The set-valued mappings are the relations on X × U :


S gph S = {(x, u)|u ∈ S(x)} =: R ⊆ X × U S(x) = {u|(x, u) ∈ R = gph S }.

Definition 5.3. (domain, range, inverse) S: X ⇒ U. Then


S −1(u) 8 {x ∈ X |u ∈ S(x)}
dom S 8 {x ∈ X |S(x)  ∅},
rge S 8 {u ∈ U |∃x ∈ X: u ∈ S(x)}.

Remark 5.4. gph S −1 is the “transpose” of the graph of S:


u ∈ S(x) ⇔ (x, u) ∈ gph S ⇔ (u, x) ∈ gph S −1 ⇔ x ∈ S −1(u).
This also shows that (S −1)−1 = S always holds ((S −1)−1(x) = {u|x ∈ S −1(u)} = {u|u ∈
S(x)} = S(x)).

Definition 5.5. (subgradients) For any f : Rn → R̄ and x ∈ Rn,


∂f (x) 8 {v ∈ Rn |f (x) + hv, y − xi 6 f (y) ∀y ∈ Rn }
is the set of “subgradients of f at x”. The set-valued mapping ∂f : Rn ⇒ Rn is the “subdif-
ferential mapping of f”.

Note that for constant functions f ≡ c ∈ R we get ∂f (x) = {0} for all x ∈ Rn. However,
for the constant function f ≡ +∞ and f ≡ −∞, from +∞ > +∞ and −∞ > −∞ (in fact
equality holds), we conclude that
∂(+∞)(x) = Rn , ∂(−∞)(x) = Rn , ∀x ∈ Rn.

Proposition 5.6. (subgradients of differentiable functions) Assume that f , g: Rn → R̄ are


convex. Then:
1. if f is differentiable at x, then ∂f (x) = {∇f (x)},
2. if f is differentiable at x and g(x) ∈ R, then ∂(f + g)(x) = ∂g(x) + ∇f (x).

Proof. 1. follows from 2. using g ≡ 0, since then ∂g(x) = {0} for all x.
To show 2., we first show ∂(f + g)(x) ⊇ ∂g(x) + ∇f (x). If v ∈ ∂g(x), then
f (y) > f (x) + h∇f (x), y − xi,
g(y) > g(x) + hv, y − xi

27
28 Subgradients

The first inequality uses the same argument as in Thm. 3.12 (we cannot apply the theorem
directly because we do not know if f is differentiable in a neighborhood of x, but the proof
is the same; specifically for all t > 0, we have
f (y) − f (x) − h∇f (x), y − xi
ky − xk2
t f (y) − t f (x) − h∇f (x), t (y − x)i
=
t ky − xk2
(1 − t) f (x) + t f (y) − f (x) − h∇f (x), t (y − x)i
=
t ky − xk2
f convex f (x + t (y − x)) − f (x) − h∇f (x), t (y − x)i
> .
t ky − xk2
This holds for all t > 0 and the last term converges to 0 from the definition of differentia-
bility. Thus f (y) − f (x) − h∇f (x), y − xi > 0.)
Adding the inequalities for f and g we obtain

f (y) + g(y) > f (x) + g(x) + hv + ∇f (x), y − xi,

which shows v + ∇f (x) ∈ ∂(f + g)(x).


To show ∂(f + g)(x) ⊆ ∂g(x) + ∇f (x), assume that v ∈ ∂(f + g)(x). Then (f + g)(y) −
(f + g)(x) > hv, y − xi for any y ∈ Rn per definition. Thus in particular
f (z) + g(z) − f (x) − g(x) − hv, z − xi
lim inf > 0.
z→x kz − xk2
Because f is differentiable in x we find that the lim inf is exactly the same as the lim inf of

−f (z) + f (x) + h∇f (x), z − xi + f (z) − f (x) + g(z) − g(x) − hv, z − xi


kz − xk2

This is possible because f (z) and f (x) must be finite close to x from the definition of
differentiability, and we can simplify:
g(z) − g(x) − hv − ∇f (x), z − xi
lim inf > 0.
z→x kz − xk2
Now consider z(t) 8 (1 − t) x + t y. Then the above equation states
g(z(t)) − g(x) − hv − ∇f (x), z(t) − xi
lim inf > 0.
tց0 kz(t) − xk2
From the definition of convexity we get

g(z(t)) 6 (1 − t) g(x) + t g(y).

Thus
(1 − t) g(x) + t g(y) − g(x) − hv − ∇f (x), z(t) − xi
lim inf > 0.
tց0 kz(t) − xk2
We assumed that g(x) is finite, so (1 − t) g(x) − g(x) = −t g(x) (if g(x) = +∞ this gives
∞ − ∞ = −∞, which is wrong). Also z(t) − x = t (y − x), and we get
−t g(x) + t g(y) − hv − ∇f (x), t (y − x)i
lim inf >0
tց0 t ky − xk2
⇔ lim inf {g(y) − g(x) − hv − ∇f (x), y − xi} > 0
tց0
⇔ g(y) > g(x) + hv − ∇f (x), y − xi
Subgradients 29

Since y was arbitrary this shows v − ∇f (x) ∈ ∂g(x), so v ∈ ∂g(x) + ∇f (x), and finally
∂(f + g) ⊆ ∂g + ∇f .


Theorem 5.7. (Generalized Fermat) Assume f : Rn → R̄ is proper. Then

x ∈ arg min f ⇔ 0 ∈ ∂f (x).

Proof.

0 ∈ ∂f (x)
⇔ f (x) + h0, y − xi 6 f (y) ∀y ∈ Rn
⇔ f (x) 6 f (y) ∀y ∈ Rn
f proper
⇔ x ∈ arg min f .

In the last equivalence the properness is required since arg min +∞ = ∅ by definition, but
∂(+∞)(x) = Rn ∋ 0 for all x. 

Definition 5.8. For a convex set C ⊆ Rn and x ∈ C, the “normal cone” NC (x) at x is
defined as

NC (x) 8 {v ∈ Rn |hv, y − xi 6 0 ∀y ∈ C }.

By convention, NC (x) 8 ∅ for x  C.

It can be easily seen that NC (x) is in fact a cone if x ∈ C.

Proposition 5.9. (subdifferential of indicator functions) Assume C ⊆ Rn is convex with


C  ∅, then

∂δC (x) = NC (x).

Proof.
For x ∈ C:

∂δC (x) = {v ∈ Rn |δC (x) + hv, y − xi 6 δC (y) ∀y ∈ Rn }


= {v ∈ Rn |0 + hv, y − xi 6 0 ∀y ∈ C } =NC (x).

For x  C: Since C  ∅ we can find a y ∈ C, i.e., δC (y) = 0. But then for any v ∈ ∂δC (x) we
have

δC (x) + hv, y − xi = +∞ > 0 = δC (y).

Thus v  ∂δC (x) for any v, and we conclude ∂δC (x) = ∅. 

Proposition 5.10. Assume C ⊆ Rn is closed and convex with C  ∅, and x ∈ Rn. Then

y = ΠC (x) ⇔ x − y ∈ NC (y),

Proof. From Prop. 3.17 we know that y = ΠC (x) is the unique minimizer of

f (y ′) 8 1 ′
2
ky − xk22 + δC (y ′).
30 Subgradients

This is the case iff 0 ∈ ∂f (y) (Thm. 5.7). From Prop. 5.6 and Prop. 5.9 we know that
∂f (y) = y − x + NC (y), thus

y = ΠC (x) ⇔ y ∈ arg min f


⇔ 0 ∈ ∂f (y)
⇔ 0 ∈ y − x + NC (y)
⇔ x − y ∈ NC (y).


A consequence of Prop. 5.10 is that by looking at the normal cone for a fixed x ∈ C,
we can find all points y that get projected into x.

Proposition 5.11. (subdifferentials as normal cones) Assume f : Rn → R̄ is proper and


convex. Then
x  dom f ,

∅,
∂f (x) =
{v ∈ Rn |(v, −1) ∈ Nepi f (x, f (x))}, x ∈ dom f .
Also, if x ∈ dom f then

Ndom f (x) = {v ∈ Rn |(v, 0) ∈ Nepi f (x, f (x))}.

Proof. x  dom f : f proper ⇒ ex. y such that f (y) ∈ R. If v ∈ ∂f (x) then by definition
of the subdifferential

v ⊤ (y − x) + f (x) 6 f (y)
⇒+∞ 6 f (y) ∈ R.

This is impossible, thus ∂f (x) = ∅.


x ∈ dom f :

v ∈ ∂f (x)
⇔ v ⊤ (y − x) + f (x) 6 f (y) ∀y ∈ Rn
⇔ v ⊤ (y − x) + f (x) 6 α ∀y ∈ dom f , ∀α > f (y), α ∈ R
⇔ v ⊤ (y − x) + (−1) (α − f (x)) 6 0 ∀(y, α) ∈ epi f
⇔ (v, −1) ∈ Nepi f (x, f (x)).

Second part:

v ∈ Ndom f (x)
⇔ v ⊤(y − x) 6 0 ∀y ∈ dom f
⇔ v ⊤ (y − x) + 0 · (α − f (x)) 6 0 ∀y ∈ dom f , ∀α > f (y)
⇔ v ⊤ (y − x) + 0 · (α − f (x)) 6 0 ∀(y, α) ∈ epi f
⇔ (v, 0) ∈ Nepi f (x, f (x)).


Example 5.12. The subdifferential of f : R → R̄, f (x) = |x| (x ∈ R) is



 {1}, x > 0,
∂f (x) = {−1}, x < 0,
[−1, 1], x = 0.

Subgradients 31

Computing subdifferentials is generally not an easy task. Fortunately in many cases


they can be found by combining the subdifferentials of simpler functions, but this requires
a little bit more caution than for differentiable functions.

Definition 5.13. (relative interior) For any set C ⊆ Rn, we define the “affine hull” and
the “relative interior”

8
\
aff C A,
A affine,C ⊆A
rint C 8 {x ∈ Rn |ex. (open) neighborhood N of x with N ∩ aff C ⊆ C }.

Proposition 5.14. (rules for subdifferentials)


1. Assume f : Rn → R̄ is convex. Then
g(x) = f (x + y) ⇒ ∂g(x) = ∂f (x + y) y ∈ Rn ,
g(x) = f (λ x) ⇒ ∂g(x) = λ ∂f (λ x), λ  0,
g(x) = λ f (x) ⇒ ∂g(x) = λ ∂f (x), λ > 0.
[example where 2. does not hold: take f (0) = 1, f (1) = −∞, then ∂f (0) is empty but
∂(f (0 · )) = ∂(0) = {0}]
2. Assume f : Rn → R̄ is proper and convex, and A ∈ Rn×m such that
{A y |y ∈ Rm } ∩ rint dom f  ∅.
If x ∈ dom (f ◦ A) = {y ∈ Rm |A y ∈ dom f }, then
∂(f ◦ A)(x) = A⊤ ∂f (A x).
3. Assume f1, , fm: Rn → R̄ are proper and convex, and
rint dom f1 ∩ ∩ rint dom fm  ∅.
If x ∈ dom f, then
∂(f1 + + fm)(x) = ∂f1(x) + + ∂fm(x).

Proof. 1. can be shown directly. The proof for 2. is surprisingly difficult and we refer
to [Roc70, 23.8,23.9]. 3. then follows from 2. with A = (I | |I)⊤ and f (x1, , xm) =
f1(x1) + + fm(xm). 

Remark 5.15. The conditions on the relative interiors Prop. 5.14 are important: in 3., it
is easy to see that the direction “⊇” always holds. But the direction “⊆” requires more. Set
f1(x) 8 δB1(0),
f2(x) 8 δC , C 8 {1} × R.
Then ∂f1(1, 0) = R>0 × {0} and ∂f2(1, 0) = NC (1, 0) = R × {0}. Thus
(∂f1 + ∂f2)(1, 0) = R × {0}.
But f1 + f2 = δ{(1,0)}, thus ∂(f1 + f2)(1, 0) = R2. This is only possible since f1, f2 do not
satisfy the condition on the relative interiors in Prop. 5.14.
Chapter 6
Conjugate Functions

6.1 The Legendre-Fenchel Transform


Definition 6.1. (convex hull of a function) For f : Rn → R̄,
con f (x) = sup g(x)
g6f , g convex

is the “convex hull” of f.

Remark 6.2. The supremum on the right-hand side is convex (Prop. 3.10), majorized
by f and clearly majorizes every convex function majorized by f . Therefore con f is the
greatest convex function majorized by f .

Definition 6.3. (closure of a function) For f : Rn → R̄, the “(lower) closure” cl f is defined
as
cl f (x) 8 lim inf f (y).
y→x

Proposition 6.4. (closure and the epigraph) For f : Rn → R̄ we have


epi(cl f ) = cl(epi f ).
Moreover, if f is convex then cl f is convex.

Proof.
(x, α) ∈ cl(epi f ) ⇔ ∃xk → x, αk → α, f (xk) 6 αk
⇔ lim inf f (y) 6 α ⇔ (x, α) ∈ epi(cl f ).
y→x
cl f (x)

This holds because if there exists such a sequence, then the lim inf is 6α, and if the lim inf
is 6α then take a sequence xk with f (xk) → lim inf . Then (x, lim inf y→xf (x)) ∈ cl(epi f ),
and therefore in particular (x, α) ∈ cl(epi f ) since α > lim inf y→xf (x).
Second part: if f is convex then epi f is convex. For x, y ∈ epi(cl f ) = cl(epi f ) we have
xk → x, y k → y with xk , y k ∈ epi f . For every τ ∈ (0, 1), z 8 (1 − τ ) x + τ y is the limit of
(1 − τ ) xk + τ y k, and all these points are in epi f because epi f is convex (f is convex).
Thus z is in cl epi f and therefore in epi(cl f ). This shows that epi (cl f ) is convex and
therefore cl f is convex. 

Note that the corresponding statement for the convex closure does not hold, i.e.,
epi(con f )  con(epi f )

33
34 Conjugate Functions

in general. One example is the function



1, x > 0,
f (x) =
0, x < 0.

Proposition 6.5. (closure, alternative definition) For f : Rn → R̄, we have

(cl f )(x) = sup g(x).


g6f ,g lsc

Proof. “6”: epi (cl f ) is closed by Prop. 6.4 and therefore cl f is lsc by Thm. 2.10. Also
cl f 6 f (i.e., lim inf y→xf (y) 6 f (x)), because of the definition of the lim inf (or take the
constant sequence y k = x). Together this shows 6.
“>”: If g 6 f and g is lsc, then
lsc g6f
g(x) 6 lim inf g(y) 6 lim inf f (y) = (cl f )(x).
y→x y→x


Theorem 6.6. (envelope representation of sets) Assume that C ⊆ Rn is closed and convex.
Then

Hb,β , where Hb,β 8 {x ∈ Rn |hx, bi − β 6 0}.


\
C =
(b,β),C ⊆Hb, β

C is thus the intersection of all closed half-spaces containing it.

Proof. If C = Rn or C = ∅ we are done (there are no such half-spaces; the intersection of


all half-spaces is empty).
If x ∈ C then x is also contained in the intersection.
If x  C then set y 8 ΠC (x). By Prop. 5.10 we know that v 8 x − y ∈ NC (y). This means

i 6 0 ∀z ∈ C
hv, z − y

⇔ v , z − hv, y i 6 0 ∀z ∈ C
b β
⇔ C ⊆ Hb,β .

But hv, x − y i = hx − y, x − y i = kx − y k2 > 0 because y ∈ C and x  C and therefore x  y.


Thus

x  Hb,β ,

which shows that x is not contained in the intersection. 

In fact not all the half-spaces are needed in Thm. 6.6. If we assume C  ∅ then it
is enough to intersect all supporting half-spaces, i.e., the half-spaces whose associated
hyperplane touches C. The following theorem shows that if C is the epigraph of a proper
lsc convex function, we do not need the vertical half-spaces.

Theorem 6.7. (envelope representation of convex functions) Assume f : Rn → R̄ is proper,


lsc and convex. Then

f (x) = sup g(x).


g affine,g6f
6.1 The Legendre-Fenchel Transform 35

Proof. [RW04, 12.1] f is lsc ⇒ epi f is closed (Thm. 2.10); f is convex ⇒ epi f is convex
(Prop. 3.7). Therefore epi f is the intersection of all half-spaces containing it (in Rn × R)
by Thm. 6.6:
\
epi f = H(b,c),β =: I1,
(b,c),β ∈S

where H(b,c),β = {(x, α) ∈ Rn × R|h(x, α), (b, c)i − β 6 0} and S defines the set of half-spaces
containing epi f .
Claim:

g affine ⇔ ∃(b, c), β with c < 0, epi g = H(b,c),β .

“⇒”: g(x) = hb, xi − β. Then (x, α) ∈ epi g ⇔ hb, xi − β 6 α ⇔ h(b, −1), (x, α)i − β 6 0 ⇔ (x,
α) ∈ H(b,−1),β .
“⇐”: (x, α) ∈ H(b,c),β with c < 0 ⇔ hb, xi + α c − β 6 0 ⇔ hb, xi − β 6 (−c) α. Since c < 0
this is equivalent to h−b/c, xi − (−β/c) 6 α ⇔ (x, α) ∈ epi g with g(x) = h−b/c, xi + β/c.
Using the claim, since
  \
epi sup g(x) = epi g
g affine,g6f g affine,g6f

and g 6 f ⇔ epi f ⊆ epi g, we only need to show that

I1 8
\ \
H(b,c),β = H(b,c),β =: I2.
(b,c),β ∈S (b,c),β ∈S ,c<0
epi f epi (sup )

The direction “⊆” is clear. For “⊇” we need to show that if (x̄, ᾱ)  I1 then (x̄, ᾱ)  I2,
i.e., there exist (b, c), β ∈ S with c < 0 such that (x̄, ᾱ)  H(b,c),β .
Assume that (x̄, ᾱ)  I1. Then there exist (b1, c1), β1 ∈ S such that (x̄, ᾱ)  H(b1,c1),β .
If c1 > 0 then

epi f ⊆H(b1,c1),β = {(x, α)|hb1, xi + α c1 − β 6 0}


 
c1 >0 1
= (x, α) α 6 (β − hb1, xi) .
c1
If this were true then epi f would be contained in a lower half space, which cannot hold
since epi f  ∅ (f is proper). This shows that c1 6 0.
If c1 < 0 then (x̄, ᾱ)  I2 by definition. Therefore the only difficult case is c1 = 0.
Assume c1 = 0. Then we define

g1(x) 8 hb1, xi − β1.

If x ∈ dom f then (x, f (x)) ∈ epi f , therefore (x, f (x)) ∈ H(b1,c1),β1 and g1(x) = hx,
b1i − β1 6 0 on dom f .
Now take any (b ′, c ′), β ′ ∈ S with c ′ < 0. Such a triple exists: if not, c = 0 holds for all
(b, c), β ∈ S, which means that epi f is only bounded by vertical hyperplanes, and therefore
f is −∞ on all of dom f , which contradicts the properness of f .
We then define the associated affine function

g2(x) 8 h−b ′/c ′, xi − (−β ′/c ′).

Since epi f ⊆ H(b ′,c ′),β ′ = epi g2 we get (x, f (x)) ∈ epi g2 and therefore g2(x) 6 f (x)
∀x ∈ dom f .
36 Conjugate Functions

For λ > 0 we construct the function


gλ 8 λ g1 + g2.
The crucial point is that g λ(x) 6 f (x) for x ∈ dom f by the above considerations. On the
other hand, because (x̄, ᾱ)  H(b1,c1),β1 we know that g1(x̄) > 0, and, since dom g2 = Rn,
we can choose λ large enough such that g λ(x̄) > ᾱ, i.e., (x̄, ᾱ)  epi g λ. Then g λ 6 f and
therefore
epi f ⊆ epi g λ = H(λb1+b2,−1),λβ1+β2.
The half-space on the right is included in the intersection in I2 because it has c = −1 < 0.
But (x̄, ᾱ)  epi g λ, therefore (x̄, ᾱ) cannot be contained in I2. 

Definition 6.8. (Legendre-Fenchel transform) Let f : Rn → R̄, then


f ∗: Rn → R̄,
f ∗(v) 8 sup {hv, xi − f (x)}
x∈Rn
is the “conjugate to f”. The mapping f  f ∗ is the “Legendre-Fenchel transform”.

Remark 6.9. The intuition here is that for some b, β, we have


hb, xi − β 6 f (x) ∀x ⇔ hb, xi − f (x) 6 β ∀x
⇔ β > sup {hb, xi − f (x)}
x
⇔ β > f ∗(b)
⇔ (b, β) ∈ epi f ∗.
The conjugate thus characterizes all the affine functions majorized by f by providing the
offset β for any given linear part b. Also, for any given slope b, all affine functions hb, xi − β
with β > f ∗(b) are majorized by hb, xi − f ∗(b) and therefore can be left out when taking
the intersection in Thm. 6.7. This means that the Legendre-Fenchel transform describes a
reduced set of affine functions that are necessary to reconstruct f by returning the offset
β for a given slope b.

Theorem 6.10. (Legendre-Fenchel transform) Assume f : Rn → R̄. Then


f ∗ = (con f )∗ = (cl f )∗ = (cl con f )∗
and
f ∗∗ 8 (f ∗)∗ 6 f .
If con f is proper, then f ∗ and f ∗∗ are proper, lsc and convex, and
f ∗∗ = cl con f .
If f : Rn → R̄ is proper, lsc and convex, then
f ∗∗ = f .

Proof. First statement: we have


(v, β) ∈ epi f ∗ ⇔ β > f ∗(v)
⇔ β > hv, xi − f (x) ∀x ∈ Rn
⇔ hv, xi − β 6 f (x) ∀x ∈ Rn. (6.1)
6.1 The Legendre-Fenchel Transform 37

The (v, β) ∈ epi f ∗ define all the affine functions majorized by f .


We claim that for every affine function h we have
h 6 f ⇔ h 6 cl f ⇔ h 6 con f ⇔ h 6 cl con f .
Indeed, if h 6 cl con f or h 6 cl f or h 6 con f , then h 6 f (by definition). If h 6 f then
h 6 cl f , h 6 con f and h 6 cl con f , because h is lsc and convex and cl f , con f and cl con f
are the largest functions in their class 6f (for cl con f : h 6 f ⇒ h 6 con f ⇒ h 6 cl con f by
the arguments for con f , cl f applied to f , con f ).
Using the claim, in the last line of (6.1) we can replace f by cl f , con f , or cl con f to get
epi f ∗ = epi (cl f )∗ = epi (con f )∗ = epi (cl con f )∗.
⇒ f ∗ = (cl f )∗ = (con f )∗ = (cl con f )∗.
For the inequality: we have
 
f ∗∗(y) = sup {hv, y i − f ∗(v)} = sup hv, yi − sup {hv, xi − f (x)}
v v x
n o
= sup hv, y i + inf {f (x) − hv, xi}
v x
set x=y
6 sup {hv, y i + f (y) − hv, y i}
v
= f (y).
Second statement: Assume con f is proper. We claim that cl con f is proper, lsc and
convex. Lower semi-continuity is clear, convexity from Prop. 6.4. For the properness,
observe that cl con f 6con f (because the epigraph increases). Because con f < +∞ this
shows cl con f < +∞. It remains to show that cl con f (x) > −∞ always (example sheets).
From the claim we know that cl con f is proper, lsc, convex and can apply Thm. 6.7 :
cl con f (x) = sup g(x)
g affine,g6cl con f
= sup {hv, xi − β }
(v,β)∈epi (cl con f )∗
= sup {hv, xi − β }
(v,β)∈epi f ∗
= sup {hv, xi − f ∗(v)}
(v,β)∈epi f ∗
= sup {hv, xi − f ∗(v)}
v∈Rn
= f ∗∗(v).
To show that f ∗ is proper, lsc and convex: we know that f ∗ = (cl con f )∗ is the conjugate
of a proper lsc convex function. epi f ∗ is closed and convex as the intersection of closed
convex sets (see (6.1), sets run over x) ⇒ f ∗ is lsc convex. Properness: con f is proper ⇒
there is at least one x ′ such that con f (x ′) < +∞, thus f ∗(v) = sup > hv, x ′i − con f (x ′),
i.e., f ∗ is lower-bounded by an affine function (or +∞) and can therefore never take the
value −∞. Therefore the only way for f ∗ not to be proper is f ∗ = +∞. But then by the
previous arguments
cl con f = f ∗∗
assumption
= (+∞)∗
= −∞.
This is not possible because we showed that cl con f is proper (see above, [RW04, 2.32]).
38 Conjugate Functions

Last statement: f is convex, therefore con f = f . Thus con f is proper. Also con f = f
is lsc, therefore f ∗∗ = cl con f = cl f =f . 

The condition that con f if proper does not appear to be easy to validate at first. It
turns out that we can just compute f ∗, and if we obtain neither +∞ nor −∞ then con f
must be proper, as the following proposition shows:

Proposition 6.11. Assume f : Rn → R̄. Then


1. con f is not proper ⇒ f ∗ ≡ +∞ or f ∗ ≡ −∞.
2. in particular: f ∗ proper ⇒ con f proper.

Proof. The first statement follows in the same way as in the proof of the second statement
in Thm. 6.10: if con f is not proper then con f = +∞ or there is x ′ s.t. con f (x ′) = −∞. If
con f = +∞ then
f ∗(v) = (con f )∗(v) = sup hv, xi − con f (x) = sup −∞ = −∞.
x x
If there is x′ s.t. con f (x ′) = −∞ then
f ∗(v) = (con f )∗(v) = sup hv, xi − (con f )(x) > hv, x ′i − (con f )(x ′) = +∞.
x
The second statement is a special case (f proper ⇒ f  {−∞, +∞}). 

6.2 Duality Correspondences


The following beautifully symmetric theorem is at the core of many of the later proofs.

Theorem 6.12. (inversion rule for subdifferentials) Assume f : Rn → R̄ proper, lsc, convex.
Then
∂f ∗ = (∂f )−1,
specifically
v ∈ ∂f (x) ⇔ f (x) + f ∗(v) = hv, xi ⇔ x ∈ ∂f ∗(v).
Moreover,
∂f (x) = arg max {hv ′, xi − f ∗(v ′)},
v′
∂f ∗(v) = arg max {hv, x ′i − f (x)}.
x′

Proof. [RW04, 11.3] We have


f (x) + f ∗(v) = hv, xi
⇔ f ∗(v) = hv, xi − f (x)
⇔ x ∈ arg max {hv, x ′i − f (x ′)} (6.2)
x′
Thm. 6.10
⇔ 0 ∈ ∂(−hv, ·i + f )(x)
Prop. 5.6
⇔ v ∈ ∂f (x).
This shows the second statement (apply to f ∗ and use f ∗∗ = f from Thm. 6.10, f is proper,
lsc, convex). The first statement is just a reformulation. The third statement then follows
from the center line of (6.2). 
6.2 Duality Correspondences 39

Prop. 6.12 provides a way to compute the subdifferentials of f and f ∗ by solving an


optimization problem. This is useful in theory, but can be hard in practice. In fact,

∂f ∗(0) = arg max −f (x) = arg min f (x),


x x

i.e., computing the subdifferential of f∗ at a single point is generally as hard as finding the
whole set of minimizers of f .

Proposition 6.13. (duality correspondences) For proper, lsc, convex f : Rn → R̄ we have

(f (·) − ha, ·i)∗ = f ∗(·+a),


(f (·+b))∗ = f ∗(·) − h·, bi,
(f (·) + c)∗ = f ∗(·) − c,
(λ f (·))∗ = λ f ∗(·/λ) (λ > 0)
(λ f (·/λ))∗ = λ f ∗(·) (λ > 0).

Proof. Follows directly from the definition. 

Proposition 6.14. (conjugation in product spaces) Let fi: Rni → R̄, i = 1, , m be proper
and

f (x1, , xm) 8 f1(x1) + + fm(xm).

Then

f ∗(v1, , vm) = f1∗(v1) + ∗


+ fm (vm).

Proof. Follows directly from the definition. 

Definition 6.15. (support functions) For any set S ⊆ Rn we define the “support function”

σS (v) 8 sup hv, xi = (δS∗ )(v).


x∈S

Definition 6.16. (positively homogeneous functions) A function f : Rn → R̄ is said to


be “positively homogeneous” iff 0 ∈ dom f and

f (λ x) = λ f (x) ∀x ∈ Rn , ∀λ > 0.

Proposition 6.17. (support functions, polar cones) The set of positively homogeneous
proper lsc convex functions and the set of closed convex nonempty sets are in one-to-one-
correspondence through the Legendre-Fenchel transform:

δC 

σC , x ∈ ∂σC (v) ⇔ x ∈ C , σC (v) = hv, xi ⇔ v ∈ NC (x).

In particular, the set of closed convex cones is in one-to-one correspondence with itself: for
any cone K define the “polar cone” (also sometimes referred to as the “dual cone”) as

K∗ 8 {v ∈ Rd |hv, xi 6 0 ∀x ∈ K }.

Then

δK 

δ K ∗, x ∈ NK ∗(v) ⇔ v ∈ NK (x) ⇔ x ∈ K , v ∈ K ∗, hx, vi = 0 ⇔
0 6K x⊥v >K ∗ 0.
40 Conjugate Functions

Proof. (example sheet) Support functions: If f (x) = δC (x) then


f ∗(x) = σC (x).
C is nonempty closed convex ⇒ δC is proper lsc convex ⇒ by Thm. 6.10, f ∗ is proper lsc
convex. Alsof ∗(0) = supx∈C h0, xi = 0 and
f ∗(λ v) = sup hλ v, xi = λ sup hv, xi = λ f ∗(v), λ > 0,
x∈C x∈C
therefore f ∗ = σC is positively homogeneous.
On the other hand, assume that g is a pos. hom. lsc convex function. We know that
g(0) = 0, therefore g ∗(x) > 0. Then for any λ > 0 we have
pos. hom
g ∗(x) = sup {hx, λ vi − g(λ v)} = sup λ (hx, vi − g(v)) = λ g∗(x).
v ∈Rn v ∈Rn

Thus g∗(x) ∈ {0, +∞}, and g ∗ is an indicator function of some set C, which must be closed
convex and nonempty because g ∗ is proper lsc convex.
One-to-one correspondence: from (δC )∗∗ = δC .
From Thm. 6.10 we get
∗ Prop. 5.9
x ∈ ∂σC (v) ⇔ v ∈ ∂(σC )(x) ⇔ v ∈ ∂δC (x) ⇔ v ∈ NC (x),
⇔δC (x) + σC (v) = hv, xi ⇔ x ∈ C , σC (v) = hv, xi.
Cones: We claim: g is a positively homogeneous lsc convex proper indicator function ⇔
g = δK with K closed convex cone: for λ > 0 we have x ∈ K ⇔ λ x ∈ K, therefore
 
0, λ x ∈ K, 0, x ∈ K, δK (x)∈{0,+∞}
δK (λ x) = = = δK (x) = λ δK (x).
+∞, otherwise +∞, otherwise
Thus δK is positively homogeneous. Other direction: g = δC positively homogeneous lsc
convex indicator function, then 0 ∈ dom g, thus 0 ∈ C, and x ∈ C ⇒ g(x) = 0 ⇒
pos. hom
g(λ x) = λ g(x) = 0 ⇒ λ x ∈ C, thus C is a cone.
Take any such pos. hom. lsc convex indicator function δK , then by the first part (forward

direction) δK is positively homogeneous lsc convex. But δK is also positively homogeneous

lsc convex, therefore by the first part (backward direction) δK is an indicator function.
∗ ′
⇒ δK = δK ′ for some closed convex cone K . We have

δK (v) < +∞ ⇔ sup hv, xi < +∞
x∈K
K cone
⇔ sup hv, xi 6 0
x∈K
⇔ v ∈ {v ′|hv ′, xi 6 0 ∀x ∈ K } = K ∗,

thus K ′ = K ∗ and δK = δ K ∗. 
Chapter 7
Duality in Optimization
Definition 7.1. (primal and dual optimization problems, perturbation formulation)
Assume f : Rn × Rm → R̄ is proper, lsc, and convex. We define the “primal” and “dual”
problems
inf ϕ(x), ϕ(x): =f (x, 0), sup ψ(y), ψ(y) 8 −f ∗(0, y).
x∈Rn y ∈Rn

and the “inf-projections”


p(u) 8 inf f (x, u), q(v) 8 inf f ∗(v, y) = −sup {−f ∗(v, y)}.
x y y

f is sometimes called a “perturbation function” for ϕ, and p the associated “marginal func-
tion”.
1
A typical example is f (x, u) = 2 kx − I k22 + δ>0(A x − b + u). The extra variables u are
used to perturb the constraints A x > b to A x > b − u.

Proposition 7.2. Assume f satisfies the assumptions in Def. 7.1. Then


1. ϕ and −ψ are lsc and convex.
2. p, q are convex.
3. p(0) and p∗∗(0) are the optimal values of the primal and dual problems:
p(0) = inf ϕ(x), p∗∗(0) = sup ψ(y).
x y

4. The primal and dual problems are feasible iff the domain of their associated marginal
function contains 0:
inf ϕ(x) < +∞ ⇔ 0 ∈ dom p,
x
sup ψ(y) > −∞ ⇔ 0 ∈ dom q.
y

Proof.
1. f proper lsc convex ⇒ f ∗ is proper lsc convex (Thm. 6.10, con f is proper) ⇒ ϕ,
ψ lsc convex (not necessarily proper!).
2. Convexity of p: We consider the strict epigraph set of p:
E 8 (u, α) ∈ Rm × R p(u) = inf f (x, u) < α
n o
x∈Rn
= m
{(u, α) ∈ R × R|∃x: f (x, u) < α}
= A({(x, u, α) ∈ Rn × Rm × R|f (x, u) < α})
=:E ′
= A(E ′),

41
42 Duality in Optimization

where A is the linear (coordinate projection) mapping A(x, u, α) 8 (u, α). The
first equality requires the strict inequality, i.e., we cannot just use epi p for the
argument. E ′ is the strict epigraph of f and thus convex (Prop. 3.7) ⇒ A(E ′) is
convex (Prop. 3.15)). Thus p is convex (Prop. 3.7).
The same argument applied to q and f ∗ shows that q is convex.
3. First part by definition:

p(0) = inf f (x, 0) = inf ϕ(x).


x x

Second part: we know

p∗(y) = sup {hy, ui − p(u)}


u n o
= sup hy, ui − inf f (x, u)
u x
= sup {hy, ui − f (x, u)}
u,x
= sup {h(0, y), (x, u)i − f (x, u)}
u,x
= f ∗(0, y) = − ψ(y).

Therefore

p∗∗(0) = sup h0, y i − p∗(y)


y
= sup ψ(y).
y

4. From the definitions: 0 ∈ dom p ⇔ p(0) < +∞ ⇔ infxf (x, 0) < +∞ ⇔ inf ϕ < +∞,
similar for q.


Theorem 7.3. (weak and strong duality) Assume f satisfies the assumptions in Def. 7.1.
Then “weak duality” always holds:

inf ϕ(x) > sup ψ(y), (7.1)


x y

and under certain conditions the infimum and supremum are equal and finite (“strong
duality”):

p(0) ∈ R and p lsc in 0 ⇔ inf ϕ(x) = sup ψ(y) ∈ R.


x y

The difference inf ϕ − sup ψ is the “duality gap”.

Proof. From [ET99, Prop. 2.1]: For the inequality:


Thm. 6.10 Prop. 7.2
inf ϕ(x) = p(0) > p∗∗(0) = sup ψ(y).
x y

Equality holds if and only if p(0) = p∗∗(0). We thus have to show

p(0) ∈ R and p lsc in 0 ⇔ p(0) = p∗∗(0) ∈ R.

“⇐”: Since p∗∗(0) 6 cl p(0) 6 p(0) holds for arbitrary p the right-hand side implies
lim inf y→0 p(y) = cl p(0) = p(0) ∈ R, thus p is lsc in 0.
Duality in Optimization 43

“⇒”: We claim that if the left-hand side holds then cl p is proper lsc convex. Convexity
and lower semi-continuity is clear from Prop. 7.2 and the definition of the closure. cl p must
then also be proper: cl p is not constant +∞ because cl p(0) 6 p(0) < +∞. If there were
y s.t. cl p(y) = −∞ then (cl p convex) cl p((1 − t) 0 + t y) 6 (1 − t) cl p(0) + t cl p(y) = −∞
for all t ∈ (0, 1). Since by assumption cl p(0) = p(0) ∈ R, this means cl p(t y) = −∞
for all t ∈ (0, 1). Moreover, t y → 0 for t → 0 and cl p is lsc (in particular in 0), thus
cl p(0) 6 lim inft→0cl p(t y) = −∞. But this would mean p(0) = −∞, because p is lsc in 0,
which implies p(0) = cl p(0). Thus cl p must be proper, lsc, and convex.
An alternative way of proving that cl p is proper is to use the fact that any improper,
convex, lsc function is constant +∞ or −∞ (example sheets), which contradicts p(0) ∈ R.
Because p∗ = (cl p)∗ always holds (Thm. 6.10),
Thm. 6.10 Thm. 6.10, cl p proper lsc conv. p lsc in 0
(p∗)∗(0) = ((cl p)∗)∗(0) = (cl p)∗∗(0) = cl p(0) = p(0).
Together with the first part this shows inf ϕ = sup ψ = p(0), which is finite by assumption.

Proposition 7.4. (primal-dual optimality conditions) Assume f satisfies the assumptions
in Def. 7.1. Then we have the “primal-dual optimality conditions”
 
 x ′ ∈ arg min ϕ(x), 
 


 ′ x 



′ ′
(0, y ) ∈ ∂f (x , 0) ⇔ y arg max ψ(y), ⇔ (x ′, 0) ∈ ∂f ∗(0, y ′). (7.2)
y
 
 inf ϕ(x) = sup ψ(y) 




 x y

The set of “primal-dual optimal points” (x ′, y ′) satisfying ( 7.2) is either empty or equal to
(arg min ϕ) × (arg max ψ).
Proof. We know from Prop. 6.12 that
(0, y ′) ∈ ∂f (x ′, 0) ⇔ (x ′, 0) ∈ ∂f ∗(0, y ′)
⇔ f (x ′, 0) + f ∗(0, y ′) = hx ′, 0i + h0, y ′i
⇔ f (x ′, 0) = −f ∗(0, y ′) ∈ R
⇔ ϕ(x ′) = ψ(y ′) ∈ R.
Because inf ϕ > sup ψ always and ϕ(x ′) = ψ(y ′) shows inf ϕ 6 sup ψ, this is equivalent to
inf ϕ(x) = sup ψ(y) ∈ R, x ′ ∈ arg minϕ, y ′ ∈ argmaxψ.
x y
This is again equivalent to
inf ϕ(x) = sup ψ(y), x ′ ∈ arg minϕ, y ′ ∈ argmaxψ,
x y

since equality with an infinite value would imply either ϕ(x ′) = +∞ or ψ(y ′) = −∞, both
of which are explicitly excluded through the definition of the arg min .
If the set of x ′, y ′ that satisfy the conditions is non-empty, then inf ϕ = sup ψ must
hold with a finite value as seen above. Thus x ′, y ′ satisfy the conditions iff x ′ ∈ arg minϕ
and y ′ ∈ arg max ψ, which proves the last statement. 
Proposition 7.5. (sufficient conditions for strong duality) Assume f satisfies the assump-
tions in Def. 7.1. Then
0 ∈ int dom p or 0 ∈ int dom q ⇒ inf ϕ(x) = sup ψ(y) (S ′)
x y
0 ∈ int dom p and 0 ∈ int dom q ⇒ inf ϕ(x) = sup ψ(y) ∈ R (S)
x y
44 Duality in Optimization

(note that in the first case equality may hold with the value +∞ or −∞) and

0 ∈ int dom p and inf ϕ(x) ∈ R ⇔ arg max ψ(y) nonempty and bounded (P ),
x y
0 ∈ int dom q and sup ψ(y) ∈ R ⇔ arg min ϕ(x) nonempty and bounded (D).
y x

In particular, if any of the conditions (S), (P ), (D) holds, then strong duality holds, i.e.,
inf ϕ = sup ψ ∈ R. Moreover, if (S) holds, or (P ) and (D) both hold, then there exist x ′,
y ′ satisfying the primal-dual optimality conditions ( 7.2).
Also,

(P ) ⇒ ∂p(0) = arg max ψ(y),


y
(D) ⇒ ∂q(0) = arg min ϕ(x).
x

Proof. Assume 0 ∈ int dom p. If p(0) = −∞ then p∗∗(0) 6 p(0) = −∞, thus sup ψ = p∗∗(0) =
p(0) = inf ϕ = −∞. The only other possibility is p(0) ∈ R since 0 ∈ int dom p. Since cl p = p
on int dom p (example sheets) we know that if 0 ∈ int dom p then p is lsc in 0 and we can
apply Thm. 7.3 to get p∗∗(0) = p(0) ∈ R, which shows inf ϕ = sup ψ (now with a finite value).
We can apply the same argument to f ′(x, y) 8 f ∗(y, x), the approach is completely
symmetric:

ϕ ′(x) = f ′(x, 0) = f ∗(0, x) = −ψ(x),


f proper, lsc, convex
ψ ′(y) = −f ′∗(0, y) = − f (y, 0) = ϕ(y),
′ ′ ∗
p (u) = inf f (x, u) = inf f (u, x) = q(u),
x x
q ′(v) = inf f ′∗(v, y) = inf f (y, v) = p(v),
y y
inf ϕ ′ = −sup ψ,
sup ψ ′ = −inf ϕ.

Then 0 ∈ int dom q ⇒ 0 ∈ int dom p ′. From the first part we know that then inf ϕ ′ = sup ψ ′,
but this means −supψ = −inf ϕ, thus inf ϕ = sup ψ.
If both 0 ∈ int dom p, 0 ∈ int dom q hold, then additionally +∞ > p(0) > p∗∗(0) = sup ψ =
−q(0) > −∞, thus the value is finite.
Non-emptyness and boundedness: See [RW04, Thm. 11.39 proof]; the idea is to show
that 0 ∈ int dom p if and only if ψ is proper (lsc convex) and level-bounded.
Subdifferential: If (P ) holds then 0 ∈ int dom p and p(0) ∈ R. Then we know that
cl p(0) = p(0) ∈ R. cl p is then proper (and lsc convex) by the proof of Thm. 7.3, thus by
Thm. 6.12

∂(cl p)(0) = arg max {h0, y i − (cl p)∗(y)} = =a = arg maxψ.


y
= arg max {−(cl p)∗}
Thm. 6.10
= arg max {−p∗}
= arg max ψ.

But

∂p(0) = {v|p(0) + hv, xi 6 p(x) ∀x}


cl p(0)= p(0)
= {v|cl p(0) + hv, xi 6 p(x) ∀x}
= {v|cl p(0) + hv, xi 6 cl p(x) ∀x}
= ∂(cl p)(0).
Duality in Optimization 45

The second-to-last inequality follows because the affine functions majorized by p are
exactly the affine functions majorized by cl p. Together we get ∂p(0) = arg max y ψ(y).
Using duality, a similar argument shows the corresponding statement for q. 

Proposition 7.6. Assume k: Rn → R̄ and h: Rm → R̄ are both proper, lsc, convex, and
A ∈ Rm×n, b ∈ Rm, c ∈ Rn. For
f (x, u) 8 hc, xi + k(x) + h(A x − b + u)
the primal and dual problems are of the form
inf ϕ(x), ϕ(x) 8 hc, xi + k(x) + h(A x − b),
x

sup ψ(y), ψ(y) 8 −hb, y i − h∗(y) − k∗(−A⊤ y − c)


y

with
int dom p = int(dom h −A dom k) + b,
int dom q = int(dom k∗ − (−A⊤) dom h∗) + c,
and the optimality conditions
 
 ′
x ∈ arg min ϕ(x), 
 
 x 
A x ′ − b ∈ ∂h∗(y ′)
 
⊤ ′ ′
   
−A y − c ∈ ∂k(x )

y ′ ∈ arg max ψ(y),

⇔ ⇔ ′ .
y ′ ∈ ∂h(A x ′ − b)  y  x ∈ ∂k∗(−A⊤ y ′ − c)
 inf ϕ(x) = sup ψ(y) 

 
 
x y

Proof. f is convex (Prop. 3.14 sum is convex, Prop. 3.14 f (A x) + b is convex) and lsc
(Prop. 2.16, sums of proper lsc functions qqare lsc, the composition of an lsc function
with a continuous function is lsc). f is also proper because f (x, u) = −∞ implies k or h
not proper, and for some x ∈ dom k  ∅ there must exist u s.t. A x − b + u ∈ dom h because
dom h  ∅ and there are no other constraints on h.
With this choice we have ϕ(x) = f (x, 0) and (this is an important trick!)
f ∗(v, y) = sup hx, v i + hu, y i − hc, xi − k(x) − h(A x − b + u)
x,u
w=Ax−b+u
= sup hx, v i + hw − A x + b, y i − hc, xi − k(x) − h(w)
x,w
= sup hx, −A⊤ y + v − ci − k(x) + hw, y i − h(w) + hb, y i
x,w
= hb, y i + sup {hx, −A⊤ y + v − ci − k(x)} + sup {hw, y i − h(w)}
x w
= hb, y i + k∗(−A⊤ y + v − c) + h∗(y).
Thus ψ(y) = −f ∗(0, y) (the case v = 0). Because f is proper, lsc, convex we can apply
Prop. 7.4 and Prop. 7.5. We have
u ∈ dom p ⇔ inf f (x, u) < +∞ ⇔ inf hc, xi + k(x) + h(A x − b + u) < +∞
x x
⇔ ∃x: hc, xi + k(x) + h(A x − b + u) < +∞
⇔ ∃x ∈ dom k: h(A x − b + u) < +∞
⇔ ∃x ∈ dom k: A x − b + u ∈ dom h
⇔ ∃x ∈ dom k: u ∈ dom h − A x + b
⇔ u ∈ dom h − A dom k + b.
46 Duality in Optimization

Thus
0 ∈ int dom p ⇔ 0 ∈ int(dom h − A dom k + b) ⇔ b ∈ int(A dom k − dom h).
Similarly for int dom q.
Subdifferential of f : we have
f (x, u) = g(x, (u + A x)) with g(x, w) 8 hc, xi + k(x) + h(w − b).
Then by the definition of the subdifferential (separable sum ⇒ product of subdifferentials)
∂g(x, w) = (c + ∂k(x)) × (∂h(w − b)).
Also, by Prop. 5.14 (chain rule) with F · (x, u) 8
  
I 0 x
A I u
and f = g ◦ F we get

∂f (x, u) = F ⊤ (∂g(F (x, u)))


= {(c + v + A⊤ y, y)|v ∈ ∂k(x), y ∈ ∂h(A x + u − b)}.
Therefore



 0 = c + v + A⊤ y, (
 ′
y = y, −A⊤ y ′ − c ∈ ∂k(x ′),
(0, y ′) ∈ ∂f (x ′, 0) ⇔ ′ ⇔

 v ∈ ∂k(x ), y ′ ∈ ∂h(A x ′ − b).
 y ∈ ∂h(A x ′ − b + 0)

Similarly for f ∗. 

Example 7.7. (conic problems) Assume that K , L are pointed closed convex cones with
polar cones K ∗, L∗ and consider the problem
inf hc, xi s.t. A x − b >L∗ 0, x >K 0,
x
in alternative notation
inf hc, xi + δK (x) + δL∗(A x − b)
x
By Prop. 7.6 the dual problem is
sup ψ(y), ψ(y) 8 −hb, y i − h∗(y) − k∗(−A⊤ y − c),
y

which can be rewritten as



sup − hb, y i − δL∗ ∗(y) − δK (−A⊤ y − c)
y
= sup − hb, y i − δL(y) − δK ∗(−A⊤ y − c).
y

Thus the dual problem is


sup − hb, y i s.t. − A⊤ y − c >K ∗ 0, y >L 0.
For self-dual cones with K ∗ = −K, such as K = Rn>0 (and the same for L) we obtain
inf hc, xi s.t. A x 6L b, x >K 0,
x
sup − hb, y i s.t. −A⊤ y 6K c, y >L 0.
y

On important special case is the Linear Programming duality, where K = Rn>0 and L = δ0:
inf hc, xi s.t. A x = b, x > 0,
sup − hb, y i s.t. −A⊤ y 6 c.
Duality in Optimization 47

Note that K = −K ∗ but the dual constraint on y disappears because L∗ = Rn.

Proposition 7.8. (Lagrangian) Assume f : Rn × Rm → R̄ is proper, lsc, convex. We define


the associated Lagrangian as
l(x, y) 8 −f (x, ·)∗(y),
i.e.,
l(x, y) 8 inf {f (x, u) − hy, ui}.
u

Then l(·, y) is convex for every y, −l(x, ·) is lsc and convex for every x, y, and
f (x, ·) = (−l(x, ·))∗,
(v, y) ∈ ∂f (x, u) ⇔ v ∈ ∂xl(x, y) and u ∈ ∂y (−l)(x, y).

Proof. Denote g(x, y, u) 8 f (x, u) − hy, ui. f is proper lsc convex ⇒ g is proper lsc
convex. Thus
l(·, y) = inf g(·, y, u)
u

is convex (but does not have to be proper for a specific y), see the proof of Prop. 7.2. Define
fx(y) 8 f (x, y), then −l(x, ·) = fx∗(·) and fx is either +∞ or proper lsc convex, therefore
−l(x, ·) is either −∞ or proper lsc convex by Thm. 6.10, but always lsc convex as claimed.
We have, by definition of the subgradient,
(v, y) ∈ ∂f (x, u) ⇔ f (x ′, u ′) > f (x, u) + hv, x ′ − xi + hy, u ′ − ui ∀x ′, u ′
⇔ f (x ′, u ′) − hy, u ′i > f (x, u) + hv, x ′ − xi − hy, ui ∀x ′, u ′
⇔ inf {f (x ′, u ′) − hy, u ′i} > f (x, u) − hy, ui + hv, x ′ − xi ∀x ′. (7.3)
u′

Setting x ′ = x the last line implies


inf {f (x, u ′) − hy, u ′i} > f (x, u) − hy, ui,
u′
which is again equivalent to
inf {f (x, u ′) − hy, u ′i} = f (x, u) − hy, ui,
u′

because infu 6 always holds (set u = u ′). Thus we can continue (7.3) via
infu ′ {f (x ′, u ′) − hy, u ′i} > f (x, u) − hy, ui + hv, x ′ − xi ∀x ′

(7.3) ⇔
inf ′ {f (x, u ′) − hy, u ′i} = f (x, u) − hy, ui
 u
infu ′ {f (x ′, u ′) − hy, u ′i} > infu ′ {f (x, u ′) − hy, u ′i} + hv, x ′ − xi ∀x ′

inf ′ {f (x, u ′) − hy, u ′i} = f (x, u) − hy, ui
 u′
l(x , y) > l(x, y) + hv, x ′ − xi ∀x ′

f (x, u ′) > f (x, u) + hy, u ′ − ui ∀u ′

y ∈ ∂uf (x, u),

v ∈ ∂xl(x, y).
Because fx is +∞ or proper lsc convex, by Prop. 6.12 (inversion of subdifferentials under
Legendre-Fenchel transform) the first condition is equivalent to u ∈ ∂fx∗(y), which is the
same as
u ∈ ∂(−l(x, ·))∗∗(y) = ∂ y(−l)(x, y).
48 Duality in Optimization

Therefore we get

u ∈ ∂ y(−l)(x, y),
(7.3) ⇔
v ∈ ∂xl(x, y)
which shows the assertion. 

The Lagrangian opens a particularly nice way to formulate the optimality conditions:

Definition 7.9. (saddle points) For any function l: Rn × Rm → R̄ we say that (x ′, y ′) is


a saddle point of l iff

l(x, y ′) > l(x ′, y ′) > l(x ′, y) ∀x, y.

The set of all saddle-points is denoted by sp l.

Remark 7.10. The condition (x ′, y ′) ∈ sp l is equivalent to

inf l(x, y ′) = l(x ′, y ′) = sup l(x ′, y),


x y

i.e., x ′ minimizes l(·, y ′) and y ′ maximizes l(x ′, ·).


The direction “⇐” is clear, “⇒” follows since

inf l(x, y ′) 6 l(x ′, y ′)


x

always holds (set x = x ′); together with the saddle-point condition we get equality (similarly
for the supremum).

Proposition 7.11. Assume f is proper, lsc, convex with associated Lagrangian l. Then

ϕ(x) = sup l(x, y),


y
ψ(y) = inf l(x, y),
x

and the primal and dual problems can be written as

inf ϕ(x) = inf sup l(x, y),


x x y
sup ψ(y) = sup inf l(x, y).
y y x

Moreover, we have the optimality condition


 
 x ′ ∈ arg min ϕ(x), 
 

 x 

0 ∈ ∂xl(x ′, y ′),
 ′  
y ∈ arg max ψ(y),

′ ′
⇔ (x , y ) ∈ sp l ⇔ .
 y  0 ∈ ∂ y (−l)(x ′, y ′)



 xinf ϕ(x) = sup ψ(y) 


y

Proof. For fixed y (without any assumptions),

inf l(x, y) = inf − sup {hy, ui − f (x, u)}


x x u
= −sup {h(x, u), (0, y)i − f (x, u)}
x,u
= −f ∗(0, y)
= ψ(y).
Duality in Optimization 49

For fixed x,
sup l(x, y) = sup −fx∗(y)
y y
= sup h0, y i − fx∗(y)
y
= fx∗∗(0)
= fx(0)
= f (x, 0)
= ϕ(x).
The equality fx∗∗ = f holds because fx is either proper lsc convex or +∞, in which case
fx∗ = −∞ and fx∗∗ = +∞.
By Prop. 7.4 the optimality condition is equivalent to (0, y ′) ∈ ∂f (x ′, 0), which by
Prop. 7.8 is equivalent to having 0 ∈ ∂xl(x ′, y ′) and 0 ∈ ∂ y (−l)(x ′, y ′). Since l and −l are
convex, this is exactly the saddle-point condition (x ′, y ′) ∈ sp l (Def. 7.9) and by Rem. 7.10
and the first part of the proposition equivalent to to ϕ(x ′) = l(x ′, y ′) = ψ(y ′). 
Note that Prop. 7.11 only gives sufficient conditions for optimality. The following
theorem answers the question of how to construct f from a given Lagrangian such that f
is proper lsc convex and for every minimizer x ′ we can find a dual feasible point y ′ such
that (x ′, y ′) is a saddle point.
Proposition 7.12. Assume X ⊆ Rn and Y ⊆ Rm are nonempty, closed, convex, and
L: X ×Y → R
is a continuous function with L(·, y) convex for every y and −L(x, ·) convex for every x.
Then
l(x, y) 8 L(x, y) + δX (x) − δY (y),
with the convention +∞ − ∞ = +∞ on the right, is the Lagrangian to
f (x, u) 8 sup {l(x, y) + hu, y i} = (−l(x, ·))∗(u).
y
f is proper, lsc, and convex, i.e., Prop. 7.11 applies with primal and dual problems
inf sup L(x, y) = inf ϕ(x), ϕ(x) 8 δX (x) + sup L(x, y), (7.4)
x∈X y ∈Y x y∈Y
sup inf L(x, y) = sup ψ(y), ψ(y) 8 −δY (y) + inf L(x, y).
y∈Y x∈X y x∈X

Moreover, if X and Y are bounded, then sp l is nonempty and bounded.


Proof. For every x, −l(x, ·) is either −∞ (if x  X) or −L(x, ·) + δY (·) which is proper
(L finite, Y  ∅), lsc (L continuous, Y closed) and convex (L convex, Y convex). Either
way we have
−l(x, ·) = (−l(x, ·))∗∗ = f (x, ·)∗
as required for l to be the Lagrangian to f (Prop. 7.8). We have f (x, u) the supremum of
convex functions (ranging over y) and therefore convex.

For the lower-semicontinuity, consider the mapping g y: (x, u) L(x, u) + hu, y i. For
every y ∈ Y , g y is lower semi-continuous. The pointwise supremum of an arbitrary family
of lsc functions is again lsc; this follows because their epigraphs are closed and therefore
their intersection is also closed. This shows that the mapping
(x, u)  sup {L(x, y) + hu, yi} = sup {L(x, y) + hu, y i − δY (y)}
y ∈Y y
50 Duality in Optimization

is lsc, and therefore f (x, u) = supy {L(x, y) + hu, y i − δY (y)} + δX (x) is lsc as well (the last
step requires X to be closed and makes use of the convention +∞ − ∞ = 0).
Properness: if f (x, u) ≡ +∞ then l(x, ·) = −(f (x, ·))∗ ≡ +∞ for all x, which is not
possible because it would mean X = ∅. For any fixed x, f (x, ·) is either +∞ or x ∈ X, in
which case −l(x, ·) = −L(x, ·) + δY . This is proper lsc convex (Y  ∅). Thus f (x, ·) must
be proper lsc convex (Thm. 6.10), which means it cannot assume the value −∞. Thus
f (x, u)  −∞ always; together f is jointly proper. All in all, the conditions in Prop. 7.11
are fulfilled.
The relations in (7.4) follow directly from the definition of l if one takes some caution
to respect the convention +∞ − ∞ = +∞:
ψ(y) = inf {L(x, y) + δX (x) − δY (y)}
x 
= inf L(x, y) − δY (y).
x∈X

The inf on the left side is always finite because X  ∅ and L is finite. Also
ϕ(x) = sup {L(x, y) + δX (x) − δY (y)}
y
+∞, x X,
=
sup yL(x, y) − δY (y), x ∈ X
 
= sup L(x, y) + δX (x).
y∈Y

To show that the set of saddle points is nonempty, consider


p(u) = inf sup {L(x, y) + hu, y i} 6 sup sup {L(x, y) + hu, y i}
x∈X y ∈Y x∈X y∈Y

(note that the inequality requires X , Y  ∅). X and Y are compact, therefore the right side
is bounded and we get dom p = Rm; similarly dom q = Rn. In particular 0 ∈ int dom p and
0 ∈ int dom q, and from Prop. 7.5 we obtain that the optimality conditions have a solution
with a finite value. By Prop. 7.11 this implies
sp l = (argmin ϕ) × (arg max ψ)  ∅.
Both sets are bounded because X and Y are bounded, therefore sp l is bounded as well. 

Example 7.13. For


l(x, y) 8 hc, xi + k(x) − hb, y i − h∗(y) + hA x, y i
with k, h proper lsc convex we get
ϕ(x) = sup l(x, y)
y
= hc, xi + k(x) + h(A x − b)
ψ(y) = inf l(x, y) = −hb, y i − h∗(y) − sup {hx, −A⊤ y − ci − k(x)}
x x
= −hb, y i − h∗(y) − k∗(−A⊤ y − c).
and
f (x, u) = sup {l(x, y) + hu, y i}
y
= sup {hc, xi + k(x) − hb, y i − h∗(y) + hA x, y i + hu, y i}
y
= hc, xi + k(x) − h(A x − b + u).
Duality in Optimization 51

Example 7.14. (shrinkage) We consider the problem


1
inf kx − ak22 + λ kxk1. (7.5)
x 2
We can reformulate the problem in “inf sup ” form by rewriting the 1-norm using its dual
norm: with the definition Y 8 {y |ky k∞ 6 1},
1
inf sup kx − ak22 + hx, y i.
x y∈Y 2
We directly get the dual
1
sup inf kx − ak22 + hx, y i.
y ∈Y x 2
The advantage of this formulation is that the Lagrangian is differentiable, so the (uncon-
strained!) inner problem can be solved explicitly by setting its gradient to zero, and we
find that it solved by x = a − y. Substituting this we obtain
1 1
sup − ky k22 + ha, y i = sup − ky − ak22.
y∈Y 2 y∈Y 2
This means that the dual problem is solved by computing the projection of a onto Y ,

y ′ = ΠY (a),

which can be computed explicitly and separately for each component. We can even obtain
a primal solution from y ′: we know that x ′ minimizes l(·, y ′), so
1 1
x ′ = arg min kx − ak22 + hx, y ′i = arg min kx − ak22 + hx, ΠY (a)i.
2 2
The solution is unique, therefore we obtain the primal solution x ′ = a − ΠY (a). This
operation is known as shrinkage, because it shrinks the value of a towards zero.
Note that recovering the primal solution from the dual in this way is only possible
because of the uniqueness of the primal solution: in the general case, not every minimizer
x ′′ of l(·, y ′) leads to a saddle point (x ′′, y ′), as it may still violated the second half of the
saddle-point condition.
The fact that problem (7.5) can be solved explicitly – and therefore exactly – and
includes both the smooth 2-norm as well as the non-smooth 1-norm has made it a very
popular problem to be solved as a sub-step for solving more complicated problems, see
Chapter 9.

Example 7.15. (complementarity conditions) We consider Lagrangians of the form

l(x, y) 8 hc, xi + k(x) + hA x − b, y i − δL(y),

with a closed convex cone L and proper lsc convex function k, with the associated primal
and dual problems

inf hc, xi + k(x) s.t. A x >L∗ b,


x
sup − hb, yi − k∗(−A⊤ y − c) s.t. y >L 0.
y

Looking at the optimality conditions in terms of the Lagrangian as in Prop. 7.11,

0 ∈ ∂xl(x, y) ⇔ 0 ∈ c + ∂k(x) + A⊤ y,
0 ∈ ∂ y(−l)(x, y) ⇔ 0 ∈ −(A x − b) + NL(y),
52 Duality in Optimization

we can apply the second part of Prop. 6.17 to rewrite the last condition through v ∈
NL(y) ⇔ v ∈ L∗, y ∈ L, hx, y i = 0, and get
0 ∈ c + A⊤ y + ∂k(x),
0 6L y⊥(A x − b) >L∗ 0.
The orthogonality constraint in the second equation is also known as a complementarity
condition. To see why, set A = I, b = 0, and L = R>0. The equation is then
0 6 y⊥x 6 0.
Because of the inequalities, no term the inner product hy, xi = 0 can be positive, therefore
hy, xi = 0 is equivalent to yi xi = 0 for all i – the variables xi and yi are complementary in
the sense that if one of them is nonzero, the other one must be zero (they could still both
be zero, however).
Chapter 8
Numerical Optimality
When designing optimization methods an important question is what stopping criterion
to choose. A very common pitfall is the following:
1. Fix a δ > 0, x0.
2. iterate: k = 1, 2,
a. compute xk+1 from xk,
b. stop if |ϕ(xk+1) − ϕ(xk)| < δ, or if kxk+1 − xk k < δ,
c. k ← k + 1.
This approach suffers from all kinds of problems:
• It is also very dependent on the scaling of the problem or the date by constant
factors.
• It stops when the solver is slow , which – without further knowledge about the
algorithm – does not imply that the iterate is close to the solution. In fact the trivial
update xk+1 ← xk “converges” after one iteration but the solution is useless.
• We do not get any information about how close xk is to the minimizer x ′ or how
close f (xk) is to f (x ′).
Ultimately we would like our method to find a solution withing a guaranteed distance to
the optimal solution:

Definition 8.1. For ϕ: Rn → R̄, a point x is an ε-optimal solution if

ϕ(x) − inf ϕ 6 ε.

This is the “ideal” stopping criterion – it guarantees that energy-wise x is not much
worse than a true minimizer x ′. Unfortunately ϕ(x ′) is generally unknown. What can we
do?

8.1 The smooth and the non-smooth case


We first consider the smooth case. Assume that f is convex with a unique minimizer,
and f ∈ C 2. The usual (local) convergence analysis for iterative methods hinges on two
important assumptions.
1. f is strongly convex : ∇2 f (x) > σ− I for some σ− > 0 and all x, i.e., the eigenvalues
of the Hessian are uniformly bounded away from zero.

53
54 Numerical Optimality

We get (Taylor)
1
f (y) = f (x) + h∇f (x), y − xi + (y − x)⊤ ∇2 f (z)(y − x)
2
for some z = (1 − t) x + t y, t ∈ [0, 1]. Thus
σ
f (y) > f (x) + h∇f (x), y − xi + − ky − xk22.
2
This is a better bound than convexity gives; convexity alone corresponds to σ− = 0.
1
a. The right-hand side is maximized by ȳ = x − σ ∇f (x), substituting this gives

1
f (y) > f (x) − k∇f (x)k22.
2 σ−
In particular for x = xk and y = x ′ (a minimizer) we get
1
f (xk) − f (x ′) 6 k∇f (xk)k22.
2 σ−
This means that f (xk) approaches f (x ′)as k∇f (xk)k2 → 0.
b. Another consequence is (x ′ is optimal!)
σ− k
0 > f (x ′) − f (xk) > h∇f (xk), xk − x ′i + kx − x ′k22
2
σ
> −k∇f (xk)k2 kxk − x ′k2 + − kxk − x ′k22,
2
thus
2
kxk − x ′k2 6 k∇f (xk)k2,
σ−
meaning that also xk approaches x ′ as k∇f (xk)k2 → 0.
2. The Hessian of f is uniformly bounded from above: ∇2 f 6 σ+ I for some σ+ > 0
and all x. Then (again using Taylor and minimizing both sides over x)
1
f (xk) − f (x ′) > k∇f (xk)k22.
2 σ+
This gives the reverse statement: k∇f (xk)k2 approaches 0 as f (xk) → f (x ′).
Together we can upper- and lower-bound f (x ′) using quadratic upper and lower bounds
on f . The convergence speeds of gradient-based methods depends heavily on the condition
κ 8 +.
σ
σ−
In general convex optimizations, problems are usually
• non-differentiable (⇒σ+ = +∞) and
• have affine regions (⇒σ− = 0)
(this is most easily visualized using a simple function such as f (x) = |x|). Therefore they
do not have a good “classical” condition (not even locally, again consider f (x) = |x|).
One problem is that the subgradients usually carry no information about how close xk
is to x ′ (or even just how close f (xk) is to f (x ′)). Another viewpoint is the following:

Subgradients in convex optimization only provide a lower bound for the function, and
may be a very bad linear approximation!
8.2 The numerical primal-dual gap 55

Some issues that arise from this:


1. Since f is not differentiable, it can have many subgradients at any given point.
Evaluating whole subdifferentials is generally very hard. Which and how do we
pick one of the subgradients? (The smallest one? That again amounts to solving
a minimization problem over the set of subgradients. Or a random one? Then we
have to deal with non-deterministic methods.)
2. Even if xk → x ′ and f (xk) → f (x), we can have ∂f (xk) ≡ {v} until x ′ = xk holds
exactly (smooth case: 2. shows that k∇f (xk)k2 → 0). In practice we only have finite
precision, so this condition cannot be checked exactly.
3. The minimizer can be non-unique, e.g., f (x) = max {x − 1, −x − 1}. Even if
f (xk) → f (x ′) this does not say anything about convergence of the sequence xk. We
could possibly pass onto a converging subsequence if f is level-bounded, but that
is only useful in theory rather than in an actual implementation.
4. Even if all conditions hold and σ+ and σ− exist, they are not much value in prac-
tice, because they are often unknown or can only be estimated very roughly, and
therefore provide conservative estimates that are useless in practice. This can lead
to slow convergence or bad estimates if we would like to know how close one specific
iterate xk is to the minimizer rather than get an asymptotic convergence rate.
We need a more reliable practical criterion to check optimality.

8.2 The numerical primal-dual gap


Proposition 8.2. (numerical primal-dual gap) Assume (xk , y k) is a primal-dual feasible
pair (or “primal-dual feasible”), i.e., xk ∈ dom ϕ and y k ∈ dom ψ. Then
ϕ(xk) > ψ(y k)
and
0 6 ϕ(xk) − inf ϕ 6 ϕ(xk) − ψ(y k) =: γ(xk , y k) =: γ.
The quantity γ is the “numerical primal-dual gap”. If γ < ε then xk is an ε-optimal solution
with “optimality certificate” y k.

Proof. The first inequality follows directly from weak duality. The second inequality
follows from ϕ(x ′) > ψ(y k) which again holds due to weak duality. 

This means that if for a given xk we can find y k such that ϕ(xk) − ψ(y k) 6 ε we have
proof that xk is an ε-optimal solution, with y k acting as a certificate of optimality! This
is an important concept – solvers can prove that their solution is ε-optimal.
If strong duality holds (i.e., inf ϕ = sup ψ ∈ R), then such (xk , y k) always exist for
arbitrarily small ε > 0, as we can take any minimizing/maximizing pair of sequences for the
primal/dual problems. Existence of a certificate for ε = 0 requires existence of a primal-
dual optimal pair.
The numerical primal-dual gap is translation invariant, i.e., if f ′(x, u) 8 f (x, u) + c for
some c ∈ R (thus ϕ ′(x) = ϕ(x) + c, ψ ′(y) = ψ ′(y) + c; verify this using ψ(y) = −f ∗(0, y))
then
ϕ(xk) − ψ(y k) 6 ε
⇔ϕ ′(xk) − ψ ′(y k) 6 ε.
56 Numerical Optimality

It is however not scale invariant, i.e., if f ′(x, u) = c f (x, u) for c >n0 (then
D ϕ ′(x)
E = c ϕ(x),
′ ∗ 1
ψ (y) = −f (0, y) = −supx,u {hu, y i − c f (x, u)} = −supx,u c u, c y − c f (x,
o
y
u) = −c f ∗ 0, c = c ψ(y/c)) and even if we get the iterates (xk , c y k), then

ϕ ′(xk) − ψ ′(y k) 6 c ε,
i.e., if we stop on the numerical primal-dual gap we get different solutions although the
problem is effectively the same!
Therefore in practice the normalized gap is often used instead:

Definition 8.3. (normalized gap) We define the normalized numerical primal-dual gap as
ϕ(xk) − ψ(y k)
γ̄ 8 γ̄(xk , y k) 8 .
ψ(y k)

We get
y ′ dual optimal ϕ(xk) − ψ(y k)
γ̄ >
ψ(y ′)
weak duality ϕ(x ) − ψ(y k)
k
>
ϕ(x ′)
weak duality ϕ(x ) − ϕ(x ′)
k
> .
ϕ(x ′)
Therefore a small normalized gap γ̄ 6 ε guarantees that xk is ε ϕ(x ′)-optimal. γ̄ is scale-
invariant due to the normalization. It is however not translation-invariant – by adding a
large constant c to ϕ, ψ we can make it arbitrarily small:
ϕ(xk) + c − ψ(y k) − c ϕ(xk) − ψ(y k)
γ′ = = .
ψ(y k) + c ψ(y k) + c
The normalized gap also requires ϕ(x ′) > 0. Nevertheless, it is an excellent stopping
criterion and widely used in practice.
Common values are in the range of δ = 10−4 for inaccurate solutions – which are often
almost identical to high-accuracy solutions in image processing – to δ = 10−8 if precise
solutions are required.

8.3 Infeasibilities
When computing the numerical primal-dual gap there is an important detail: We need to
work with the full extended real-valued ϕ and ψ. This is very easy to lose track of, as the
following example shows.

Example 8.4. Consider


inf |x| = inf sup x y
x∈[1,2] x∈[1,2] y∈[−1,1]
= sup inf x y
y∈[−1,1]x∈[1,2]
= sup min {y, 2 y }.
y∈[−1,1]

An optimal primal-dual pair is (x ′, y ′) = (1, 1), then ϕ(x ′) = |x ′| = 1 = min {1, 2 · 1} = ψy ′,


i.e., the numerical duality gap is zero.
8.3 Infeasibilities 57

Assume we have a point xk = 1 + δk, y k = 1 + δk with δk ց 0, then (xk , y k) → (x ′, y ′), and


|xk | − min {y k , 2 y k } = 1 + δk − (1 + δk) = 0.
This means the numerical primal-dual gap is computed as zero, although none of the
iterates xk are primal optimal. In fact, we could produce a negative gap by setting y k =
1 + 2 δk, which would clearly violate weak duality.
The reason for this is that we did not use the complete primal and dual objectives to
compute the gap. In fact
ϕ(x) = |x| + δ[1,2](x), ψ(y) = min {y, 2 y } − δ[−1,1](y).
In our example we get ϕ(xk) = 1, but ψ(y k) = −∞ because the y k are not dual feasible.
The numerical primal-dual gap is therefore not zero but +∞!

Unfortunately we cannot accurately deal with this issue – the primal or dual constraints
could be non-trivial equality constraints, and in the general case cannot be fulfilled exactly
using finite precision arithmetic. Also, we would like to have an indicator of how the primal-
dual sequence converges even if it is infeasible.
One way to deal with this issue is to split the primal and dual objectives:

Definition 8.5. Assume that


np
X
ϕ(x) = ϕ0(x) + δ gi(x)60,
i=1
Xnd
ψ(y) = ψ0(y) − δhi(x)60,
i=1
where dom ϕ0 = dom ψ0 = Rn and gi: Rn → R, hi: Rm → R are suitable continuous real-
valued convex functions, i.e., the primal and dual constraints are of the form
gi(x) 6 0, i ∈ {1, , n p },
hi(y) 6 0, i ∈ {1, , nd }.
Then the primal and dual infeasibilities are defined as
ηp 8 max {0, g1(xk), , gn p(xk)},
ηd 8 max {0, h1(y k), , hnd(y k)}.

Writing the objectives in such a way is often natural. We can then use the stopping
criterion
ϕ0(xk) − ψ0(y k)
max {γ̄0, η p , ηd } 6 ε, γ̄0 8 .
ψ0(y k)
This enforces not only a small primal-dual gap but also that the primal and dual iterates
are close to being feasible.
For Ex. 8.4 we get
γ̄0 = (1 + δk − (1 + δk))/(1 + δk) = 0,
η p = max {0, 1 − xk , xk − 2} = 0,
ηd = max {0, y k − 1, −1 − y k } = δk.
This makes it obvious that although the apparent numerical duality gap is zero, it cannot
be fully trusted because the dual infeasibility is not zero.
Chapter 9
First-Order Methods

9.1 Forward and backward steps


Traditional gradient descent for differentiable functions:
xk+1 = xk − τk ∇f (xk)
for some step size sequence tk.
Idea: The basic gradient descent discretizes the PDE
xt = −∇f (x)
using explicit (forward) Euler steps with step length τk. For non-smooth f this is generally
not possible, but what if we change it to
xt ∈ ∂f (x)?
Indeed it is possible to show that if f : Rn → R̄ is proper lsc convex with arg min f  ∅, and
x0 ∈ cl dom f , then there exists a unique path x(t), t ∈ [0, ∞) with x(0) = x0, ∂f (x(t))  ∅
for all t and
d
x(t) ∈ −∂f (x(t)) for a.e. t,
dt
where t  x(t) is absolutely continuous on all intervals [δ, ∞) and x(t) converges to a
minimizer of f (Bruck 1975).
In order to discretize this, there are two straightforward choices:

Definition 9.1. For f : Rn → R̄ we define


1. the “forward step”:
Fτk f (xk) 8 (I − τk ∂f ) xk ,
2. the “backward step”:
Bτk f (xk) 8 (I + τk ∂f )−1 xk.

These are called “forward” and “backward” steps, since they correspond to forward and
backward Euler discretizations of the gradient descent PDE. The optimization problem
can be seen as finding a zero of the set-valued mapping ∂f : Rn ⇒ Rn, i.e.,
find x ∈ Rn s.t. 0 ∈ ∂f (x).
The forward and backward steps then amount to
xk+1 ∈ xk − τk ∂f (xk), xk+1 ∈ xk − τk ∂f (xk+1).

59
60 First-Order Methods

We see that the only difference is that in the backward step, the new iterate xk+1 is used
to compute the subdifferential.
The backward step is also known as proximal step, because it can be reformulated as
minimizing f together with an additional quadratic prox term that keeps the new iterate
close to the previous iterate:

Proposition 9.2. (proximal steps) If f : Rn → R̄ is proper lsc convex with τ > 0, then the
backward step is
 
1 2
Bτf (x) = arg min ky − xk2 + τ f (y)
y 2
and therefore unique.

Proof.
y ∈
Bτf (x)
⇔y (I + τ ∂f )−1(x)

⇔x ∈
(I + τ ∂f )(y)
⇔0 ∈
y − x + τ∂f (y) 
1 2
⇔y ∈ arg min ky − xk2 + τ f (y) .
2


If we are given a problem


inf f (x),
x
without any additional knowledge about the structure of f the most straightforward
approaches are to apply forward or backward steps to f :

Example 9.3. (forward and backward stepping) Assume f is proper lsc convex and
arg min f  ∅.
1. Forward stepping:
xk+1 ∈ Fτk f (xk).
Convergence:
• Only restricted convergence results available (line search for smooth func-
tions; but basic result for non-smooth functions requires knowledge of inf f ).
• Also: convergence of xk to minimizer if there is α > 0 such that 0 < inf τk 6
sup τk < 2 α and [Eck89, 3.3]
hy − x, h − g i > α kh − gk22 ∀ g ∈ ∂f (x), h ∈ ∂f (y).
In the smooth case this implies the Lipschitz condition:
ky − xk2 k∇f (y) − ∇f (x)k2 > hy − x, ∇f (y) − ∇f (x)i
> α k∇f (x) − ∇f (x)k22
1
⇒k∇f (x) − ∇f (x)k2 6 ky − xk2. (9.1)
α
Advantages:
• requires very little information (just a single subgradient)
9.1 Forward and backward steps 61

• if inf τk > 0 then fixed points are minimizers of f [Eck89, 3.6].


Disadvantages:
• sequence not unique, depends on selection of subgradient
• can get stuck if xk becomes infeasible
2. Backward stepping:
xk+1 = Bτk f (xk).
Convergence:
P 2
• k τk = +∞ is sufficient (i.e., no upper bound on step size!)
Advantages:
• unique sequence
• cannot get stuck, infeasible starting point possible
Disadvantages:
• substeps are as hard as the original problem (but strictly convex)

The last point is generally the dealbreaker for backward steps. But we often encounter
problems where the objective is a sum of several terms, each of which is easy to minimize:
f (x) = g(x) + h(x).
The question is, can we find a minimum of f by combining suitable alternating minimiza-
tion steps of g and h?

Example 9.4. (splitting principle) Assume f = g + h such that ∂f = ∂g + ∂h with f , g, h


proper lsc convex and arg min f  ∅.
1. Backward-Backward:
xk+1 = Bτk h Bτk g(xk).
Convergence:
• in the mean, i.e.,
k
! k
!
8 k ′ +1
X X
yk τk ′x / τk ′
k ′ =0 k ′ =0

τk2 < +∞.


P P
converges to a minimizer of f if τk > 0, k τk = +∞, k
Advantages:
• few restrictions on step size
Disadvantages:
• step size needs to approach 0
• fixed points (i.e., x with x = Bτh Bτg(x) for some fixed τ ) are generally not
minimizers of f !
• convergence in the mean raises numerical difficulties for large number of steps
2. Forward-Backward:
xk+1 ∈ Bτk h Fτk g(xk).
62 First-Order Methods

A very popular special case is gradient-projection, where f (x) = g(x) + δC (x), g is


differentiable, C  ∅ closed and convex. Then
 
k+1 1 k k 2
x ∈ arg min ky − (x − τk ∇g(x ))k2 + δC (x)
x 2
!
k
= ΠC x − τk ∇g(x ) . k

gradient
projection

Convergence:
• gradient-projection: if (9.1) holds and 0 < inf τk 6 sup τk < 2 α.
• general case: in the mean if k τk = +∞ and k τk2 < +∞, and the chosen
P P

sequence of subgradients y k ∈ Fτk g (xk) is bounded (this is not clear and needs
to be shown separately)
Advantages:
• fixed points are the minimizers of f
• requires backward steps on h only, and can therefore deal with relatively
complicated functions g (think g(x) = kA x − bk22 where backward steps
involve solving large systems)
Disadvantages:
• sequence not unique, may get stuck
• can escape from minimizer! (if ∂g(x ′)  {0}, i.e., contains non-zero subgra-
dients)

Example 9.5. (ISTA) Assume we would like to reconstruct an image y ∈ Rn from the
observation b = R y, where R is a linear operator. This could be a blurring operator, a
super-resolution operator, or just the identity for denoising of a given image.
As prior knowledge about the image to be recovered we make the first assumption that
y = W x,
where W ∈ Rn×m is a set of basis vectors (possible overcomplete). One choice is to use a
wavelet basis consisting of scaled and translates copies of a “mother wavelet”, for which

the product x W x can be evaluated quickly although W is large and non-sparse. The
second and key assumption is that x is sparse, i.e., it has only a few non-zeros: we assume
that the image can be represented using only a few of the basis functions.
An approach to recover y is to set A = R W and find
 
1 2
x ∈ arg min kA x − bk2 + λ kxk1 .
x 2
The left term ensures that the resulting image is close to the observation, the right term
promotes sparsity of x. If we apply forward-backward splitting with f = g + h, g(x) =
1
2
kA x − bk2, h(x) = λ kxk1, we get
 
k+1 1 k ⊤ k 2
x = arg min kx − (x − τk A (A x − b))k2 + τk λ kxk1 .
x 2
This is know as ISTA (iterative shrinkage thresholding) and has two strong points:
• It only requires us to do matrix multiplications involving A and A⊤, as opposed to
solving linear equation systems involving A.
9.2 Primal-dual methods 63

• The proximal step is separable, and can be solved explicitly using shrinkage (see
example sheets).
 
1
• Convergence: O k , where k is the number of steps. A variant (FISTA)
  employs
1
over-relaxation and an adaptive choice of the τk to achieve O k2
. This is –
although far from a linear convergence rate – often enough in practice and widely
used.

9.2 Primal-dual methods


A slightly different approach is to not try to minimize the function, but to solve the primal-
dual optimality conditions instead. This is particularly fruitful in convex optimization,
as a primal-dual solution not only yields a global minimizer of f , but also an optimality
certificate in form of the dual feasible (and optimal, in case of a solution) point.
Consider the primal and dual problems with associated Lagrangian
inf ϕ(x), ϕ(x) = g(x) + h (A x),
x
sup ψ(y), ψ(y) = −g∗(−A⊤ y) − h∗(y),
y
l(x, y) = g(x) − h∗(y) + hA x, y i.
The optimality (saddle-point) conditions are
x ′ ∈ arg min l(x, y ′) = arg min {g(x) + hA⊤ y ′, xi},
x
y ′ ∈ arg max l(x ′, y) = arg min {h∗(y) + h−A x, y i}.
y

An obvious strategy is to do alternate between improving the primal and dual objectives,
keeping the other variable fixed. In the following we generally assume that strong duality
holds.

Example 9.6. (primal-dual methods)


1. PFBS (proximal forward-backward splitting): We apply a full minimization step with
respect to y, followed by a backward step on x. The minimization with respect to
y can be rewritten as follows:
y k+1 ∈ arg min {h∗(y) + h−A xk , y i}
y
⇔0 ∈ ∂h∗(y k+1) − A xk
∂h∗ =(∂h) −1
⇔ y k+1 ∈ ∂h(A x).
The backward step is then
 
1 2
xk+1 ∈ arg min g(x) + hA⊤ y k+1, xi + k
kx − x k2
x 2 τk
 
1 k ⊤ k+1 2
= arg min g(x) + kx − (x − τk A y )k2 .
x 2 τk
By the chain rule (under sufficient regularity assumptions) A⊤ y k+1 ∈ ∂(h ◦ A)(xk),
so PFBS actually amounts to a forward step on h ◦ A followed by a backward step
on g – it is in fact the same as forward-backward splitting in the operator splitting
approach!
64 First-Order Methods

Convergence:
• if ∇(h ◦ A) is Lipschitz continuous (in particular differentiable!) with con-
stant L, and 0 < inf τk 6 sup τk < 2/L.
Advantages:
• requires proximal steps on g only
• compared to forward-backward splitting, provides an alternative way to com-
pute the subgradient by minimizing h∗ and generates a sequence of dual
iterates y k that can be used to compute the numerical primal-dual gap.
Disadvantages:
• step size restriction, need to estimate L
2. Modified PDHG (primal-dual hybrid gradient): We alternate between backward
steps with respect to x and y with an additional over-relaxation:
y k+1 = Bσk (−l(x̄k ,·))(y k),
xk+1 = Bτk(l(·,yk+1))(xk),
x̄ k+1 = xk+1 + θk (xk+1 − xk).
Convergence:
• in the mean if σk ≡ σ, τk ≡ τ , θk ≡ θ = 1, σ τ < 1/kAk2, then with O(1/k).
• in the mean if g or h∗ strongly convex with constant α: O(1/k 2) for kxk − x ′k2
(needs adaptive θk)
• in the mean if g and h∗ strongly convex with constants α, β: O(ω k) with
1+θ
ω =  √ , i.e., linear convergence rate.
σβ
2 1+
kAk2

Advantages:
• few requirements for convergence, flexible
• efficiently removes linearity A
Disadvantages:
• step size bound
• convergence in the mean (although convergence in (xk , y k) is observed)
• even if kAk is known, the ratio σ/τ may still need to be tuned for good
convergence

Example 9.7. (Augmented Lagrangian): We shift h into the primal objective by adding
an artificial variable z:
inf ϕ(x), ϕ(x) = g(x) + h (z) + δAx=z ,
x
l((x, z), y) = g(x) + h(z) + hA x − z, y i.
The dual variables y only occur as multipliers. Note the optimality conditions:
( (
0 ∈ ∂(x,z)l((x, z), y) 0 ∈ ∂g(x) + A⊤ y, 0 ∈ ∂h(z) − y

0 ∈ ∂ y(−l)((x, z), y) Ax=z
( (
0 ∈ ∂g(x) + A⊤ y subgrad. inversion 0 ∈ ∂g(x) + A⊤ y
⇔ ⇔ .
y ∈ ∂h(A x) 0 ∈ ∂h∗(y) − A x
9.2 Primal-dual methods 65

This shows that solutions ((x, z), y) of the new problem are exactly the triples ((x, A x), y)
where (x, y) is a primal-dual solution for l! In particular, y is always a dual solution for
the original problem, although we did not explicitly introduce it as such.
We augment the Lagrangian by adding a quadratic penalty:

l¯((x, z), y) 8 1
kA x − z k22.
g(x) + h(z) + hA x − z, y i +

This does not change the set of minimizers, because A x = z holds for every minimizer.
We then alternate between minimization with respect to x and z and gradient ascent for
y with step size σ:
 
k+1 1 k k 2
x = arg min g(x) + kA x − z + σ y k2 ,
x 2σ
 
1
z k+1 = arg min h(z) + kz − A xk+1 − σ y k k22 ,
z 2σ
1
y k+1 = y k + (A xk+1 − z k+1).
σ
This is known as the alternating direction method of multipliers (ADMM).
Convergence:
• for any σ > 0 we have xk → x ′ and y k → y ′ where (x ′, y ′) primal-dual optimal pair;
additionally z k → A x ′
Advantages:
• very few assumptions required for convergence
Disadvantages:
• We need to solve problems involving the term kA x − ·k22, which can be difficult due
to the coupling of the variables in x.
Chapter 10
Interior-Point Methods
Setting: We would like to solve
inf hc, xi s.t. A x − b >K 0,
,inf hc, xi + δK (A x − b),
where K is a proper closed convex cone.
Idea: We would like to use Newton’s method
xk+1 = xk − (∇2 f (x))−1∇f (x)
since it is usually very fast (locally quadratic convergence rate kxk+1 − x ′k 6 c kxk − x ′k2).
Problem: the constraints!
Idea: We replace the indicator function δK (z) by an approximation F (z) such that
F (z) → +∞ as z → bd K and solve
inf t hc, xi + F (A x − b).
As t → +∞ the minimizers should approach the minimizers of the original problem.
Nevertheless, they all lie in the interior of the constraint set K, hence then name “interior-
point methods” as opposed to methods that travel on the boundary of the constraint set.
Two ideas:
• Can we guarantee that we get a solution with a certain accuracy, i.e., how do we
need to choose t?
• As t → +∞ the problems become potentially more difficult, as the optimal point
moves toward the boundary, and the condition of F become worse the closer one
allows points to be from the boundary. Can we do this iteratively, i.e., if we know
a solution for a certain tk, can we quickly find a solution for a larger tk+1?

Definition 10.1. (canonical barrier) For a cone K we define the “canonical barriers”
F = FK and associated parameters θF:
• K = KnLP = {x ∈ Rn |x1, , xn > 0}, we have the canonical barrier
n
X
F (x) = −log xi , θF = n,
i=1
 q 
• K = KnSOCP = x ∈ Rn xn > x21 + 2
+ xn−1 ,

F (x) = −log (x2n − x21 − 2


− xn−1), θF = 2,
• K = KnSDP 8 {X ∈ Rn×n |X symmetric positive semidefinite },
F (X) = −log det X , θF = n.

67
68 Interior-Point Methods

• K = K 1 × K 2 , then FK (x1, x2) = FK 1(x1) + FK 2(x2) with θF = θF 1 + θF 2.

Proposition 10.2. [Nem 6.3.1, 6.3.2, Boyd 11.6.1] If F is a canonical barrier for K, then
F is smooth on dom F = int K and strictly convex,
F (t x) = F (x) − θF log t ∀x ∈ dom F ,
and for x ∈ dom F we have
1. −∇F (x) ∈ dom F,
2. h∇F (x), xi = −θF,
3. −∇F (−∇F (x)) = x,
1
4. −∇F (t x) = − t ∇F (x).

Proof. [Boyd 11.6.1] differentiate with respect to t at t = 1. 

Again we consider (assume that A has full column rank (linearly independent columns),
Nem p.53 Ass. A).
inf hc, xi s.t. A x − b >K 0.
The dual problem is
sup h−b, y i s.t. −A⊤ y = c, y >K ∗ 0.
We replace y by −y:
sup hb, y i s.t. A⊤ y = c, y 6K ∗ 0.
We then assume that K is self-dual (K ∗ = K) and get
sup hb, y i s.t. A⊤ y = c, y >K 0.
In the remainder of this chapter we generally assume that the primal and dual problems
are strictly feasible, i.e., there exist x, y so that A x − b ∈ int K, A⊤ y = c, and y ∈ int K.
This guarantees that we can actually find interior point, and that strong duality holds.

Proposition 10.3. (primal-dual central path) The primal central path is the mapping
t  x(t) = arg min {t hc, xi + F (A x − b)}.
The dual central path is the mapping
t  y(t) = arg min {−thb, y i + F (y) + δA⊤ y=c }.
The primal-dual central path is the mapping t  z(t) 8 (x(t), y(t)). These central paths
exist and are unique for all t > 0. Also,
1
y(t) = − ∇F (A x(t) − b),
t
and (x, y) is on the central path, i.e., (x, y) = (x(t), y(t)) for some t > 0, if and only if
A x − b ∈ dom F ,
A⊤ y = c,
t y + ∇F (A x − b) = 0.

Proof. Existence, uniqueness: see Nem. 6.4.2 (strict convexity etc.).


Interior-Point Methods 69

Second part: From Prop. 10.2 we get that y 8 − t ∇F (A x − b) ∈ dom F because


1

A x − b ∈ dom F (x is strictly feasible) and dom F is a cone. Also, if x is on the primal


central path, then
0 = t c + A⊤ ∇F (A x − b)
1
⇔c = − A⊤ ∇F (A x − b).
t
Therefore
1
A⊤ y = − A⊤ ∇F (A x − b) = c.
t
Together this shows that y is feasible.
The dual optimality conditions are therefore
0 ∈ −t b + ∇F (y) + NA⊤ y=c .
rge A=ARn
We check:
 
1
−t b + ∇F (y) = −t b + ∇F − ∇F (A x − b)
t
10.2
= −t b + t ∇F (−∇F (A x − b))
10.2
= −t b − t (A x − b)
= −t A x ∈ rge A.
Therefore y is a dual solution and therefore the dual solution (because it is unique).
Last part: Multiplying the last line by A⊤, we get
0 = t A⊤ y + A⊤∇F (A x − b)
= t c + A⊤ ∇F (A x − b),
which is the optimality condition for the primal central point. 

Proposition 10.4. (duality gap along the path) For feasible x, y (i.e., A x − b ∈ K ,
A⊤ y = c, y ∈ K), the duality gap is
ϕ(x) − ψ(y) = hy, A x − bi.
Moreover, for points (x(t), y(t)) on the central path, the duality gap is
θF
ϕ(x(t)) − ψ(y(t)) = .
t
Proof.
ϕ(x) − ψ(y) = hc, xi − hb, yi
= hA⊤ y, xi − hb, yi
= hy, A x − bi.
ϕ(x(t)) − ψ(y(t)) = hy(t), A x(t) − bi
Prop. 10.3
= h−t−1 ∇F (A x(t) − b), A x(t) − bi
Prop. 10.2 θF
= .
t


Remark: This has an intriguing consequence: If we are given x, y on the central path,
we can immediately find the unique t to which they belong by computing the numerical gap!
70 Interior-Point Methods

We do not have to compute exact solutions, it is essentially enough to stay close to the
central path:

Proposition 10.5. (near the central path) We define


kv k∗x 8 (v ⊤ ∇2F (A x − b)−1 v)1/2,
z 8 (x, y), so z(t) is the primal-dual central path, and
dist(z, z(t)) = kt y + ∇F (A x − b)k∗x.
Then for A x − b ∈ dom F, y ∈ dom F, A⊤ y = c, we have
2 θF
dist(z, z(t)) 6 1 ⇒ ϕ(x) − ψ(y) 6 2 (ϕ(x(t)) − ψ(y(t))) = .
t
Proof. See Nemirovski. 

(path tracing) Idea: From 10.3 we know that a primal-dual optimal pair for a fixed t
can be found by solving
A⊤ y = c, t y + ∇F (A x − b) = 0.
Idea: Newton for the nonlinear equation system; we linearize:
∇F (A xk+1 − b) = ∇F (A xk − b + A ∆x) ≈ ∇F (A xk − b) + ∇2F (A xk − b) A ∆x
and get the following linear equation system for the steps ∆x and ∆y:
A⊤ (y k + ∆y) = c,
tk+1 (y k + ∆y) + ∇F (A xk − b) + ∇2F (A xk − b) A ∆x = 0.
Multiply the last line by A⊤ from the left (to eliminate y) and substitute; we get
∆x = H −1 (−tk+1 c − A⊤ ∇F (A xk − b)), H 8 A⊤ ∇2F (A xk − b) A,
∆y = −(tk+1)−1 (∇F (A xk − b) + ∇2F (A xk − b) A ∆x) − y k.
Here we need the assumption that A has full column rank in order for H to be regular.
We apply a step in the direction of (∆x, ∆y):
(xk+1, y k+1) = (xk , y k) + τk (∆x, ∆y) (10.1)
with the step size τk found using line search or a full Newton step (τk = 1):
1
Theorem 10.6. Assume 0 < ρ 6 κ < 10 , tk > 0 fixed and z k = (xk , y k) strictly feasible, i.e.,
A xk − b ∈ dom F, y k ∈ dom f, such that
dist(z k , z(tk)) < κ.

If we apply a full Newton step in ( 10.1) with τk = 1 and tk+1 8 1 + √


 
ρ
tk to generate
θF
z k+1 , then xk+1, y k+1 are strictly primal and dual feasible, and
dist(z k+1, z(tk+1)) < κ
as well.

(The idea is to bound the step for tk so as to keep the Newton method within its region
1
of quadratic convergence. The factor 10 comes from a nasty expression involving ρ and κ
Interior-Point Methods 71

Advantages:
• We can update the penalty after a single Newton step!
• In particular, this means that
 −k
θ 2θ ρ
ϕ(xk) − ψ(y k) 6 2 kF = F 1+ √ .
t t0 θF
We have guaranteed global linear convergence with explicit bounds!
• In practice, we can often do much larger steps for t (we could imagine computing t
from the duality gap!)
• Rapid convergence if the Newton steps can be solved in reasonable time.
Disadvantages:
• Relatively complicated.
• Many detail problems: finding a feasible point, computing the Newton step (or
Quasi-Newton approximation)
• Requires problem-specific code for medium-scale problems to be efficient.
Chapter 11
Using Solvers and Discretization Issues
See cvx_demo.zip, available online.

73
Chapter 12
Support Vector Machines

12.1 Introduction to machine learning


General problem: Given a sample set of feature vectors X = {x1, , xn }, taken from some
feature space F ⊆ Rm, find a classifier function hθ: Rm → C from a class of functions
{hθ |θ ∈ Θ} that “best” maps each feature vector into a corresponding class from the set C,
where usually C is a finite subset of Z.
• In unsupervised learning, only the feature vectors X are known, together with some
prior knowledge about the structure of the feature space and the distribution of
the data. The task is to automatically find classes and cluster the samples into sets
that are simlar with respect to some criterion.
A typical application is clustering: an online shop might be interested to auto-
matically classify their customers into groups to better target the individual group
interests, or they might want to improve categorization of their products on the web-
site. In image processing, one might want to separate foreground and background of
an image, only knowing that they look “different”, but not what they look like – it
might be a white bird against a blue sky, but it also might be a squirrel on a tree.
Other applications include the detection of “unusual” events, e.g., in a video stream,
or identifying and mapping the data points to few low-dimensional subspaces as in
Principal Component Analysis (PCA).
Generally, there is no a priori knowledge on what the individual classes should
be, and often the number of classes isn’t even known. This makes unsupervised
learning a very hard problem.
• In supervised learning, the number of classes is generally known, and there is a set
of training data T = {(x1, y 1), , (xn , y n)}, consisting of pairs of a feature vector
x i ∈ F and a class label y i ∈ C.
Usually such data is obtained by hand-labeling data or using historical data.
For example, we might want to predict whether it is adivsable to buy a particular
stock based on the recent development of the stock price. In image processing, we
might be interested in finding a way to detect cancer cells from regular cells based
on a library of cell images that have been classified beforehand by an expert, classify
classify brain regions or scans, detect vertebrae, faces or body parts, or generally
identify objects in images for search indexing or finding related images.
In this chapter we will only consider supervised learning. Most supervised learning tech-
niques can be summarized in the following energy-based framework:
For a given training data set T = {(x1, y 1), , (xn , y n)}, find the classifier function hθ
from a family {hθ |θ ∈ Θ} such that the total loss
n
8
X
f (θ, T ) g(y i , hθ(xi)) + R(θ) (12.1)
i=1

75
76 Support Vector Machines

is minimized:
• The function g penalizes a deviation of the predicted class label hθ(x i) from the
actual (known) class label y i.
• The regularizer R can make the problem well-defined if there is not enough input
data to uniquely identify θ, and can counter-act the problem of overfitting: If the
size of the training data set is small compared to the parameter space Θ, there is
a danger of fitting the classifier to the specific structure of the training data set,
instead of the distribution underlying the training data – in effect, the classifier will
work very well for the training data, but will perform poorly on feature vectors that
are not in the training data set.
In this chapter we will mostly deal with the two-class case, more precisely, hθ: Rm → {−1,
1}; however, most concepts can be generalized to multiple classes. For the two-class case,
the set (hθ)−1({0}) defines the decision boundary, i.e., the set that separates the two class
regions (hθ)−1({1}) and (hθ)−1({−1}).

12.2 Linear Classifiers


Primal problem.
A simple but very powerful idea is to look for linear classifiers of the form

hθ: Rm → {−1, 1},


hθ(x) 8 sgn(hw, xi + b), where θ = (w, b). (12.2)

This corresponds to finding a separating hyperplane H 8 {x|hw, xi + b = 0} that separates


the two sets of points. Even if this is possible, there will be ambiguity as to how to exactly
choose the hyperplane, so regularization is needed.
We will first look at maximum margin classifiers. Here the margin – the minimal
(signed, in the direction of the desired class label y i) distance between points xi in T and
the hyperplane H is maximized:
  
i w i b
sup min y · ,x +
w,b i∈{1, ,n} kwk2 kwk2
We rewrite
  
w b
sup c s.t. c 6yi · i
,x + , i ∈ {1, , n}
w,b,c kwk2 kwk2
= sup c s.t. kwk2 c 6 y i ·{hw, xi i + b}, i ∈ {1, , n}.
w,b,c

We now substitute c ′ = c kwk2 and obtain

= sup c ′/kwk2 s.t. c ′ 6 y i ·{hw, xi i + b}, i ∈ {1, , n}.


w,b,c ′

For any solution (w, b, c ′), any scalar multiple (λ w, λ b, λ c ′) with λ > 0 will also be a
solution. Assuming that there exists at least one solution with c ′ > 0 (i.e., the data sets
can in fact be exactly separated), we can thus pick one of the solutions by forcing c ′ = 1,
and obtain
1
sup s.t. 1 6 y i ·{hw, xi i + b}, i ∈ {1, , n}. (12.3)
w,b kwk2
12.2 Linear Classifiers 77

1
Finally, maximizing kwk−1 is equivalent to minimizing 2
kwk2, so we obtain
1
inf kwk22 s.t. 1 6 y i ·{hw, xi i + b}, i ∈ {1, , n}. (12.4)
w,b 2
This is convex problem with quadratic objective, and can be rewritten in SDP form (an
SOCP formulation is equally possible but leads to a different dual problem). In terms of
1
the framework in (12.1), the quadratic part 2 kwk22 constitutes the regularizer R, while the
indicator function of the constraints is the loss function g – in effect, the loss for a correct
classification is always 0, while incorrect classifications have infinite loss and are therefore
prohibited.
Dual problem and optimality conditions.
Rewriting the primal problem as
    
w 1
inf k(w, b) + h M − 1 , k(w, b) = kwk22, h(z) = δ>0(z), (12.5)
w,b b 2
 
1 1 ⊤
y · (x ) y 1

M =
 
.
n n ⊤
y · (x ) y n

and using (note the conjugate of k with respect to b),


1
k∗(u, c) = kuk22 + δ0(c), h∗(v) = δ60(v),
2
from Ex. 7.13 and Prop. 7.6 we obtain the saddle-point form
   
w
inf sup k(w, b) + M − 1, z − h∗(z)
w,b z b
   
1 2 w
= inf sup kwk + M − 1, z − δ60(z), (12.6)
w,b z 2 b
and dual problem
n n 2 n
!
X 1 X X
sup − zi − δ60(z) − − y i xi zi − δ0 y i zi .
z 2
i=1 i=1 2 i=1
We get the dual formulation of the linear SVM,
n 2
1 X i i
inf y x zi + e⊤ z (12.7)
z ∈Rn 2
i=1 2
s.t. z 6 0,
Xn
y i zi = 0.
i=1
Support vector machines are generally solved in their dual form (we will later see why
exactly). As in Ex. 7.15, the primal-dual optimality conditions can be stated in terms of
the Lagrangian in (12.6) in complementarity form,
 
w
0∈ + M⊤ z, (12.8)
0
   
w
06 M − 1 ⊥z 6 0. (12.9)
b
Again as in Ex. 7.15, as all terms in the last equality cannot be positive due to the preceding
two conditions, the complementarity condition reduces to n scalar equalities of the form
(y i ·{hw, xi i + b} − 1) · zi = 0. (12.10)
78 Support Vector Machines

Therefore, any point xi with zi  0 must satisfy y i (hxi , wi + b) = 1, i.e.,


 
w b 1
zi  0 ⇒ xi , − = ,
kwk2 kwk2 kwk2
which by (12.3) is the minimum distance between any point x j and the separating hyper-
plane. Therefore, the xi with z i  0 have minimum distance.
Evaluating the linear function.
In order to classify a new incoming sample x using hθ in (12.2), we need to compute
the linear term hw, xi + b. Fortunately, given a dual solution z, we can easily recover w
from the optimality conditions (12.8). In fact,
* n +
X X
hw, xi = − z i y i xi , x = − z i y i hxi , xi. (12.11)
i=1 i
To recover b, we can use (12.10) and plug in any support vector (having zi  0), or take
the mean over all support vectors to increase numerical stability.

12.3 The Kernel Trick


The trouble with linear classifiers is that in practice, data can very rarely be separated by
a hyperplane, i.e., the decision boundary must be of a more complicated shape. Extending
the maximum-margin approach as in the previous section to non-linear decision boundaries
is entirely non-trivial, and generally impossible except for special cases. But fortunately
it turns out that we actually do not need to – instead, we transform the samples xi before
applying a linear classifier:
We transform the original training samples T = {(x1, y 1), , (xn , y n)} using a nonlinear
embedding η: F → F̄ , where F ⊆ Rm is the original feature space and F̄ ⊆ Rm̄ is some
higher-dimensional feature space, usually with m̄ ≫ m:
T¯ 8 {(x̄ 1, y 1), , (x̄ n , y n)}, x̄ i 8 η(xi).
The SVM approach then gives us a linear classifier function h̄θ(x̄) = sgn(hw, xi − b) in the
space F̄ , which induces the nonlinear decision function
hθ(x) = sgn(hw, η(x)i − b)
in the original space. For example, for two-dimensional data we could define
√ √ √
η(x) 8 x21, x22, 2 x1 x2, 2 x1, 2 x2, 1 .

(12.12)
The decision boundary can then be any polynomial of degree 2 or less:
√ √ √
hθ(x) = sgn w1 x21 + w2 x22 + w3 2 x1 x2 + w4 2 x1 + w5 2 x2 + w6 − b .


Unfortunately, such spaces can become huge very quickly for moderately large m – for
polynomials of degree 2 or less, we already need
 
m
= m (m − 1)
2
terms. As this defines the size of the optimization problem (12.5) and of the coefficient
vector w, it becomes too large very quickly.
The first key insight is the following: The size of the z in dual problem ( 12.7) does
not depend on the size m of the feature space – the size of z is defined by the number of
samples n instead!
12.3 The Kernel Trick 79

Unfortunately, the objective

n 2
1 X i
inf y η(xi) zi + e⊤ z
z ∈Rn 2
i=1 2
still requires to map xi into the high-dimensional space F̄ . We start by rewriting the
quadratic part as
n 2 n
1 X i X
y η(xi) zi = y i y j hη(xi), η(x j )i zi z j
2
i=1 2 i,j=1
n
κ(x, x ′) 8 hη(x), η(y)i.
X
= y i y j κ(xi , x j ) zi zj , (12.13)
i,j=1
This is where the second key insight comes into play: By carefully choosing η, the kernel κ
can be evaluated without computing η: For η as in (12.12),
κ(x, x ′) = x21 x1′ 2 + + 2 x2 x2′ + 1 = (hx, x ′i + 1)2
does the trick. Once we have solved the dual problem, the inner product hw, η(x)i is,
according to (12.11),
Xn
hw, η(x)i = − z i y i hx̄ i , η(x)i
i=1
n
X
= − z i y i κ(xi , x). (12.14)
i=1
From a solution z ∈ Rn of the dual problem we can therefore compute b and evaluate the
decision function hθ(x) = hw, η(x)i − b without ever evaluating η! This concept is known
as the kernel trick.
In order to evaluate hθ we still need to store the training vectors xi in order to compute
the terms κ(xi , x). But from (12.14) we see that we can discard any vectors with z i = 0,
and only need to store vectors xi where z i  0. Any such vector must be a support vector
and thus minimizes the distance to the separating hyperplane, which in practice is a very
rare occurence – in order to evaluate (12.14), we will usually have to store only very few
of the xi, rather than the whole training data set.
We have shown that by solving the dual problem, we can find and evaluate an optimal
non-linear classifier in the potentially very high-dimensional space F̄ without ever having
to evaluate η. In fact, looking at (12.13), we find that we do not even have to know η
explicitly: We can substitute any kernel κ, as long as the dual problem stays convex. A
sufficient condition for this is that the matrix M ∈ Rn×n, Mij = κ(xi , x j ) is symmetric and
positive semidefinite for all choices of n and {x1, , xn } (the y i correspond to multiplication
with diagonal matrices and do not affect positive semidefiniteness).
Additionally, the extended feature space F̄ can be infinite-dimensional. It can be shown
that for any positive definite κ, an embedding η into a so-called reproducing kernel Hilbert
space can be found that defines κ through η and its inner product (see for example [?]).
Within these bounds, the kernel can be chosen freely to suit the application. Popular
choices are:
κ(x, x ′) = hx, x ′i q (monomials of degree q),
′ ′ q
x ) = (hx, x i + 1)
κ(x, (polynomials of degree q or less),
1
κ(x, x ′) = exp − kx − x ′k22/σ 2 (Gaussian kernel),
2
and many more, see [Bis06, Chapt. 6.3] for an overview.
Chapter 13
Total Variation and Applications

13.1 Functions of Bounded Variation


In this section we will have a more detailed look at one of the most-often used non-smooth
regularizers, the total variation (TV) of a function. Details can be found in [AFP00].
As a motivation, for C 1 functions we can formulate the regularizer
Z
f (u) = k∇u(x)k2 dx. (13.1)

We have already seen in practice that this regularizer can be very useful. To motivate the
choice of the term “total variation” for f , consider the one-dimensional case, and assume
that u monotononeously non-decreasing between two points a and b, i.e., u ′(x) > 0 for all
x ∈ [a, b]. Then
Z b Z b Z b
k∇u(x)k2 dx = ′
|u (x)| dx = u ′(x) dx = u(b) − u(a).
a a a

The same argument can be made if u is non-increasing, in which case we get u(a) − u(b).
This means that if u is monotonous between two points a and b, then f (u) = |u(a) −
u(b)|, and the behaviour of u between the points is completely irrelevant, as long as it is
monotonous. Hence f counts the the differences between extreme points of u, which gives
rise to the term “variation” of u.
For general functions that are not necessarily differentiable (and may even be discon-
tinuous), we use the following definition. We generally assume Ω to be an open Lipschitz
domain in Rn.

Definition 13.1. For u ∈ L1(Ω, Rm), the total variation of u is defined as


Z
TV(u) 8 sup hu, Div vi dx, (13.2)
v ∈Cc1(Ω,Rm×n),kv k∞ 61 Ω

where v(x) = (v 1(x), , v m(x))⊤ , Div v = (div v 1, ,div v m), and for v ∈ Cc1(Ω, Rm×n),

kv k∞ = sup kv(x)k2.
x∈Ω

The space of functions of bounded variation is defined as

BV(Ω, Rm) 8 {u ∈ L1(Ω, Rm)|TV(u) < +∞}.

This can be understood as follows: If u ∈ C 1, i.e., ∇u exists as a function, then


Z Z
k∇u(x)k2 dx = sup h∇u, v(x)i dx,
Ω Ω v(x)∈Rm×n ,kv k2 61

81
82 Total Variation and Applications

which is just rewriting the norm using its dual norm. It is possible to show that the
supremum and the integral can be swapped and v can be restricted to Cc1 functions:
Z Z
k∇u(x)k2 dx = sup h∇u, vi dx.
Ω v ∈Cc1(Ω,Rm×n),kv k∞ 61 Ω

As all v have compact support, we can use the divergence theorem applied to ui v i (which
is just partial integration):
Z Z Z
0 = i
div (ui v ) dx ⇒ i
h∇ui , v i dx = − ui div v i dx,
Ω Ω Ω

and get
Z Z
k∇u(x)k2 dx = sup hu, Div v i dx = TV(u).
Ω v ∈Cc1(Ω,Rm×n),kv k∞ 61 Ω

The is also sometimes referred to as the dual formulation of the total variation. The space
BV can also be defined as the space of all u ∈ L1 so that the gradient of u exists as a finite
radon measure in Ω, denoted by Du, but we will not go into these details. The importance
in formulation (13.2) as opposed to (13.1) is that it does not require u to be differentiable,
or even continuous.
An important property of the total variation is that for characteristic functions of sets,
it reduces to boundary length/area of the set:

Proposition 13.2. Assume A ⊂ Ω is a set so that its boundary is C 1 and satisfies


Hn−1(Ω ∩ ∂A) < ∞. Define

1A(x) 8
1, x ∈ A,
0, x  A.

Then

TV(1A) = Hn−1(Ω ∩ ∂A).

Proof. The idea is that


Z
TV(1A) = sup 1A div vdx
v ∈Cc1(Ω,Rn),kv k∞ 61 Ω
Z
= sup div vdx
v ∈Cc1(Ω,Rn),kv k∞ 61 A
Z
= sup hv, ni ds
v ∈Cc1(Ω,Rn),kv k∞ 61 ∂A

by Gauss’ theorem. This immediately show the lower bound TV(1A) 6 Hn−1(∂A). The
upper bound is slightly more difficult, intuitively we need v(x) = n on the boundary, but
have to show that v can be made sufficiently smooth, for example by constructing as a
suitable L1 function and approximating it using Cc1 functions, see [AFP00] for general
remarks. 

Penalizing boundary length is a very natural way of favouring smooth contours, which
makes the total variation a good candidate for the problem of finding an optimal set, such
as in segmentation problems.
13.1 Functions of Bounded Variation 83

Theorem 13.3. (Coarea Formula): If u ∈ BV(Ω), then

TV(1{x|u(x)>t}) < +∞ for L1-a.e. t ∈ R,

and
Z +∞
TV(u) = TV(1{u>t}) dt.
−∞

Proof. See [AFP00, Thm. 3.4]. 

This can often be used to reduce the study of functions of bounded variation to the
study of their (super-)levelsets {u > t}.
Analogous to the definition of BV, we can define higher-order BV spaces by applying
this idea to the gradients of W k,1 functions as follows. For simplicity we assume the scalar
valued case, i.e., u: Ω → R.

Definition 13.4. For Ω ⊆ Rd and k > 1, we define the space BVk(Ω) as

BVk 8 u ∈ W k −1,1 ∇k −1u ∈ BV Ω, Rd


n  k −1
o

and the higher-order total variation as


Z
TVk(u) = sup udivk vdx, (13.3)
v∈Cck(Ω,Symk(Rd)),kv k∞ 61 Ω
= TV(∇k−1u)

where Symk(Rd) is the space of symmetric tensors v of order k with arguments in Rd:

v = (vi1, ,ik)i1, ,ik =1, ,d ,

and v(z 1, , z k) = v z π(1), , z π(k) for all permutations π of {1,



, k }. The k-divergence
is defined as
X
divk v = ∂xi1 ∂xikvi1, ,ik.
i¯=(i1, ,ik)∈{1, ,d}k

For k = 1 this reduces to the usual total variation, as then v ∈ Cc∞(Ω, Rd), the symmetry
condition disappears, and
X
div1 v = ∂xi1vi1 = div v.
i1 ∈{1, ,d}

Note that all functions in BVk must have a (k − 1)-th derivatve in L1, which means a
certain regularity (see, e.g., the Sobolev embedding theorem).
In practice, TVk can again be implemented by discretizing the k-th order derivatives
k
using a matrix Gk ∈ Rnd ×n and using a special norm defined as the sum of 2-norms over
the rows of Gk u:
n
TV (u) = kG uk∗ 8
X
k k k(Gku)i k2,
i=1

or the equivalent dual formulation


n k
o
TVk(u) = sup hu, (Gk)⊤ v i v ∈ Rd ×n , v·j ∈ Symk(Rd), kv·i k2 6 1 .
84 Total Variation and Applications

Usually finite-difference discretizations of Gk will lead to a symmetric tensor (e.g., for


d = 2 we would expect the discretization to only produce symmetric Hessians). If this is
the case, then (Gk u)i ∈ Symk(Rd), which is a linear subspace of R, and for an arbitrary,
k
possibly non-symmetric tensor v ∈ Rd , we can replace the inner product hGku, vi by Gk u,
ΠSymk(Rd)(v) . Since ΠSymk(Rd)(v) 2 6 kvk2 (the projection onto linear subspaces can only
decrease the norm), this shows that we can drop the symmetry constraint on v, and get
n k
o
TVk(u) = sup hu, (Gk)⊤ vi v ∈ Rd ×n , kvi·k2 6 1 , (13.4)

which is very similar to the formulation for the “plain” TV regularizer, and can in most
cases be easily substituted.

13.2 Infimal Convolution and TGV


We consider the following reformulation of the basic ROF problem:
   
1 2 1 2
inf ku − gk2 + λ TV(u) = inf kwk2 + λ TV(u) . (13.5)
u 2 u,w,u+w=g 2
Solving (13.5) can be seen as decomposing the given data g into two components u and
w, where w (the “noise”) is small with respect to the L2 norm – essentially assuming that
the noise follows a Gaussian distribution – and u has a small total variation, i.e., it “looks
like” a natural image as measured by the regularizer.
This decomposition can be easily extended to more terms – in particular, as the total
variation favours piecewise constant images in practice, small affine changes in the image
may end up in the noise variable w instead, which is not ideal. On the other hand, TV2
favours affine regions, but cannot handle discontinuities. Adding TV2 to the decomposition
allows discontinuities as well as affine structures:
 
1 2 2
inf kwk2 + λ TV(u) + µ TV (v) .
u,v,w,u+v+w= g 2

We can then separately examine the piecewise constant (“cartoon”) part u, the piecewise
affine part v, and the noise w, or use u + v to get a denoised version of g.

Definition 13.5. (infimal convolution) For functions f1, , fn: X → R̄ for any set X, we
define the inf-convolution (f1 fn): X → R̄ as

(f1 fk)(x) = inf (f1(z 1) + + fk(z k)).


z 1, ,z k ,z 1 + +z k =x

In this primal formulation, infimal convolutions are difficult to deal with, as even evalu-
ating them involves solving an optimization problem. Fortunately, they have a very concise
dual representation:

Proposition 13.6. (conjugates of infimal convolutions) Assume f1, , fk: Rn → R̄ are


proper, lsc, convex. Then

(f1 fk) = (f1∗ + + fk∗)∗.

Proof. See example sheets. 


13.2 Infimal Convolution and TGV 85

Example 13.7. If f , g: Rn → R are proper, lsc, convex, and positively homogeneous – as


is the case if f and g are norms or seminorms – then by Prop. 6.17 we know that f = (δC )∗
and g = (δD)∗ for some closed convex non-empty sets C , D ⊆ Rn. Then
(f g)∗ = δC + δD = δC ∩D ,
and
(f g) = (δC ∩D)∗.
For example, for discretizations G ∈ R2n×n and H ∈ R4n×n of the gradient and Hessian,
we can write
n o
i
TV(u) = sup hu, −G⊤ vi = sup hu, v ′i v ′ = G⊤ v, v ′ 2 6 1∀i
v ∈R2×n kv i k2 61∀i
= δC∗
8
, C {v ′ ∈ Rn |∃v ∈ R2×n: v ′ = G⊤ v, kv i k2 6 1∀i}, (13.6)
TV2(u) = δD∗
8
, D {w ′ ∈ Rn |∃w ∈ R4×n: w ′ = H ⊤w, kwi k2 6 1∀i}.
Then the combined regularizer
h = (TVTV2)
has the set representation

h = δE , E 8 {z ∈ Rn |∃v ∈ R2×n , w ∈ R4×n: z = G⊤ v = H ⊤ w, kv i k2 6 1, kwi k2 6 1∀i}.
Solving a quadratic optimization problem with h as a regularizer amounts to a backward
step on h:
 
1 2
arg min ku − g k2 + λ h(u) = Bλh(g).
u 2
From the example sheets we know that Bλf (x) = x − λ Bλ−1f ∗(x/λ), thus
Bλh(g) = g − λ Bλ−1δE(g/λ)
= g − λΠE (g/λ).
This can be interpreted as splitting
g = u + y,
u = argmin =g − λ ΠE (g/λ),
y = g −u = λ ΠE (g/λ).
This has a very natural interpretation: For a given image g, we compute the noise by
(nonlinearly) projecting onto the set E – effectively, E defines the noise that we would like
to allow – and the original image by subtracting the noise.
The infimal convolution of two regularizers corresponds to intersecting their sets
of “allowed” noise.

In practice it was found that the combined (TVTV2) regularizer can often be
improved upon by using the Total Generalized Variation (TGV) [SST11, BKP10] instead:

Definition 13.8. (Total Generalized Variation) For u ∈ L1 , k > 1, and α = (α0, ,


αk−1) > 0, the Total Generalized Variation is defined as
Z 
k k k k d j
TGVα(u) = sup udiv v v ∈ Cc (Ω, Sym (R )), kdiv v k∞ 6 α j , 0 6 j 6 k − 1

86 Total Variation and Applications

with the convention that div0 v = v.

Comparing this to the definition of TVk in (13.3), the difference is in the additional
constraints on the lower-order divergences on v.
Consider the case n = 2 and α = (1, 1), and assume that the discretization of the gradient
can be decomposed as G2 = G2G1, where G1 ∈ R2n×n discretizes the first-order derivatives
of u, and G2 ∈ R4n×2n discretizes the first-order derivatives of a vector field (in particular
∇u) with suitable boundary conditions. The suitable discretization for div1 is then −G⊤ 2,
which leads to (again assuming symmetric discretization of the gradient)

TGV2(u) = sup hu, (G2)⊤ v i v ∈ C1, −G⊤


 
2 v ∈ C2
C1 = {v ∈ R4n×n | kv·i k2 6 1 ∀i},
C2 = {z ∈ R2n×n | kz·i k2 6 1 ∀i}.

We can rewrite this as follows:


TGV2(u) = sup hu, (G2 G1)⊤ vi − δC1(v) − δC2 −G⊤
 
2 v
v 
= sup −G1 u, −G⊤ ⊤

2 v − δC1(v) − δC2 −G2 v
v 
= sup −hG1 u, wi − δC1(v) − δC2(w) − δ{0} G⊤

2 v+w
v,w
( ! !)
0 I v
= sup −hG1 u, wi − δC1(v) − δC2×{0} .
v,w G⊤2 I w

This has standard duality form, and we obtain


∗ ∗
TGV2(u) = inf δC 1
(−G2 y 2) + δ{0}(y 1 + y 2 − G1 u) + δC 2
(y 1)
y 1,y 2
= inf ky 1kC2 + kG2 y 2kC1. (13.7)
y 1,y 2,y 1 + y 2 =G1 u

This is also known as the “differentiation cascade” formulation of TGV2, and can be
extended to TGVk. Comparing this to the standard infimal convolution,

(TVTV2)(u) = inf (TV(y) + TV2(u))


z 1,z 2,z 1 +z 2 =u
= inf (kG1 ukC2 + kG2 G1 ukC1), (13.8)
z 1,z 2,z 1 +z 2 =u

we see that TGV2 is again a certain kind of infimal convolution, but instead of splitting
u into multiple components, the gradient of u is split. The cascading formulation is also
a convenient way of converting the dual energy in (13.8) into a form that can be handled
by most solvers.

13.3 Meyers G-Norm


Meyer’s G-norm [Mey01, AC04] was introduced as a regularizer adapted to textured
regions, and is defined as

kukG = inf {kv k∞|div v =u, v ∈ L∞(Rd)}.


After discretization, we again get

kukG = inf {δC (v) + δ−G⊤ v=u }.
v
13.4 Non-local regularization 87

In saddle-point form:

kukG = inf {δC (v) + δ−G⊤ v=u }
v

= inf sup {δC (v) + hw, G⊤ v + ui}
v w

= sup inf {δC (v) + hw, G⊤ v + ui}
w v

= sup −sup {hv, G wi − δC (v) − hw, ui}
w v
= sup {hw, ui − δC (G w)}
w
= sup {hw, ui| kG wk∞ 6 1}
w
= sup {hw, ui| TV(w) 6 1}.
w
The G-norm is the dual norm to the total variation, i.e., the norm associated with the unit
ball with respect to TV. In particular,

k·kG = δ{u|TV(u)61} = δB∗ TV (u), BTV 8 {u|TV(u) 6 1}.
In the same way, we get (compare (13.6))
sup {hu, −G⊤ vi| kv k∞ 6 1}
TV(u) = sup {hu, −G⊤ v i| kvk∞ 6 1}, BG 8 {u| kukG 6 1}.
= sup {hu, wi|∃v: kvk∞ 6 1, w = −G⊤ v}
= sup {hu, wi|kwkG 6 1}
= δB∗ G(u), BG 8 {u| kukG 6 1}.
Consider the problem of finding
1
arg min ku − f k22 s.t. u ∈ λ BG ,
u 2
i.e., separating the “texture” component of f with the assumption that the texture “level”
as measured in the G-norm is at most λ. The solution is just the projection ΠλG(f ),
which we can already relate to the ROF problem: From the example sheets we know that
B f = I − B f ∗, and obviously if TV(u) = δB∗ G then λ TV(u) = δλBG, so
 
1 2
ΠλBG = f − BλTV(u) = f − arg min ku − f k2 + λ TV(u) .
u 2
This explains why the G-norm is a good candidate for regularizing data containing texture:
Solving L2-k·kG removes the part of an image that can be explained by a “structure”
component with low total variation.
In the same way, we can rewrite TV-constrained problems as
 
1 2 1 2
arg min ku − f k2 s.t. TV(u) 6 λ = f − arg min ku − f k2 + λ kukG .
u 2 u 2

13.4 Non-local regularization


Total variation regularization is inherently local, in the sense that for each point a small
neighbourhood is considered to contain enough information to determine the regularization
cost at that point. While this is very convenient for analysis, on real-world images this
assumption may not always be justified: Many images contain textured regions that are
highly oscillatory, but in a very regular sense – such as striped or checkedboard patterns.
Such regions would be highly penalized by a total variation regularizer, even if they contain
no noise.
88 Total Variation and Applications

In order to better cope with such situations, nonlocal regularizers have been pro-
posed [BCM05, GO08]. The idea is to measure the regularity of the image at a given
point not by the dissimilary of its gray value to its immediate neighbours, but to other
points in the image that are in a similar location of the repeating pattern. These points
can be potentially far away, which gives rise to the name, non-local regularization.
In a discrete setting, this can be achieved as follows.
Definition 13.9. Denote Ω = {1, , n}. For u ∈ Rn and x, y ∈ Rn and a nonnegative
weighting function w: Ω2 → R>0 , we define the non-local partial derivative ∂ yu(x) as
∂ yu(x) = (u(y) − u(x))w(x, y).
The non-local gradient of u at x for the weighting function w is the n-vector
∇w: Rn → Rn×n ,
∇wu(x) = (∂ yu(x))x∈Ω.

The difference between ∇w and a usual discrete gradient is that ∇wu(x) is an n-vector
of all “partial derivatives” to all other points in the image, instead of a (usually) 2-vector
of the partial derivatives in x- and y-direction.
We can equally define a corresponding non-local divergence, which sums up all partial
derivatives:
X
divw v(x) = (v(x, y) − v(y, x))w(x, y).
y∈Ω

With the usual Euclidean inner products in Rn and Rn×n, it can be seen that we have a
discrete “divergence theorem”,
h−divwv, ui = hv, ∇wui.
We can now define regularizers based on the non-local gradient.
X
J (u) = g(∇wu). (13.9)
x∈Ω
Most gradient-based regularizers that involve norms can be immediately extended to this
setting. There is some freedom, for example in the TV case we could define the gradient-
or difference-based versions
g
X X
d
TVNL (u) = k∇wuk2 or TVNL = k∇w uk1.
x∈Ω x∈Ω
For the ordinary total variation regularization, the 2-norm approach gives better results as
it better respects the isotropy of the total variation, but non-local regularization is inher-
ently anisotropic in any case due to the weighting function w, so there is no obvious “better”
candidate.
d
A relevant difference in practice is that while TVNL only contains terms of the form
g
|(u(y) − u(x)) w(x, y)| and is separable otherwise, TVNL is a sum of 2-norms of n-
dimensional vectors and is therefore much less separable, which makes it less attractive
from an optimization viewpoint. Nevertheless, both regularizers are convex and can be
g d
reformulated in SOCP (TVNL ) or LP (TVNL ) form.
In the same way, it is possible to generalize many other gradien-based regularizers such
as the p-norm, the G-norm to nonlocal gradients.
The defining feature is the choice of the weighting function w. First we note that (13.9)
1
includes most usual discretizations of the total variation – setting w(x, y) = h (where h is
the grid spacing) if x and y are neighbours, and w(x, y) = 0 otherwise, leaves the nonlocal
gradient essentially as the local gradient, extended with zeros.
13.4 Non-local regularization 89

The intuition is that w(x, y) should be large if the neighbourhoods of x and y are
similar, as measured by a patch distance. A classical choice is to consider the patch distance
Z
du(x, y) = Kσ(p) (u(y + t) − u(x + t))2 dt,

which is just the ℓ2distance weighted by a Gaussian Kσ centered at zero with variance
σ 2. While it would be possible to set, e.g., w(x, y) = 1/(ε + du(x, y)), this leaves us with a
d
very large optimization problem: For TVNL , the objective would contain at least n2 non-
smooth terms. Even for moderately-sized images with n ≈ 100000 this is clearly not feasible.
Therefore the weights are usually pruned before. The original choice is to define the set
( )
A(x) 8 arg min
X
du(x, y) A ⊆ S(x), |A| = k
A
y∈Ω

for a given search neighborhood S(x), which consists of the k points around x with smallest
distance. The weights are then simply set as

1, y ∈ A(x) or x ∈ A(y),
w(x, y) =
0, otherwise.
The reason for introducing the search neighbourhood S(x) is that computing du(x, y) for
all pairs of points x, y can already be too expensive.
Generally non-local regularizers tend to work very well in practice, but they also require
more parameters to be chosen. In particular, the search window weights Kσ should be large
enough for the noise error to average out when comparing patches, but small enough to get
a good resolution. The main obstacle when implementing such methods is computational,
i.e., how to quickly compute the patch distances and how to prune the weights in a way
that keeps the optimization problem tractable.
Chapter 14
Relaxation
Many real-world problems are inherently non-convex. A typical example is the problem
of segmenting an image into two regions: For given image data g, find a set C ⊆ Ω that
best describes the foreground in the sense that it fits to the given data, but also adheres
to some prior knowledge about the typical shape of the foreground.
A typical energy is the Chan-Vese model [CV01],
(Z Z )
fCV(C , c1, c2) 8 (g − c1)2 dx+ (g − c2)2 dx + λ Hd−1(C) ,
C Ω\C

where Hd−1(C) is the perimeter (i.e., length or area) of the boundary ∂C, and minimization
is performed over (C , c1, c2). The constants c1 and c2 describe the typical value of g inside
(c1) and outside (c2) of C, i.e., the problem consists in identifying the foreground and
background region together with model parameters (c1, c2) for each region.
The Chan-Vese model is in fact a special case of the Mumford-Shah model [MS89], one
of the best-studied – but still not fully understood – models in image processing:
Z Z
fMS(K , u) = (g − u)2 dx + µ k∇uk22dx+ν Hd−1(K),
Ω Ω\K

where K ⊆ Ω is closed and u is differentiable outside of K, i.e. u ∈ C 1(Ω \ K). Essentially,


this corresponds to the L2 − L2 denoising approach (?), with the exception that u is allowed
to be discontinuous on a “boundary” set K that should be “small” as measured by the
Hausdorff term.
If one requires that u is piecewise constant, and furthermore assumes at most two values
c1 and c2, the term involving k∇uk2 vanishes, and Mumford-Shah energy simplifies to the
Chan-Vese energy.
An efficient way of reformulating fCV in a way that can be handled numerically is to
represent the set C using its indicator function 1C , and require 1C ∈ BV(Ω). As the total
variation of an indicator function is just the perimeter of the set, we obtain
Z
fCV(C , c1, c2) = 1C (g − c1)2 + (1 − 1C ) (g − c2)2 dx+λ TV(1C ).

We can now introduce a function u: Ω → {0, 1}, u ∈ BV(Ω), and minimize
Z
inf u (g − c1)2 + (1 − u) (g − c2)2 dx+λ TV(u)
u:Ω→{0,1},u∈BV(Ω),c1,c2 Ω
Z
= inf u ((g − c1)2 − (g − c2)2) + (g − c2)2 dx+λ TV(u).
u:Ω→{0,1},u∈BV(Ω),c1,c2 Ω

There are two difficulties to overcome:


1. The optimization problem is of combinatorial nature, i.e., the constraint set consists
of a discrete set of points. This makes it difficult to apply optimization methods
that rely on taking small steps towards a minimizer based on derivative information.

91
92 Relaxation

2. Even if the constraint set was convex, the objective is not jointly convex in (u, c1, c2).
The second problem cannot be easily overcome. However, if we fix either u or c1, c2, then the
problem is convex in the other variable. In the remainder of this chapter we will therefore
assume that c1 and c2 are known and fixed. A possible way to obtain a (at least local)
solution for unknown c1, c2 is to alternate between minimization with respect to u and c1, c2.
The first point can be addressed by a relaxation approach: Instead of our original,
nonconvex constraint set

{u ∈ BV(Ω)|u(x) ∈ {0, 1} a.e.}

we pass on to the convex hull of the constraint set:

{u ∈ BV(Ω)|u(x) ∈ [0, 1] a.e.}.

We obtain the energy


Z
inf u ((g − c1)2 − (g − c2)2) dx+(g − c2)2 + λ TV(u).
u:Ω→[0,1],u∈BV(Ω) Ω

This is in fact a convex problem. Moreover, we can set s(x) 8 (g − c1)2 − (g − c2)2 and
ignore the constant term (g − c2)2 as it does not change the set of minimizers, and obtain

inf f (u), f (u) 8 hu, siL1 + λ TV(u). (14.1)


u:Ω→[0,1],u∈BV(Ω)

This is even more appealing once one realizes that this problem is still convex if we replace
the terms (g − ci)2 by arbitrary (integrable) functions h1 and h2, and set s 8 h1 − h2. We
have found a variation of the problem that is always convex, no matter how complicated
the data term is.
There is a slightly ambiguous understanding of what is meant by a convex relaxation
approach – while it is frequently used to loosely describe the process of constructing some
convex problem from a non-convex problem, in other publications it is understood in the
strict sense of replacing a non-convex function f by its convex hull con f .
The cost of removing the non-convexity is that minimizers of f are not necessarily
indicator functions anymore. We can say the following:

Proposition 14.1. Assume c1, c2 are fixed. If u is a minimizer of f, and u(x) ∈ {0, 1}
a.e. (in particular, u = 1C for some set C ⊆ Ω), then C is a minimizer of fCV(·, c1, c2).

Proof. Assume that u = 1C for some set C,

f (1C ) = fCV(C , c1, c2).

Since C is a minimizer of f , for every set C ′ we have

fCV(C ′, c1, c2) = f (1C ′) > f (1C ) = fCV(C , c1, c2),

i.e., C must be a minimizer of fCV. 

Unfortunately, the minimizer u of f may not be a characteristic function.

Definition 14.2. Denote C 8 BV(Ω, [0, 1]) 8 {u ∈ BV(Ω)|u(x) ∈ [0, 1] a.e.}. For u ∈ C,
and α ∈ [0, 1], define

ūα 8 1{u>a}, 1{u>α}(x) =
1, u(x) > α,
0, u(x) 6 α.
Relaxation 93

We say that a functional f : C → R satisfies the generalized coarea condition iff


Z 1
f (u) = f (ūα) dα ∀u ∈ C.
0

Proposition 14.3. Assume that s ∈ L∞(Ω) and Ω is bounded. Then the function f in
( 14.1) satisfies the generalized coarea condition.

Proof. For f (u) = TV(u), the condition is exactly the coarea formula (Thm. 13.3). As
the condition is additive, we only have to show it for the linear part:
Z Z Z 1
s(x) u(x) dx = s(x) 1{u(x)>α}dα
Ω ZΩ1 Z 0

= s(x) 1{u(x)>α}dα.
0 Ω
Swapping the integral using Fubini’s theorem requires to show that |s(x) 1{u(x)>α}| is
bounded, which holds since Ω |s(x) 1{u(x)>α}| 6 ksk∞ |Ω| < ∞ due to s ∈ L∞(Ω). 
R

The most important result in this section is the following, generalized from [CEN06]:

Theorem 14.4. (Thresholding Theorem) Assume that f : BV(Ω, [0, 1]) → R satisfies the
generalized coarea condition, and u∗ satisfies
u∗ ∈ arg min f (u).
u∈BV(Ω,[0,1])

Then for almost every α ∈ [0, 1], the thresholded function ūα∗ = 1{u∗ >α} satisfies
ūα∗ ∈ arg min f (u).
u∈BV(Ω,{0,1})

Proof. We define the set of α violating the assertion, S 8 {α ∈ [0, 1]|f (ūα∗ )  f (u∗)}. Since
ūα∗ ∈ C{0,1} and C{0,1} ⊆ C, we have for any minimizer u∗{0,1} of f over C{0,1},
f (u∗) 6 f (u∗{0,1}) 6 f (ūα∗ ), (14.2)
thus S = {α ∈ [0, 1]|f (u∗) < f (ūα∗ )}. Moreover, if α  S, then f (u∗) = f (u∗{0,1}) = f (ūα∗ ) by
(14.2). Therefore, in order to show the theorem it suffices to show that S is an L1-zero set.
Assume the contrary holds, i.e. L1(S) > 0. Then there must be ε > 0 such that
Sε 8 {α ∈ [0, 1]|f (u∗) 6 f (u∗α) − ε} (14.3)
has also nonzero
S measure, since otherwise S would be the countable union of zero measure
sets, S = i∈N S1/n, and would consequently have zero measure as well. Then
Z Z

f (u ) = ∗
f (u )dα + f (u∗)dα (14.4)
[0,1]\Sε Sε
u∗ optimal
Z Z
6 f (ūα∗ )dα + f (u∗)dα (14.5)
[0,1]\Sε Sε
definition of Sε
Z Z

6 f (ūα)dα + (f (ūα∗ ) − ε)dα (14.6)
[0,1]\Sε Sε
Z 1
linearity
= f (ūα∗ )dα − ε L1(Sε). (14.7)
0
But we assumed L1(S ε) > 0, therefore
Z 1
f (u∗) < f (ūα∗ )dα. (14.8)
0
94 Relaxation

This is a contradiction to (?), therefore L1(S) = 0 and the assertion follows. 

At the heart of the proof is the generalized coarea condition (?). It has the following
intuitive interpretation:
1. The function u may be written in the form of a “generalized convex combination”
of (an infinite number of) extreme points Eu 8 {ūα |α ∈ [0, 1]} of the constraint
set, i.e. the unit ball in BV(Ω). As shown in [Fle57] based on a result by Choquet
[Cho56], and noted in [Str83, p.127], extreme points of this constraint set are (mul-
tiples of, but in this case equal to) indicator functions.
2. The extreme points (ūα) and coefficients in this convex combination can be explicitly
found. In fact, the coefficients are all equal to 1/|[0, 1]| = 1, i.e. u is the barycenter
of the points in Eu.
3. For any convex f , the inequality
Z 1
f (ūα) dα > f (u) (14.9)
0

always holds. The coarea formula (?) is therefore equivalent to the reverse
inequality.
In fact, the original proof of the coarea formula [FR60] relies on showing (14.9) and using
the fact that (?) holds for piecewise linear u [FR60, (1.5c)]. Approximating an arbitrary
u ∈ BV(Ω) by a sequence of piecewise linear functions, this result is then transported to
the general case.

Remark 14.5. It is important to understand that Thm. 14.4 in its current form only
holds before discretization: Assume the problem is discretized according to

min f (u), f (u) 8 hu, si + λ kGuk∗, s.t. ui ∈ [0, 1], (14.10)


u∈Rn

where kG uk∗ is a suitable norm implementing the total variation. From Thm. 14.4 one
might naturally assume that if u∗ solves (?), then for almost every α ∈ [0, 1], the thresholded
minimizer ūα∗ solves the combinatorial problem

min f (u) s.t. ui ∈ {0, 1}. (14.11)


u∈Rn

This is generally not the case! The reason is that in order to transfer Thm. 14.4 to the
discretized problem, the discretized objective functions needs to satisfy the generalized
coarea condition. While the linear term doesPnot pose a problem, the usual “isotropic”
discretizations of the total variation, such as i kGi uk2, with Gi implementing forward,
central, or finite differences on a staggered grid, do not have this property.
In finite dimensions, the integral
Z 1
f (ūα) dα
0

is also known as the Lovasz extension of an energy f : {0, 1}n → R. Energies with a
convex Lovasz extension are called “submodular”, and play a central role in combinatorial
optimization, as a large problem class that can be solved efficiently.
Submodular energies that consist of a sum of terms involving at most two variables
ui can be solved by computing a minimal cut through a graph, which can be achieved in
polynomial time by solving the dual problem with specialized “maximum-flow” methods
[BVZ01, Ber98].
Relaxation 95

The energy (14.10) can be made to satisfy a generalized coarea property, for example
by choosing a forward difference discretization for G, and setting
X X
kGuk∗ = kGi uk1 = (|u(i+1),j − ui,j | + |ui,(j+1) − ui,j |)
i i,j

(try to verify this as an interesting exercise). Unfortunately this discretization is not


isotropic anymore – i.e., it does not converge to the actual total variation as the grid
spacing goes to zero –, and shows a bias towards horizontal and vertical edges.
This appears to limit the usefulness of the thresholding theorem for practical purposes.
However, in practice it turns out that very often thresholded minimizers of (14.10) for non-
submodular (but isotropic) discretization generate better – in terms of visual appearance,
not in terms of the energy – segmentations than solving the combinatorial problem. Also,
in many cases it can be advantageous to drop the requirement that the discretized solution
should only assume the values {0, 1} and allow intermediate values in [0, 1] – later pro-
cessing steps such as computing statistics on the geometry (center, area, boundary length,
variance, higher-order moments, etc.) and extraction of a boundary contour can actually
benefit from the increased accuracy that comes from allowing non-integer labels.
Bibliography
[AC04] Jean-Franhc|oiis Aujol and Antonin Chambolle. Dual norms and image decomposition models.
Tech. Rep. 5130, INRIA, 2004.
[AFP00] L. Ambrosio, N. Fusco and D. Pallara. Functions of Bounded Variation and Free Discontinuity
Problems. Clarendon Press, 2000.
[BCM05] A. Buades, B. Coll and J.-M. Morel. A non-local algorithm for image denoising. In
Comp. Vis. Patt. Recogn. 2005.
[Ber98] D. P. Bertsekas. Network Optimization: Continuous and Discrete Models. Athena Scientific,
1998.
[Bis06] C. M. Bishop. Pattern Recognition and Machine Learning . Springer, 2006.
[BKP10] K. Bredies, K. Kunisch and T. Pock. Total generalized variation. J. Imaging Sci., 3(3):294–
526, 2010.
[BVZ01] Y. Boykov, O. Veksler and R. Zabih. Fast approximate energy minimization via graph cuts.
Patt. Anal. Mach. Intell., 23(11):1222–1239, 2001.
[CEN06] T. F. Chan, S. Esedolu and M. Nikolova. Algorithms for finding global minimizers of image
segmentation and denoising models. J. Appl. Math., 66(5):1632–1648, 2006.
[Cho56] G. Choquet. Existence des représentations intégrales au moyen des points extrémaux dans les
cônes convexes. C. R. Acad. Sci., 243:669–702, 1956.
[CV01] T. F. Chan and L. A. Vese. Active contours without edges. IEEE Trans. Image Proc., 10(2):266–
277, 2001.
[Eck89] J. Eckstein. Splitting Methods for Monotone Operators with Application to Parallel Optimization.
PhD thesis, MIT, 1989.
[ET99] I. Ekeland and R. Témam. Convex analysis and variational problems. SIAM, 1999.
[Fle57] W. H. Fleming. Functions with generalized gradient and generalized surfaces. Annali di Mathe-
matica Pura ed Applicata, 44(1), 1957.
[FR60] W. H. Fleming and R. Rishel. An integral formula for total gradient variation. Archiv der Math-
ematik , 11(1):218–222, 1960.
[GO08] G. Gilboa and S. Osher. Nonlocal operators with applications to image processing. Multiscale
Mod. Simul., 7:1005–1028, 2008.
[Mey01] Y. Meyer. Oscillating Patterns in Image Processing and Nonlinear Evolution Equations,
volume 22 of Univ. Lect. Series. AMS, 2001.
[MS89] D. Mumford and J. Shah. Optimal approximations by piecewise smooth functions and associated
variational problems. Comm. Pure Appl. Math., 42:577–685, 1989.
[Roc70] R. T. Rockafellar. Convex Analysis. Princeton Univ. Press, 1970.
[RW04] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis. Springer, 2nd edition, 2004.
[SST11] S. Setzer, G. Steidl and T. Teuber. Infimal convolution regularizations with discrete l1-type
functionals. Comm. Math. Sci., 9(3):797–872, 2011.
[Str83] G. Strang. Maximal flow through a domain. Math. Prog., 26:123–143, 1983.

97

You might also like