Convex Optimizatiom IP
Convex Optimizatiom IP
Lecture Notes
3
Chapter 1
Introduction
1.1 Motivation
Why optimization in image processing? It is a convenient strategy to design algorithms
1. by defining what the result should look like, not how to find it,
2. that can be analyzed,
3. that are modular and can be easily modified as requirements change.
A classical example is image denoising, because the problem is very clear: we are given an
input image I: Ω ⊆ Rd → [0, 1] that is corrupted by noise – such as sensor noise or JPEG
artifacts – and would like to reconstruct the uncorrupted image u as accurately as possible.
Simply convolving the image with a Gaussian kernel or computing the average in a
small window around each point results in an image that is much too smooth, i.e., much
of the structure in the original image is lost.
A common first approach is to use “ad hoc” methods. For example, we might notice
that natural images tend to have a high amount of self-similarity, so it seems reasonable
to try to find similar patches in the image, and for each point compute the average over
the corresponding pixels in similar patches.
Such methods may work, and are often simple to implement and fast. The disadvantage
is that if they do not work, it may be very hard to precisely point out why. Similarly it is
often difficult to make specific statements about the quality of their output – which features
does it remove or preserve? Does it remove features below a certain size? How strong can
the noise be if we want it effectively removed? Does it keep edges?
A very useful way out of this dilemma is to follow a variational approach. We postulate
that the output of our method is the minimizer of a function
min f (u).
u
This abstracts from the actual implementation of the algorithm, and allows to focus on
modelling the problem, i.e., finding a function f such that the minimizer has the desired
properties.
A prototypical variational model is the following:
Z Z
min f (u) 8
1
ku − I k2 dx + λ k∇uk2 dx.
u:Ω→R 2 Ω Ω
The left term is commonly called the data term, and serves to keep u close to the observed
input image I. The right term, which we refer to as the regularization term, is weighted
by a scalar λ > 0 and advocates solutions that are smooth. Finding suitable regularizers
is often the more difficult part, as they encode what we know about the characteristics of
the desired solution – our prior knowledge that does not depend on the actual data.
5
6 Introduction
Apart from this intuitive explanation, there is also a statistical interpretation: min-
imizing f is the same as maximizing e−f (x). In the above case, this is the product of a
Gaussian density with mean I and another density that encodes the regularizer. The
minimizer of the variational model is therefore the image u that is most likely with respect
to a certain density that depends on the input image.
A drawback of the above model is that it tends to over-smooth edges. The reason can
be easily seen by inspecting the regularizer: it clearly prefers continuous transitions between
two values c1 and c2 to a “jump” of the same height (going from 0 to 1 via 1/2 is penalized
by (1/2)2 + (1/2)2 = 1/2, but a sharp transition from 0 to 1 is penalized by 12 = 1). We
have to make sure that they cost the same or we will get smoothing!
This leas to the Rudin-Osher-Fatemi (ROF) model:
Z Z
min f (u) 8
1 2
ku − I k dx + λ k∇uk dx (1.1)
u:Ω→R 2 Ω Ω
(the notation is not very precise, since the ROF model requires a generalization of the
gradient to discontinuous functions, but it is enough to illustrate the point). The idea is
that monotone transitions between two values have exactly the same regularization cost,
regardless of how sharp the transition is.
For the ROF model it is possible to show that if the input image I assumes only the
values 0 and 1 then jumps will be preserved. This highlights an important point: it is often
relatively easy to explain unexpected results from basic properties of the objective, and
then to slightly modify the objective to get rid of these properties.
Taking the idea one step further, we arrive at the TV − L1 model:
Z Z
min f (u) 8 ku − I k dx + λ k∇uk dx.
u:Ω→R Ω Ω
Compared to ROF, this has the advantage that it does not reduce the contrast of the
image. In fact, it is possible to show that characteristic function of discs with radius of at
least λ/2 will be preserved, and discs with radius strictly less than λ/2 will be completely
removed – we have a good intuition of what happens to features of a certain size.
The drawback of variational methods is that the model description in terms of f does
not include any hint on how to implement the numerical minimization. In fact, there can be
many obstacles such as large-scale problems, f being non-differentiable, or having multiple
local minima.
This raises an important question: let us assume that the solver hits the stopping
criterion and returns a potential minimizer u of f , but we find that u is not what we would
like the output to look like. It could be that the model f is not appropriate, but it could
also be that the model is fine but the solver just did not find a good minimizer.
This is where the concept of convexity comes into play:
convexity ⇒ local minimizer = global minimizer.
If f if convex (and in fact the ROF model (1.1) is), then every local minimizer is also a
global minimizer. Therefore, if the output is not as expected (and the solver returns a
local minimum), we know that the only correct way to improve the situation is to modify
the model. Therefore convexity decouples the modelling and implementation/optimization
stage.
A difficulty when implementing convex solvers for image processing tasks is that we
often have to deal with problem that are
This causes several problems: we cannot evaluate the gradient everywhere, and much less
the Hessian. Even if we could, the gradient is not Lipschitz-continuous and does not carry
any information about how close we are to the minimum. If we decide to replace the non-
smooth terms by a smoothed version (such as k∇uk ≈ (u2x + u2y + ε2)1/2, close to the non-
smoothness the gradient can be a very bad approximation of the behaviour of the function.
But the good news is:
In the last decades very efficient general-purpose and dedicated convex solvers have been
developed, so that even problems with 10 million variables can often be solved in reasonable
time. This is not the case for general non-convex problems: minimizing
is already challenging for n = 50 and practically impossible in general for n = 1000, because
at minimum f needs to be evaluated at 2n points.
11. Using Solvers and Discretization Issues: transforming problems into normal forms,
representing multi-dimensional data in vector form, implementing linear operators,
discretizing variational problems, adjoint differential operators
12. Support Vector Machines: supervised and unsupervised machine learning, linear
maximum-margin classifiers, reformulation as convex problem, dual problem and
optimality conditions, computing primal from dual solutions, support vectors,
kernel trick
13. Total Variation: functions of bounded variation, dual formulation of the total vari-
ation, coarea formula, higher-order total variation, infimal convolution, conjugate
of infinal convolution, total generalized variation, Meyer’s G-norm, non-local regu-
larizers.
14. Relaxation: non-convexity in image processing, Chan-Vese, Mumford-Shah, convex
relaxation, genralized coarea condition, thresholding theorem, discretized energy,
anisotropy.
Definition 2.1. (extended real line) We define R̄ 8 R ∪ {+∞, −∞} with the rules:
1. ∞ + c = ∞, − ∞ + c = −∞ for all c ∈ R,
2. 0 · ∞ = 0, 0 · (−∞) = 0,
3. inf R = sup ∅ = −∞, inf ∅ = sup R = +∞.
4. +∞ − ∞ = −∞ + ∞ = +∞ (sometimes; careful: −∞ = λ (∞ − ∞) λ ∞ − λ ∞ = ∞
if λ < 0)
The last rule does not hold for sup ! Example: f (x) = x, g(y) = −∞, then
sup {f (x) + g(y)} = sup (x − ∞) = sup (−∞) = −∞ +∞ = +∞ − ∞ = sup f + sup g.
x,y x,y x,y x y
Another special case to watch out for is that inf C 6 sup C does not hold if C = ∅. Also
a > b ⇔ a − b > 0 always holds, but a > b ⇔ b − a 6 0 does not!
9
10 Existence
2. arg min f 8
∅, f ≡+∞,
{x ∈ Rn |f (x) = inf f }, f < +∞.
(set of minimizers/optimal solutions)
3. f is “proper” : ⇔ dom f ∅ and f (x) > −∞∀x ∈ Rn (i.e., f +∞ and f > −∞).
By the definition of the arg min , if x ∈ arg min f then f (x) < +∞, however f (x) = −∞
is possible. Proper functions are the “interesting” function for minimization: if f = +∞
then the problem does not have any solution, and if there are x such that f (x) = −∞ then
arg min f consists of exactly these points.
Epigraphs are an alternative way to define functions and are often a convenient way to
derive properties functions using theorem about sets. While every epigraph is a set, not
every set is the epigraph of a function – in fact a set C is an epigraph iff for every x there
is an α ∈ R̄ such that C ∩ (x × R) = [α, +∞], i.e., all vertical one-dimensional sections must
be closed upper half-lines. If f is proper then epi f is not empty and does not include a
complete vertical line.
“6”: We show that lim infx→x ′ f (x) 6 α for all α ∈ S 8 {α ∈ R̄|∃(xk) → x ′: f (xk) → α}.
For such an α take (xk) → x ′ from the definition of S. Then f (xk) → α, so for all j exists
k j such that xk j ∈ B1/j (x ′). Therefore
inf f (x) 6 f xkj
x∈B1/j (x ′)
j→∞ xk→x
⇒ lim inf f (x) 6 lim f xkj = lim f (xk) = α.
j→∞x∈B1/j (x ′) j→∞ k→∞
lim infx→x ′ f (x)
“>”: We show that there exists a sequence xk → x ′ such that f (xk) → lim infx→x ′ f (x): if this
is true, then “>” holds and we also have shown that the “min” notation is justified, i.e., S
contains a minimal element.
For every k we can find xk ∈ B1/k(x ′) such that
1
f (xk) 6 inf f (x) +
x∈B1/k(x ′) k
(this requires that the lim inf is finite, but we can make a similar argument by adding −∞
if it is −∞). Since f (xk) > infx∈B1/k(x ′) f (x) this implies
1
inf f (x) 6f (xk) 6 inf f (x) + .
x∈B1/k (x ′) x∈B1/k(x ′) k
Since this holds for all k we can take k → ∞ and obtain that
lim f (xk) = lim inf f (x)
k→∞ k→∞x∈B1/k(x ′)
Example 2.9.
1, x > 0
1. f (x) = 0, x 6 0
is lsc,
3. f (x) = δC is lsc in the open sets int C and ext C. It is lsc on Rn iff C is closed:
lim infx→x ′δC (x) is always 0 for x ′ ∈ bnd C, therefore δC is lsc iff δC (x) = 0 for all
x ∈ bnd C, which is the case iff C is closed.
Theorem 2.10. (semicontinuity and the epigraph) Let f : Rn → R̄. Then the following
properties are equivalent:
1. f is lsc on Rn,
2. epi f is closed in Rn × R,
12 Existence
Proof. Idea (1 ⇔ 2): epi f can only be not closed along vertical lines.
1.⇒2.: Assume (xk , αk) ∈ epi f , (xk , αk) → (x, α), α ∈ R. Then f (xk) 6 αk (because
(xk , αk) ∈ epi f ), and lim infk→∞ f (xk) 6 lim infk→∞αk = α. The second part of Lemma
gives f (x) 6 lim infk→∞ f (xk); therefore f (x) 6 . 6 α and (x, α) ∈ epi f .
2.⇒3.: We have
epi f closed
⇒ epi f ∩ (Rn × {α ′}) closed for all α ′ ∈ R
⇔ {(x, α) ∈ Rn × R|α = α ′, f (x) 6 α} closed for all α ′ ∈ R
⇔ {x ∈ Rn |f (x) 6 α ′} = lev6α ′ f closed in Rn for all α ′ ∈ R.
This shows 3. for all α ′ ∈ R. For α ′ = +∞ we have lev6+∞ f = Rn which is always closed.
For the case α ′ = −∞ we note that
\
lev6−∞ f = lev6kf ,
k∈N
Example 2.12. f1(x) 8 x2 is level-bounded. f2(x) 8 x is not bounded below and not
level-bounded. f3(x) 8 1/|x| (with the +∞ extension at 0) is bounded below and not
level-bounded. f4(x) = min {1, |x|} is bounded from below and not level-bounded, since
lev6αf = Rn for α > 1.
Proof. “⇒”: Assume we have kxk k2 → +∞. Then for any α ∈ R there is K(α) such that
xk lev6αf for k > K(α), because all these sets are bounded. Therefore f (xk) > α for all
k > K(α). This holds for all α, therefore f (xk) → +∞.
2.2 Existence of minimizers 13
The property in Prop. 2.13 is also often referred to as coercivity in the literature.
For any α in the intersection, lev6αf is closed since f is lsc (Thm. 2.10, 3.). But f is also
level-bounded, therefore lev6αf is bounded. Together lev6αf is compact.
We can also make the intersection countable (use α = inf f + 1/k if inf f ∈ R, and α = −k
if inf f = −∞). All sets in the intersection are nonempty and compact. Therefore Cantor’s
intersection theorem states that their intersection is also nonempty (it is also compact).
(Why? Take any sequence with xk ∈ lev6αkf , we can do this explicitly in Rn through inf .
(xk) has a converging subsequence because it lies completely in lev6α1 f which is compact.
Denote the limit by x ′. For every k we can find a “tail” of (xk) that lies completely in
lev6αk (and still converges to x ′), and because all these sets are closed we have x ′ ∈ lev6αk.
Therefore x ′ is also contained in the intersection.)
The only thing that remains to be shown is that inf f >−∞. Assume that inf f = −∞.
By the previous part, arg min f ∅. Therefore there exists x ∈ arg min f , but then
f (x) = inf f = −∞, which contradicts the properness of f .
Remark 2.15. The proof of the theorem does not require full level-boundedness, it suffices
to have lev6αf bounded and nonempty for at least one α ∈ R: closedness of the sublevelsets
follows from f being lsc, and boundedness is only required for all lev6α ′ with α ′ 6 α, for
which it automatically holds because lev6α ′ ⊆ lev6α.
An example for such a function is f (x) = 1 − e−|x|, which is bounded from above by 1
and therefore not level-bounded, but it is lsc, proper, and attains its minimum in x = 0. All
the sets lev 6 α are bounded for α < 1, and, as one would expect arg min f = {0} is non-
empty, closed and convex.
Proof. Exercise.
Chapter 3
Convexity
Definition 3.1. (convex sets and functions)
1. f : Rn → R̄ is “convex” : ⇔
Remark 3.2. C ⊆ Rn is convex iff for every two points x, y ∈ C the whole line segment
between them is contained in C:
If −f is convex then we say that f is concave. Not that this does not correspond to
reversing the inequality sign in (3.1) alone due to the +∞ − ∞ = +∞ convention (Def. 2.1),
but also requires to change the convention to +∞ − ∞ = −∞ in addition to the reversed
sign. In order to avoid this problem we prefer to say that −f is convex to saying that f
is concave.
Example 3.3.
1. Rn is convex,
2. {x ∈ Rn |x > 0} is convex,
3. {x ∈ Rn | kxk2 6 1} is convex,
4. {x ∈ Rn | kxk2 6 1, x 0} is not convex,
5. the half-spaces {x|a⊤ x + b > 0} are convex,
6. f (x) = a⊤ x + b is convex (inequality holds as an equality) but not strictly convex,
7. f (x) = kxk22 is strictly convex,
8. f (x) = kxk2 is convex but not strictly convex.
15
16 Convexity
Proof.
1. “⇐”: For m = 1, (3.2) is the convexity condition.
“⇒”: Induction: m = 0 is trivial, and the case m = 1 follows from convexity.
Assume that (3.2) holds for all m ′ < m for some m > 2, and consider an arbitrary
convex combination x = λ0 x0 + + λm xm.
W.l.o.g. assume that all λi > 0 (if not we can remove it from the sum together
with the corresponding xi) and all λi < 1 (if not then all other λi = 0 and we have
the trivial case m = 0). Then
x = λ0 x0 + + λm xm
m−1
X λi
= (1 − λm) xi + λm xm.
1 − λm
i=0
Thu
m−1
!
m ′ =1 X λi
f (x) 6 (1 − λm)f xi + λm f (xm)
1 − λm
i=0
m ′ =m−1 m−1
X λi
6 (1 − λm) f (xi) + λm f (xm)
1 − λm
i=0
m
X
= λi f (xi).
i=0
Proof. Let x, y ∈ dom f , τ ∈ (0, 1). Then f (x), f (y) < +∞ and by convexity f ((1 − τ ) x +
τ y) 6 (1 − τ ) f (x) + τ f (y) < ∞, therefore (1 − τ ) x + τ y ∈ dom f .
Proof.
1.
epi f convex ⇔ (1 − τ ) (x0, α0) + τ (x1, α1) ∈ epi f ∀(x0, α0), (x1, α1) ∈ epi f ,
τ ∈ (0, 1)
⇔ f ((1 − τ ) x0 + τ x1) 6 (1 − τ ) α0 + τ α1 ∀x0, x1, ∀α0 > f (x0),
α1 > f (x1), τ ∈ (0, 1)
⇔ f ((1 − τ ) x0 + τ x1) 6 (1 − τ ) f (x0) + τ f (x1) ∀x0, x1∀τ ∈ (0, 1)
(last equivalence: “⇒” follows immediately with αi = f (xi), “⇐” follows because
τ ∈ (0, 1) and therefore (1 − τ ) > 0 and τ > 0).
2. similar with strict inequalities.
Proof. α = +∞ is trivial (lev6+∞ f = Rn). Let α < +∞, x, y ∈ lev6αf and τ ∈ (0, 1), then
f convex
f ((1 − τ ) x + τ y) 6 (1 − τ ) f (x) + τ f (y)
lev6α
6 (1 − τ ) α + τ α
= α.
Therefore (1 − τ ) x + τ y ∈ lev6αf .
Proof.
1. If f +∞ then arg min f = lev6inf ff and we can use Prop. 3.8. If f = +∞ then
arg min f = ∅, which is convex.
2. Assume x is a local minimizer and y ∈ Rn, and assume f (y) < f (x), i.e., x is not a
global minimizer. By convexity:
Proposition 3.10. (operations that preserve convexity) Let I be an arbitrary index set.
Then
1. fi , i ∈ I convex ⇒ f (x) 8 supi∈Ifi(x) is convex,
2. fi , i ∈ I strictly convex, I finite ⇒ f (x) 8 supi∈Ifi(x) is strictly convex,
T
3. Ci , i ∈ I convex ⇒ i∈I Ci convex,
4. fk , k ∈ N convex ⇒ f (x) 8 lim supk→∞ fk(x) is convex.
Proof. Exercise.
1. from the definition of convexity and ai 6 bi ⇒ supi∈Iai 6 supi∈Ibi.
2. same with strict inequalities for finite sup.
3. δ∩i∈ICi = supi∈IδCi and 1.
4. similar to 1.
Example 3.11.
1. The union of convex sets is generally not convex (but can be): C = [0, 1], D = [2, 3].
2. f : Rn → R convex and C convex ⇒ f + δC is convex ⇒ set of minimizers of f on
C is convex (Thm. 3.9).
3. f (x) = |x| = max {x, −x} is convex (Prop. 3.10).
4. f (x) = kxk2 = supkyk261 y ⊤ x is convex (Prop. 3.10) (similarly: f (x) = kxk p is convex,
look at dual norm k·k q with 1/p + 1/q = 1). It is not strictly convex: set x = 0, y 0,
then f (x/2 + y/2) = f (y/2) = f (x)/2 + f (y)/2.
Theorem 3.12. (derivative tests) Assume C ⊆ Rn is open and convex, and f : C → R (i.e.,
real-valued!) is differentiable. Then the following conditions are equivalent:
1. f is [strictly] convex,
2. hy − x, ∇f (y) − ∇f (x)i > 0 for all x, y ∈ C [and >0 if x y],
3. f (x) + hy − x, ∇f (x)i 6 f (y) for all x, y ∈ C [and < f (y) if x y],
4. if f is additionally twice differentiable: ∇2 f (x) is positive semidefinite for all x ∈ C.
If f is twice differentiable and ∇2 f is positive definite, then f is strictly convex. The
opposite does not hold.
Remark 3.13. The second condition is a monotonicity condition on the gradient: in one
dimension it becomes
which is equivalent to y > x ⇔ f ′(y) > f ′(x). This means that the derivative must increase
when going towards larger values of x. The third condition says that f is never below any
of its local linear approximations. The fourth condition in one dimension means f ′′ > 0,
the graph is “curved upwards”.
Convexity 19
Proposition 3.14.
1. (nonnegative linear combination) Assume f1, , fm: Rn → R̄ are convex, λ1, ,
λm > 0. Then
m
f8
X
λ i fi
i=1
is convex. If at least one of the fi with λi > 0 is strictly convex, then f is strictly
convex.
2. (separable sum) Assume fi: Rni → R̄, i = 1, , m are convex. Then
f : Rn1 × × Rnm → R̄,
m
, xm) 8
X
f (x1, fi(xi)
i=1
is convex. If all fi are strictly convex, then f is strictly convex.
3. (linear composition) Assume f : Rm → R̄ is convex, A ∈ Rm×n, and b ∈ Rm. Then
g(x) 8 f (A x + b)
is convex.
Proof.
1. From Def. 3.1:
m
X
f ((1 − τ ) x + τ y) = λi fi((1 − τ ) x + τ y)
i=1
Xm
6[<] λi ((1 − τ ) fi(x) + τ fi(y))
i=1
= (1 − τ ) f (x) + τ f (y).
2. From Def. 3.1:
m
X
f ((1 − τ ) x + τ y) = fi((1 − τ ) xi + τ yi)
i=1
Xm
6[<] ((1 − τ ) fi(xi) + τ fi(yi))
i=1
= (1 − τ ) f (xi) + τ f (yi).
3. From Def. 3.1:
g((1 − τ ) x + τ y) = f (A ((1 − τ ) x + τ y) + b)
= f ((1 − τ )(A x + b) + τ (A y + b))
f convex
6 (1 − τ ) f (A x + b) + τ f (A y + b)
= (1 − τ ) g(x) + τ g(y).
Proof. Exercise.
Definition 3.16. For any set S ⊆ Rn and any point x ∈ Rn, we define the “projection of
x onto S” as
ΠS (y) 8 arg min kx − y k2.
x∈S
Proof. We can rewrite the problem using a more convenient (differentiable) objective:
1
ΠC (y) = arg min kx − y k22
2
x∈C
1
= arg min kx − y k22 + δC (x)
x 2
Quick proof:
• 1
x 2 kx − y k22 is lsc (it is continuous) and level-bounded (f (x) → +∞ as kxk2 →
+∞) and proper (never −∞, not always +∞)
δC (x) is lsc because C is closed (Ex. 2.9, alternatively Thm. 2.10: epi δC is closed
⇒ δC lsc).
⇒f (x) 8 2 kx − y k2 + δC is lsc, level-bounded and proper
1
Remark 3.19. The convex hull con S is the smallest convex set that contains S: con S
is convex by Prop. 3.10 (intersection of convex sets) and every set C that is convex and
contains S also contains con S by definition.
Convexity 21
Theorem 3.20. (convex hulls from convex combinations) Assume S ⊆ Rn, then
( p p
)
X X
con S = λi xi xi ∈ S , λi > 0, λi = 1, p > 0 .
i=0 i=0
Proof. We denote the right-hand side by D and need to show that con S =D.
“⊇”: S ⊆ con S. con S is convex ⇒ (Thm 3.5): con S contains all convex combinations
of points in con S ⇒ con S contains all convex combinations of points in S ⇔ D ⊆ con S.
“⊆”: if x, y ∈ D then for some xi , yi , λi , µi:
mx
X my
X
(1 − τ ) x + τ y = (1 − τ ) λi xi + τ µi y i
i=0 i=0
mx
X my
X
= (1 − τ )λi xi + τµi yi.
i=0 i=0
with
mx
X my
X mx
X my
X
(1 − τ ) λi + τ λi = (1 − τ ) λi + τ µi
i=0 i=0 i=0 i=0
= (1 − τ ) · 1 + τ · 1
= 1.
⇒ convex combinations of elements in D are in D.
⇒(Thm 3.5) D is convex.
⇒(Def. 3.18, S ⊆ D) con S ⊆ D.
Remark 3.22. The closure is to closedness what the convex hull is to convexity:
\
cl C = S.
S closed,C ⊆S
Chapter 4
Cones and Generalized Inequalities
Definition 4.1. (cones) K ⊆ Rn “cone” : ⇔
0 ∈ K, λx∈K ∀x ∈ K , λ > 0.
A cone K is “pointed” : ⇔
x1 + + xm = 0, xi ∈ K ∀i ∈ {1, , m} ⇒ x1 = = xm = 0.
Note that cones can also be nonconvex , such as the cone K = (R>0 × {0}) ∪ ({0} × R>0).
Proposition 4.2. (convex cones) Assume K ⊆ Rn is an arbitrary set. Then the following
conditions are equivalent:
1. K is a convex cone,
2. K is a cone and K + K ⊆ K,
3. K ∅ and m i=0 αi xi ∈ K for all xi ∈ K and αi > 0 (not necessarily summing to 1).
P
Pm Pm α
Proof. Exercise, idea: x = i=0 αi xi ∈ K, αi > 0 ⇔ i=0
P i x ∈ K, and the latter is a
j αj
convex combination with coefficients summing to 1 (if there is at least one αi 0).
23
24 Cones and Generalized Inequalities
Proof. 1.-6. from the definition of a cone. Converse: recover K = {x ∈ Rn |x > 0}, then
5.
x > y ⇔ x − y > 0 ⇔ x − y ∈ K ⇔ x >K y. Then show that K is a closed convex cone.
Antisymmetry: x >K y, y >K x ⇔ x − y ∈ K , y − x ∈ K ⇔ x − y ∈ K ∩ −K. This holds for
all x, y iff K ∩ −K = {0}, which by Prop. 4.3 is equivalent to K being pointed.
Definition 4.5. (conic program) For any pointed, closed, convex cone K ⊆ Rm, a matrix
A ∈ Rm×n and vectors c ∈ Rn, b ∈ Rm, we define the “conic program” or “conic problem”
(CP)
inf c⊤x
x
s.t. A x >K b.
then
min y s.t. y > x1 − x2, y > x2 − x1, x1 > −1, −x1 > 1, x2 > 0,
x,y
and finally
x1
min (0, 0, 1) x2
(x1,x2,y)∈R3
y
−1 1 1 0
1 −1 1 x1 0
s.t. 1 0 0 x2 > −1 .
−1 0 0 y 1
0 1 0 0
Example 4.7. (second-order cone): The “second-order cone” (also called “Lorentz cone”,
“ice-cream cone”)
8 x ∈ R xn > x1 + + xn−1
q
SOCP n 2 2
Kn
Cones and Generalized Inequalities 25
Note that for constant functions f ≡ c ∈ R we get ∂f (x) = {0} for all x ∈ Rn. However,
for the constant function f ≡ +∞ and f ≡ −∞, from +∞ > +∞ and −∞ > −∞ (in fact
equality holds), we conclude that
∂(+∞)(x) = Rn , ∂(−∞)(x) = Rn , ∀x ∈ Rn.
Proof. 1. follows from 2. using g ≡ 0, since then ∂g(x) = {0} for all x.
To show 2., we first show ∂(f + g)(x) ⊇ ∂g(x) + ∇f (x). If v ∈ ∂g(x), then
f (y) > f (x) + h∇f (x), y − xi,
g(y) > g(x) + hv, y − xi
27
28 Subgradients
The first inequality uses the same argument as in Thm. 3.12 (we cannot apply the theorem
directly because we do not know if f is differentiable in a neighborhood of x, but the proof
is the same; specifically for all t > 0, we have
f (y) − f (x) − h∇f (x), y − xi
ky − xk2
t f (y) − t f (x) − h∇f (x), t (y − x)i
=
t ky − xk2
(1 − t) f (x) + t f (y) − f (x) − h∇f (x), t (y − x)i
=
t ky − xk2
f convex f (x + t (y − x)) − f (x) − h∇f (x), t (y − x)i
> .
t ky − xk2
This holds for all t > 0 and the last term converges to 0 from the definition of differentia-
bility. Thus f (y) − f (x) − h∇f (x), y − xi > 0.)
Adding the inequalities for f and g we obtain
This is possible because f (z) and f (x) must be finite close to x from the definition of
differentiability, and we can simplify:
g(z) − g(x) − hv − ∇f (x), z − xi
lim inf > 0.
z→x kz − xk2
Now consider z(t) 8 (1 − t) x + t y. Then the above equation states
g(z(t)) − g(x) − hv − ∇f (x), z(t) − xi
lim inf > 0.
tց0 kz(t) − xk2
From the definition of convexity we get
Thus
(1 − t) g(x) + t g(y) − g(x) − hv − ∇f (x), z(t) − xi
lim inf > 0.
tց0 kz(t) − xk2
We assumed that g(x) is finite, so (1 − t) g(x) − g(x) = −t g(x) (if g(x) = +∞ this gives
∞ − ∞ = −∞, which is wrong). Also z(t) − x = t (y − x), and we get
−t g(x) + t g(y) − hv − ∇f (x), t (y − x)i
lim inf >0
tց0 t ky − xk2
⇔ lim inf {g(y) − g(x) − hv − ∇f (x), y − xi} > 0
tց0
⇔ g(y) > g(x) + hv − ∇f (x), y − xi
Subgradients 29
Since y was arbitrary this shows v − ∇f (x) ∈ ∂g(x), so v ∈ ∂g(x) + ∇f (x), and finally
∂(f + g) ⊆ ∂g + ∇f .
Proof.
0 ∈ ∂f (x)
⇔ f (x) + h0, y − xi 6 f (y) ∀y ∈ Rn
⇔ f (x) 6 f (y) ∀y ∈ Rn
f proper
⇔ x ∈ arg min f .
In the last equivalence the properness is required since arg min +∞ = ∅ by definition, but
∂(+∞)(x) = Rn ∋ 0 for all x.
Definition 5.8. For a convex set C ⊆ Rn and x ∈ C, the “normal cone” NC (x) at x is
defined as
NC (x) 8 {v ∈ Rn |hv, y − xi 6 0 ∀y ∈ C }.
Proof.
For x ∈ C:
For x C: Since C ∅ we can find a y ∈ C, i.e., δC (y) = 0. But then for any v ∈ ∂δC (x) we
have
Proposition 5.10. Assume C ⊆ Rn is closed and convex with C ∅, and x ∈ Rn. Then
y = ΠC (x) ⇔ x − y ∈ NC (y),
Proof. From Prop. 3.17 we know that y = ΠC (x) is the unique minimizer of
f (y ′) 8 1 ′
2
ky − xk22 + δC (y ′).
30 Subgradients
This is the case iff 0 ∈ ∂f (y) (Thm. 5.7). From Prop. 5.6 and Prop. 5.9 we know that
∂f (y) = y − x + NC (y), thus
A consequence of Prop. 5.10 is that by looking at the normal cone for a fixed x ∈ C,
we can find all points y that get projected into x.
Proof. x dom f : f proper ⇒ ex. y such that f (y) ∈ R. If v ∈ ∂f (x) then by definition
of the subdifferential
v ⊤ (y − x) + f (x) 6 f (y)
⇒+∞ 6 f (y) ∈ R.
v ∈ ∂f (x)
⇔ v ⊤ (y − x) + f (x) 6 f (y) ∀y ∈ Rn
⇔ v ⊤ (y − x) + f (x) 6 α ∀y ∈ dom f , ∀α > f (y), α ∈ R
⇔ v ⊤ (y − x) + (−1) (α − f (x)) 6 0 ∀(y, α) ∈ epi f
⇔ (v, −1) ∈ Nepi f (x, f (x)).
Second part:
v ∈ Ndom f (x)
⇔ v ⊤(y − x) 6 0 ∀y ∈ dom f
⇔ v ⊤ (y − x) + 0 · (α − f (x)) 6 0 ∀y ∈ dom f , ∀α > f (y)
⇔ v ⊤ (y − x) + 0 · (α − f (x)) 6 0 ∀(y, α) ∈ epi f
⇔ (v, 0) ∈ Nepi f (x, f (x)).
Definition 5.13. (relative interior) For any set C ⊆ Rn, we define the “affine hull” and
the “relative interior”
8
\
aff C A,
A affine,C ⊆A
rint C 8 {x ∈ Rn |ex. (open) neighborhood N of x with N ∩ aff C ⊆ C }.
Proof. 1. can be shown directly. The proof for 2. is surprisingly difficult and we refer
to [Roc70, 23.8,23.9]. 3. then follows from 2. with A = (I | |I)⊤ and f (x1, , xm) =
f1(x1) + + fm(xm).
Remark 5.15. The conditions on the relative interiors Prop. 5.14 are important: in 3., it
is easy to see that the direction “⊇” always holds. But the direction “⊆” requires more. Set
f1(x) 8 δB1(0),
f2(x) 8 δC , C 8 {1} × R.
Then ∂f1(1, 0) = R>0 × {0} and ∂f2(1, 0) = NC (1, 0) = R × {0}. Thus
(∂f1 + ∂f2)(1, 0) = R × {0}.
But f1 + f2 = δ{(1,0)}, thus ∂(f1 + f2)(1, 0) = R2. This is only possible since f1, f2 do not
satisfy the condition on the relative interiors in Prop. 5.14.
Chapter 6
Conjugate Functions
Remark 6.2. The supremum on the right-hand side is convex (Prop. 3.10), majorized
by f and clearly majorizes every convex function majorized by f . Therefore con f is the
greatest convex function majorized by f .
Definition 6.3. (closure of a function) For f : Rn → R̄, the “(lower) closure” cl f is defined
as
cl f (x) 8 lim inf f (y).
y→x
Proof.
(x, α) ∈ cl(epi f ) ⇔ ∃xk → x, αk → α, f (xk) 6 αk
⇔ lim inf f (y) 6 α ⇔ (x, α) ∈ epi(cl f ).
y→x
cl f (x)
This holds because if there exists such a sequence, then the lim inf is 6α, and if the lim inf
is 6α then take a sequence xk with f (xk) → lim inf . Then (x, lim inf y→xf (x)) ∈ cl(epi f ),
and therefore in particular (x, α) ∈ cl(epi f ) since α > lim inf y→xf (x).
Second part: if f is convex then epi f is convex. For x, y ∈ epi(cl f ) = cl(epi f ) we have
xk → x, y k → y with xk , y k ∈ epi f . For every τ ∈ (0, 1), z 8 (1 − τ ) x + τ y is the limit of
(1 − τ ) xk + τ y k, and all these points are in epi f because epi f is convex (f is convex).
Thus z is in cl epi f and therefore in epi(cl f ). This shows that epi (cl f ) is convex and
therefore cl f is convex.
Note that the corresponding statement for the convex closure does not hold, i.e.,
epi(con f ) con(epi f )
33
34 Conjugate Functions
Proof. “6”: epi (cl f ) is closed by Prop. 6.4 and therefore cl f is lsc by Thm. 2.10. Also
cl f 6 f (i.e., lim inf y→xf (y) 6 f (x)), because of the definition of the lim inf (or take the
constant sequence y k = x). Together this shows 6.
“>”: If g 6 f and g is lsc, then
lsc g6f
g(x) 6 lim inf g(y) 6 lim inf f (y) = (cl f )(x).
y→x y→x
Theorem 6.6. (envelope representation of sets) Assume that C ⊆ Rn is closed and convex.
Then
i 6 0 ∀z ∈ C
hv, z − y
⇔ v , z − hv, y i 6 0 ∀z ∈ C
b β
⇔ C ⊆ Hb,β .
x Hb,β ,
In fact not all the half-spaces are needed in Thm. 6.6. If we assume C ∅ then it
is enough to intersect all supporting half-spaces, i.e., the half-spaces whose associated
hyperplane touches C. The following theorem shows that if C is the epigraph of a proper
lsc convex function, we do not need the vertical half-spaces.
Proof. [RW04, 12.1] f is lsc ⇒ epi f is closed (Thm. 2.10); f is convex ⇒ epi f is convex
(Prop. 3.7). Therefore epi f is the intersection of all half-spaces containing it (in Rn × R)
by Thm. 6.6:
\
epi f = H(b,c),β =: I1,
(b,c),β ∈S
where H(b,c),β = {(x, α) ∈ Rn × R|h(x, α), (b, c)i − β 6 0} and S defines the set of half-spaces
containing epi f .
Claim:
“⇒”: g(x) = hb, xi − β. Then (x, α) ∈ epi g ⇔ hb, xi − β 6 α ⇔ h(b, −1), (x, α)i − β 6 0 ⇔ (x,
α) ∈ H(b,−1),β .
“⇐”: (x, α) ∈ H(b,c),β with c < 0 ⇔ hb, xi + α c − β 6 0 ⇔ hb, xi − β 6 (−c) α. Since c < 0
this is equivalent to h−b/c, xi − (−β/c) 6 α ⇔ (x, α) ∈ epi g with g(x) = h−b/c, xi + β/c.
Using the claim, since
\
epi sup g(x) = epi g
g affine,g6f g affine,g6f
I1 8
\ \
H(b,c),β = H(b,c),β =: I2.
(b,c),β ∈S (b,c),β ∈S ,c<0
epi f epi (sup )
The direction “⊆” is clear. For “⊇” we need to show that if (x̄, ᾱ) I1 then (x̄, ᾱ) I2,
i.e., there exist (b, c), β ∈ S with c < 0 such that (x̄, ᾱ) H(b,c),β .
Assume that (x̄, ᾱ) I1. Then there exist (b1, c1), β1 ∈ S such that (x̄, ᾱ) H(b1,c1),β .
If c1 > 0 then
If x ∈ dom f then (x, f (x)) ∈ epi f , therefore (x, f (x)) ∈ H(b1,c1),β1 and g1(x) = hx,
b1i − β1 6 0 on dom f .
Now take any (b ′, c ′), β ′ ∈ S with c ′ < 0. Such a triple exists: if not, c = 0 holds for all
(b, c), β ∈ S, which means that epi f is only bounded by vertical hyperplanes, and therefore
f is −∞ on all of dom f , which contradicts the properness of f .
We then define the associated affine function
Since epi f ⊆ H(b ′,c ′),β ′ = epi g2 we get (x, f (x)) ∈ epi g2 and therefore g2(x) 6 f (x)
∀x ∈ dom f .
36 Conjugate Functions
Last statement: f is convex, therefore con f = f . Thus con f is proper. Also con f = f
is lsc, therefore f ∗∗ = cl con f = cl f =f .
The condition that con f if proper does not appear to be easy to validate at first. It
turns out that we can just compute f ∗, and if we obtain neither +∞ nor −∞ then con f
must be proper, as the following proposition shows:
Proof. The first statement follows in the same way as in the proof of the second statement
in Thm. 6.10: if con f is not proper then con f = +∞ or there is x ′ s.t. con f (x ′) = −∞. If
con f = +∞ then
f ∗(v) = (con f )∗(v) = sup hv, xi − con f (x) = sup −∞ = −∞.
x x
If there is x′ s.t. con f (x ′) = −∞ then
f ∗(v) = (con f )∗(v) = sup hv, xi − (con f )(x) > hv, x ′i − (con f )(x ′) = +∞.
x
The second statement is a special case (f proper ⇒ f {−∞, +∞}).
Theorem 6.12. (inversion rule for subdifferentials) Assume f : Rn → R̄ proper, lsc, convex.
Then
∂f ∗ = (∂f )−1,
specifically
v ∈ ∂f (x) ⇔ f (x) + f ∗(v) = hv, xi ⇔ x ∈ ∂f ∗(v).
Moreover,
∂f (x) = arg max {hv ′, xi − f ∗(v ′)},
v′
∂f ∗(v) = arg max {hv, x ′i − f (x)}.
x′
i.e., computing the subdifferential of f∗ at a single point is generally as hard as finding the
whole set of minimizers of f .
Proposition 6.14. (conjugation in product spaces) Let fi: Rni → R̄, i = 1, , m be proper
and
Then
Definition 6.15. (support functions) For any set S ⊆ Rn we define the “support function”
f (λ x) = λ f (x) ∀x ∈ Rn , ∀λ > 0.
Proposition 6.17. (support functions, polar cones) The set of positively homogeneous
proper lsc convex functions and the set of closed convex nonempty sets are in one-to-one-
correspondence through the Legendre-Fenchel transform:
δC
∗
σC , x ∈ ∂σC (v) ⇔ x ∈ C , σC (v) = hv, xi ⇔ v ∈ NC (x).
In particular, the set of closed convex cones is in one-to-one correspondence with itself: for
any cone K define the “polar cone” (also sometimes referred to as the “dual cone”) as
K∗ 8 {v ∈ Rd |hv, xi 6 0 ∀x ∈ K }.
Then
δK
∗
δ K ∗, x ∈ NK ∗(v) ⇔ v ∈ NK (x) ⇔ x ∈ K , v ∈ K ∗, hx, vi = 0 ⇔
0 6K x⊥v >K ∗ 0.
40 Conjugate Functions
Thus g∗(x) ∈ {0, +∞}, and g ∗ is an indicator function of some set C, which must be closed
convex and nonempty because g ∗ is proper lsc convex.
One-to-one correspondence: from (δC )∗∗ = δC .
From Thm. 6.10 we get
∗ Prop. 5.9
x ∈ ∂σC (v) ⇔ v ∈ ∂(σC )(x) ⇔ v ∈ ∂δC (x) ⇔ v ∈ NC (x),
⇔δC (x) + σC (v) = hv, xi ⇔ x ∈ C , σC (v) = hv, xi.
Cones: We claim: g is a positively homogeneous lsc convex proper indicator function ⇔
g = δK with K closed convex cone: for λ > 0 we have x ∈ K ⇔ λ x ∈ K, therefore
0, λ x ∈ K, 0, x ∈ K, δK (x)∈{0,+∞}
δK (λ x) = = = δK (x) = λ δK (x).
+∞, otherwise +∞, otherwise
Thus δK is positively homogeneous. Other direction: g = δC positively homogeneous lsc
convex indicator function, then 0 ∈ dom g, thus 0 ∈ C, and x ∈ C ⇒ g(x) = 0 ⇒
pos. hom
g(λ x) = λ g(x) = 0 ⇒ λ x ∈ C, thus C is a cone.
Take any such pos. hom. lsc convex indicator function δK , then by the first part (forward
∗
direction) δK is positively homogeneous lsc convex. But δK is also positively homogeneous
∗
lsc convex, therefore by the first part (backward direction) δK is an indicator function.
∗ ′
⇒ δK = δK ′ for some closed convex cone K . We have
∗
δK (v) < +∞ ⇔ sup hv, xi < +∞
x∈K
K cone
⇔ sup hv, xi 6 0
x∈K
⇔ v ∈ {v ′|hv ′, xi 6 0 ∀x ∈ K } = K ∗,
∗
thus K ′ = K ∗ and δK = δ K ∗.
Chapter 7
Duality in Optimization
Definition 7.1. (primal and dual optimization problems, perturbation formulation)
Assume f : Rn × Rm → R̄ is proper, lsc, and convex. We define the “primal” and “dual”
problems
inf ϕ(x), ϕ(x): =f (x, 0), sup ψ(y), ψ(y) 8 −f ∗(0, y).
x∈Rn y ∈Rn
f is sometimes called a “perturbation function” for ϕ, and p the associated “marginal func-
tion”.
1
A typical example is f (x, u) = 2 kx − I k22 + δ>0(A x − b + u). The extra variables u are
used to perturb the constraints A x > b to A x > b − u.
4. The primal and dual problems are feasible iff the domain of their associated marginal
function contains 0:
inf ϕ(x) < +∞ ⇔ 0 ∈ dom p,
x
sup ψ(y) > −∞ ⇔ 0 ∈ dom q.
y
Proof.
1. f proper lsc convex ⇒ f ∗ is proper lsc convex (Thm. 6.10, con f is proper) ⇒ ϕ,
ψ lsc convex (not necessarily proper!).
2. Convexity of p: We consider the strict epigraph set of p:
E 8 (u, α) ∈ Rm × R p(u) = inf f (x, u) < α
n o
x∈Rn
= m
{(u, α) ∈ R × R|∃x: f (x, u) < α}
= A({(x, u, α) ∈ Rn × Rm × R|f (x, u) < α})
=:E ′
= A(E ′),
41
42 Duality in Optimization
where A is the linear (coordinate projection) mapping A(x, u, α) 8 (u, α). The
first equality requires the strict inequality, i.e., we cannot just use epi p for the
argument. E ′ is the strict epigraph of f and thus convex (Prop. 3.7) ⇒ A(E ′) is
convex (Prop. 3.15)). Thus p is convex (Prop. 3.7).
The same argument applied to q and f ∗ shows that q is convex.
3. First part by definition:
Therefore
4. From the definitions: 0 ∈ dom p ⇔ p(0) < +∞ ⇔ infxf (x, 0) < +∞ ⇔ inf ϕ < +∞,
similar for q.
Theorem 7.3. (weak and strong duality) Assume f satisfies the assumptions in Def. 7.1.
Then “weak duality” always holds:
and under certain conditions the infimum and supremum are equal and finite (“strong
duality”):
“⇐”: Since p∗∗(0) 6 cl p(0) 6 p(0) holds for arbitrary p the right-hand side implies
lim inf y→0 p(y) = cl p(0) = p(0) ∈ R, thus p is lsc in 0.
Duality in Optimization 43
“⇒”: We claim that if the left-hand side holds then cl p is proper lsc convex. Convexity
and lower semi-continuity is clear from Prop. 7.2 and the definition of the closure. cl p must
then also be proper: cl p is not constant +∞ because cl p(0) 6 p(0) < +∞. If there were
y s.t. cl p(y) = −∞ then (cl p convex) cl p((1 − t) 0 + t y) 6 (1 − t) cl p(0) + t cl p(y) = −∞
for all t ∈ (0, 1). Since by assumption cl p(0) = p(0) ∈ R, this means cl p(t y) = −∞
for all t ∈ (0, 1). Moreover, t y → 0 for t → 0 and cl p is lsc (in particular in 0), thus
cl p(0) 6 lim inft→0cl p(t y) = −∞. But this would mean p(0) = −∞, because p is lsc in 0,
which implies p(0) = cl p(0). Thus cl p must be proper, lsc, and convex.
An alternative way of proving that cl p is proper is to use the fact that any improper,
convex, lsc function is constant +∞ or −∞ (example sheets), which contradicts p(0) ∈ R.
Because p∗ = (cl p)∗ always holds (Thm. 6.10),
Thm. 6.10 Thm. 6.10, cl p proper lsc conv. p lsc in 0
(p∗)∗(0) = ((cl p)∗)∗(0) = (cl p)∗∗(0) = cl p(0) = p(0).
Together with the first part this shows inf ϕ = sup ψ = p(0), which is finite by assumption.
Proposition 7.4. (primal-dual optimality conditions) Assume f satisfies the assumptions
in Def. 7.1. Then we have the “primal-dual optimality conditions”
x ′ ∈ arg min ϕ(x),
′ x
∈
′ ′
(0, y ) ∈ ∂f (x , 0) ⇔ y arg max ψ(y), ⇔ (x ′, 0) ∈ ∂f ∗(0, y ′). (7.2)
y
inf ϕ(x) = sup ψ(y)
x y
The set of “primal-dual optimal points” (x ′, y ′) satisfying ( 7.2) is either empty or equal to
(arg min ϕ) × (arg max ψ).
Proof. We know from Prop. 6.12 that
(0, y ′) ∈ ∂f (x ′, 0) ⇔ (x ′, 0) ∈ ∂f ∗(0, y ′)
⇔ f (x ′, 0) + f ∗(0, y ′) = hx ′, 0i + h0, y ′i
⇔ f (x ′, 0) = −f ∗(0, y ′) ∈ R
⇔ ϕ(x ′) = ψ(y ′) ∈ R.
Because inf ϕ > sup ψ always and ϕ(x ′) = ψ(y ′) shows inf ϕ 6 sup ψ, this is equivalent to
inf ϕ(x) = sup ψ(y) ∈ R, x ′ ∈ arg minϕ, y ′ ∈ argmaxψ.
x y
This is again equivalent to
inf ϕ(x) = sup ψ(y), x ′ ∈ arg minϕ, y ′ ∈ argmaxψ,
x y
since equality with an infinite value would imply either ϕ(x ′) = +∞ or ψ(y ′) = −∞, both
of which are explicitly excluded through the definition of the arg min .
If the set of x ′, y ′ that satisfy the conditions is non-empty, then inf ϕ = sup ψ must
hold with a finite value as seen above. Thus x ′, y ′ satisfy the conditions iff x ′ ∈ arg minϕ
and y ′ ∈ arg max ψ, which proves the last statement.
Proposition 7.5. (sufficient conditions for strong duality) Assume f satisfies the assump-
tions in Def. 7.1. Then
0 ∈ int dom p or 0 ∈ int dom q ⇒ inf ϕ(x) = sup ψ(y) (S ′)
x y
0 ∈ int dom p and 0 ∈ int dom q ⇒ inf ϕ(x) = sup ψ(y) ∈ R (S)
x y
44 Duality in Optimization
(note that in the first case equality may hold with the value +∞ or −∞) and
0 ∈ int dom p and inf ϕ(x) ∈ R ⇔ arg max ψ(y) nonempty and bounded (P ),
x y
0 ∈ int dom q and sup ψ(y) ∈ R ⇔ arg min ϕ(x) nonempty and bounded (D).
y x
In particular, if any of the conditions (S), (P ), (D) holds, then strong duality holds, i.e.,
inf ϕ = sup ψ ∈ R. Moreover, if (S) holds, or (P ) and (D) both hold, then there exist x ′,
y ′ satisfying the primal-dual optimality conditions ( 7.2).
Also,
Proof. Assume 0 ∈ int dom p. If p(0) = −∞ then p∗∗(0) 6 p(0) = −∞, thus sup ψ = p∗∗(0) =
p(0) = inf ϕ = −∞. The only other possibility is p(0) ∈ R since 0 ∈ int dom p. Since cl p = p
on int dom p (example sheets) we know that if 0 ∈ int dom p then p is lsc in 0 and we can
apply Thm. 7.3 to get p∗∗(0) = p(0) ∈ R, which shows inf ϕ = sup ψ (now with a finite value).
We can apply the same argument to f ′(x, y) 8 f ∗(y, x), the approach is completely
symmetric:
Then 0 ∈ int dom q ⇒ 0 ∈ int dom p ′. From the first part we know that then inf ϕ ′ = sup ψ ′,
but this means −supψ = −inf ϕ, thus inf ϕ = sup ψ.
If both 0 ∈ int dom p, 0 ∈ int dom q hold, then additionally +∞ > p(0) > p∗∗(0) = sup ψ =
−q(0) > −∞, thus the value is finite.
Non-emptyness and boundedness: See [RW04, Thm. 11.39 proof]; the idea is to show
that 0 ∈ int dom p if and only if ψ is proper (lsc convex) and level-bounded.
Subdifferential: If (P ) holds then 0 ∈ int dom p and p(0) ∈ R. Then we know that
cl p(0) = p(0) ∈ R. cl p is then proper (and lsc convex) by the proof of Thm. 7.3, thus by
Thm. 6.12
But
The second-to-last inequality follows because the affine functions majorized by p are
exactly the affine functions majorized by cl p. Together we get ∂p(0) = arg max y ψ(y).
Using duality, a similar argument shows the corresponding statement for q.
Proposition 7.6. Assume k: Rn → R̄ and h: Rm → R̄ are both proper, lsc, convex, and
A ∈ Rm×n, b ∈ Rm, c ∈ Rn. For
f (x, u) 8 hc, xi + k(x) + h(A x − b + u)
the primal and dual problems are of the form
inf ϕ(x), ϕ(x) 8 hc, xi + k(x) + h(A x − b),
x
with
int dom p = int(dom h −A dom k) + b,
int dom q = int(dom k∗ − (−A⊤) dom h∗) + c,
and the optimality conditions
′
x ∈ arg min ϕ(x),
x
A x ′ − b ∈ ∂h∗(y ′)
⊤ ′ ′
−A y − c ∈ ∂k(x )
y ′ ∈ arg max ψ(y),
⇔ ⇔ ′ .
y ′ ∈ ∂h(A x ′ − b) y x ∈ ∂k∗(−A⊤ y ′ − c)
inf ϕ(x) = sup ψ(y)
x y
Proof. f is convex (Prop. 3.14 sum is convex, Prop. 3.14 f (A x) + b is convex) and lsc
(Prop. 2.16, sums of proper lsc functions qqare lsc, the composition of an lsc function
with a continuous function is lsc). f is also proper because f (x, u) = −∞ implies k or h
not proper, and for some x ∈ dom k ∅ there must exist u s.t. A x − b + u ∈ dom h because
dom h ∅ and there are no other constraints on h.
With this choice we have ϕ(x) = f (x, 0) and (this is an important trick!)
f ∗(v, y) = sup hx, v i + hu, y i − hc, xi − k(x) − h(A x − b + u)
x,u
w=Ax−b+u
= sup hx, v i + hw − A x + b, y i − hc, xi − k(x) − h(w)
x,w
= sup hx, −A⊤ y + v − ci − k(x) + hw, y i − h(w) + hb, y i
x,w
= hb, y i + sup {hx, −A⊤ y + v − ci − k(x)} + sup {hw, y i − h(w)}
x w
= hb, y i + k∗(−A⊤ y + v − c) + h∗(y).
Thus ψ(y) = −f ∗(0, y) (the case v = 0). Because f is proper, lsc, convex we can apply
Prop. 7.4 and Prop. 7.5. We have
u ∈ dom p ⇔ inf f (x, u) < +∞ ⇔ inf hc, xi + k(x) + h(A x − b + u) < +∞
x x
⇔ ∃x: hc, xi + k(x) + h(A x − b + u) < +∞
⇔ ∃x ∈ dom k: h(A x − b + u) < +∞
⇔ ∃x ∈ dom k: A x − b + u ∈ dom h
⇔ ∃x ∈ dom k: u ∈ dom h − A x + b
⇔ u ∈ dom h − A dom k + b.
46 Duality in Optimization
Thus
0 ∈ int dom p ⇔ 0 ∈ int(dom h − A dom k + b) ⇔ b ∈ int(A dom k − dom h).
Similarly for int dom q.
Subdifferential of f : we have
f (x, u) = g(x, (u + A x)) with g(x, w) 8 hc, xi + k(x) + h(w − b).
Then by the definition of the subdifferential (separable sum ⇒ product of subdifferentials)
∂g(x, w) = (c + ∂k(x)) × (∂h(w − b)).
Also, by Prop. 5.14 (chain rule) with F · (x, u) 8
I 0 x
A I u
and f = g ◦ F we get
Similarly for f ∗.
Example 7.7. (conic problems) Assume that K , L are pointed closed convex cones with
polar cones K ∗, L∗ and consider the problem
inf hc, xi s.t. A x − b >L∗ 0, x >K 0,
x
in alternative notation
inf hc, xi + δK (x) + δL∗(A x − b)
x
By Prop. 7.6 the dual problem is
sup ψ(y), ψ(y) 8 −hb, y i − h∗(y) − k∗(−A⊤ y − c),
y
On important special case is the Linear Programming duality, where K = Rn>0 and L = δ0:
inf hc, xi s.t. A x = b, x > 0,
sup − hb, y i s.t. −A⊤ y 6 c.
Duality in Optimization 47
Then l(·, y) is convex for every y, −l(x, ·) is lsc and convex for every x, y, and
f (x, ·) = (−l(x, ·))∗,
(v, y) ∈ ∂f (x, u) ⇔ v ∈ ∂xl(x, y) and u ∈ ∂y (−l)(x, y).
Proof. Denote g(x, y, u) 8 f (x, u) − hy, ui. f is proper lsc convex ⇒ g is proper lsc
convex. Thus
l(·, y) = inf g(·, y, u)
u
is convex (but does not have to be proper for a specific y), see the proof of Prop. 7.2. Define
fx(y) 8 f (x, y), then −l(x, ·) = fx∗(·) and fx is either +∞ or proper lsc convex, therefore
−l(x, ·) is either −∞ or proper lsc convex by Thm. 6.10, but always lsc convex as claimed.
We have, by definition of the subgradient,
(v, y) ∈ ∂f (x, u) ⇔ f (x ′, u ′) > f (x, u) + hv, x ′ − xi + hy, u ′ − ui ∀x ′, u ′
⇔ f (x ′, u ′) − hy, u ′i > f (x, u) + hv, x ′ − xi − hy, ui ∀x ′, u ′
⇔ inf {f (x ′, u ′) − hy, u ′i} > f (x, u) − hy, ui + hv, x ′ − xi ∀x ′. (7.3)
u′
because infu 6 always holds (set u = u ′). Thus we can continue (7.3) via
infu ′ {f (x ′, u ′) − hy, u ′i} > f (x, u) − hy, ui + hv, x ′ − xi ∀x ′
(7.3) ⇔
inf ′ {f (x, u ′) − hy, u ′i} = f (x, u) − hy, ui
u
infu ′ {f (x ′, u ′) − hy, u ′i} > infu ′ {f (x, u ′) − hy, u ′i} + hv, x ′ − xi ∀x ′
⇔
inf ′ {f (x, u ′) − hy, u ′i} = f (x, u) − hy, ui
u′
l(x , y) > l(x, y) + hv, x ′ − xi ∀x ′
⇔
f (x, u ′) > f (x, u) + hy, u ′ − ui ∀u ′
y ∈ ∂uf (x, u),
⇔
v ∈ ∂xl(x, y).
Because fx is +∞ or proper lsc convex, by Prop. 6.12 (inversion of subdifferentials under
Legendre-Fenchel transform) the first condition is equivalent to u ∈ ∂fx∗(y), which is the
same as
u ∈ ∂(−l(x, ·))∗∗(y) = ∂ y(−l)(x, y).
48 Duality in Optimization
Therefore we get
u ∈ ∂ y(−l)(x, y),
(7.3) ⇔
v ∈ ∂xl(x, y)
which shows the assertion.
The Lagrangian opens a particularly nice way to formulate the optimality conditions:
always holds (set x = x ′); together with the saddle-point condition we get equality (similarly
for the supremum).
Proposition 7.11. Assume f is proper, lsc, convex with associated Lagrangian l. Then
For fixed x,
sup l(x, y) = sup −fx∗(y)
y y
= sup h0, y i − fx∗(y)
y
= fx∗∗(0)
= fx(0)
= f (x, 0)
= ϕ(x).
The equality fx∗∗ = f holds because fx is either proper lsc convex or +∞, in which case
fx∗ = −∞ and fx∗∗ = +∞.
By Prop. 7.4 the optimality condition is equivalent to (0, y ′) ∈ ∂f (x ′, 0), which by
Prop. 7.8 is equivalent to having 0 ∈ ∂xl(x ′, y ′) and 0 ∈ ∂ y (−l)(x ′, y ′). Since l and −l are
convex, this is exactly the saddle-point condition (x ′, y ′) ∈ sp l (Def. 7.9) and by Rem. 7.10
and the first part of the proposition equivalent to to ϕ(x ′) = l(x ′, y ′) = ψ(y ′).
Note that Prop. 7.11 only gives sufficient conditions for optimality. The following
theorem answers the question of how to construct f from a given Lagrangian such that f
is proper lsc convex and for every minimizer x ′ we can find a dual feasible point y ′ such
that (x ′, y ′) is a saddle point.
Proposition 7.12. Assume X ⊆ Rn and Y ⊆ Rm are nonempty, closed, convex, and
L: X ×Y → R
is a continuous function with L(·, y) convex for every y and −L(x, ·) convex for every x.
Then
l(x, y) 8 L(x, y) + δX (x) − δY (y),
with the convention +∞ − ∞ = +∞ on the right, is the Lagrangian to
f (x, u) 8 sup {l(x, y) + hu, y i} = (−l(x, ·))∗(u).
y
f is proper, lsc, and convex, i.e., Prop. 7.11 applies with primal and dual problems
inf sup L(x, y) = inf ϕ(x), ϕ(x) 8 δX (x) + sup L(x, y), (7.4)
x∈X y ∈Y x y∈Y
sup inf L(x, y) = sup ψ(y), ψ(y) 8 −δY (y) + inf L(x, y).
y∈Y x∈X y x∈X
is lsc, and therefore f (x, u) = supy {L(x, y) + hu, y i − δY (y)} + δX (x) is lsc as well (the last
step requires X to be closed and makes use of the convention +∞ − ∞ = 0).
Properness: if f (x, u) ≡ +∞ then l(x, ·) = −(f (x, ·))∗ ≡ +∞ for all x, which is not
possible because it would mean X = ∅. For any fixed x, f (x, ·) is either +∞ or x ∈ X, in
which case −l(x, ·) = −L(x, ·) + δY . This is proper lsc convex (Y ∅). Thus f (x, ·) must
be proper lsc convex (Thm. 6.10), which means it cannot assume the value −∞. Thus
f (x, u) −∞ always; together f is jointly proper. All in all, the conditions in Prop. 7.11
are fulfilled.
The relations in (7.4) follow directly from the definition of l if one takes some caution
to respect the convention +∞ − ∞ = +∞:
ψ(y) = inf {L(x, y) + δX (x) − δY (y)}
x
= inf L(x, y) − δY (y).
x∈X
The inf on the left side is always finite because X ∅ and L is finite. Also
ϕ(x) = sup {L(x, y) + δX (x) − δY (y)}
y
+∞, x X,
=
sup yL(x, y) − δY (y), x ∈ X
= sup L(x, y) + δX (x).
y∈Y
(note that the inequality requires X , Y ∅). X and Y are compact, therefore the right side
is bounded and we get dom p = Rm; similarly dom q = Rn. In particular 0 ∈ int dom p and
0 ∈ int dom q, and from Prop. 7.5 we obtain that the optimality conditions have a solution
with a finite value. By Prop. 7.11 this implies
sp l = (argmin ϕ) × (arg max ψ) ∅.
Both sets are bounded because X and Y are bounded, therefore sp l is bounded as well.
y ′ = ΠY (a),
which can be computed explicitly and separately for each component. We can even obtain
a primal solution from y ′: we know that x ′ minimizes l(·, y ′), so
1 1
x ′ = arg min kx − ak22 + hx, y ′i = arg min kx − ak22 + hx, ΠY (a)i.
2 2
The solution is unique, therefore we obtain the primal solution x ′ = a − ΠY (a). This
operation is known as shrinkage, because it shrinks the value of a towards zero.
Note that recovering the primal solution from the dual in this way is only possible
because of the uniqueness of the primal solution: in the general case, not every minimizer
x ′′ of l(·, y ′) leads to a saddle point (x ′′, y ′), as it may still violated the second half of the
saddle-point condition.
The fact that problem (7.5) can be solved explicitly – and therefore exactly – and
includes both the smooth 2-norm as well as the non-smooth 1-norm has made it a very
popular problem to be solved as a sub-step for solving more complicated problems, see
Chapter 9.
with a closed convex cone L and proper lsc convex function k, with the associated primal
and dual problems
0 ∈ ∂xl(x, y) ⇔ 0 ∈ c + ∂k(x) + A⊤ y,
0 ∈ ∂ y(−l)(x, y) ⇔ 0 ∈ −(A x − b) + NL(y),
52 Duality in Optimization
we can apply the second part of Prop. 6.17 to rewrite the last condition through v ∈
NL(y) ⇔ v ∈ L∗, y ∈ L, hx, y i = 0, and get
0 ∈ c + A⊤ y + ∂k(x),
0 6L y⊥(A x − b) >L∗ 0.
The orthogonality constraint in the second equation is also known as a complementarity
condition. To see why, set A = I, b = 0, and L = R>0. The equation is then
0 6 y⊥x 6 0.
Because of the inequalities, no term the inner product hy, xi = 0 can be positive, therefore
hy, xi = 0 is equivalent to yi xi = 0 for all i – the variables xi and yi are complementary in
the sense that if one of them is nonzero, the other one must be zero (they could still both
be zero, however).
Chapter 8
Numerical Optimality
When designing optimization methods an important question is what stopping criterion
to choose. A very common pitfall is the following:
1. Fix a δ > 0, x0.
2. iterate: k = 1, 2,
a. compute xk+1 from xk,
b. stop if |ϕ(xk+1) − ϕ(xk)| < δ, or if kxk+1 − xk k < δ,
c. k ← k + 1.
This approach suffers from all kinds of problems:
• It is also very dependent on the scaling of the problem or the date by constant
factors.
• It stops when the solver is slow , which – without further knowledge about the
algorithm – does not imply that the iterate is close to the solution. In fact the trivial
update xk+1 ← xk “converges” after one iteration but the solution is useless.
• We do not get any information about how close xk is to the minimizer x ′ or how
close f (xk) is to f (x ′).
Ultimately we would like our method to find a solution withing a guaranteed distance to
the optimal solution:
ϕ(x) − inf ϕ 6 ε.
This is the “ideal” stopping criterion – it guarantees that energy-wise x is not much
worse than a true minimizer x ′. Unfortunately ϕ(x ′) is generally unknown. What can we
do?
53
54 Numerical Optimality
We get (Taylor)
1
f (y) = f (x) + h∇f (x), y − xi + (y − x)⊤ ∇2 f (z)(y − x)
2
for some z = (1 − t) x + t y, t ∈ [0, 1]. Thus
σ
f (y) > f (x) + h∇f (x), y − xi + − ky − xk22.
2
This is a better bound than convexity gives; convexity alone corresponds to σ− = 0.
1
a. The right-hand side is maximized by ȳ = x − σ ∇f (x), substituting this gives
−
1
f (y) > f (x) − k∇f (x)k22.
2 σ−
In particular for x = xk and y = x ′ (a minimizer) we get
1
f (xk) − f (x ′) 6 k∇f (xk)k22.
2 σ−
This means that f (xk) approaches f (x ′)as k∇f (xk)k2 → 0.
b. Another consequence is (x ′ is optimal!)
σ− k
0 > f (x ′) − f (xk) > h∇f (xk), xk − x ′i + kx − x ′k22
2
σ
> −k∇f (xk)k2 kxk − x ′k2 + − kxk − x ′k22,
2
thus
2
kxk − x ′k2 6 k∇f (xk)k2,
σ−
meaning that also xk approaches x ′ as k∇f (xk)k2 → 0.
2. The Hessian of f is uniformly bounded from above: ∇2 f 6 σ+ I for some σ+ > 0
and all x. Then (again using Taylor and minimizing both sides over x)
1
f (xk) − f (x ′) > k∇f (xk)k22.
2 σ+
This gives the reverse statement: k∇f (xk)k2 approaches 0 as f (xk) → f (x ′).
Together we can upper- and lower-bound f (x ′) using quadratic upper and lower bounds
on f . The convergence speeds of gradient-based methods depends heavily on the condition
κ 8 +.
σ
σ−
In general convex optimizations, problems are usually
• non-differentiable (⇒σ+ = +∞) and
• have affine regions (⇒σ− = 0)
(this is most easily visualized using a simple function such as f (x) = |x|). Therefore they
do not have a good “classical” condition (not even locally, again consider f (x) = |x|).
One problem is that the subgradients usually carry no information about how close xk
is to x ′ (or even just how close f (xk) is to f (x ′)). Another viewpoint is the following:
Subgradients in convex optimization only provide a lower bound for the function, and
may be a very bad linear approximation!
8.2 The numerical primal-dual gap 55
Proof. The first inequality follows directly from weak duality. The second inequality
follows from ϕ(x ′) > ψ(y k) which again holds due to weak duality.
This means that if for a given xk we can find y k such that ϕ(xk) − ψ(y k) 6 ε we have
proof that xk is an ε-optimal solution, with y k acting as a certificate of optimality! This
is an important concept – solvers can prove that their solution is ε-optimal.
If strong duality holds (i.e., inf ϕ = sup ψ ∈ R), then such (xk , y k) always exist for
arbitrarily small ε > 0, as we can take any minimizing/maximizing pair of sequences for the
primal/dual problems. Existence of a certificate for ε = 0 requires existence of a primal-
dual optimal pair.
The numerical primal-dual gap is translation invariant, i.e., if f ′(x, u) 8 f (x, u) + c for
some c ∈ R (thus ϕ ′(x) = ϕ(x) + c, ψ ′(y) = ψ ′(y) + c; verify this using ψ(y) = −f ∗(0, y))
then
ϕ(xk) − ψ(y k) 6 ε
⇔ϕ ′(xk) − ψ ′(y k) 6 ε.
56 Numerical Optimality
It is however not scale invariant, i.e., if f ′(x, u) = c f (x, u) for c >n0 (then
D ϕ ′(x)
E = c ϕ(x),
′ ∗ 1
ψ (y) = −f (0, y) = −supx,u {hu, y i − c f (x, u)} = −supx,u c u, c y − c f (x,
o
y
u) = −c f ∗ 0, c = c ψ(y/c)) and even if we get the iterates (xk , c y k), then
ϕ ′(xk) − ψ ′(y k) 6 c ε,
i.e., if we stop on the numerical primal-dual gap we get different solutions although the
problem is effectively the same!
Therefore in practice the normalized gap is often used instead:
Definition 8.3. (normalized gap) We define the normalized numerical primal-dual gap as
ϕ(xk) − ψ(y k)
γ̄ 8 γ̄(xk , y k) 8 .
ψ(y k)
We get
y ′ dual optimal ϕ(xk) − ψ(y k)
γ̄ >
ψ(y ′)
weak duality ϕ(x ) − ψ(y k)
k
>
ϕ(x ′)
weak duality ϕ(x ) − ϕ(x ′)
k
> .
ϕ(x ′)
Therefore a small normalized gap γ̄ 6 ε guarantees that xk is ε ϕ(x ′)-optimal. γ̄ is scale-
invariant due to the normalization. It is however not translation-invariant – by adding a
large constant c to ϕ, ψ we can make it arbitrarily small:
ϕ(xk) + c − ψ(y k) − c ϕ(xk) − ψ(y k)
γ′ = = .
ψ(y k) + c ψ(y k) + c
The normalized gap also requires ϕ(x ′) > 0. Nevertheless, it is an excellent stopping
criterion and widely used in practice.
Common values are in the range of δ = 10−4 for inaccurate solutions – which are often
almost identical to high-accuracy solutions in image processing – to δ = 10−8 if precise
solutions are required.
8.3 Infeasibilities
When computing the numerical primal-dual gap there is an important detail: We need to
work with the full extended real-valued ϕ and ψ. This is very easy to lose track of, as the
following example shows.
Unfortunately we cannot accurately deal with this issue – the primal or dual constraints
could be non-trivial equality constraints, and in the general case cannot be fulfilled exactly
using finite precision arithmetic. Also, we would like to have an indicator of how the primal-
dual sequence converges even if it is infeasible.
One way to deal with this issue is to split the primal and dual objectives:
Writing the objectives in such a way is often natural. We can then use the stopping
criterion
ϕ0(xk) − ψ0(y k)
max {γ̄0, η p , ηd } 6 ε, γ̄0 8 .
ψ0(y k)
This enforces not only a small primal-dual gap but also that the primal and dual iterates
are close to being feasible.
For Ex. 8.4 we get
γ̄0 = (1 + δk − (1 + δk))/(1 + δk) = 0,
η p = max {0, 1 − xk , xk − 2} = 0,
ηd = max {0, y k − 1, −1 − y k } = δk.
This makes it obvious that although the apparent numerical duality gap is zero, it cannot
be fully trusted because the dual infeasibility is not zero.
Chapter 9
First-Order Methods
These are called “forward” and “backward” steps, since they correspond to forward and
backward Euler discretizations of the gradient descent PDE. The optimization problem
can be seen as finding a zero of the set-valued mapping ∂f : Rn ⇒ Rn, i.e.,
find x ∈ Rn s.t. 0 ∈ ∂f (x).
The forward and backward steps then amount to
xk+1 ∈ xk − τk ∂f (xk), xk+1 ∈ xk − τk ∂f (xk+1).
59
60 First-Order Methods
We see that the only difference is that in the backward step, the new iterate xk+1 is used
to compute the subdifferential.
The backward step is also known as proximal step, because it can be reformulated as
minimizing f together with an additional quadratic prox term that keeps the new iterate
close to the previous iterate:
Proposition 9.2. (proximal steps) If f : Rn → R̄ is proper lsc convex with τ > 0, then the
backward step is
1 2
Bτf (x) = arg min ky − xk2 + τ f (y)
y 2
and therefore unique.
Proof.
y ∈
Bτf (x)
⇔y (I + τ ∂f )−1(x)
∈
⇔x ∈
(I + τ ∂f )(y)
⇔0 ∈
y − x + τ∂f (y)
1 2
⇔y ∈ arg min ky − xk2 + τ f (y) .
2
Example 9.3. (forward and backward stepping) Assume f is proper lsc convex and
arg min f ∅.
1. Forward stepping:
xk+1 ∈ Fτk f (xk).
Convergence:
• Only restricted convergence results available (line search for smooth func-
tions; but basic result for non-smooth functions requires knowledge of inf f ).
• Also: convergence of xk to minimizer if there is α > 0 such that 0 < inf τk 6
sup τk < 2 α and [Eck89, 3.3]
hy − x, h − g i > α kh − gk22 ∀ g ∈ ∂f (x), h ∈ ∂f (y).
In the smooth case this implies the Lipschitz condition:
ky − xk2 k∇f (y) − ∇f (x)k2 > hy − x, ∇f (y) − ∇f (x)i
> α k∇f (x) − ∇f (x)k22
1
⇒k∇f (x) − ∇f (x)k2 6 ky − xk2. (9.1)
α
Advantages:
• requires very little information (just a single subgradient)
9.1 Forward and backward steps 61
The last point is generally the dealbreaker for backward steps. But we often encounter
problems where the objective is a sum of several terms, each of which is easy to minimize:
f (x) = g(x) + h(x).
The question is, can we find a minimum of f by combining suitable alternating minimiza-
tion steps of g and h?
gradient
projection
Convergence:
• gradient-projection: if (9.1) holds and 0 < inf τk 6 sup τk < 2 α.
• general case: in the mean if k τk = +∞ and k τk2 < +∞, and the chosen
P P
sequence of subgradients y k ∈ Fτk g (xk) is bounded (this is not clear and needs
to be shown separately)
Advantages:
• fixed points are the minimizers of f
• requires backward steps on h only, and can therefore deal with relatively
complicated functions g (think g(x) = kA x − bk22 where backward steps
involve solving large systems)
Disadvantages:
• sequence not unique, may get stuck
• can escape from minimizer! (if ∂g(x ′) {0}, i.e., contains non-zero subgra-
dients)
Example 9.5. (ISTA) Assume we would like to reconstruct an image y ∈ Rn from the
observation b = R y, where R is a linear operator. This could be a blurring operator, a
super-resolution operator, or just the identity for denoising of a given image.
As prior knowledge about the image to be recovered we make the first assumption that
y = W x,
where W ∈ Rn×m is a set of basis vectors (possible overcomplete). One choice is to use a
wavelet basis consisting of scaled and translates copies of a “mother wavelet”, for which
the product x W x can be evaluated quickly although W is large and non-sparse. The
second and key assumption is that x is sparse, i.e., it has only a few non-zeros: we assume
that the image can be represented using only a few of the basis functions.
An approach to recover y is to set A = R W and find
1 2
x ∈ arg min kA x − bk2 + λ kxk1 .
x 2
The left term ensures that the resulting image is close to the observation, the right term
promotes sparsity of x. If we apply forward-backward splitting with f = g + h, g(x) =
1
2
kA x − bk2, h(x) = λ kxk1, we get
k+1 1 k ⊤ k 2
x = arg min kx − (x − τk A (A x − b))k2 + τk λ kxk1 .
x 2
This is know as ISTA (iterative shrinkage thresholding) and has two strong points:
• It only requires us to do matrix multiplications involving A and A⊤, as opposed to
solving linear equation systems involving A.
9.2 Primal-dual methods 63
• The proximal step is separable, and can be solved explicitly using shrinkage (see
example sheets).
1
• Convergence: O k , where k is the number of steps. A variant (FISTA)
employs
1
over-relaxation and an adaptive choice of the τk to achieve O k2
. This is –
although far from a linear convergence rate – often enough in practice and widely
used.
An obvious strategy is to do alternate between improving the primal and dual objectives,
keeping the other variable fixed. In the following we generally assume that strong duality
holds.
Convergence:
• if ∇(h ◦ A) is Lipschitz continuous (in particular differentiable!) with con-
stant L, and 0 < inf τk 6 sup τk < 2/L.
Advantages:
• requires proximal steps on g only
• compared to forward-backward splitting, provides an alternative way to com-
pute the subgradient by minimizing h∗ and generates a sequence of dual
iterates y k that can be used to compute the numerical primal-dual gap.
Disadvantages:
• step size restriction, need to estimate L
2. Modified PDHG (primal-dual hybrid gradient): We alternate between backward
steps with respect to x and y with an additional over-relaxation:
y k+1 = Bσk (−l(x̄k ,·))(y k),
xk+1 = Bτk(l(·,yk+1))(xk),
x̄ k+1 = xk+1 + θk (xk+1 − xk).
Convergence:
• in the mean if σk ≡ σ, τk ≡ τ , θk ≡ θ = 1, σ τ < 1/kAk2, then with O(1/k).
• in the mean if g or h∗ strongly convex with constant α: O(1/k 2) for kxk − x ′k2
(needs adaptive θk)
• in the mean if g and h∗ strongly convex with constants α, β: O(ω k) with
1+θ
ω = √ , i.e., linear convergence rate.
σβ
2 1+
kAk2
Advantages:
• few requirements for convergence, flexible
• efficiently removes linearity A
Disadvantages:
• step size bound
• convergence in the mean (although convergence in (xk , y k) is observed)
• even if kAk is known, the ratio σ/τ may still need to be tuned for good
convergence
Example 9.7. (Augmented Lagrangian): We shift h into the primal objective by adding
an artificial variable z:
inf ϕ(x), ϕ(x) = g(x) + h (z) + δAx=z ,
x
l((x, z), y) = g(x) + h(z) + hA x − z, y i.
The dual variables y only occur as multipliers. Note the optimality conditions:
( (
0 ∈ ∂(x,z)l((x, z), y) 0 ∈ ∂g(x) + A⊤ y, 0 ∈ ∂h(z) − y
⇔
0 ∈ ∂ y(−l)((x, z), y) Ax=z
( (
0 ∈ ∂g(x) + A⊤ y subgrad. inversion 0 ∈ ∂g(x) + A⊤ y
⇔ ⇔ .
y ∈ ∂h(A x) 0 ∈ ∂h∗(y) − A x
9.2 Primal-dual methods 65
This shows that solutions ((x, z), y) of the new problem are exactly the triples ((x, A x), y)
where (x, y) is a primal-dual solution for l! In particular, y is always a dual solution for
the original problem, although we did not explicitly introduce it as such.
We augment the Lagrangian by adding a quadratic penalty:
l¯((x, z), y) 8 1
kA x − z k22.
g(x) + h(z) + hA x − z, y i +
2σ
This does not change the set of minimizers, because A x = z holds for every minimizer.
We then alternate between minimization with respect to x and z and gradient ascent for
y with step size σ:
k+1 1 k k 2
x = arg min g(x) + kA x − z + σ y k2 ,
x 2σ
1
z k+1 = arg min h(z) + kz − A xk+1 − σ y k k22 ,
z 2σ
1
y k+1 = y k + (A xk+1 − z k+1).
σ
This is known as the alternating direction method of multipliers (ADMM).
Convergence:
• for any σ > 0 we have xk → x ′ and y k → y ′ where (x ′, y ′) primal-dual optimal pair;
additionally z k → A x ′
Advantages:
• very few assumptions required for convergence
Disadvantages:
• We need to solve problems involving the term kA x − ·k22, which can be difficult due
to the coupling of the variables in x.
Chapter 10
Interior-Point Methods
Setting: We would like to solve
inf hc, xi s.t. A x − b >K 0,
,inf hc, xi + δK (A x − b),
where K is a proper closed convex cone.
Idea: We would like to use Newton’s method
xk+1 = xk − (∇2 f (x))−1∇f (x)
since it is usually very fast (locally quadratic convergence rate kxk+1 − x ′k 6 c kxk − x ′k2).
Problem: the constraints!
Idea: We replace the indicator function δK (z) by an approximation F (z) such that
F (z) → +∞ as z → bd K and solve
inf t hc, xi + F (A x − b).
As t → +∞ the minimizers should approach the minimizers of the original problem.
Nevertheless, they all lie in the interior of the constraint set K, hence then name “interior-
point methods” as opposed to methods that travel on the boundary of the constraint set.
Two ideas:
• Can we guarantee that we get a solution with a certain accuracy, i.e., how do we
need to choose t?
• As t → +∞ the problems become potentially more difficult, as the optimal point
moves toward the boundary, and the condition of F become worse the closer one
allows points to be from the boundary. Can we do this iteratively, i.e., if we know
a solution for a certain tk, can we quickly find a solution for a larger tk+1?
Definition 10.1. (canonical barrier) For a cone K we define the “canonical barriers”
F = FK and associated parameters θF:
• K = KnLP = {x ∈ Rn |x1, , xn > 0}, we have the canonical barrier
n
X
F (x) = −log xi , θF = n,
i=1
q
• K = KnSOCP = x ∈ Rn xn > x21 + 2
+ xn−1 ,
67
68 Interior-Point Methods
Proposition 10.2. [Nem 6.3.1, 6.3.2, Boyd 11.6.1] If F is a canonical barrier for K, then
F is smooth on dom F = int K and strictly convex,
F (t x) = F (x) − θF log t ∀x ∈ dom F ,
and for x ∈ dom F we have
1. −∇F (x) ∈ dom F,
2. h∇F (x), xi = −θF,
3. −∇F (−∇F (x)) = x,
1
4. −∇F (t x) = − t ∇F (x).
Again we consider (assume that A has full column rank (linearly independent columns),
Nem p.53 Ass. A).
inf hc, xi s.t. A x − b >K 0.
The dual problem is
sup h−b, y i s.t. −A⊤ y = c, y >K ∗ 0.
We replace y by −y:
sup hb, y i s.t. A⊤ y = c, y 6K ∗ 0.
We then assume that K is self-dual (K ∗ = K) and get
sup hb, y i s.t. A⊤ y = c, y >K 0.
In the remainder of this chapter we generally assume that the primal and dual problems
are strictly feasible, i.e., there exist x, y so that A x − b ∈ int K, A⊤ y = c, and y ∈ int K.
This guarantees that we can actually find interior point, and that strong duality holds.
Proposition 10.3. (primal-dual central path) The primal central path is the mapping
t x(t) = arg min {t hc, xi + F (A x − b)}.
The dual central path is the mapping
t y(t) = arg min {−thb, y i + F (y) + δA⊤ y=c }.
The primal-dual central path is the mapping t z(t) 8 (x(t), y(t)). These central paths
exist and are unique for all t > 0. Also,
1
y(t) = − ∇F (A x(t) − b),
t
and (x, y) is on the central path, i.e., (x, y) = (x(t), y(t)) for some t > 0, if and only if
A x − b ∈ dom F ,
A⊤ y = c,
t y + ∇F (A x − b) = 0.
Proposition 10.4. (duality gap along the path) For feasible x, y (i.e., A x − b ∈ K ,
A⊤ y = c, y ∈ K), the duality gap is
ϕ(x) − ψ(y) = hy, A x − bi.
Moreover, for points (x(t), y(t)) on the central path, the duality gap is
θF
ϕ(x(t)) − ψ(y(t)) = .
t
Proof.
ϕ(x) − ψ(y) = hc, xi − hb, yi
= hA⊤ y, xi − hb, yi
= hy, A x − bi.
ϕ(x(t)) − ψ(y(t)) = hy(t), A x(t) − bi
Prop. 10.3
= h−t−1 ∇F (A x(t) − b), A x(t) − bi
Prop. 10.2 θF
= .
t
Remark: This has an intriguing consequence: If we are given x, y on the central path,
we can immediately find the unique t to which they belong by computing the numerical gap!
70 Interior-Point Methods
We do not have to compute exact solutions, it is essentially enough to stay close to the
central path:
(path tracing) Idea: From 10.3 we know that a primal-dual optimal pair for a fixed t
can be found by solving
A⊤ y = c, t y + ∇F (A x − b) = 0.
Idea: Newton for the nonlinear equation system; we linearize:
∇F (A xk+1 − b) = ∇F (A xk − b + A ∆x) ≈ ∇F (A xk − b) + ∇2F (A xk − b) A ∆x
and get the following linear equation system for the steps ∆x and ∆y:
A⊤ (y k + ∆y) = c,
tk+1 (y k + ∆y) + ∇F (A xk − b) + ∇2F (A xk − b) A ∆x = 0.
Multiply the last line by A⊤ from the left (to eliminate y) and substitute; we get
∆x = H −1 (−tk+1 c − A⊤ ∇F (A xk − b)), H 8 A⊤ ∇2F (A xk − b) A,
∆y = −(tk+1)−1 (∇F (A xk − b) + ∇2F (A xk − b) A ∆x) − y k.
Here we need the assumption that A has full column rank in order for H to be regular.
We apply a step in the direction of (∆x, ∆y):
(xk+1, y k+1) = (xk , y k) + τk (∆x, ∆y) (10.1)
with the step size τk found using line search or a full Newton step (τk = 1):
1
Theorem 10.6. Assume 0 < ρ 6 κ < 10 , tk > 0 fixed and z k = (xk , y k) strictly feasible, i.e.,
A xk − b ∈ dom F, y k ∈ dom f, such that
dist(z k , z(tk)) < κ.
(The idea is to bound the step for tk so as to keep the Newton method within its region
1
of quadratic convergence. The factor 10 comes from a nasty expression involving ρ and κ
Interior-Point Methods 71
Advantages:
• We can update the penalty after a single Newton step!
• In particular, this means that
−k
θ 2θ ρ
ϕ(xk) − ψ(y k) 6 2 kF = F 1+ √ .
t t0 θF
We have guaranteed global linear convergence with explicit bounds!
• In practice, we can often do much larger steps for t (we could imagine computing t
from the duality gap!)
• Rapid convergence if the Newton steps can be solved in reasonable time.
Disadvantages:
• Relatively complicated.
• Many detail problems: finding a feasible point, computing the Newton step (or
Quasi-Newton approximation)
• Requires problem-specific code for medium-scale problems to be efficient.
Chapter 11
Using Solvers and Discretization Issues
See cvx_demo.zip, available online.
73
Chapter 12
Support Vector Machines
75
76 Support Vector Machines
is minimized:
• The function g penalizes a deviation of the predicted class label hθ(x i) from the
actual (known) class label y i.
• The regularizer R can make the problem well-defined if there is not enough input
data to uniquely identify θ, and can counter-act the problem of overfitting: If the
size of the training data set is small compared to the parameter space Θ, there is
a danger of fitting the classifier to the specific structure of the training data set,
instead of the distribution underlying the training data – in effect, the classifier will
work very well for the training data, but will perform poorly on feature vectors that
are not in the training data set.
In this chapter we will mostly deal with the two-class case, more precisely, hθ: Rm → {−1,
1}; however, most concepts can be generalized to multiple classes. For the two-class case,
the set (hθ)−1({0}) defines the decision boundary, i.e., the set that separates the two class
regions (hθ)−1({1}) and (hθ)−1({−1}).
For any solution (w, b, c ′), any scalar multiple (λ w, λ b, λ c ′) with λ > 0 will also be a
solution. Assuming that there exists at least one solution with c ′ > 0 (i.e., the data sets
can in fact be exactly separated), we can thus pick one of the solutions by forcing c ′ = 1,
and obtain
1
sup s.t. 1 6 y i ·{hw, xi i + b}, i ∈ {1, , n}. (12.3)
w,b kwk2
12.2 Linear Classifiers 77
1
Finally, maximizing kwk−1 is equivalent to minimizing 2
kwk2, so we obtain
1
inf kwk22 s.t. 1 6 y i ·{hw, xi i + b}, i ∈ {1, , n}. (12.4)
w,b 2
This is convex problem with quadratic objective, and can be rewritten in SDP form (an
SOCP formulation is equally possible but leads to a different dual problem). In terms of
1
the framework in (12.1), the quadratic part 2 kwk22 constitutes the regularizer R, while the
indicator function of the constraints is the loss function g – in effect, the loss for a correct
classification is always 0, while incorrect classifications have infinite loss and are therefore
prohibited.
Dual problem and optimality conditions.
Rewriting the primal problem as
w 1
inf k(w, b) + h M − 1 , k(w, b) = kwk22, h(z) = δ>0(z), (12.5)
w,b b 2
1 1 ⊤
y · (x ) y 1
M =
.
n n ⊤
y · (x ) y n
Unfortunately, such spaces can become huge very quickly for moderately large m – for
polynomials of degree 2 or less, we already need
m
= m (m − 1)
2
terms. As this defines the size of the optimization problem (12.5) and of the coefficient
vector w, it becomes too large very quickly.
The first key insight is the following: The size of the z in dual problem ( 12.7) does
not depend on the size m of the feature space – the size of z is defined by the number of
samples n instead!
12.3 The Kernel Trick 79
n 2
1 X i
inf y η(xi) zi + e⊤ z
z ∈Rn 2
i=1 2
still requires to map xi into the high-dimensional space F̄ . We start by rewriting the
quadratic part as
n 2 n
1 X i X
y η(xi) zi = y i y j hη(xi), η(x j )i zi z j
2
i=1 2 i,j=1
n
κ(x, x ′) 8 hη(x), η(y)i.
X
= y i y j κ(xi , x j ) zi zj , (12.13)
i,j=1
This is where the second key insight comes into play: By carefully choosing η, the kernel κ
can be evaluated without computing η: For η as in (12.12),
κ(x, x ′) = x21 x1′ 2 + + 2 x2 x2′ + 1 = (hx, x ′i + 1)2
does the trick. Once we have solved the dual problem, the inner product hw, η(x)i is,
according to (12.11),
Xn
hw, η(x)i = − z i y i hx̄ i , η(x)i
i=1
n
X
= − z i y i κ(xi , x). (12.14)
i=1
From a solution z ∈ Rn of the dual problem we can therefore compute b and evaluate the
decision function hθ(x) = hw, η(x)i − b without ever evaluating η! This concept is known
as the kernel trick.
In order to evaluate hθ we still need to store the training vectors xi in order to compute
the terms κ(xi , x). But from (12.14) we see that we can discard any vectors with z i = 0,
and only need to store vectors xi where z i 0. Any such vector must be a support vector
and thus minimizes the distance to the separating hyperplane, which in practice is a very
rare occurence – in order to evaluate (12.14), we will usually have to store only very few
of the xi, rather than the whole training data set.
We have shown that by solving the dual problem, we can find and evaluate an optimal
non-linear classifier in the potentially very high-dimensional space F̄ without ever having
to evaluate η. In fact, looking at (12.13), we find that we do not even have to know η
explicitly: We can substitute any kernel κ, as long as the dual problem stays convex. A
sufficient condition for this is that the matrix M ∈ Rn×n, Mij = κ(xi , x j ) is symmetric and
positive semidefinite for all choices of n and {x1, , xn } (the y i correspond to multiplication
with diagonal matrices and do not affect positive semidefiniteness).
Additionally, the extended feature space F̄ can be infinite-dimensional. It can be shown
that for any positive definite κ, an embedding η into a so-called reproducing kernel Hilbert
space can be found that defines κ through η and its inner product (see for example [?]).
Within these bounds, the kernel can be chosen freely to suit the application. Popular
choices are:
κ(x, x ′) = hx, x ′i q (monomials of degree q),
′ ′ q
x ) = (hx, x i + 1)
κ(x, (polynomials of degree q or less),
1
κ(x, x ′) = exp − kx − x ′k22/σ 2 (Gaussian kernel),
2
and many more, see [Bis06, Chapt. 6.3] for an overview.
Chapter 13
Total Variation and Applications
The same argument can be made if u is non-increasing, in which case we get u(a) − u(b).
This means that if u is monotonous between two points a and b, then f (u) = |u(a) −
u(b)|, and the behaviour of u between the points is completely irrelevant, as long as it is
monotonous. Hence f counts the the differences between extreme points of u, which gives
rise to the term “variation” of u.
For general functions that are not necessarily differentiable (and may even be discon-
tinuous), we use the following definition. We generally assume Ω to be an open Lipschitz
domain in Rn.
where v(x) = (v 1(x), , v m(x))⊤ , Div v = (div v 1, ,div v m), and for v ∈ Cc1(Ω, Rm×n),
kv k∞ = sup kv(x)k2.
x∈Ω
81
82 Total Variation and Applications
which is just rewriting the norm using its dual norm. It is possible to show that the
supremum and the integral can be swapped and v can be restricted to Cc1 functions:
Z Z
k∇u(x)k2 dx = sup h∇u, vi dx.
Ω v ∈Cc1(Ω,Rm×n),kv k∞ 61 Ω
As all v have compact support, we can use the divergence theorem applied to ui v i (which
is just partial integration):
Z Z Z
0 = i
div (ui v ) dx ⇒ i
h∇ui , v i dx = − ui div v i dx,
Ω Ω Ω
and get
Z Z
k∇u(x)k2 dx = sup hu, Div v i dx = TV(u).
Ω v ∈Cc1(Ω,Rm×n),kv k∞ 61 Ω
The is also sometimes referred to as the dual formulation of the total variation. The space
BV can also be defined as the space of all u ∈ L1 so that the gradient of u exists as a finite
radon measure in Ω, denoted by Du, but we will not go into these details. The importance
in formulation (13.2) as opposed to (13.1) is that it does not require u to be differentiable,
or even continuous.
An important property of the total variation is that for characteristic functions of sets,
it reduces to boundary length/area of the set:
Then
by Gauss’ theorem. This immediately show the lower bound TV(1A) 6 Hn−1(∂A). The
upper bound is slightly more difficult, intuitively we need v(x) = n on the boundary, but
have to show that v can be made sufficiently smooth, for example by constructing as a
suitable L1 function and approximating it using Cc1 functions, see [AFP00] for general
remarks.
Penalizing boundary length is a very natural way of favouring smooth contours, which
makes the total variation a good candidate for the problem of finding an optimal set, such
as in segmentation problems.
13.1 Functions of Bounded Variation 83
and
Z +∞
TV(u) = TV(1{u>t}) dt.
−∞
This can often be used to reduce the study of functions of bounded variation to the
study of their (super-)levelsets {u > t}.
Analogous to the definition of BV, we can define higher-order BV spaces by applying
this idea to the gradients of W k,1 functions as follows. For simplicity we assume the scalar
valued case, i.e., u: Ω → R.
where Symk(Rd) is the space of symmetric tensors v of order k with arguments in Rd:
For k = 1 this reduces to the usual total variation, as then v ∈ Cc∞(Ω, Rd), the symmetry
condition disappears, and
X
div1 v = ∂xi1vi1 = div v.
i1 ∈{1, ,d}
Note that all functions in BVk must have a (k − 1)-th derivatve in L1, which means a
certain regularity (see, e.g., the Sobolev embedding theorem).
In practice, TVk can again be implemented by discretizing the k-th order derivatives
k
using a matrix Gk ∈ Rnd ×n and using a special norm defined as the sum of 2-norms over
the rows of Gk u:
n
TV (u) = kG uk∗ 8
X
k k k(Gku)i k2,
i=1
which is very similar to the formulation for the “plain” TV regularizer, and can in most
cases be easily substituted.
We can then separately examine the piecewise constant (“cartoon”) part u, the piecewise
affine part v, and the noise w, or use u + v to get a denoised version of g.
Definition 13.5. (infimal convolution) For functions f1, , fn: X → R̄ for any set X, we
define the inf-convolution (f1 fn): X → R̄ as
In this primal formulation, infimal convolutions are difficult to deal with, as even evalu-
ating them involves solving an optimization problem. Fortunately, they have a very concise
dual representation:
In practice it was found that the combined (TVTV2) regularizer can often be
improved upon by using the Total Generalized Variation (TGV) [SST11, BKP10] instead:
Comparing this to the definition of TVk in (13.3), the difference is in the additional
constraints on the lower-order divergences on v.
Consider the case n = 2 and α = (1, 1), and assume that the discretization of the gradient
can be decomposed as G2 = G2G1, where G1 ∈ R2n×n discretizes the first-order derivatives
of u, and G2 ∈ R4n×2n discretizes the first-order derivatives of a vector field (in particular
∇u) with suitable boundary conditions. The suitable discretization for div1 is then −G⊤ 2,
which leads to (again assuming symmetric discretization of the gradient)
This is also known as the “differentiation cascade” formulation of TGV2, and can be
extended to TGVk. Comparing this to the standard infimal convolution,
we see that TGV2 is again a certain kind of infimal convolution, but instead of splitting
u into multiple components, the gradient of u is split. The cascading formulation is also
a convenient way of converting the dual energy in (13.8) into a form that can be handled
by most solvers.
In saddle-point form:
∗
kukG = inf {δC (v) + δ−G⊤ v=u }
v
∗
= inf sup {δC (v) + hw, G⊤ v + ui}
v w
∗
= sup inf {δC (v) + hw, G⊤ v + ui}
w v
∗
= sup −sup {hv, G wi − δC (v) − hw, ui}
w v
= sup {hw, ui − δC (G w)}
w
= sup {hw, ui| kG wk∞ 6 1}
w
= sup {hw, ui| TV(w) 6 1}.
w
The G-norm is the dual norm to the total variation, i.e., the norm associated with the unit
ball with respect to TV. In particular,
∗
k·kG = δ{u|TV(u)61} = δB∗ TV (u), BTV 8 {u|TV(u) 6 1}.
In the same way, we get (compare (13.6))
sup {hu, −G⊤ vi| kv k∞ 6 1}
TV(u) = sup {hu, −G⊤ v i| kvk∞ 6 1}, BG 8 {u| kukG 6 1}.
= sup {hu, wi|∃v: kvk∞ 6 1, w = −G⊤ v}
= sup {hu, wi|kwkG 6 1}
= δB∗ G(u), BG 8 {u| kukG 6 1}.
Consider the problem of finding
1
arg min ku − f k22 s.t. u ∈ λ BG ,
u 2
i.e., separating the “texture” component of f with the assumption that the texture “level”
as measured in the G-norm is at most λ. The solution is just the projection ΠλG(f ),
which we can already relate to the ROF problem: From the example sheets we know that
B f = I − B f ∗, and obviously if TV(u) = δB∗ G then λ TV(u) = δλBG, so
1 2
ΠλBG = f − BλTV(u) = f − arg min ku − f k2 + λ TV(u) .
u 2
This explains why the G-norm is a good candidate for regularizing data containing texture:
Solving L2-k·kG removes the part of an image that can be explained by a “structure”
component with low total variation.
In the same way, we can rewrite TV-constrained problems as
1 2 1 2
arg min ku − f k2 s.t. TV(u) 6 λ = f − arg min ku − f k2 + λ kukG .
u 2 u 2
In order to better cope with such situations, nonlocal regularizers have been pro-
posed [BCM05, GO08]. The idea is to measure the regularity of the image at a given
point not by the dissimilary of its gray value to its immediate neighbours, but to other
points in the image that are in a similar location of the repeating pattern. These points
can be potentially far away, which gives rise to the name, non-local regularization.
In a discrete setting, this can be achieved as follows.
Definition 13.9. Denote Ω = {1, , n}. For u ∈ Rn and x, y ∈ Rn and a nonnegative
weighting function w: Ω2 → R>0 , we define the non-local partial derivative ∂ yu(x) as
∂ yu(x) = (u(y) − u(x))w(x, y).
The non-local gradient of u at x for the weighting function w is the n-vector
∇w: Rn → Rn×n ,
∇wu(x) = (∂ yu(x))x∈Ω.
The difference between ∇w and a usual discrete gradient is that ∇wu(x) is an n-vector
of all “partial derivatives” to all other points in the image, instead of a (usually) 2-vector
of the partial derivatives in x- and y-direction.
We can equally define a corresponding non-local divergence, which sums up all partial
derivatives:
X
divw v(x) = (v(x, y) − v(y, x))w(x, y).
y∈Ω
With the usual Euclidean inner products in Rn and Rn×n, it can be seen that we have a
discrete “divergence theorem”,
h−divwv, ui = hv, ∇wui.
We can now define regularizers based on the non-local gradient.
X
J (u) = g(∇wu). (13.9)
x∈Ω
Most gradient-based regularizers that involve norms can be immediately extended to this
setting. There is some freedom, for example in the TV case we could define the gradient-
or difference-based versions
g
X X
d
TVNL (u) = k∇wuk2 or TVNL = k∇w uk1.
x∈Ω x∈Ω
For the ordinary total variation regularization, the 2-norm approach gives better results as
it better respects the isotropy of the total variation, but non-local regularization is inher-
ently anisotropic in any case due to the weighting function w, so there is no obvious “better”
candidate.
d
A relevant difference in practice is that while TVNL only contains terms of the form
g
|(u(y) − u(x)) w(x, y)| and is separable otherwise, TVNL is a sum of 2-norms of n-
dimensional vectors and is therefore much less separable, which makes it less attractive
from an optimization viewpoint. Nevertheless, both regularizers are convex and can be
g d
reformulated in SOCP (TVNL ) or LP (TVNL ) form.
In the same way, it is possible to generalize many other gradien-based regularizers such
as the p-norm, the G-norm to nonlocal gradients.
The defining feature is the choice of the weighting function w. First we note that (13.9)
1
includes most usual discretizations of the total variation – setting w(x, y) = h (where h is
the grid spacing) if x and y are neighbours, and w(x, y) = 0 otherwise, leaves the nonlocal
gradient essentially as the local gradient, extended with zeros.
13.4 Non-local regularization 89
The intuition is that w(x, y) should be large if the neighbourhoods of x and y are
similar, as measured by a patch distance. A classical choice is to consider the patch distance
Z
du(x, y) = Kσ(p) (u(y + t) − u(x + t))2 dt,
Ω
which is just the ℓ2distance weighted by a Gaussian Kσ centered at zero with variance
σ 2. While it would be possible to set, e.g., w(x, y) = 1/(ε + du(x, y)), this leaves us with a
d
very large optimization problem: For TVNL , the objective would contain at least n2 non-
smooth terms. Even for moderately-sized images with n ≈ 100000 this is clearly not feasible.
Therefore the weights are usually pruned before. The original choice is to define the set
( )
A(x) 8 arg min
X
du(x, y) A ⊆ S(x), |A| = k
A
y∈Ω
for a given search neighborhood S(x), which consists of the k points around x with smallest
distance. The weights are then simply set as
1, y ∈ A(x) or x ∈ A(y),
w(x, y) =
0, otherwise.
The reason for introducing the search neighbourhood S(x) is that computing du(x, y) for
all pairs of points x, y can already be too expensive.
Generally non-local regularizers tend to work very well in practice, but they also require
more parameters to be chosen. In particular, the search window weights Kσ should be large
enough for the noise error to average out when comparing patches, but small enough to get
a good resolution. The main obstacle when implementing such methods is computational,
i.e., how to quickly compute the patch distances and how to prune the weights in a way
that keeps the optimization problem tractable.
Chapter 14
Relaxation
Many real-world problems are inherently non-convex. A typical example is the problem
of segmenting an image into two regions: For given image data g, find a set C ⊆ Ω that
best describes the foreground in the sense that it fits to the given data, but also adheres
to some prior knowledge about the typical shape of the foreground.
A typical energy is the Chan-Vese model [CV01],
(Z Z )
fCV(C , c1, c2) 8 (g − c1)2 dx+ (g − c2)2 dx + λ Hd−1(C) ,
C Ω\C
where Hd−1(C) is the perimeter (i.e., length or area) of the boundary ∂C, and minimization
is performed over (C , c1, c2). The constants c1 and c2 describe the typical value of g inside
(c1) and outside (c2) of C, i.e., the problem consists in identifying the foreground and
background region together with model parameters (c1, c2) for each region.
The Chan-Vese model is in fact a special case of the Mumford-Shah model [MS89], one
of the best-studied – but still not fully understood – models in image processing:
Z Z
fMS(K , u) = (g − u)2 dx + µ k∇uk22dx+ν Hd−1(K),
Ω Ω\K
91
92 Relaxation
2. Even if the constraint set was convex, the objective is not jointly convex in (u, c1, c2).
The second problem cannot be easily overcome. However, if we fix either u or c1, c2, then the
problem is convex in the other variable. In the remainder of this chapter we will therefore
assume that c1 and c2 are known and fixed. A possible way to obtain a (at least local)
solution for unknown c1, c2 is to alternate between minimization with respect to u and c1, c2.
The first point can be addressed by a relaxation approach: Instead of our original,
nonconvex constraint set
This is in fact a convex problem. Moreover, we can set s(x) 8 (g − c1)2 − (g − c2)2 and
ignore the constant term (g − c2)2 as it does not change the set of minimizers, and obtain
This is even more appealing once one realizes that this problem is still convex if we replace
the terms (g − ci)2 by arbitrary (integrable) functions h1 and h2, and set s 8 h1 − h2. We
have found a variation of the problem that is always convex, no matter how complicated
the data term is.
There is a slightly ambiguous understanding of what is meant by a convex relaxation
approach – while it is frequently used to loosely describe the process of constructing some
convex problem from a non-convex problem, in other publications it is understood in the
strict sense of replacing a non-convex function f by its convex hull con f .
The cost of removing the non-convexity is that minimizers of f are not necessarily
indicator functions anymore. We can say the following:
Proposition 14.1. Assume c1, c2 are fixed. If u is a minimizer of f, and u(x) ∈ {0, 1}
a.e. (in particular, u = 1C for some set C ⊆ Ω), then C is a minimizer of fCV(·, c1, c2).
Definition 14.2. Denote C 8 BV(Ω, [0, 1]) 8 {u ∈ BV(Ω)|u(x) ∈ [0, 1] a.e.}. For u ∈ C,
and α ∈ [0, 1], define
ūα 8 1{u>a}, 1{u>α}(x) =
1, u(x) > α,
0, u(x) 6 α.
Relaxation 93
Proposition 14.3. Assume that s ∈ L∞(Ω) and Ω is bounded. Then the function f in
( 14.1) satisfies the generalized coarea condition.
Proof. For f (u) = TV(u), the condition is exactly the coarea formula (Thm. 13.3). As
the condition is additive, we only have to show it for the linear part:
Z Z Z 1
s(x) u(x) dx = s(x) 1{u(x)>α}dα
Ω ZΩ1 Z 0
= s(x) 1{u(x)>α}dα.
0 Ω
Swapping the integral using Fubini’s theorem requires to show that |s(x) 1{u(x)>α}| is
bounded, which holds since Ω |s(x) 1{u(x)>α}| 6 ksk∞ |Ω| < ∞ due to s ∈ L∞(Ω).
R
The most important result in this section is the following, generalized from [CEN06]:
Theorem 14.4. (Thresholding Theorem) Assume that f : BV(Ω, [0, 1]) → R satisfies the
generalized coarea condition, and u∗ satisfies
u∗ ∈ arg min f (u).
u∈BV(Ω,[0,1])
Then for almost every α ∈ [0, 1], the thresholded function ūα∗ = 1{u∗ >α} satisfies
ūα∗ ∈ arg min f (u).
u∈BV(Ω,{0,1})
Proof. We define the set of α violating the assertion, S 8 {α ∈ [0, 1]|f (ūα∗ ) f (u∗)}. Since
ūα∗ ∈ C{0,1} and C{0,1} ⊆ C, we have for any minimizer u∗{0,1} of f over C{0,1},
f (u∗) 6 f (u∗{0,1}) 6 f (ūα∗ ), (14.2)
thus S = {α ∈ [0, 1]|f (u∗) < f (ūα∗ )}. Moreover, if α S, then f (u∗) = f (u∗{0,1}) = f (ūα∗ ) by
(14.2). Therefore, in order to show the theorem it suffices to show that S is an L1-zero set.
Assume the contrary holds, i.e. L1(S) > 0. Then there must be ε > 0 such that
Sε 8 {α ∈ [0, 1]|f (u∗) 6 f (u∗α) − ε} (14.3)
has also nonzero
S measure, since otherwise S would be the countable union of zero measure
sets, S = i∈N S1/n, and would consequently have zero measure as well. Then
Z Z
∗
f (u ) = ∗
f (u )dα + f (u∗)dα (14.4)
[0,1]\Sε Sε
u∗ optimal
Z Z
6 f (ūα∗ )dα + f (u∗)dα (14.5)
[0,1]\Sε Sε
definition of Sε
Z Z
∗
6 f (ūα)dα + (f (ūα∗ ) − ε)dα (14.6)
[0,1]\Sε Sε
Z 1
linearity
= f (ūα∗ )dα − ε L1(Sε). (14.7)
0
But we assumed L1(S ε) > 0, therefore
Z 1
f (u∗) < f (ūα∗ )dα. (14.8)
0
94 Relaxation
At the heart of the proof is the generalized coarea condition (?). It has the following
intuitive interpretation:
1. The function u may be written in the form of a “generalized convex combination”
of (an infinite number of) extreme points Eu 8 {ūα |α ∈ [0, 1]} of the constraint
set, i.e. the unit ball in BV(Ω). As shown in [Fle57] based on a result by Choquet
[Cho56], and noted in [Str83, p.127], extreme points of this constraint set are (mul-
tiples of, but in this case equal to) indicator functions.
2. The extreme points (ūα) and coefficients in this convex combination can be explicitly
found. In fact, the coefficients are all equal to 1/|[0, 1]| = 1, i.e. u is the barycenter
of the points in Eu.
3. For any convex f , the inequality
Z 1
f (ūα) dα > f (u) (14.9)
0
always holds. The coarea formula (?) is therefore equivalent to the reverse
inequality.
In fact, the original proof of the coarea formula [FR60] relies on showing (14.9) and using
the fact that (?) holds for piecewise linear u [FR60, (1.5c)]. Approximating an arbitrary
u ∈ BV(Ω) by a sequence of piecewise linear functions, this result is then transported to
the general case.
Remark 14.5. It is important to understand that Thm. 14.4 in its current form only
holds before discretization: Assume the problem is discretized according to
where kG uk∗ is a suitable norm implementing the total variation. From Thm. 14.4 one
might naturally assume that if u∗ solves (?), then for almost every α ∈ [0, 1], the thresholded
minimizer ūα∗ solves the combinatorial problem
This is generally not the case! The reason is that in order to transfer Thm. 14.4 to the
discretized problem, the discretized objective functions needs to satisfy the generalized
coarea condition. While the linear term doesPnot pose a problem, the usual “isotropic”
discretizations of the total variation, such as i kGi uk2, with Gi implementing forward,
central, or finite differences on a staggered grid, do not have this property.
In finite dimensions, the integral
Z 1
f (ūα) dα
0
is also known as the Lovasz extension of an energy f : {0, 1}n → R. Energies with a
convex Lovasz extension are called “submodular”, and play a central role in combinatorial
optimization, as a large problem class that can be solved efficiently.
Submodular energies that consist of a sum of terms involving at most two variables
ui can be solved by computing a minimal cut through a graph, which can be achieved in
polynomial time by solving the dual problem with specialized “maximum-flow” methods
[BVZ01, Ber98].
Relaxation 95
The energy (14.10) can be made to satisfy a generalized coarea property, for example
by choosing a forward difference discretization for G, and setting
X X
kGuk∗ = kGi uk1 = (|u(i+1),j − ui,j | + |ui,(j+1) − ui,j |)
i i,j
97