0% found this document useful (0 votes)
20 views66 pages

Lecture4 APG

This document provides a summary of key concepts from a lecture on convex analysis including: 1) Basic definitions of norms, inner products, and projections onto closed convex sets. 2) The definition of extended real-valued functions and properties like convexity. 3) Indicator functions and their relationship to convex sets. 4) Dual and polar cones and conditions for self-duality.

Uploaded by

jiayuan0113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views66 pages

Lecture4 APG

This document provides a summary of key concepts from a lecture on convex analysis including: 1) Basic definitions of norms, inner products, and projections onto closed convex sets. 2) The definition of extended real-valued functions and properties like convexity. 3) Indicator functions and their relationship to convex sets. 4) Dual and polar cones and conditions for self-duality.

Uploaded by

jiayuan0113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

APG

DSA5103 Lecture 4

Yangjing Zhang
05-Sep-2023
NUS
Today’s content

1. Basic convex analysis


2. Proximal operator
3. (Accelerated) proximal gradient method

lecture4 1/54
Basic convex analysis
Norms

A vector norm on Rn is a function k · k : Rn → R satisfying the following


properties:
(1) kxk ≥ 0 ∀ x ∈ Rn , and kxk = 0 ⇐⇒ x = 0
(2) kαxk = |α|kxk ∀ α ∈ R, x ∈ Rn
(3) kx + yk ≤ kxk + kyk ∀ x, y ∈ Rn
Example.
Pn
1. kxk1 = i=1 |xi | (`1 norm)
Pn 2 1/2
2. kxk2 = ( i=1 |xi | ) (`2 norm)
Pn p 1/p
3. kxkp = ( i=1 |xi | ) , where 1 ≤ p < ∞ (`p norm)
4. kxk∞ = max1≤i≤n |xi | (`∞ norm)
5. kxkW,p = kW xkp , where W is a fixed nonsingular matrix,
1≤p≤∞

lecture4 2/54
Inner product

For the space of m × n matrices, Rm×n , we define the standard inner


product, for any A, B ∈ Rm×n ,
m X
X n
hA, Bi = Tr(AT B) = Aij Bij
i=1 j=1
Pn
Recap: the trace of a square matrix C ∈ Rn×n is Tr(C) = i=1 Cii .
n
For the space of n-vectors, R , we define the standard inner product, for
any x, y ∈ Rn ,
n
X
hx, yi = xT y = xi yi
i=1

lecture4 3/54
Inner product

From the definition of inner products, the following properties hold for
any matrices A, B, C ∈ Rm×n and any scalars a, b ∈ R

• kAk2F := hA, Ai ≥ 0
• hA, Ai = 0 if and only if A = 0
• hC, aA + bBi = ahC, Ai + bhC, Bi
• hA + B, A + Bi = hA, Ai + 2hA, Bi + hB, Bi
(i.e., kA + Bk2F = kAk2F + 2hA, Bi + kBk2F )

lecture4 4/54
Projection onto a closed convex set

Theorem (Projection theorem). Let C be a closed convex set.


(1) For every z, there exists a unique minimizer of
1
min kx − zk2 ,
x∈C 2
denoted as ΠC (z) and called as the projection of z onto C.
(2) x∗ := ΠC (z) is the projection of z onto C if and only if
hz − x∗ , x − x∗ i ≤ 0 ∀ x ∈ C.

lecture4 5/54
Projection onto a closed convex set

Theorem (Projection theorem). Let C be a closed convex set.


(1) For every z, there exists a unique minimizer of
1
min kx − zk2 ,
x∈C 2
denoted as ΠC (z) and called as the projection of z onto C.
(2) x∗ := ΠC (z) is the projection of z onto C if and only if
hz − x∗ , x − x∗ i ≤ 0 ∀ x ∈ C.

lecture4 5/54
Projection onto a closed convex set

Example.
1. C = Rn+ = {x ∈ Rn | xi ≥ 0, ∀ i = 1, 2, . . . , n} positive orthant
ΠC (z) = ΠRn+ (z) = max{z, 0}
2. C = {x ∈ Rn | kxk2 ≤ 1} `2 -norm ball

 z, if kzk2 ≤ 1 z
ΠC (z) = =
 z
, if kzk2 > 1 max{kzk2 , 1}
kzk2

Figure 1: C = {x ∈ R2 | kxk2 ≤ 1} `2 -norm ball


lecture4 6/54
Projection onto a closed convex set

Example.

3. C = Sn+ the space of n × n symmetric and positive semidefinite


matrices
 
max{λ1 , 0}
ΠC (A) = ΠSn+ (A) = Q  ..  T
Q

.
max{λn , 0}

for a given A ∈ Sn with eigenvalue decomposition


 
λ1
A = Q ..  T
Q

.
λn

lecture4 7/54
arg min

arg min f (x) denotes the solution set of x for which f (x) attains its
x
minimum (argument of the minimum)

4 1

0.9
3.5
0.8
3
0.7
2.5
0.6

2 0.5

0.4
1.5
0.3
1
0.2
0.5
0.1

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Left: min f (x) = 0, arg min f (x) = {2}


x x

Right: min f (x) = 0, arg min f (x) = [−1, 1]


x x
1
ΠC (z) = arg minx∈C 2 kx − zk2
lecture4 8/54
Extended real-valued function

Definition.
Let X be a Euclidean space (e.g., X = Rn or Rm×n ). Let
f : X → (−∞, +∞] be an extended real-valued function1 .
(1) The (effective) domain of f is defined to be the set

dom(f ) := {x ∈ X | f (x) < +∞}.

(2) f is said to be proper if dom(f ) 6= ∅.


(3) f is said to be closed if its epi-graph

epi(f ) := {(x, α) ∈ X × R | f (x) ≤ α}

is closed.
(4) f is said to be convex if its epi-graph is convex.
1 Here f is allowed to take the value of +∞, but not allowed to take the value of −∞
lecture4 9/54
Extended real-valued function

• For a real-valued function f : X → R, this definition of convexity

epi(f ) is convex (1)

coincides with the one we have used

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y), ∀ x, y ∈ dom(f ), λ ∈ [0, 1]


(2)
[Exercise] (1) ⇐⇒ (2)
• A convex function f : D ⊆ Rn → R can be extended to a convex
function on all of Rn by setting f (x) = +∞ for x ∈
/ D.

lecture4 10/54
Example: extended real-valued function


 0, if x = 0


f (x) = 1, if x > 0


 +∞, if x < 0

• dom(f ) = [0, +∞), f is proper


• epi(f ) = {0} × [0, +∞) ∪ (0, +∞) × [1, +∞) is closed, i.e., f is
closed
• epi(f ) is not convex, i.e., f is not convex

lecture4 11/54
Indicator function

Let C be a nonempty set in X . f : R2 → (−∞, +∞]


The indicator function of C is
(
0, if x ∈ C
δC (x) =
+∞, if x ∈ /C

• dom(f ) = C, f is proper
• epi(f ) = C × [0, +∞) is closed if C is closed, i.e., δC (·) is closed if
C is closed
• epi(f ) is convex if C is convex, i.e., δC (·) is convex if C is convex

lecture4 12/54
Dual/polar cone

Definition (Cone)
A set C ⊆ X is called a cone if λx ∈ C when x ∈ C and λ ≥ 0.

Definition (Dual and polar cone)


The dual cone of a set C ⊆ X (not necessarily convex) is defined by

C ∗ = {y ∈ X | hx, yi ≥ 0 ∀x ∈ C}

The polar cone of C is C o = −C ∗ .


If C ∗ = C, then C is said to be self-dual.

• C ∗ is always a convex cone, even if C is neither convex nor a cone.

lecture4 13/54
Dual/polar cone

Figure 2: Left: A set C and its dual cone C ∗ . Right: A set C and its polar
cone C o . The dual cone and the polar cone are symmetric to each other with
respect to the origin. Image from internet

lecture4 14/54
Example: self-dual cones

1. X = Rn . C = Rn+ is a self-dual closed convex cone


. Proof: C ∗ = {y ∈ Rn | hx, yi ≥ 0 ∀x ∈ Rn n
+ } = R+

2. X = Sn . C = Sn+ is a self-dual closed convex cone (psd cone)


. Proof: want to show that {B ∈ Sn | hA, Bi ≥ 0 ∀A ∈ Sn n
+ } = S+
∗ n T n
[LHS ⊆ RHS] Take B ∈ C . For any x ∈ R , we have xx ∈ S+
and thus hxxT , Bi = xT Bx ≥ 0, which implies B ∈ Sn +.
[RHS ⊆ LHS] Take B ∈ Sn + . Compute its eigenvalue decomposition
B= n T n
P
i=1 λi vi vi , λi ≥ 0. For any A ∈ S+ , we have
Pn T Pn T
hA, Bi = hA, i=1 λi vi vi i = i=1 λi vi Avi ≥ 0.

lecture4 15/54
Example: self-dual cones
p
3. X = Rn . C := {x ∈ Rn | x22 + · · · + x2n ≤ x1 , x1 ≥ 0} is a
self-dual closed convex cone (second-order cone)
. Proof: want to show that {y ∈ Rn | hx, yi ≥ 0 ∀x ∈ C} = C
[RHS ⊆ LHS] Take y ∈ C. For any x ∈ C
v v
u n u n
uX uX
2
hx, yi = x1 y1 + x2 y2 + · · · + xn yn ≥ x1 y1 − t xi t yi2 ≥ 0
i=2 i=2

The first inequality follows from the Cauchy-Schwartz inequality, and


the last inequality follows from the fact that x, y ∈ C.
[LHS ⊆ RHS] Take y ∈ C ∗ . If [y2 ; . . . ; yn ] = 0, we take
x = [1; 0; . . . ; 0] ∈ C then hx, yi = y1 ≥ 0. Obviously, y ∈ C. Else,
we take v 
u n
uX
2
x= t yi ; −y2 ; . . . ; −yn  ∈ C
i=2

then v v
u n u n
uX 2 2
uX
2
hx, yi = y1 t yi − y2 − · · · − yn ≥ 0 ⇒ y1 ≥ t yi2 ⇒ y ∈ C
i=2 i=2 lecture4 16/54
.
Example: self-dual cones

Figure 3: Left: second-order cone {x ∈ R2 | x1 ≥ |x2 |}. Right: second-order


p
cone {x ∈ R3 | x1 ≥ x22 + x23 }. Image from internet

lecture4 17/54
Normal cone

Definition (Normal cone)


Let C be a convex set in X and x̄ ∈ C. The normal cone of C at x̄ ∈ C
is defined by
NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}
By convention, we let NC (x̄) = ∅ if x̄ ∈
/ C.

lecture4 18/54
Example: normal cones

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}


Example 1. C = [0, 1] ⊆ R



 (−∞, 0], if x̄ = 0


 [0, +∞), if x̄ = 1
NC (x̄) =


 {0}, if x̄ ∈ (0, 1)


 ∅, if x̄ ∈
/C

lecture4 19/54
Example: normal cones

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}


Example 2.
C = {x ∈ R2 | kxk ≤ 1} ⊆ R2

 {λx̄ | λ ≥ 0}, if kx̄k = 1


NC (x̄) = {0}, if kx̄k < 1


 ∅, if x̄ ∈
/C

. In page 22, we show that u ∈ NC (x̄) ⇐⇒ x̄ = ΠC (x̄ + u). If


kx̄k = 1 and u 6= 0, then x̄ = ΠC (x̄ + u) implies that kx̄ + uk > 1
x̄+u
and x̄ = ΠC (x̄ + u) = kx̄+uk . It follows that u = (kx̄ + uk − 1)x̄
(kx̄ + uk − 1 > 0). That is, NC (x̄) = {λx̄ | λ ≥ 0} as it is a cone.

lecture4 20/54
Example: normal cones

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}

Example 3. C = a triangle ⊆ R2

lecture4 21/54
Example: normal cones

NC (x̄) := {z ∈ X | hz, x − x̄i ≤ 0 ∀ x ∈ C}

Example 3. C = a triangle ⊆ R2

[Exercise] (1) C = {x ∈ R2 | x1 + x2 ≤ 1, x1 ≥ 0, x2 ≥ 0}. Find NC (x̄)


for x̄ = [0; 1], x̄ = [0.5; 0.5], x̄ = [0.1; 0.2].
(2) C = Rn+ . Find NC (x̄), x̄ = [1; 1; 0; 0; . . . ; 0]. lecture4 21/54
Normal cone

Proposition Let C ⊆ X be a nonempty convex set and x̄ ∈ C. Then


(1) NC (x̄) is a closed convex cone.
(2) If x̄ ∈ int(C) (x̄ is an interior point of C), then NC (x̄) = {0}.
(3) If C is a cone, then NC (x̄) ⊆ C o .

lecture4 22/54
Normal cone

Proof*: (1) Closeness is easy to prove [Exercise]. First we prove that it


is a cone. Consider z ∈ NC (x̄) and λ ≥ 0. By definition, we have that
hz, x − x̄i ≤ 0 ∀ x ∈ C ⇒ hλz, x − x̄i ≤ 0 ∀ x ∈ C. Thus
λz ∈ NC (x̄). Next we show that if z1 , z2 ∈ NC (x̄), then
z1 + z2 ∈ NC (x̄). By definition, we have that hz1 , x − x̄i ≤ 0 and
hz2 , x − x̄i ≤ 0 for all x ∈ C. Thus hz1 + z2 , x − x̄i ≤ 0 for all x ∈ C,
which implies that z1 + z2 ∈ NC (x̄) and NC (x̄) is convex.
(2) Let z ∈ NC (x̄). Since x̄ ∈ int(C), there exists  > 0 such that
x̄ + tz ∈ C for |t| < . By definition of normal cone, we have that
0 ≥ hz, (x̄ + tz) − x̄i = tkzk2 . By considering both positive and negative
t, we get kzk2 = 0. Hence z = 0.
(3) Suppose C is a cone. Then x̄ + x ∈ C for any x ∈ C. Thus for any
z ∈ NC (x̄), we have hz, xi = hz, (x + x̄) − x̄i ≤ 0 ∀ x ∈ C. Hence
z ∈ C o . That is, NC (x̄) ⊆ C o .

lecture4 23/54
Normal cone

Proposition Let C ⊆ X be a nonempty closed convex set. Then for any


y ∈ C,
u ∈ NC (y) ⇐⇒ y = ΠC (y + u)

Proof. “⇒” Suppose u ∈ NC (y). Then hu, x − yi ≤ 0 for all x ∈ C.


Thus
h(y + u) − y, x − yi ≤ 0 for all x ∈ C
which implies that y = ΠC (y + u).
“⇐” Suppose y = ΠC (y + u). Then we know that

hu, x − yi = h(y + u) − y, x − yi ≤ 0 for all x ∈ C

which implies that u ∈ NC (y).

lecture4 24/54
Subdifferential

Definition Let f : X → (−∞, +∞] be a convex function.


We call v a subgradient of f at x ∈ dom(f ) if
f (z) ≥ f (x) + hv, z − xi ∀z ∈ X.
The set of all subgradients at x is called the subdifferential of f at x,
denoted as
∂f (x) = {v | f (z) ≥ f (x) + hv, z − xi ∀ z ∈ X }.
By convention, ∂f (x) = ∅ for any x ∈
/ dom(f ).

lecture4 25/54
Subdifferential and optimization

Subgradient is an extension of gradient

• If f is differentiable at x, then ∂f (x) = {∇f (x)}.


. Proof: If v ∈ ∂f (x), then f (x + h) ≥ f (x) + hv, hi ∀ h. By taking
h = t(v − ∇f (x)), t > 0, and use first-order Taylor series expansion
to get tkv − ∇f (x)k ≤ o(t) ∀ t > 0, which implies v = ∇f (x).

Theorem Let f : X → (−∞, +∞] be a proper convex function. Then


x̄ ∈ X is a global minimizer of min f (x) if and only if 0 ∈ ∂f (x̄).
x∈X

lecture4 26/54
Subdifferential and optimization

Subgradient is an extension of gradient

• If f is differentiable at x, then ∂f (x) = {∇f (x)}.


. Proof: If v ∈ ∂f (x), then f (x + h) ≥ f (x) + hv, hi ∀ h. By taking
h = t(v − ∇f (x)), t > 0, and use first-order Taylor series expansion
to get tkv − ∇f (x)k ≤ o(t) ∀ t > 0, which implies v = ∇f (x).

Theorem Let f : X → (−∞, +∞] be a proper convex function. Then


x̄ ∈ X is a global minimizer of min f (x) if and only if 0 ∈ ∂f (x̄).
x∈X

Proof. By the subgradient inequality f (z) ≥ f (x̄) + hv, z − x̄i ∀z ∈ X.


Take 0 = v ∈ ∂f (x̄)

lecture4 26/54
Example: subdifferential

Example 1.
f (x) = |x|, x ∈ R.
1

0.9

0.8
 0.7

 {−1}, if x < 0

 0.6

0.5

∂f (x) = [−1, 1], if x = 0 0.4



 0.3
 {1}, if x > 0 0.2

0.1

0
-1 -0.5 0 0.5 1

lecture4 27/54
Example: subdifferential

Example 1.
f (x) = |x|, x ∈ R.
1

0.9

0.8
 0.7

 {−1}, if x < 0

 0.6

0.5

∂f (x) = [−1, 1], if x = 0 0.4



 0.3
 {1}, if x > 0 0.2

0.1

0
-1 -0.5 0 0.5 1

The `1 norm promotes sparsity: for example, in lasso


1
minimize kXβ − Y k2 + λkβk1
β∈Rp 2
Optimality condition: 0 ∈ X T (Xβ − Y ) + λ∂(k · k1 )(β)

T
 βi < 0, 0 = [X (Xβ − Y )]i − λ

βi = 0, 0 ∈ [X T (Xβ − Y )]i + λ[−1, 1] (∗)
 β > 0, 0 = [X T (Xβ − Y )] + λ

i i

(∗) ⇐⇒ βi = 0, |[X T (Xβ − Y )]i | ≤ λ Sparsity!


lecture4 27/54
Example: subdifferential

Example 2.
f (x) = max{x2 − 1, 0}, x ∈ R.
1.5

1



 {2x}, if x < −1, x > 1 0.5

 {0},

if − 1 < x < 1 0

∂f (x) =


 [−2, 0], if x = −1 -0.5



 [0, 2], if x = 1 -1
-1.5 -1 -0.5 0 0.5 1 1.5

lecture4 28/54
Example: subdifferential

Example 3. Let C be a convex set.


(
NC (x), if x ∈ C
∂δC (x) =
∅, if x ∈
/ C.

Proof: Take x ∈ C

v ∈ ∂δC (x)
⇐⇒ δC (z) ≥ δC (x) + hv, z − xi ∀z
⇐⇒ 0 ≥ hv, z − xi ∀z ∈ C
⇐⇒ v ∈ NC (x)

lecture4 29/54
Lipschitz continuous

Definition (Lipschitz continuous)


A function F : Rn → Rm is said to be locally Lipschitz continuous if for
any open set O ⊆ Rn , there exists a constant L (depending on O) such
that
kF (x) − F (y)k ≤ Lkx − yk ∀ x, y ∈ O.
If O = Rn , then F is said to be globally Lipschitz continuous.
Example
(1) f (x) = |x|, x ∈ R is globally Lipschitz continuous with Lipschitz
constant L = 1
(2) f (x) = x2 , x ∈ R is locally Lipschitz continuous but not globally
Lipschitz continuous

lecture4 30/54
Fenchel conjugate

Definition Let f : X → [−∞, +∞]. The (Fenchel) conjugate of f is


defined by

f ∗ (y) = sup{hy, xi − f (x) | x ∈ X }, y ∈ X

Remark

• f ∗ is always closed and convex, even if f is neither convex nor


closed.
• If f : X → (−∞, +∞] a closed proper convex function, then
(f ∗ )∗ = f .

lecture4 31/54
Fenchel conjugate

Proposition Let f : X → (−∞, +∞] be a closed proper convex


function. The following is equivalent:
(1) f (x) + f ∗ (y) = hx, yi
(2) y ∈ ∂f (x)
(3) x ∈ ∂f ∗ (y)

• “y ∈ ∂f (x) ⇐⇒ x ∈ ∂f ∗ (y)” means that ∂f ∗ is the inverse of ∂f


in the sense of multi-valued mappings.

lecture4 32/54
Example: conjugate function

Example 1. Let C ⊆ X be a nonempty convex set. Compute the


conjugate of the indicator function of C.
Solution. Recall that the indicator function of C is
(
0, if x ∈ C
δC (x) =
+∞, if x ∈ /C

Its conjugate

δC (y) = sup{hy, xi − δC (x) | x ∈ X } = sup{hy, xi | x ∈ C}

is called as the support function.

lecture4 33/54
Example: conjugate function

Example 2. Let f (x) = kxk1 , x ∈ Rn . Compute f ∗ .


Solution. f ∗ (y) = supx {hy, xi − kxk1 }
Case 1: if kyk∞ ≤ 1, we have hy, xi − kxk1 ≤ kxk1 kyk∞ − kxk1 ≤ 0.
Therefore, f ∗ (y) = 0.
Case 2: if kyk∞ > 1, there exists |yk | > 1. For a positive integer m, we
construct
x̄ = [0; . . . ; 0; m sign(yk ); 0; . . . ; 0]
and therefore f ∗ (y) ≥ hy, x̄i − kx̄k1 = m(|yk | − 1) → +∞, as
m → +∞. Therefore, f ∗ (y) = +∞.
In conclusion, f ∗ (y) = δC (y), C = {y ∈ Rn | kyk∞ ≤ 1}.

lecture4 34/54
Example: conjugate function

Example 2. Let f (x) = kxk1 , x ∈ Rn . Compute f ∗ .


Solution. f ∗ (y) = supx {hy, xi − kxk1 }
Case 1: if kyk∞ ≤ 1, we have hy, xi − kxk1 ≤ kxk1 kyk∞ − kxk1 ≤ 0.
Therefore, f ∗ (y) = 0.
Case 2: if kyk∞ > 1, there exists |yk | > 1. For a positive integer m, we
construct
x̄ = [0; . . . ; 0; m sign(yk ); 0; . . . ; 0]
and therefore f ∗ (y) ≥ hy, x̄i − kx̄k1 = m(|yk | − 1) → +∞, as
m → +∞. Therefore, f ∗ (y) = +∞.
In conclusion, f ∗ (y) = δC (y), C = {y ∈ Rn | kyk∞ ≤ 1}.
[Exercise] Let f (x) = λkxkp , x ∈ Rn , 1 < p < ∞, λ > 0. Show that
f ∗ (y) = δC (y), C = {y ∈ Rn | kykq ≤ λ}, p1 + 1q = 1.

lecture4 34/54
Proximal operator
Moreau envelope and proximal mapping

Let f : X → (−∞, +∞] be a closed proper convex function. We define

• Moreau envelope (Moreau-Yosida regularization) of f at x


 
1
Mf (x) = min f (y) + ky − xk2
y 2
• Proximal mapping of f at x
 
1
Pf (x) = arg min f (y) + ky − xk2
y 2

• Mf (x) is differentiable, its gradient is ∇Mf (x) = x − Pf (x)


(Moreau envelope is a way to smooth a possibly non-differentiable
convex function)
• Pf (x) exists and is unique
• Mf (x) ≤ f (x)
• arg min f (x) = arg min Mf (x)

lecture4 35/54
Example

Let C ⊆ X be a nonempty closed convex set and f (x) = δC (x) be the


indicator function of C. Its proximal mapping is
 
1 1
Pf (x) = arg min δC (y) + ky − xk2 = arg min ky − xk2 = ΠC (x)
y∈X 2 y∈C 2

Its Moreau envelope is


1
Mf (x) = kx − ΠC (x)k2
2

lecture4 36/54
Example

Let f (x) = λ|x|, x ∈ R. Its Moreau envelope (known as Huber function)


(
1 2
x , |x| ≤ λ
Mf (x) = 2
λ2
λ|x| − 2 , |x| > λ
Its proximal mapping is (known as soft thresholding)
Pf (x) = sign(x) max{|x| − λ, 0}

Soft thresholding
1.5 2 2

1.8 1.8
1
1.6 1.6

1.4 1.4
0.5
1.2 1.2

0 1 1

0.8 0.8
-0.5
0.6 0.6

0.4 0.4
-1
0.2 0.2

-1.5 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

We can see that Mf is a smoothing of f , Mf ≤ f ,


arg min f (x) = arg min Mf (x).
lecture4 37/54
Soft thresholding

The soft thresholding operator Sλ : Rn → Rn is defined as


 
sign(x1 ) max{|x1 | − λ, 0}
 sign(x2 ) max{|x2 | − λ, 0} 
 
Sλ (x) = 
 .. 
.

 
sign(xn ) max{|xn | − λ, 0}

for any x = [x1 ; . . . ; xn ] ∈ Rn and λ > 0.


Example. Given
     
1.5 1 0
−0.4  0  0
     
x =  3 , S0.5 (x) =  2.5  , S2 (x) = 1
     
     
 −2  −1.5 0
0.8 0.3 0

lecture4 38/54
Moreau envelope and proximal mapping

Theorem (Moreau decomposition)


Let f : X → (−∞, +∞] be a closed proper convex function and f ∗ be
its conjugate. For any x ∈ X , it holds that
x =Pf (x) + Pf ∗ (x)
1
kxk2 =Mf (x) + Mf ∗ (x)
2
Example. Let C ⊆ X be a nonempty closed convex cone. f (x) = δC (x)
and f ∗ (x) = δC

(x) = δC o (x). Therefore
x = ΠC (x) + ΠC o (x).

lecture4 39/54
Moreau envelope and proximal mapping

Theorem (Moreau decomposition)


Let f : X → (−∞, +∞] be a closed proper convex function and f ∗ be
its conjugate. For any x ∈ X , it holds that
x =Pf (x) + Pf ∗ (x)
1
kxk2 =Mf (x) + Mf ∗ (x)
2
Example. Let C ⊆ X be a nonempty closed convex cone. f (x) = δC (x)
and f ∗ (x) = δC

(x) = δC o (x). Therefore
x = ΠC (x) + ΠC o (x).

• Mf (·) is always differentiable even though f is non-differentiable


• Pf (·) is important in many optimization algorithms (e.g.,
accelerated proximal gradient methods introduced later)
• For many widely used regularizaters, Pf (·) and Mf (·) have explicit
expression
lecture4 39/54
lecture4 40/54
(Accelerated) proximal gradient
method
A proximal point view of gradient methods

To minimize a differentiable function min f (β)


β

β (k+1) = β (k) − αk ∇f (β (k) )

The gradient step can be written equivalently as


1
β (k+1) = arg min{f (β (k) ) + h∇f (β (k) ), β − β (k) i + kβ − β (k) k2 }
β | {z } 2αk
linear approximation
| {z }
proximal term

lecture4 41/54
Optimizing composite functions

minimize
p
f (β) + g(β)
β∈R

• f : Rp → R is convex and differentiable, ∇f is L-Lipschitz


continuous
• g : Rp → (−∞, +∞] is closed proper convex, non-differentiable
• For example, in lasso,
1
minimize kXβ − Y k2 + λkβk1
β∈Rp 2
| {z } | {z }
=g(β)
=f (β)

• Since g is non-differentiable, we cannot apply gradient methods

lecture4 42/54
Proximal gradient step

minimize
p
f (β) + g(β)
β∈R

Gradient step (suppose g(β) disappears):

β (k+1) = β (k) − αk ∇f (β (k) )

which can be written equivalently as


 
1
β (k+1) = arg min f (β (k) ) + h∇f (β (k) ), β − β (k) i + kβ − β (k) k2
β 2αk

Proximal gradient step:


 
(k+1) (k) (k) (k) 1 (k) 2
β = arg min f (β ) + h∇f (β ), β − β i+g(β) + kβ − β k
β 2αk

lecture4 43/54
Proximal gradient step

Proximal gradient step:


 
1
β (k+1) = arg min f (β (k) ) + h∇f (β (k) ), β − β (k) i+g(β) + kβ − β (k) k2
β 2αk
After ignoring constant terms and completing the square, the above step
can be written equivalently as
  2 
(k+1) 1 
(k) (k)
β = arg min β − β − αk ∇f (β ) + g(β)
β 2αk
 
= Pαk g β (k) − αk ∇f (β (k) )

lecture4 44/54
Proximal gradient step

Proximal gradient step:


 
1
β (k+1) = arg min f (β (k) ) + h∇f (β (k) ), β − β (k) i+g(β) + kβ − β (k) k2
β 2αk
After ignoring constant terms and completing the square, the above step
can be written equivalently as
  2 
(k+1) 1 
(k) (k)
β = arg min β − β − αk ∇f (β ) + g(β)
β 2αk
 
= Pαk g β (k) − αk ∇f (β (k) )

Derivation* completing the square


1
h∇f (β (k) ), βi + kβ − β (k) k2
2αk
1 (k) 1
=h∇f (β (k) ) − β , βi + kβk2 + constant
αk 2αk
1 1
= hαk ∇f (β (k) ) − β (k) , βi + kβk2 + constant
αk 2αk
lecture4 44/54
Proximal gradient methods

Algorithm (Proximal gradient (PG) method)


Choose β (0) , constant step length α > 0. Set k ← 0
repeat until convergence
 
β (k+1) = Pαg β (k) − α∇f (β (k) )
k ←k+1
end(repeat)
return β (k)

lecture4 45/54
Proximal gradient methods

(Informal) in convex problems (f and g are convex), iteration complexity


of PG method is O( k1 ):
 
1
f (β (k) ) + g(β (k) ) − minp f (β) + g(β) ≤ O
β∈R k
| {z }
optimal value

If adopting stopping condition

f (β (k) ) + g(β (k) ) − optimal value ≤ 10−4

we need O(104 ) PG iterations

lecture4 46/54
Nesterov’s accelerated method

Nesterov’s idea: include a momentum term for acceleration


tk − 1  (k) 
β̄ (k) = β (k) + β − β (k−1)
tk+1
| {z }
momentum term
 
β (k+1) = Pαg β̄ (k) − α∇f (β̄ (k) )

{tk } is a positive sequence such that t0 = t1 = 1, t2k+1 − tk+1 ≤ t2k

• e.g., tk = 1, ∀ k. In this case, momentum term = 0, it is reduced to


PG without acceleration
√ √
1+ 1+4t2k
• e.g., tk+1 = 2 ⇒ t0 = 1, t1 = 1, t2 = 1+ 5
2 ,...
• there are many other sequences satisfying the condition

1+ 1+4t2k
• Next we simply take t0 = t1 = 1, tk+1 = 2

lecture4 47/54
Sequence tk


1+ 1+4t2k
Take t0 = t1 = 1, tk+1 = 2

45 1

40 0.9

0.8
35
0.7
30
0.6
25
0.5
20
0.4
15
0.3
10
0.2

5 0.1

0 0
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80

lecture4 48/54
Accelerated proximal gradient methods

Algorithm (Accelerated proximal gradient (APG) method)


Choose β (0) , constant step length α > 0. Set t0 = t1 = 1, k ← 0
repeat until convergence
tk − 1 (k)
β̄ (k) = β (k) + (β − β (k−1) )
tk+1
 
β (k+1) = Pαg β̄ (k) − α∇f (β̄ (k) )
k ←k+1
p
1+ 1 + 4t2k
tk+1 =
2
end(repeat)
return β (k)

The algorithmic framework follows from [1]. It is built based on


Nesterov’s accelerated method in 1983 [2]
lecture4 49/54
Accelerated proximal gradient methods

(Informal) in convex problems (f and g are convex), iteration complexity


of APG method is O( k12 ):
 
1
f (β (k) ) + g(β (k) ) − minp f (β) + g(β) ≤ O
β∈R k2
| {z }
optimal value

If adopting stopping condition

f (β (k) ) + g(β (k) ) − optimal value ≤ 10−4

we need O(102 ) PG iterations

lecture4 50/54
Accelerated proximal gradient methods

• Backtracking line search is also applicable for finding step length αk .


• For simplicity, we take a constant step length. f It should satisfy
α ∈ (0, L1 ), where L is Lipschitz constant of ∇f (·) (typically
unknown)
• APG methods enjoy the same computational cost per iteration as
PG methods.
• Iteration complexity: APG O( k12 ); PG O( k1 )

lecture4 51/54
APG for lasso

1
minimize kXβ − Y k2 + λkβk1 lasso problem
p
β∈R 2
| {z } | {z }
=g(β)
=f (β)

Then ∇f (β) = X T (Xβ − Y ) with Lipschitz constant L = λmax (X T X).


Choose step length α = 1/L. APG iterations:
tk − 1  (k) 
β̄ (k) = β (k) + β − β (k−1)
tk+1
 
1 T
β (k+1) (k) (k)
= Sλ/L β̄ − X (X β̄ − Y )
L

APG is also applicable for “logistic regression + lasso regularization”

lecture4 52/54
APG for lasso

Design a stopping criteria for lasso problem


1
minimize kXβ − Y k2 + λkβk1
p
β∈R 2
| {z } | {z }
=g(β)
=f (β)

We know that β is a global minimizer to lasso problem if and only if

0 ∈ ∇f (β) + ∂g(β),

which is equivalent to [check]

β = Pg (β − ∇f (β)).

Therefore, we can choose a tolerance ε > 0 and stop the method at β (k)
once the stopping criteria is satisfied
 
β (k) − Sλ β (k) − X T (Xβ (k) − Y ) < ε.

lecture4 53/54
Restart

• A strategy to speed up APG is to restart the algorithm after a fixed


number of iterations
• using the latest iterate as the starting point of the new round of
APG iteration
• a reasonable choice is to perform restart every 100 or 200 iterations

lecture4 54/54
References i

A. Beck and M. Teboulle.


A fast iterative shrinkage-thresholding algorithm for linear
inverse problems.
SIAM journal on imaging sciences, 2(1):183–202, 2009.
Y. E. Nesterov.
A method for solving the convex programming problem with
convergence rate O(1/k 2 ).
In Dokl. Akad. Nauk SSSR,, volume 269, pages 543–547, 1983.

You might also like