0% found this document useful (0 votes)
6 views

Subgradient Methods

The document discusses subgradient methods in large-scale optimization, focusing on their application to nondifferentiable problems and the generalization of steepest descent. It covers concepts such as subgradients, projected subgradient descent, and issues related to finding descent directions and step size rules. Examples illustrate the behavior of subgradient methods in various contexts, including Lipschitz functions and norms.

Uploaded by

Yitian98
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Subgradient Methods

The document discusses subgradient methods in large-scale optimization, focusing on their application to nondifferentiable problems and the generalization of steepest descent. It covers concepts such as subgradients, projected subgradient descent, and issues related to finding descent directions and step size rules. Examples illustrate the behavior of subgradient methods in various contexts, including Lipschitz functions and norms.

Uploaded by

Yitian98
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

ELE 522: Large-Scale Optimization for Data Science

Subgradient methods

Yuxin Chen
Princeton University, Fall 2019
Outline

• Steepest descent

• Subgradients

• Projected subgradient descent


◦ Convex and Lipschitz problems
◦ Strongly convex and Lipschitz problems

• Convex-concave saddle point problems

Subgradient methods 4-2


Nondifferentiable problems

Differentiability of the objective function f is essential for the validity


of gradient methods

However, there is no shortage of interesting cases (e.g. `1


minimization, nuclear norm minimization) where non-differentiability
is present at some points

Subgradient methods 4-3


Generalizing steepest descent?

minimizex f (x) subject to x ∈ C

• find a search direction dt that minimizes the directional derivative

dt ∈ arg min f 0 (xt ; d)


d:kdk2 ≤1

f (x+αd)−f (x)
where f 0 (x; d) := limα↓0 α

• updates
xt+1 = xt + ηt dt

Subgradient methods 4-4


Issues

• Finding the steepest descent direction (or even finding a descent


direction) may involve expensive computation

• Stepsize rules are tricky to choose: for certain popular stepsize


rules (like exact line search), steepest descent might converge to
non-optimal points

Subgradient methods 4-5


Wolfe’s example


5(9x21 + 16x22 ) 2
1
if x1 > |x2 |
f (x1 , x2 ) =
9x1 + 16|x2 | if x1 ≤ |x2 |

• (0,0) is a non-differentiable point


• if one starts from x0 = ( 16
9 , 1) and uses exact line search, then
◦ {xt } are all differentiable points
◦ xt → (0, 0) as t → ∞

Subgradient methods 4-6


Wolfe’s example


5(9x21 + 16x22 ) 2
1
if x1 > |x2 |
f (x1 , x2 ) =
9x1 + 16|x2 | if x1 ≤ |x2 |

• even though it never hits non-differentiable points, steepest


descent with exact line search gets stuck around a non-optimal
point (i.e. (0, 0))
• problem: steepest descent directions may undergo
large / discontinuous changes when close to convergence limits

Subgradient methods 4-6


(Projected) subgradient method

Practically, a popular choice is “subgradient-based methods”



xt+1 = PC xt − ηt g t (4.1)
where g t is any subgradient of f at xt

• the focus of this lecture


• caution: this update rule does not necessarily yield reduction
w.r.t. the objective values

Subgradient methods 4-7


Subgradients
Subgradients

Subgradients

say g is subgradient of f at point x if

f (z) Ø f (x) + g € (z ≠ x) , ’z (4.2)


We say g is a subgradient
¸ ˚˙ of f at˝the point x if
linear under-estimate of f

f (z) ≥ f (x) + g > (z − x) , ∀z (4.2)


set of all subgradients of f at x|is called{zsubdifferential
} of f at
a linear under-estimate of f
x, denoted by ˆf (x)

• the set of all subgradients of f at x is called the subdifferential


of f at x, denoted by ∂f (x)
Subgradient methods 4-9
Example: f (x) = |x|
Example: f (x) = |x|f (x) = |x|
Y
_
]{≠1},
_ if x < 0
f (x) = |x| ˆf (x) = [≠1, 1], if x = 0
_
_
Y
[{1}, if x > 0
_
]{≠1},
_ if x < 0
ˆf (x) = [≠1, 1], if x = 0
_
_
[{1}, if x > 0

Subgradient methods

Example: f (x) = |x| Example: f (x) = |x|


4-10


{−1},
 if x < 0
f (x) = |x| ∂f (x) = [−1, 1], if x = 0


{1}, if |x|
f (x) = x>0
f (x) = |x|
Y
Y _
]{≠1},
_ if x < 0
_
]{≠1},
Subgradient methods
_ if x < 0 4-10
ˆf (x) = [≠1, 1], if x = 0
_
Example: a subgradient of norms at 0
Let f (x) = kxk for any norm k · k, then for any g obeying kgk∗ ≤ 1,

g ∈ ∂f (0)

where k · k∗ is the dual norm of k · k (i.e. kxk∗ := supz:kzk≤1 hz, xi)

Proof: To see this, it suffices to prove that

f (z) ≥ f (0) + hg, z − 0i, ∀z

⇐⇒ hg, zi ≤ kzk, ∀z
This follows from generalized Cauchy-Schwarz, i.e.

hg, zi ≤ kgk∗ kzk ≤ kzk


Subgradient methods 4-11
f (x) = |x|
Example: max{f1 (x), f2 (x)}
Y
_
]{≠1},
_ if x < 0
f (x) = max{f1 (x), f2 (x)} where f1 and f2 are differentiable
ˆf (x) = [≠1, 1], if x = 0
_
_
Y
[{1}, if x > 0
]{f1 (x)},
_
_
Õ if f1 (x) > f2 (x)
ˆf (x) = [f1Õ (x), f2Õ (x)], if f1 (x) = f2 (x)
_
_
[{f Õ (x)},
2 if f1 (x) < f2 (x)
Subgradient methods 4-11

Subgradient methods

f (x) = max{f1 (x), f2 (x)} where f1 and f2 are differentiable


{f1 (x)},


0 if f1 (x) > f2 (x)
∂f (x) = [f 0 (x), f20 (x)], if f1 (x) = f2 (x)
 1

{f 0 (x)},
2 if f1 (x) < f2 (x)
Subgradient methods 4-12
Basic rules

• scaling: ∂(αf ) = α∂f (for α > 0)

• summation: ∂(f1 + f2 ) = ∂f1 + ∂f2

Subgradient methods 4-13


Example: `1 norm

n
X
f (x) = kxk1 = |xi |
|{z}
i=1
=:fi (x)

since (
sgn(xi )ei , if xi = 6 0
∂fi (x) =
[−1, 1] · ei , if xi = 0
we have
X
sgn(xi )ei ∈ ∂f (x)
i:xi 6=0

Subgradient methods 4-14


Basic rules (cont.)

• affine transformation: if h(x) = f (Ax + b), then

∂h(x) = A> ∂f (Ax + b)

Subgradient methods 4-15


Example: kAx + bk1

h(x) = kAx + bk1

letting f (x) = kxk1 and A = [a1 , · · · , am ]> , we have


X
g= sgn(a>
i x + bi )ei ∈ ∂f (Ax + b).
i x+bi 6=0
i:a>

X
=⇒ A> g = sgn(a>
i x + bi )ai ∈ ∂h(x)
i x+bi 6=0
i:a>

Subgradient methods 4-16


Basic rules (cont.)

• chain rule: suppose f is convex, and g is differentiable,


nondecreasing, and convex. Let h = g ◦ f , then

∂h(x) = g 0 (f (x))∂f (x)

• composition: suppose f (x) = h(f1 (x), · · · , fn (x)), where fi ’s


are convex, and h is differentiable, nondecreasing, and convex.
Let q = ∇h (y) |y=[f1 (x),··· ,fn (x)] , and gi ∈ ∂fi (x). Then

q1 g1 + · · · + qn gn ∈ ∂f (x)

Subgradient methods 4-17


Basic rules (cont.)

• pointwise maximum: if f (x) = max1≤i≤k fi (x), then


n[  o
∂f (x) = conv ∂fi (x) | fi (x) = f (x)
| {z }
convex hull of subdifferentials of all active functions

• pointwise supremum: if f (x) = supα∈F fα (x), then


 n[  o
∂f (x) = closure conv ∂fα (x) | fα (x) = f (x)

Subgradient methods 4-18


Example: piece-wise linear functions

 >
f (x) = max ai x + bi
1≤i≤m


pick any aj s.t. a>
j x + bj = maxi ai x + bi , then
>

aj ∈ ∂f (x)

Subgradient methods 4-19


Example: the `∞ norm

f (x) = kxk∞ = max |xi |


1≤i≤n

if x 6= 0, then pick any xj obeying |xj | = maxi |xi | to obtain

sgn(xj )ej ∈ ∂f (x)

Subgradient methods 4-20


Example: the maximum eigenvalue

f (x) = λmax (x1 A1 + · · · + xn An )


where A1 , · · · , An are real symmetric matrices

Rewrite
f (x) = sup y > (x1 A1 + · · · + xn An ) y
y:kyk2 =1

as the supremum of some affine functions of x. Therefore, taking y


as the leading eigenvector of x1 A1 + · · · + xn An , we have
 > >
y A1 y, · · · , y > An y ∈ ∂f (x)

Subgradient methods 4-21


Example: the nuclear norm
Let X ∈ Rm×n with SVD X = U ΣV > and
min{n,m}
X
f (X) = σi (X)
i=1

where σi (x) is the ith largest singular value of X

Rewrite

f (X) = sup AB > , X := sup fA,B (X)


orthonormal A,B orthonormal A,B

Recognizing that fA,B (X) is maximized by A = U and B = V and


that ∇fA,B (X) = AB > , we have

U V > ∈ ∂f (X)

Subgradient methods 4-22


Negative subgradients are not necessarily descent is n
Negative subgradient
directi
directions Negative subgradient
d
Example: f (x) = |x1 | + 3|x2 |

Example: f (x) = |x1 | + 3|x2 |


At x = (1, 0):
Example: f (x) = |x1 | + 3|x2 | • g1 = ≠(1, 0) œ ˆf (x), and ≠g1
• gAt = (1,3)0):
2 =x ≠(1, œ ˆf (x), but ≠g2
• g1 = ≠(1, 0) œ ˆf (x), an
Reason:• lack
g2 =of≠(1,
continuity
3) œ ˆf—(x),
one bu
c
without violating validity of subgradie
Reason: lack of continuity —
without violating validity of sub
at x = (1, 0): Subgradient methods

• g1 = (1, 0) ∈ ∂f (x), and −g1 is a descent direction


• g2 = (1, 3) ∈ ∂f (x), but −g2 is not a descent direction
Subgradient methods

Reason: lack of continuity — one can change directions


significantly without violating the validity of subgradients
Subgradient methods 4-23
Negative subgradient is not necessarily descent
direction

Since f (xt ) is not necessarily monotone, we will keep track of the


best point
f best,t := min f (xi )
1≤i≤t

We also denote by f opt := minx f (x) the optimal objective value

Subgradient methods 4-24


Convex and Lipschitz problems

Clearly, we cannot analyze all nonsmooth functions. A nice (and


widely encountered) class to start with is Lipschitz functions, i.e. the
set of all f obeying

|f (x) − f (z)| ≤ Lf kx − zk2 ∀ x and z

Subgradient methods 4-25


Fundamental inequality for projected subgradient
methods

We’d like to optimize kxt+1 − x∗ k22 , but don’t have access to x∗

Key idea (majorization-minimization): find another function that


majorizes kxt+1 − x∗ k22 , and optimize the majorizing function

Lemma 4.1
Projected subgradient update rule (4.1) obeys

kxt+1 − x∗ k22 ≤ kxt − x∗ k22 − 2ηt f (xt ) − f opt + ηt2 kg t k22 (4.3)
| {z }
fixed
| {z }
majorizing function

Subgradient methods 4-26


Proof of Lemma 4.1

 
kxt+1 − x∗ k22 = kPC xt − ηt g t − PC x∗ k22
≤ kxt − ηt g t − x∗ k22 (nonexpansiveness of projection)
= kx −
t
x∗ k22 − 2ηt hx − x∗ , g t i + ηt2 kg t k22
t

≤ kxt − x∗ k22 − 2ηt f (xt ) − f (x∗ ) + ηt2 kg t k22

where the last line uses the subgradient inequality

f (x∗ ) − f (xt ) ≥ hx∗ − xt , g t i

Subgradient methods 4-27


Polyak’s stepsize rule

The majorizing function in (4.3) suggests a stepsize (Polyak ’87)

f (xt ) − f opt
ηt = (4.4)
kgt k22
which leads to error reduction
2
f (xt ) − f (x∗ )
kx t+1
− x∗ k22 t
≤ kx − x∗ k22 − (4.5)
kg t k22

• useful if f opt is known


• the estimation error is monotonically decreasing with Polyak’s
stepsize

Subgradient methods 4-28


Example: projection onto in
Example: projection onto intersection of convex sets
]
be closed convex sets and suppose C1 fl C2 ”= ÿ
[PICTURE]
find x œ C Let
fl C C1 , C2 be closed convex sets and
1 2

find x œ C
Ì

Let C1 , C2 be closed convex sets and suppose C1 ∩ C2 6= ∅


Ì
minimizex max {distC1 (x), distC2 (x)}
find x ∈ C1 ∩ C2
(x) := minzœC Îx ≠ zÎ2 minimizex max {dist
m
minimizex where
max {distCdist C (x) := minzœC Îx ≠ zÎ2
1 (x), distC2 (x)}

where distC (x) := minz∈C kx − zk2


Subgradient methods 4-29
ds 4-29
e: projection onto intersection of convex sets
Example:
Example: projection onto projection
intersection ontosets
of convex inte

RE]
C2 be closed convex sets and suppose C1 fl C2 ”= ÿ
[PICTURE]
find x œ C fl LetC C1 , C2 be closed convex sets and sup
1 2

find x œ C1 fl
Ì
Ì
For minimize
this problem, maxsubgradient
x the distC2with
method
{distC1 (x), (x)}Polyak’s stepsize rule
is equivalent to alternating projection
istC (x) := minzœC Îx ≠ zÎ2 minimizex max {distC1 (
t+1
x = PC1 (x ),
t
x t+2
= P (x
t+1
)
where distC (x)
C2
:= minzœC Îx ≠ zÎ2

Subgradient methods 4-30


methods 4-29
Example: projection onto intersection of convex sets

Proof: Use the subgradient rule for pointwise max functions to get

g t ∈ ∂distCi (xt )

where i = arg maxj=1,2 distCj (xt )

If distCi (xt ) 6= 0, then one has

xt − PCi (xt )
g t = ∇distCi (xt ) =
distCi (xt )
 
1 2
which follows since ∇ 2 distCi (x )
t = xt − PCi (xt ) (homework) and
 
1 2
2 distCi (x ) = distCi (xt ) · ∇distCi (xt )
∇ t

Subgradient methods 4-31


Example: projection onto intersection of convex sets

Proof (cont.): Adopting Polya’s stepsize rule and recognizing that


kg t k2 = 1, we arrive at

distCi (xt ) xt − PCi (xt )


xt+1 = xt − ηt g t = xt −
kg t k22 distCi (xt )
| {z }
=ηt

= PCi (xt )

where i = arg maxj=1,2 distCj (xt )




Subgradient methods 4-32


Convergence rate with Polyak’s stepsize

Theorem 4.2 (Convergence of projected subgradient method


with Polyak’s stepsize)

Suppose f is convex and Lf -Lipschitz continuous. Then the projected


subgradient method (4.1) with Polyak’s stepsize rule obeys

Lf kx0 − x∗ k2
f best,t − f opt ≤ √
t+1


• sublinear convergence rate O(1/ t)

Subgradient methods 4-33


Proof of Theorem 4.2
We have seen from (4.5) that
2 n o
f (xt ) − f (x∗ ) ≤ kxt − x∗ k22 − kxt+1 − x∗ k22 kg t k22
n o
≤ kxt − x∗ k22 − kxt+1 − x∗ k22 L2f

Applying it recursively for all iterations (from 0th to tth) and


summing them up yield
t
X 2 n o
f (xk ) − f (x∗ ) ≤ kx0 − x∗ k22 − kxt+1 − x∗ k22 L2f
k=0

2
=⇒ (t + 1) f best,t − f opt ≤ kx0 − x∗ k22 L2f

which concludes the proof


Subgradient methods 4-34
Other stepsize choices?

Unfortunately, Polyak’s stepsize rule requires knowledge of f opt ,


which is often unknown a priori

We might often need simpler rules for setting stepsizes

Subgradient methods 4-35


Convex and Lipschitz problems

Theorem 4.3 (Subgradient methods for convex and Lipschitz


functions)

Suppose f is convex and Lf -Lipschitz continuous. Then the projected


subgradient update rule (4.1) obeys
Pt
best,t opt
kx0 − x∗ k22 + L2f 2
i=0 ηi
f −f ≤ Pt
2 i=0 ηi

Subgradient methods 4-36


Implications: stepsize rules

• Constant step size ηt ≡ η:

L2f η
lim f best,t ≤
t→∞ 2
i.e. may converge to non-optimal points
P 2 P
• Diminishing step size obeying t ηt < ∞ and t ηt → ∞:

lim f best,t = 0
t→∞

i.e. converges to optimal points

Subgradient methods 4-37


Implications: stepsize rule

1
• Optimal choice? ηt = √
t
:

kx0 − x∗ k22 + L2f log t


f best,t − f opt . √
t

i.e. attains ε-accuracy within about O(1/ε2 ) iterations (ignoring


the log factor)

Subgradient methods 4-38


Proof of Theorem 4.5
Applying Lemma 4.1 recursively gives
t
X t
X

kxt+1 − x∗ k22 ≤ kx0 − x∗ k22 − 2 ηi f (xi ) − f opt + ηi2 kg i k22
i=0 i=0

Rearranging terms, we are left with


t
X t
X
opt  0
2 ηi f (x ) − f
i
≤ kx − x∗ k22 − kx t+1
− x∗ k22 + ηi2 kg i k22
i=0 i=0
t
X
≤ kx0 − x∗ k22 + L2f ηi2
i=0
Pt  Pt
best,t opt i=0 ηif (xi ) − f opt kx0 − x∗ k22 + L2f 2
i=0 ηi
=⇒ f −f ≤ Pt ≤ P
i=0 ηi 2 ti=0 ηi

Subgradient methods 4-39


Strongly convex and Lipschitz problems

If f is strongly convex, then the convergence guarantees can be


improved to O(1/t), as long as the stepsize dimishes at O(1/t)

Theorem 4.4 (Subgradient methods for strongly convex and


Lipschitz functions)

Let f be µ-strongly convex and Lf -Lipschitz continuous over C. If


2
ηt ≡ η = µ(t+1) , then
2L2f 1
f best,t − f opt ≤ ·
µ t+1

Subgradient methods 4-40


Proof of Theorem 4.4
When f is µ-strongly convex, we can improve Lemma 4.1 to (exercise)
 
kxt+1 − x∗ k22 ≤ (1 − µηt )kxt − x∗ k22 − 2ηt f (xt ) − f opt + ηt2 kg t k22

1 − µηt t 1 ηt
=⇒ f (xt )−f opt ≤ kx −x∗ k22 − kxt+1 −x∗ k22 + kg t k22
2ηt 2ηt 2
Since ηt = 2/(µ(t + 1)), we have

µ(t − 1) t ∗ 2 µ(t + 1) t+1 ∗ 2 1


f (xt )−f opt ≤ kx −x k2 − kx −x k2 + kg t k22
4 4 µ(t + 1)

and hence
  µt(t − 1) t ∗ 2 µt(t + 1) t+1 ∗ 2 1 t 2
t f (xt ) − f opt ≤ kx −x k2 − kx −x k2 + kg k2
4 4 µ

Subgradient methods 4-41


Proof of Theorem 4.4 (cont.)

Summing over all iterations before t, we get


t
X 
opt
 µt(t + 1) t+1 ∗ 2 1X t
k f (x ) − f k
≤0− kx − x k2 + kg k k22
k=0
4 µ k=0
t 2
≤ L
µ f

L2f t 2L2f 1
=⇒ f best,k − f opt ≤ Pt ≤
µ k=0 k µ t+1

Subgradient methods 4-42


Summary: subgradient methods

stepsize convergence iteration


rule rate complexity
convex & Lipschitz 1

1
 
1

ηt  √ O √ O
problems t t ε2

strongly convex & 1


 
1
 
1
ηt  O O
Lipschitz problems t t ε

Subgradient methods 4-43


Convex-concave saddle point problems
Convex-concave saddle point problems

minimize max f (x, y)


x∈X y∈Y

• f (x, y): convex in x and concave in y


• X , Y: bounded closed convex sets
• arises in game theory, robust optimization, generative adversarial
network (GAN), ...
• under mild conditions, it is equivalent to its dual formulation

maximize min f (x, y)


y∈Y x∈X

Subgradient methods 4-45


 ⌧ x ⌧ y
opt ⌧ =0
+ f min f (x̃, y) 
x̃2X Saddle points
opt
✏(x, y) = max f (x, ỹ) f f opt
+ kx 0 min
2 f (0x̃, y)
yk22 + L
ỹ2Y 2 + ky
xkx̃2X
x̃, y)
= max f (x, ỹ) min f (x̃, y) t
X
ỹ2Y x̃2X
2 2
 DX + DY + L2f ⌘⌧2
ng optimal solutions 0.675

ef opt
= f (x , y ) with x⇤ and y ⇤ being optimal solutions
⇤ ⇤
0.67
⌧ =0

y convexity (and concavity)


(x⇤ , yof
⇤ f,
) 0.665

t
x xi, x2X 0.66

f (xt , y t ) f (x, y t )  hgxt , xt xi, x2X


t 0.655

y y i, yt ,2y)Y f (xt , y t )  hg t , y
f (x 0.65
y t i, y2Y
0.6 y 0.1
0.05
0.55
0

indicating that
0.5 -0.05
-0.1

yOptimal
, y)y t i,fpoint t , y ) obeys
(x ∗ ∗
gytf,(xt
(x, y x )2Xhg,xt ,yx2
t
Yxi + hgyt , y y t i, x 2 X, y 2
f (x∗ , y) ≤ f (x∗ , y ∗ ) ≤ f (x, y ∗ ), ∀x ∈ X , y ∈ Y
herefore,
of f againinvoking
gives convexity-concavity of f again gives
Subgradient methods 4-46
t t t t
Projected subgradient method

A natural strategy is to apply the subgradient-based approach

" # " # " #!


xt+1 xt gxt
= PX ×Y − ηt (4.6)
y t+1 yt −gyt
" #!
subgrad descent on xt
= projection
subgrad ascent on y t

where gxt ∈ ∂x f (xt , y t ) and −gyt ∈ ∂y − f (xt , y t )

Subgradient methods 4-47


Performance metric

One way to measure the quality of the solution is via the following
error metric (think of it as a certain “duality gap”)

   
opt opt
ε(x, y) := max f (x, ỹ) − f + f − min f (x̃, y)
ỹ∈Y x̃∈X

= max f (x, ỹ) − min f (x̃, y)


ỹ∈Y x̃∈X

where f opt := f (x∗ , y ∗ ) with (x∗ , y ∗ ) the optimal solution

Subgradient methods 4-48


Convex-concave and Lipschitz problems

Theorem 4.5 (Subgradient methods for saddle point problems)

Suppose f is convex in x and concave in y, and is Lf -Lipschitz


continuous over X × Y. Let DX (resp. DY ) be the diameter of X
(resp. Y). Then the projected subgradient method (4.6) obeys
2 + D 2 + L2 Pt 2
DX τ =0 ητ
ε(x̂ , ŷ ) ≤
t t Y f
Pt
2 τ =0 ητ
Pt Pt
ητ xτ ητ y τ
where x̂ = Pt
t τ =0
and ŷ = Pτ =0
t
t
τ =0
ητ τ =0
ητ

• similar to our theory for convex problems



• suggests varying stepsize ηt  1/ t

Subgradient methods 4-49


Iterate averaging

Notably, it is crucial to output the weighted average (x̂t , ŷ t ) of the


iterates of the subgradient methods

In fact, the original iterates (xt , y t ) might not converge

Example (bilinear game): f (x, y) = xy


• When ηt → 0 (continuous limit), (xt , y t ) exhibits cycling
behavior around (x∗ , y ∗ ) = (0, 0) without converging to it

Subgradient methods 4-50


Proof of Theorem 4.5
By the convexity-concavity of f ,
f (xt , y t ) − f (x, y t ) ≤ hgxt , xt − xi, x∈X
f (x , y) − f (x , y ) ≤
t t t
hgyt , y t
− y i, y∈Y
Adding these two inequalities yields
f (xt , y) − f (x, y t ) ≤ hgxt , xt − xi − hgyt , y t − yi, x ∈ X, y ∈ Y
Therefore, invoking the convexity-concavity of f once again gives
ε(x̂t , ŷ t ) = max f (x̂t , y) − min f (x, ŷ t )
y∈Y x∈X
( )
1 X t X t
≤ Pt max ητ f (x , y) − min
τ
ητ f (x, y )
τ

τ =0 ητ
y∈Y x∈X
τ =0 τ =0

1
t
X 
≤ Pt max ητ hgxτ , xτ − xi − hgyτ , y τ − yi (4.7)
τ =0 ητ
x∈X ,y∈Y
τ =0

Subgradient methods 4-51


Proof of Theorem 4.5 (cont.)

It then suffices to control the RHS of (4.7) as follows:


Lemma 4.6
t
X n o
max ητ hgxτ , xτ − xi − hgyτ , y τ − yi
x∈Y,y∈Y
τ =0
2 + Pt
DX DY2 + L2f 2
τ =0 ητ

2

This lemma together with (4.7) immediately establishes Theorem 4.5

Subgradient methods 4-52


Proof of Lemma 4.6
For any x ∈ X we have

kxτ +1 − xk22 = kPX (xτ − ητ gxτ ) − PX (x)k22


≤ kxτ − ητ gxτ − xk22 (convexity of X )
= kx −τ
xk22 − 2ητ hx −
τ
x, gxτ i + ητ2 kgxτ k22

=⇒ 2ητ hxτ − x, gxτ i ≤ kxτ − xk22 − kxτ +1 − xk22 + ητ2 kgxτ k22

Similarly, for any y ∈ Y one has

−2ητ hy τ − y, gyτ i ≤ ky τ − yk22 − ky τ +1 − yk22 + ητ2 kgyτ k22

Combining these two inequalities and using Lipschitz continuity yield

2ητ hgxτ , xτ − xi − 2ητ hgyτ , y τ − yi


≤ kxτ − xk22 + ky τ − yk22 − kxτ +1 − xk22 − ky τ +1 − yk22 + ητ2 L2f

Subgradient methods 4-53


Proof of Lemma 4.6 (cont.)

Summing up these inequalities over τ = 0, · · · , t gives


t
X 
2 ητ hgxτ , xτ − xi − ητ hgyτ , y τ − yi
τ =0
Xt
≤ kx0 − xk22 + ky 0 − yk22 − kxt+1 − xk22 − ky t+1 − yk22 + L2f ητ2
τ =0
Xt
≤ kx0 − xk22 + ky 0 − yk22 + L2f ητ2
τ =0
Xt
≤ DX2
+ DY
2
+ L2f ητ2
τ =0

as claimed

Remark: this lemma does NOT rely on the convexity-concavity of f (·, ·)

Subgradient methods 4-54


Reference

[1] ”Convex optimization, EE364B lecture notes,” S. Boyd, Stanford.


[2] ”Convex optimization and algorithms,” D. Bertsekas, 2015.
[3] ”First-order methods in optimization,” A. Beck, Vol. 25, SIAM, 2017.
[4] ”Convex optimization: algorithms and complexity,” S. Bubeck,
Foundations and trends in machine learning, 2015.
[5] ”Optimization methods for large-scale systems, EE236C lecture notes,”
L. Vandenberghe, UCLA.
[6] ”Introduction to optimization,” B. Polyak, Optimization Software, 1987.
[7] ”Robust stochastic approximation approach to stochastic programming,”
A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, SIAM Journal on
optimization, 2009.

Subgradient methods 4-55

You might also like