0% found this document useful (0 votes)

6 views

Subgradient Methods

The document discusses subgradient methods in large-scale optimization, focusing on their application to nondifferentiable problems and the generalization of steepest descent. It covers concepts such as subgradients, projected subgradient descent, and issues related to finding descent directions and step size rules. Examples illustrate the behavior of subgradient methods in various contexts, including Lipschitz functions and norms.

Uploaded by

Yitian98

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Subgradient Methods

Uploaded by

Yitian98

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

ELE 522: Large-Scale Optimization for Data Science

Subgradient methods

Yuxin Chen
Princeton University, Fall 2019
Outline

• Steepest descent

• Subgradients

• Projected subgradient descent

◦ Convex and Lipschitz problems
◦ Strongly convex and Lipschitz problems

• Convex-concave saddle point problems

Subgradient methods 4-2

Nondifferentiable problems

Differentiability of the objective function f is essential for the validity

of gradient methods

However, there is no shortage of interesting cases (e.g. `1

minimization, nuclear norm minimization) where non-differentiability
is present at some points

Subgradient methods 4-3

Generalizing steepest descent?

minimizex f (x) subject to x ∈ C

• find a search direction dt that minimizes the directional derivative

dt ∈ arg min f 0 (xt ; d)

d:kdk2 ≤1

f (x+αd)−f (x)
where f 0 (x; d) := limα↓0 α

• updates
xt+1 = xt + ηt dt

Subgradient methods 4-4

Issues

• Finding the steepest descent direction (or even finding a descent

direction) may involve expensive computation

• Stepsize rules are tricky to choose: for certain popular stepsize

rules (like exact line search), steepest descent might converge to
non-optimal points

Subgradient methods 4-5

Wolfe’s example

5(9x21 + 16x22 ) 2
1
if x1 > |x2 |
f (x1 , x2 ) =
9x1 + 16|x2 | if x1 ≤ |x2 |

• (0,0) is a non-differentiable point

• if one starts from x0 = ( 16
9 , 1) and uses exact line search, then
◦ {xt } are all differentiable points
◦ xt → (0, 0) as t → ∞

Subgradient methods 4-6

Wolfe’s example

5(9x21 + 16x22 ) 2
1
if x1 > |x2 |
f (x1 , x2 ) =
9x1 + 16|x2 | if x1 ≤ |x2 |

• even though it never hits non-differentiable points, steepest

descent with exact line search gets stuck around a non-optimal
point (i.e. (0, 0))
• problem: steepest descent directions may undergo
large / discontinuous changes when close to convergence limits

Subgradient methods 4-6

(Projected) subgradient method

Practically, a popular choice is “subgradient-based methods”

xt+1 = PC xt − ηt g t (4.1)
where g t is any subgradient of f at xt

• the focus of this lecture

• caution: this update rule does not necessarily yield reduction
w.r.t. the objective values

Subgradient methods 4-7

Subgradients
Subgradients

Subgradients

say g is subgradient of f at point x if

f (z) Ø f (x) + g € (z ≠ x) , ’z (4.2)

We say g is a subgradient
¸ ˚˙ of f at˝the point x if
linear under-estimate of f

f (z) ≥ f (x) + g > (z − x) , ∀z (4.2)

set of all subgradients of f at x|is called{zsubdifferential
} of f at
a linear under-estimate of f
x, denoted by ˆf (x)

• the set of all subgradients of f at x is called the subdifferential

of f at x, denoted by ∂f (x)
Subgradient methods 4-9
Example: f (x) = |x|
Example: f (x) = |x|f (x) = |x|
Y
_
]{≠1},
_ if x < 0
f (x) = |x| ˆf (x) = [≠1, 1], if x = 0
_
_
Y
[{1}, if x > 0
_
]{≠1},
_ if x < 0
ˆf (x) = [≠1, 1], if x = 0
_
_
[{1}, if x > 0

Subgradient methods

Example: f (x) = |x| Example: f (x) = |x|

4-10


{−1},
 if x < 0
f (x) = |x| ∂f (x) = [−1, 1], if x = 0


{1}, if |x|
f (x) = x>0
f (x) = |x|
Y
Y _
]{≠1},
_ if x < 0
_
]{≠1},
Subgradient methods
_ if x < 0 4-10
ˆf (x) = [≠1, 1], if x = 0
_
Example: a subgradient of norms at 0
Let f (x) = kxk for any norm k · k, then for any g obeying kgk∗ ≤ 1,

g ∈ ∂f (0)

where k · k∗ is the dual norm of k · k (i.e. kxk∗ := supz:kzk≤1 hz, xi)

Proof: To see this, it suffices to prove that

f (z) ≥ f (0) + hg, z − 0i, ∀z

⇐⇒ hg, zi ≤ kzk, ∀z
This follows from generalized Cauchy-Schwarz, i.e.

hg, zi ≤ kgk∗ kzk ≤ kzk

Subgradient methods 4-11
f (x) = |x|
Example: max{f1 (x), f2 (x)}
Y
_
]{≠1},
_ if x < 0
f (x) = max{f1 (x), f2 (x)} where f1 and f2 are differentiable
ˆf (x) = [≠1, 1], if x = 0
_
_
Y
[{1}, if x > 0
]{f1 (x)},
_
_
Õ if f1 (x) > f2 (x)
ˆf (x) = [f1Õ (x), f2Õ (x)], if f1 (x) = f2 (x)
_
_
[{f Õ (x)},
2 if f1 (x) < f2 (x)
Subgradient methods 4-11

Subgradient methods

f (x) = max{f1 (x), f2 (x)} where f1 and f2 are differentiable


{f1 (x)},


0 if f1 (x) > f2 (x)
∂f (x) = [f 0 (x), f20 (x)], if f1 (x) = f2 (x)
 1

{f 0 (x)},
2 if f1 (x) < f2 (x)
Subgradient methods 4-12
Basic rules

• scaling: ∂(αf ) = α∂f (for α > 0)

• summation: ∂(f1 + f2 ) = ∂f1 + ∂f2

Subgradient methods 4-13

Example: `1 norm

n
X
f (x) = kxk1 = |xi |
|{z}
i=1
=:fi (x)

since (
sgn(xi )ei , if xi = 6 0
∂fi (x) =
[−1, 1] · ei , if xi = 0
we have
X
sgn(xi )ei ∈ ∂f (x)
i:xi 6=0

Subgradient methods 4-14

Basic rules (cont.)

• affine transformation: if h(x) = f (Ax + b), then

∂h(x) = A> ∂f (Ax + b)

Subgradient methods 4-15

Example: kAx + bk1

h(x) = kAx + bk1

letting f (x) = kxk1 and A = [a1 , · · · , am ]> , we have

X
g= sgn(a>
i x + bi )ei ∈ ∂f (Ax + b).
i x+bi 6=0
i:a>

X
=⇒ A> g = sgn(a>
i x + bi )ai ∈ ∂h(x)
i x+bi 6=0
i:a>

Subgradient methods 4-16

Basic rules (cont.)

• chain rule: suppose f is convex, and g is differentiable,

nondecreasing, and convex. Let h = g ◦ f , then

∂h(x) = g 0 (f (x))∂f (x)

• composition: suppose f (x) = h(f1 (x), · · · , fn (x)), where fi ’s

are convex, and h is differentiable, nondecreasing, and convex.
Let q = ∇h (y) |y=[f1 (x),··· ,fn (x)] , and gi ∈ ∂fi (x). Then

q1 g1 + · · · + qn gn ∈ ∂f (x)

Subgradient methods 4-17

Basic rules (cont.)

• pointwise maximum: if f (x) = max1≤i≤k fi (x), then

n[ o
∂f (x) = conv ∂fi (x) | fi (x) = f (x)
| {z }
convex hull of subdifferentials of all active functions

• pointwise supremum: if f (x) = supα∈F fα (x), then

n[ o
∂f (x) = closure conv ∂fα (x) | fα (x) = f (x)

Subgradient methods 4-18

Example: piece-wise linear functions

>
f (x) = max ai x + bi
1≤i≤m

pick any aj s.t. a>
j x + bj = maxi ai x + bi , then
>

aj ∈ ∂f (x)

Subgradient methods 4-19

Example: the `∞ norm

f (x) = kxk∞ = max |xi |

1≤i≤n

if x 6= 0, then pick any xj obeying |xj | = maxi |xi | to obtain

sgn(xj )ej ∈ ∂f (x)

Subgradient methods 4-20

Example: the maximum eigenvalue

f (x) = λmax (x1 A1 + · · · + xn An )

where A1 , · · · , An are real symmetric matrices

Rewrite
f (x) = sup y > (x1 A1 + · · · + xn An ) y
y:kyk2 =1

as the supremum of some affine functions of x. Therefore, taking y

as the leading eigenvector of x1 A1 + · · · + xn An , we have
> >
y A1 y, · · · , y > An y ∈ ∂f (x)

Subgradient methods 4-21

Example: the nuclear norm
Let X ∈ Rm×n with SVD X = U ΣV > and
min{n,m}
X
f (X) = σi (X)
i=1

where σi (x) is the ith largest singular value of X

Rewrite

f (X) = sup AB > , X := sup fA,B (X)

orthonormal A,B orthonormal A,B

Recognizing that fA,B (X) is maximized by A = U and B = V and

that ∇fA,B (X) = AB > , we have

U V > ∈ ∂f (X)

Subgradient methods 4-22

Negative subgradients are not necessarily descent is n
Negative subgradient
directi
directions Negative subgradient
d
Example: f (x) = |x1 | + 3|x2 |

Example: f (x) = |x1 | + 3|x2 |

At x = (1, 0):
Example: f (x) = |x1 | + 3|x2 | • g1 = ≠(1, 0) œ ˆf (x), and ≠g1
• gAt = (1,3)0):
2 =x ≠(1, œ ˆf (x), but ≠g2
• g1 = ≠(1, 0) œ ˆf (x), an
Reason:• lack
g2 =of≠(1,
continuity
3) œ ˆf—(x),
one bu
c
without violating validity of subgradie
Reason: lack of continuity —
without violating validity of sub
at x = (1, 0): Subgradient methods

• g1 = (1, 0) ∈ ∂f (x), and −g1 is a descent direction

• g2 = (1, 3) ∈ ∂f (x), but −g2 is not a descent direction
Subgradient methods

Reason: lack of continuity — one can change directions

significantly without violating the validity of subgradients
Subgradient methods 4-23
Negative subgradient is not necessarily descent
direction

Since f (xt ) is not necessarily monotone, we will keep track of the

best point
f best,t := min f (xi )
1≤i≤t

We also denote by f opt := minx f (x) the optimal objective value

Subgradient methods 4-24

Convex and Lipschitz problems

Clearly, we cannot analyze all nonsmooth functions. A nice (and

widely encountered) class to start with is Lipschitz functions, i.e. the
set of all f obeying

|f (x) − f (z)| ≤ Lf kx − zk2 ∀ x and z

Subgradient methods 4-25

Fundamental inequality for projected subgradient
methods

We’d like to optimize kxt+1 − x∗ k22 , but don’t have access to x∗

Key idea (majorization-minimization): find another function that

majorizes kxt+1 − x∗ k22 , and optimize the majorizing function

Lemma 4.1
Projected subgradient update rule (4.1) obeys

kxt+1 − x∗ k22 ≤ kxt − x∗ k22 − 2ηt f (xt ) − f opt + ηt2 kg t k22 (4.3)
| {z }
fixed
| {z }
majorizing function

Subgradient methods 4-26

Proof of Lemma 4.1

kxt+1 − x∗ k22 = kPC xt − ηt g t − PC x∗ k22
≤ kxt − ηt g t − x∗ k22 (nonexpansiveness of projection)
= kx −
t
x∗ k22 − 2ηt hx − x∗ , g t i + ηt2 kg t k22
t

≤ kxt − x∗ k22 − 2ηt f (xt ) − f (x∗ ) + ηt2 kg t k22

where the last line uses the subgradient inequality

f (x∗ ) − f (xt ) ≥ hx∗ − xt , g t i

Subgradient methods 4-27

Polyak’s stepsize rule

The majorizing function in (4.3) suggests a stepsize (Polyak ’87)

f (xt ) − f opt
ηt = (4.4)
kgt k22
which leads to error reduction
2
f (xt ) − f (x∗ )
kx t+1
− x∗ k22 t
≤ kx − x∗ k22 − (4.5)
kg t k22

• useful if f opt is known

• the estimation error is monotonically decreasing with Polyak’s
stepsize

Subgradient methods 4-28

Example: projection onto in
Example: projection onto intersection of convex sets
]
be closed convex sets and suppose C1 ﬂ C2 ”= ÿ
[PICTURE]
find x œ C Let
ﬂ C C1 , C2 be closed convex sets and
1 2

find x œ C
Ì

Let C1 , C2 be closed convex sets and suppose C1 ∩ C2 6= ∅

Ì
minimizex max {distC1 (x), distC2 (x)}
find x ∈ C1 ∩ C2
(x) := minzœC Îx ≠ zÎ2 minimizex max {dist
m
minimizex where
max {distCdist C (x) := minzœC Îx ≠ zÎ2
1 (x), distC2 (x)}

where distC (x) := minz∈C kx − zk2

Subgradient methods 4-29
ds 4-29
e: projection onto intersection of convex sets
Example:
Example: projection onto projection
intersection ontosets
of convex inte

RE]
C2 be closed convex sets and suppose C1 ﬂ C2 ”= ÿ
[PICTURE]
find x œ C ﬂ LetC C1 , C2 be closed convex sets and sup
1 2

find x œ C1 ﬂ
Ì
Ì
For minimize
this problem, maxsubgradient
x the distC2with
method
{distC1 (x), (x)}Polyak’s stepsize rule
is equivalent to alternating projection
istC (x) := minzœC Îx ≠ zÎ2 minimizex max {distC1 (
t+1
x = PC1 (x ),
t
x t+2
= P (x
t+1
)
where distC (x)
C2
:= minzœC Îx ≠ zÎ2

Subgradient methods 4-30

methods 4-29
Example: projection onto intersection of convex sets

Proof: Use the subgradient rule for pointwise max functions to get

g t ∈ ∂distCi (xt )

where i = arg maxj=1,2 distCj (xt )

If distCi (xt ) 6= 0, then one has

xt − PCi (xt )
g t = ∇distCi (xt ) =
distCi (xt )

1 2
which follows since ∇ 2 distCi (x )
t = xt − PCi (xt ) (homework) and

1 2
2 distCi (x ) = distCi (xt ) · ∇distCi (xt )
∇ t

Subgradient methods 4-31

Example: projection onto intersection of convex sets

Proof (cont.): Adopting Polya’s stepsize rule and recognizing that

kg t k2 = 1, we arrive at

distCi (xt ) xt − PCi (xt )

xt+1 = xt − ηt g t = xt −
kg t k22 distCi (xt )
| {z }
=ηt

= PCi (xt )

where i = arg maxj=1,2 distCj (xt )

Subgradient methods 4-32

Convergence rate with Polyak’s stepsize

Theorem 4.2 (Convergence of projected subgradient method

with Polyak’s stepsize)

Suppose f is convex and Lf -Lipschitz continuous. Then the projected

subgradient method (4.1) with Polyak’s stepsize rule obeys

Lf kx0 − x∗ k2
f best,t − f opt ≤ √
t+1

√
• sublinear convergence rate O(1/ t)

Subgradient methods 4-33

Proof of Theorem 4.2
We have seen from (4.5) that
2 n o
f (xt ) − f (x∗ ) ≤ kxt − x∗ k22 − kxt+1 − x∗ k22 kg t k22
n o
≤ kxt − x∗ k22 − kxt+1 − x∗ k22 L2f

Applying it recursively for all iterations (from 0th to tth) and

summing them up yield
t
X 2 n o
f (xk ) − f (x∗ ) ≤ kx0 − x∗ k22 − kxt+1 − x∗ k22 L2f
k=0

2
=⇒ (t + 1) f best,t − f opt ≤ kx0 − x∗ k22 L2f

which concludes the proof

Subgradient methods 4-34
Other stepsize choices?

Unfortunately, Polyak’s stepsize rule requires knowledge of f opt ,

which is often unknown a priori

We might often need simpler rules for setting stepsizes

Subgradient methods 4-35

Convex and Lipschitz problems

Theorem 4.3 (Subgradient methods for convex and Lipschitz

functions)

Suppose f is convex and Lf -Lipschitz continuous. Then the projected

subgradient update rule (4.1) obeys
Pt
best,t opt
kx0 − x∗ k22 + L2f 2
i=0 ηi
f −f ≤ Pt
2 i=0 ηi

Subgradient methods 4-36

Implications: stepsize rules

• Constant step size ηt ≡ η:

L2f η
lim f best,t ≤
t→∞ 2
i.e. may converge to non-optimal points
P 2 P
• Diminishing step size obeying t ηt < ∞ and t ηt → ∞:

lim f best,t = 0
t→∞

i.e. converges to optimal points

Subgradient methods 4-37

Implications: stepsize rule

1
• Optimal choice? ηt = √
t
:

kx0 − x∗ k22 + L2f log t

f best,t − f opt . √
t

i.e. attains ε-accuracy within about O(1/ε2 ) iterations (ignoring

the log factor)

Subgradient methods 4-38

Proof of Theorem 4.5
Applying Lemma 4.1 recursively gives
t
X t
X

kxt+1 − x∗ k22 ≤ kx0 − x∗ k22 − 2 ηi f (xi ) − f opt + ηi2 kg i k22
i=0 i=0

Rearranging terms, we are left with

t
X t
X
opt 0
2 ηi f (x ) − f
i
≤ kx − x∗ k22 − kx t+1
− x∗ k22 + ηi2 kg i k22
i=0 i=0
t
X
≤ kx0 − x∗ k22 + L2f ηi2
i=0
Pt Pt
best,t opt i=0 ηif (xi ) − f opt kx0 − x∗ k22 + L2f 2
i=0 ηi
=⇒ f −f ≤ Pt ≤ P
i=0 ηi 2 ti=0 ηi

Subgradient methods 4-39

Strongly convex and Lipschitz problems

If f is strongly convex, then the convergence guarantees can be

improved to O(1/t), as long as the stepsize dimishes at O(1/t)

Theorem 4.4 (Subgradient methods for strongly convex and

Lipschitz functions)

Let f be µ-strongly convex and Lf -Lipschitz continuous over C. If

2
ηt ≡ η = µ(t+1) , then
2L2f 1
f best,t − f opt ≤ ·
µ t+1

Subgradient methods 4-40

Proof of Theorem 4.4
When f is µ-strongly convex, we can improve Lemma 4.1 to (exercise)

kxt+1 − x∗ k22 ≤ (1 − µηt )kxt − x∗ k22 − 2ηt f (xt ) − f opt + ηt2 kg t k22

1 − µηt t 1 ηt
=⇒ f (xt )−f opt ≤ kx −x∗ k22 − kxt+1 −x∗ k22 + kg t k22
2ηt 2ηt 2
Since ηt = 2/(µ(t + 1)), we have

µ(t − 1) t ∗ 2 µ(t + 1) t+1 ∗ 2 1

f (xt )−f opt ≤ kx −x k2 − kx −x k2 + kg t k22
4 4 µ(t + 1)

and hence
µt(t − 1) t ∗ 2 µt(t + 1) t+1 ∗ 2 1 t 2
t f (xt ) − f opt ≤ kx −x k2 − kx −x k2 + kg k2
4 4 µ

Subgradient methods 4-41

Proof of Theorem 4.4 (cont.)

Summing over all iterations before t, we get

t
X
opt
µt(t + 1) t+1 ∗ 2 1X t
k f (x ) − f k
≤0− kx − x k2 + kg k k22
k=0
4 µ k=0
t 2
≤ L
µ f

L2f t 2L2f 1
=⇒ f best,k − f opt ≤ Pt ≤
µ k=0 k µ t+1

Subgradient methods 4-42

Summary: subgradient methods

stepsize convergence iteration

rule rate complexity
convex & Lipschitz 1

1

1

ηt √ O √ O
problems t t ε2

strongly convex & 1

1

1
ηt O O
Lipschitz problems t t ε

Subgradient methods 4-43

Convex-concave saddle point problems
Convex-concave saddle point problems

minimize max f (x, y)

x∈X y∈Y

• f (x, y): convex in x and concave in y

• X , Y: bounded closed convex sets
• arises in game theory, robust optimization, generative adversarial
network (GAN), ...
• under mild conditions, it is equivalent to its dual formulation

maximize min f (x, y)

y∈Y x∈X

Subgradient methods 4-45

 ⌧ x ⌧ y
opt ⌧ =0
+ f min f (x̃, y) 
x̃2X Saddle points
opt
✏(x, y) = max f (x, ỹ) f f opt
+ kx 0 min
2 f (0x̃, y)
yk22 + L
ỹ2Y 2 + ky
xkx̃2X
x̃, y)
= max f (x, ỹ) min f (x̃, y) t
X
ỹ2Y x̃2X
2 2
 DX + DY + L2f ⌘⌧2
ng optimal solutions 0.675

ef opt
= f (x , y ) with x⇤ and y ⇤ being optimal solutions
⇤ ⇤
0.67
⌧ =0

y convexity (and concavity)

(x⇤ , yof
⇤ f,
) 0.665

t
x xi, x2X 0.66

f (xt , y t ) f (x, y t )  hgxt , xt xi, x2X

t 0.655

y y i, yt ,2y)Y f (xt , y t )  hg t , y
f (x 0.65
y t i, y2Y
0.6 y 0.1
0.05
0.55
0

indicating that
0.5 -0.05
-0.1

yOptimal
, y)y t i,fpoint t , y ) obeys
(x ∗ ∗
gytf,(xt
(x, y x )2Xhg,xt ,yx2
t
Yxi + hgyt , y y t i, x 2 X, y 2
f (x∗ , y) ≤ f (x∗ , y ∗ ) ≤ f (x, y ∗ ), ∀x ∈ X , y ∈ Y
herefore,
of f againinvoking
gives convexity-concavity of f again gives
Subgradient methods 4-46
t t t t
Projected subgradient method

A natural strategy is to apply the subgradient-based approach

" # " # " #!

xt+1 xt gxt
= PX ×Y − ηt (4.6)
y t+1 yt −gyt
" #!
subgrad descent on xt
= projection
subgrad ascent on y t

where gxt ∈ ∂x f (xt , y t ) and −gyt ∈ ∂y − f (xt , y t )

Subgradient methods 4-47

Performance metric

One way to measure the quality of the solution is via the following
error metric (think of it as a certain “duality gap”)

opt opt
ε(x, y) := max f (x, ỹ) − f + f − min f (x̃, y)
ỹ∈Y x̃∈X

= max f (x, ỹ) − min f (x̃, y)

ỹ∈Y x̃∈X

where f opt := f (x∗ , y ∗ ) with (x∗ , y ∗ ) the optimal solution

Subgradient methods 4-48

Convex-concave and Lipschitz problems

Theorem 4.5 (Subgradient methods for saddle point problems)

Suppose f is convex in x and concave in y, and is Lf -Lipschitz

continuous over X × Y. Let DX (resp. DY ) be the diameter of X
(resp. Y). Then the projected subgradient method (4.6) obeys
2 + D 2 + L2 Pt 2
DX τ =0 ητ
ε(x̂ , ŷ ) ≤
t t Y f
Pt
2 τ =0 ητ
Pt Pt
ητ xτ ητ y τ
where x̂ = Pt
t τ =0
and ŷ = Pτ =0
t
t
τ =0
ητ τ =0
ητ

• similar to our theory for convex problems

√
• suggests varying stepsize ηt 1/ t

Subgradient methods 4-49

Iterate averaging

Notably, it is crucial to output the weighted average (x̂t , ŷ t ) of the

iterates of the subgradient methods

In fact, the original iterates (xt , y t ) might not converge

Example (bilinear game): f (x, y) = xy

• When ηt → 0 (continuous limit), (xt , y t ) exhibits cycling
behavior around (x∗ , y ∗ ) = (0, 0) without converging to it

Subgradient methods 4-50

Proof of Theorem 4.5
By the convexity-concavity of f ,
f (xt , y t ) − f (x, y t ) ≤ hgxt , xt − xi, x∈X
f (x , y) − f (x , y ) ≤
t t t
hgyt , y t
− y i, y∈Y
Adding these two inequalities yields
f (xt , y) − f (x, y t ) ≤ hgxt , xt − xi − hgyt , y t − yi, x ∈ X, y ∈ Y
Therefore, invoking the convexity-concavity of f once again gives
ε(x̂t , ŷ t ) = max f (x̂t , y) − min f (x, ŷ t )
y∈Y x∈X
( )
1 X t X t
≤ Pt max ητ f (x , y) − min
τ
ητ f (x, y )
τ

τ =0 ητ
y∈Y x∈X
τ =0 τ =0

1
t
X
≤ Pt max ητ hgxτ , xτ − xi − hgyτ , y τ − yi (4.7)
τ =0 ητ
x∈X ,y∈Y
τ =0

Subgradient methods 4-51

Proof of Theorem 4.5 (cont.)

It then suffices to control the RHS of (4.7) as follows:

Lemma 4.6
t
X n o
max ητ hgxτ , xτ − xi − hgyτ , y τ − yi
x∈Y,y∈Y
τ =0
2 + Pt
DX DY2 + L2f 2
τ =0 ητ
≤
2

This lemma together with (4.7) immediately establishes Theorem 4.5

Subgradient methods 4-52

Proof of Lemma 4.6
For any x ∈ X we have

kxτ +1 − xk22 = kPX (xτ − ητ gxτ ) − PX (x)k22

≤ kxτ − ητ gxτ − xk22 (convexity of X )
= kx −τ
xk22 − 2ητ hx −
τ
x, gxτ i + ητ2 kgxτ k22

=⇒ 2ητ hxτ − x, gxτ i ≤ kxτ − xk22 − kxτ +1 − xk22 + ητ2 kgxτ k22

Similarly, for any y ∈ Y one has

−2ητ hy τ − y, gyτ i ≤ ky τ − yk22 − ky τ +1 − yk22 + ητ2 kgyτ k22

Combining these two inequalities and using Lipschitz continuity yield

2ητ hgxτ , xτ − xi − 2ητ hgyτ , y τ − yi

≤ kxτ − xk22 + ky τ − yk22 − kxτ +1 − xk22 − ky τ +1 − yk22 + ητ2 L2f

Subgradient methods 4-53

Proof of Lemma 4.6 (cont.)

Summing up these inequalities over τ = 0, · · · , t gives

t
X
2 ητ hgxτ , xτ − xi − ητ hgyτ , y τ − yi
τ =0
Xt
≤ kx0 − xk22 + ky 0 − yk22 − kxt+1 − xk22 − ky t+1 − yk22 + L2f ητ2
τ =0
Xt
≤ kx0 − xk22 + ky 0 − yk22 + L2f ητ2
τ =0
Xt
≤ DX2
+ DY
2
+ L2f ητ2
τ =0

as claimed

Remark: this lemma does NOT rely on the convexity-concavity of f (·, ·)

Subgradient methods 4-54

Reference

[1] ”Convex optimization, EE364B lecture notes,” S. Boyd, Stanford.

[2] ”Convex optimization and algorithms,” D. Bertsekas, 2015.
[3] ”First-order methods in optimization,” A. Beck, Vol. 25, SIAM, 2017.
[4] ”Convex optimization: algorithms and complexity,” S. Bubeck,
Foundations and trends in machine learning, 2015.
[5] ”Optimization methods for large-scale systems, EE236C lecture notes,”
L. Vandenberghe, UCLA.
[6] ”Introduction to optimization,” B. Polyak, Optimization Software, 1987.
[7] ”Robust stochastic approximation approach to stochastic programming,”
A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, SIAM Journal on
optimization, 2009.

Subgradient methods 4-55

Weatherwax Nocedal Solutions
No ratings yet
Weatherwax Nocedal Solutions
23 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
02-Subgrad Method Notes
No ratings yet
02-Subgrad Method Notes
27 pages
Subgrad Method Slides
No ratings yet
Subgrad Method Slides
33 pages
Subgradient Method
No ratings yet
Subgradient Method
22 pages
Subgradients
No ratings yet
Subgradients
39 pages
Subgradients: Subgradient Calculus Duality and Optimality Conditions Directional Derivative
No ratings yet
Subgradients: Subgradient Calculus Duality and Optimality Conditions Directional Derivative
39 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
13 Generalized Programming and Subgradient Optimization PDF
No ratings yet
13 Generalized Programming and Subgradient Optimization PDF
20 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Primal - Dual Decomposition Methods
No ratings yet
Primal - Dual Decomposition Methods
40 pages
subgradients_slides
No ratings yet
subgradients_slides
37 pages
L10_Subgrad_PGD (partially annotated)
No ratings yet
L10_Subgrad_PGD (partially annotated)
39 pages
Convex Problems
No ratings yet
Convex Problems
48 pages
Optimization PPT - Part-2
No ratings yet
Optimization PPT - Part-2
42 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Gradinet
No ratings yet
Gradinet
51 pages
lect4_removed
No ratings yet
lect4_removed
32 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Optimization For Machine Learning: Lecture 8: Subgradient Method Accelerated Gradient 6.881: MIT
No ratings yet
Optimization For Machine Learning: Lecture 8: Subgradient Method Accelerated Gradient 6.881: MIT
89 pages
Mathematical Methods of Optimization
No ratings yet
Mathematical Methods of Optimization
62 pages
5165 Test 2 Cheating
No ratings yet
5165 Test 2 Cheating
7 pages
Chapter 8 Lecture Notes
No ratings yet
Chapter 8 Lecture Notes
4 pages
Gradient
No ratings yet
Gradient
31 pages
SGD
No ratings yet
SGD
19 pages
Gradient
No ratings yet
Gradient
37 pages
Chapter 4: Unconstrained Optimization
No ratings yet
Chapter 4: Unconstrained Optimization
25 pages
Maximum Slope Method
No ratings yet
Maximum Slope Method
14 pages
Opt_Lec_10
No ratings yet
Opt_Lec_10
16 pages
Hw2sol PDF
100% (1)
Hw2sol PDF
5 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Gradient - Descent Important 23-24
No ratings yet
Gradient - Descent Important 23-24
37 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Hw3sol PDF
No ratings yet
Hw3sol PDF
8 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
Latex for Mu
No ratings yet
Latex for Mu
3 pages
Frank Wolfe
No ratings yet
Frank Wolfe
25 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
Structural and Multidisciplinary Optimization
No ratings yet
Structural and Multidisciplinary Optimization
33 pages
15.093: Optimization Methods
No ratings yet
15.093: Optimization Methods
8 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
Notes On Subgradients
No ratings yet
Notes On Subgradients
13 pages
Non Linear Optmisation - Notes
No ratings yet
Non Linear Optmisation - Notes
24 pages
Lec_11
No ratings yet
Lec_11
13 pages
Lecture_7_8_other_descent_methods
No ratings yet
Lecture_7_8_other_descent_methods
7 pages
(K) K (k+1) (K) K (K)
No ratings yet
(K) K (k+1) (K) K (K)
6 pages
Conditional Gradient (Frank-Wolfe) Method: Lecturer: Javier Pe Na Convex Optimization 10-725/36-725
No ratings yet
Conditional Gradient (Frank-Wolfe) Method: Lecturer: Javier Pe Na Convex Optimization 10-725/36-725
28 pages
Homework_Template__1_
No ratings yet
Homework_Template__1_
2 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
Gradient_Descent
No ratings yet
Gradient_Descent
52 pages
Chapter 2 Conjugate Gradient and Matlab PDF
No ratings yet
Chapter 2 Conjugate Gradient and Matlab PDF
50 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
Optimization Based On Gradient Descent
No ratings yet
Optimization Based On Gradient Descent
24 pages
Steepest Decent and CG
No ratings yet
Steepest Decent and CG
68 pages
Limits and Continuity (Calculus) Engineering Entrance Exams Question Bank
From Everand
Limits and Continuity (Calculus) Engineering Entrance Exams Question Bank
Mohmmad Khaja Shareef
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
qje_2018
No ratings yet
qje_2018
70 pages
shsconf_icprss2023_01022
No ratings yet
shsconf_icprss2023_01022
6 pages
PS1-CF-2024
No ratings yet
PS1-CF-2024
5 pages
PS2-CF-2024 (1)
No ratings yet
PS2-CF-2024 (1)
8 pages
Daa Lab Manual Cse III I Sem
No ratings yet
Daa Lab Manual Cse III I Sem
47 pages
COMP 3250 B - Design and Analysis of Algorithms (Advanced Class)
No ratings yet
COMP 3250 B - Design and Analysis of Algorithms (Advanced Class)
2 pages
MCS31 combinedQPapers
No ratings yet
MCS31 combinedQPapers
23 pages
Recurrences
No ratings yet
Recurrences
71 pages
07 Network Flow I
No ratings yet
07 Network Flow I
96 pages
Applied Numerical Methods with MATLAB for Engineers and Scientists 4th Edition Steven C. Chapra Dr. - Download the full set of chapters carefully compiled
100% (1)
Applied Numerical Methods with MATLAB for Engineers and Scientists 4th Edition Steven C. Chapra Dr. - Download the full set of chapters carefully compiled
52 pages
Unit: III Subject Name Optimization & Numerical Techniques (AAS0404) Course Details B Tech-4th Sem
No ratings yet
Unit: III Subject Name Optimization & Numerical Techniques (AAS0404) Course Details B Tech-4th Sem
54 pages
Search Algorithms in AI4
No ratings yet
Search Algorithms in AI4
13 pages
Chapter 8 Part 2 Distribution Model
No ratings yet
Chapter 8 Part 2 Distribution Model
23 pages
Intro - Matlab - and - Numerical - Method Lab Full Document
No ratings yet
Intro - Matlab - and - Numerical - Method Lab Full Document
75 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Business Mathematics
No ratings yet
Business Mathematics
4 pages
Class 12 Maths Assignemnt
No ratings yet
Class 12 Maths Assignemnt
4 pages
Worksheet. Polynomials 23-24
No ratings yet
Worksheet. Polynomials 23-24
1 page
Assignment#1: Design & Analysis of Algorithm
No ratings yet
Assignment#1: Design & Analysis of Algorithm
7 pages
Fenton94b Interpolation and Numerical Differentiation in Civil Engineering Problems
No ratings yet
Fenton94b Interpolation and Numerical Differentiation in Civil Engineering Problems
11 pages
Regularization
No ratings yet
Regularization
45 pages
12th Chapter 3 (Questions)
No ratings yet
12th Chapter 3 (Questions)
3 pages
ME421: IEOR Assignment 03: Special Cases of LP and Sensitivity Analysis Department of Mechanical Engineering, IIT Guwahati
No ratings yet
ME421: IEOR Assignment 03: Special Cases of LP and Sensitivity Analysis Department of Mechanical Engineering, IIT Guwahati
2 pages
Nonlinear Presolve Algorithm in AIMMS
No ratings yet
Nonlinear Presolve Algorithm in AIMMS
10 pages
Water Jug Problem in AI With BFS and DFS Algorithm
No ratings yet
Water Jug Problem in AI With BFS and DFS Algorithm
10 pages
Business Mathematics Part 10
No ratings yet
Business Mathematics Part 10
6 pages
Mathematics Pre-Test What I Know
No ratings yet
Mathematics Pre-Test What I Know
7 pages
Chapter 6 Part 1
No ratings yet
Chapter 6 Part 1
25 pages
Polynomials Lesson Assign - Compressed
No ratings yet
Polynomials Lesson Assign - Compressed
6 pages
bcsl-58 lab
No ratings yet
bcsl-58 lab
30 pages
Mcse 004
0% (1)
Mcse 004
15 pages
Euclid Notes 3 Polynomial Functions
No ratings yet
Euclid Notes 3 Polynomial Functions
4 pages
Find Solution of Assignment Problem Using Hungarian Method-2 (MAX Case)
No ratings yet
Find Solution of Assignment Problem Using Hungarian Method-2 (MAX Case)
3 pages
S&NM
No ratings yet
S&NM
2 pages