0% found this document useful (0 votes)
68 views89 pages

Optimization For Machine Learning: Lecture 8: Subgradient Method Accelerated Gradient 6.881: MIT

The document summarizes a lecture on optimization for machine learning. It discusses the subgradient method for solving unconstrained convex optimization problems. The subgradient method generates a sequence of solutions by taking steps in the negative direction of a subgradient at each iteration. The method converges slowly with a constant step size but can converge faster with a diminishing step size that satisfies certain properties. The document provides examples of applying the subgradient method and analyzes its convergence under assumptions like bounded subgradients and a bounded domain.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views89 pages

Optimization For Machine Learning: Lecture 8: Subgradient Method Accelerated Gradient 6.881: MIT

The document summarizes a lecture on optimization for machine learning. It discusses the subgradient method for solving unconstrained convex optimization problems. The subgradient method generates a sequence of solutions by taking steps in the negative direction of a subgradient at each iteration. The method converges slowly with a constant step size but can converge faster with a diminishing step size that satisfies certain properties. The document provides examples of applying the subgradient method and analyzes its convergence under assumptions like bounded subgradients and a bounded domain.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Optimization for Machine Learning

Lecture 8: Subgradient method; Accelerated gradient


6.881: MIT

Suvrit Sra
Massachusetts Institute of Technology

16 Mar, 2021
First-order methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 2
Subgradient method

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 3
Unconstrained convex problem

min f (x)
x

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem

min f (x)
x

1 Start with some guess x0 ; set k = 0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem

min f (x)
x

1 Start with some guess x0 ; set k = 0


2 If 0 ∈ ∂f (xk ), stop; output xk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem

min f (x)
x

1 Start with some guess x0 ; set k = 0


2 If 0 ∈ ∂f (xk ), stop; output xk
3 Otherwise, generate next guess xk+1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem

min f (x)
x

1 Start with some guess x0 ; set k = 0


2 If 0 ∈ ∂f (xk ), stop; output xk
3 Otherwise, generate next guess xk+1
4 Repeat above procedure until f (xk ) ≤ f (x∗ ) + ε

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Subgradient method

xk+1 = xk − ηk gk
where gk ∈ ∂f (xk ) is any subgradient

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 5
Subgradient method

xk+1 = xk − ηk gk
where gk ∈ ∂f (xk ) is any subgradient

Stepsize ηk > 0 must be chosen

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 5
Subgradient method

xk+1 = xk − ηk gk
where gk ∈ ∂f (xk ) is any subgradient

Stepsize ηk > 0 must be chosen


I Method generates sequence xk k≥0


I Does this sequence converge to an optimal solution x∗ ?


I If yes, then how fast?
I What if we have constraints: x ∈ C?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 5
Example

min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 6
Example

min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))

2
10

1
10

0
10
0 20 40 60 80 100

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 6
Example

min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 7
Example

min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))
2
10

1
10

0
10
0 20 40 60 80 100

(More careful implementation)


Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 7
Exercise

Exercise: Experiment with deep neural network classifier


where we want to learn sparse weights. In particular, exper-
iment with the following loss function:
n
1X
min L(x) := `(yi , N N (x, ai )) + λkxk1 .
x n
i=1

Implement a stochastic subgradient update to minimize L.


(Hint: If we pretend that the loss part is differentiable, then we can
invoke Clarke’s rule: ∂◦ L = ∇loss + λ∂reg)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 8
Subgradient method – stepsizes

I Constant Set ηk = η > 0, for k ≥ 0


I Normalized ηk = η/kgk k2 (kxk+1 − xk k2 = η)
I Square summable
X X
ηk2 < ∞, ηk = ∞
k k

I Diminishing
X
lim ηk = 0, ηk = ∞
k k

I Adaptive stepsizes (not covered)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 9
Subgradient method – stepsizes

I Constant Set ηk = η > 0, for k ≥ 0


I Normalized ηk = η/kgk k2 (kxk+1 − xk k2 = η)
I Square summable
X X
ηk2 < ∞, ηk = ∞
k k

I Diminishing
X
lim ηk = 0, ηk = ∞
k k

I Adaptive stepsizes (not covered)

Not a descent method!


Could use best f k so far: fmin
k
:= min0≤i≤k f i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 9
Convergence
(sketch)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 10
Convergence analysis

Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Convergence analysis

Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Convergence analysis

Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f
I Bounded domain: kx0 − x∗ k2 ≤ R

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Convergence analysis

Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f
I Bounded domain: kx0 − x∗ k2 ≤ R

k
Convergence results for: fmin := min0≤i≤k f i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22


= kxk − x∗ k22 + ηk2 kgk k22 − 2hηk gk , xk − x∗ i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22


= kxk − x∗ k22 + ηk2 kgk k22 − 2hηk gk , xk − x∗ i
≤ kxk − x∗ k22 + ηk2 kgk k22 − 2ηk (f (xk ) − f ∗ ),

since f ∗ = f (x∗ ) ≥ f (xk ) + hgk , x∗ − xk i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22


= kxk − x∗ k22 + ηk2 kgk k22 − 2hηk gk , xk − x∗ i
≤ kxk − x∗ k22 + ηk2 kgk k22 − 2ηk (f (xk ) − f ∗ ),

since f ∗ = f (x∗ ) ≥ f (xk ) + hgk , x∗ − xk i


Apply same argument to kxk − x∗ k22 recursively

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22


= kxk − x∗ k22 + ηk2 kgk k22 − 2hηk gk , xk − x∗ i
≤ kxk − x∗ k22 + ηk2 kgk k22 − 2ηk (f (xk ) − f ∗ ),

since f ∗ = f (x∗ ) ≥ f (xk ) + hgk , x∗ − xk i


Apply same argument to kxk − x∗ k22 recursively
Xk Xk
kxk+1 − x∗ k22 ≤ kx0 − x∗ k22 + ηt2 kgt k22 − 2 ηt (f t − f ∗ ).
t=1 t=1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22


= kxk − x∗ k22 + ηk2 kgk k22 − 2hηk gk , xk − x∗ i
≤ kxk − x∗ k22 + ηk2 kgk k22 − 2ηk (f (xk ) − f ∗ ),

since f ∗ = f (x∗ ) ≥ f (xk ) + hgk , x∗ − xk i


Apply same argument to kxk − x∗ k22 recursively
Xk Xk
kxk+1 − x∗ k22 ≤ kx0 − x∗ k22 + ηt2 kgt k22 − 2 ηt (f t − f ∗ ).
t=1 t=1

Now use our convenient assumptions!

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence
Xk Xk
kxk+1 − x∗ k22 ≤ R2 + G2 ηt2 − 2 ηt (f t − f ∗ ).
t=1 t=1
I To get a bound on the last term, simply notice (for t ≤ k)

f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 13
Subgradient method – convergence
Xk Xk
kxk+1 − x∗ k22 ≤ R2 + G2 ηt2 − 2 ηt (f t − f ∗ ).
t=1 t=1
I To get a bound on the last term, simply notice (for t ≤ k)

f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t
I Plugging this in yields the bound
Xk Xk
2 ηt (f t − f ∗ ) ≥ 2(fmin
k
− f ∗) ηt .
t=1 t=1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 13
Subgradient method – convergence
Xk Xk
kxk+1 − x∗ k22 ≤ R2 + G2 ηt2 − 2 ηt (f t − f ∗ ).
t=1 t=1
I To get a bound on the last term, simply notice (for t ≤ k)

f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t
I Plugging this in yields the bound
Xk Xk
2 ηt (f t − f ∗ ) ≥ 2(fmin
k
− f ∗) ηt .
t=1 t=1

I So that we finally have


Xk Xk
0 ≤ kxk+1 − x∗ k2 ≤ R2 + G2 k
ηt2 − 2(fmin − f ∗) ηt
t=1 t=1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 13
Subgradient method – convergence
Xk Xk
kxk+1 − x∗ k22 ≤ R2 + G2 ηt2 − 2 ηt (f t − f ∗ ).
t=1 t=1
I To get a bound on the last term, simply notice (for t ≤ k)

f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t
I Plugging this in yields the bound
Xk Xk
2 ηt (f t − f ∗ ) ≥ 2(fmin
k
− f ∗) ηt .
t=1 t=1

I So that we finally have


Xk Xk
0 ≤ kxk+1 − x∗ k2 ≤ R2 + G2 k
ηt2 − 2(fmin − f ∗) ηt
t=1 t=1

Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 13
Subgradient method – convergence

Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt

k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence

Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt

k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Constant step: ηk = η; We obtain

R2 + G2 kη 2
k
fmin − f∗ ≤
2kη

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence

Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt

k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Constant step: ηk = η; We obtain

R2 + G2 kη 2 G2 η
k
fmin − f∗ ≤ → as k → ∞.
2kη 2

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence

Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt

k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Constant step: ηk = η; We obtain

R2 + G2 kη 2 G2 η
k
fmin − f∗ ≤ → as k → ∞.
2kη 2

2
P P
Square summable, not summable: k ηk < ∞, k ηk =∞

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence

Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt

k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Constant step: ηk = η; We obtain

R2 + G2 kη 2 G2 η
k
fmin − f∗ ≤ → as k → ∞.
2kη 2

Square summable, not summable: k ηk2 < ∞, k ηk = ∞


P P
k
As k → ∞, numerator < ∞ but denominator → ∞; so fmin → f∗

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence

Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt

k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Constant step: ηk = η; We obtain

R2 + G2 kη 2 G2 η
k
fmin − f∗ ≤ → as k → ∞.
2kη 2

Square summable, not summable: k ηk2 < ∞, k ηk = ∞


P P
k
As k → ∞, numerator < ∞ but denominator → ∞; so fmin → f∗

In practice, fair bit of stepsize tuning needed, e.g. ηt = a/(b + t)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
I Optimize the bound for ηt : want

k
fmin − f∗ ≤ ε

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
I Optimize the bound for ηt : want

R2 + G2 kt=1 ηt2
P
k ∗
fmin −f ≤ ≤ ε
2 kt=1 ηt
P

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
I Optimize the bound for ηt : want

R2 + G2 kt=1 ηt2
P
k ∗
fmin −f ≤ ≤ ε
2 kt=1 ηt
P

I For fixed k: best possible stepsize is constant η

R2 + G2 kη 2 R
≤ ⇒ η= √
2kη G k

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
I Optimize the bound for ηt : want

R2 + G2 kt=1 ηt2
P
k ∗
fmin −f ≤ ≤ ε
2 kt=1 ηt
P

I For fixed k: best possible stepsize is constant η

R2 + G2 kη 2 R
≤ ⇒ η= √
2kη G k

k
I Then, after k steps fmin − f ∗ ≤ RG/ k.
I For accuracy , we need at least (RG/)2 = O(1/2 ) steps

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
I Optimize the bound for ηt : want

R2 + G2 kt=1 ηt2
P
k ∗
fmin −f ≤ ≤ ε
2 kt=1 ηt
P

I For fixed k: best possible stepsize is constant η

R2 + G2 kη 2 R
≤ ⇒ η= √
2kη G k

k
I Then, after k steps fmin − f ∗ ≤ RG/ k.
I For accuracy , we need at least (RG/)2 = O(1/2 ) steps
I (quite slow but already hits the lower bound!)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Exercise: Support vector machines

I Let D := {(xi , yi ) | xi ∈ Rn , yi ∈ {±1}}


I We wish to find w ∈ Rn and b ∈ R such that
Xm
min 21 kwk22 + C max[0, 1 − yi (wT xi + b)]
w,b i=1

I Derive and implement a subgradient method


I Plot evolution of objective function
I Experiment with different values of C > 0
k
I Plot and keep track of fmin := min0≤t≤k f (xt )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 16
Exercise: Geometric median

• Let a ∈ Rn be a given vector.


P
• Let f (x) = i |x − ai |, i.e., f : R → R+
• Implement different subgradient methods to minimize f
k
• Also keep track of fbest := min0≤i<k f (xi )
Exercise: Implement the above. Plot the f (xk ) values; also try
to guess what optimum is being found.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 17
Optimization with simple constraints

min f (x) s.t. x∈C

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 18
Optimization with simple constraints

min f (x) s.t. x∈C

Previously:
xt+1 = xt − ηt gt
This could be infeasible!

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 18
Optimization with simple constraints

min f (x) s.t. x∈C

Previously:
xt+1 = xt − ηt gt
This could be infeasible!
Use projection

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 18
Projected subgradient method

xk+1 = PC (xk − ηk gk )
where gk ∈ ∂f (xk ) is any subgradient

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 19
Projected subgradient method

xk+1 = PC (xk − ηk gk )
where gk ∈ ∂f (xk ) is any subgradient

I Projection closest feasible point

PC (x) = arg min kx − yk2


y∈C

(Assume C is closed and convex, then projection is unique)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 19
Projected subgradient method

xk+1 = PC (xk − ηk gk )
where gk ∈ ∂f (xk ) is any subgradient

I Projection closest feasible point

PC (x) = arg min kx − yk2


y∈C

(Assume C is closed and convex, then projection is unique)


I Great as long as projection is “easy”
I Same questions as before:
Does it converge? For which stepsizes? How fast?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 19
Key idea: Projection Theorem

Let C be nonempty, closed and convex.


Recall: Optimality conditions: y∗ = PC (z) iff

hz − y∗ , y − y∗ i ≤ 0 for all y ∈ C

Verify: Projection is nonexpansive:

kPC (x) − PC (z)k ≤ kx − zk2 for all x, z ∈ Rn .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 20
Convergence analysis

Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f
I Bounded domain: kx0 − x∗ k2 ≤ R
Analysis
I Let zt+1 = PC (xt − ηt gt ).
I Then xt+1 = PC (zt+1 ).
I Recall analysis of unconstrained method:

kzt+1 − x∗ k22 = kxt − ηt gt − x∗ k22


≤ kxt − x∗ k22 + ηt2 kgt k22 − 2ηt (f (xt ) − f ∗ )
...

I Need to relate to kxt+1 − x∗ k22 , the rest is as before

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 21
Convergence analysis: Key idea

I Using nonexpansiveness of projection:

kxt − ηt gt − x∗ k22
≤ kxt − x∗ k22 + ηt2 kgt k22 − 2ηt (f (xt ) − f ∗ )
...

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 22
Convergence analysis: Key idea

I Using nonexpansiveness of projection:

kxt+1 − x∗ k22 = kPC (xt − ηt gt ) − PC (x∗ )k22


≤ kxt − ηt gt − x∗ k22
≤ kxt − x∗ k22 + ηt2 kgt k22 − 2ηt (f (xt ) − f ∗ )
...

Same convergence results as in unconstrained case:


I within neighborhood of optimal for constant step size
I converges for diminishing non-summable

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 22
Examples of simple projections

I Nonnegativity x ≥ 0, PC (z) = [z]+


I `∞ -ball kxk∞ ≤ 1
Projection: min kx − zk2 s.t. x ≤ 1 and x ≥ −1
Pkxk∞ ≤1 (z) = y where yi = sgn(zi ) min{|zi |, 1}
I Linear equality constraints Ax = b (A ∈ Rn×m has rank n)

PC (x) = z − A> (AA> )−1 (Az − b)


= (I − A> (A> A)−1 A)z + A> (AA> )−1 b

I Simplex: x> 1 = 1 and x ≥ 0


doable in O(n) time; similarly `1 -norm ball

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 23
Some remarks
I Why care?
simple
low-memory
stochastic version possible

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 24
Some remarks
I Why care?
simple
low-memory
stochastic version possible
Another perspective

1
xk+1 = min hx, gk i + kx − xk k2
x∈C 2ηk
Mirror Descent version
1
xk+1 = min hx, gk i + Dϕ (x, xk )
x∈C ηk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 24
Accelerated gradient

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 25
Gradient methods – upper bounds

Theorem. (Upper bound I). Let f ∈ C1L . Then,

min k∇f (xk )k ≤ ε in O(1/ε2 ) iterations.


k

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 26
Gradient methods – upper bounds

Theorem. (Upper bound I). Let f ∈ C1L . Then,

min k∇f (xk )k ≤ ε in O(1/ε2 ) iterations.


k

Theorem. (Upper bound II). Let f ∈ S1L,µ . Then,


 2k
∗ L κ−1
k
f (x ) − f (x ) ≤ kx0 − x∗ k22
2 κ+1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 26
Gradient methods – upper bounds

Theorem. (Upper bound I). Let f ∈ C1L . Then,

min k∇f (xk )k ≤ ε in O(1/ε2 ) iterations.


k

Theorem. (Upper bound II). Let f ∈ S1L,µ . Then,


 2k
∗ L κ−1
k
f (x ) − f (x ) ≤ kx0 − x∗ k22
2 κ+1

Theorem. (Upper bound III). Let f ∈ C1L be convex. Then,

2L(f (x0 ) − f (x∗ ))kx0 − x∗ k22


f (xk ) − f (x∗ ) ≤ .
k+4

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 26
Gradient methods – lower bounds

Theorem. (Carmon-Duchi-Hinder-Sidford 2017). There’s an f ∈ C1L ,


such that k∇f (x)k ≤ ε requires Ω(ε−2 ) gradient evaluations.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 27
Gradient methods – lower bounds

Theorem. (Carmon-Duchi-Hinder-Sidford 2017). There’s an f ∈ C1L ,


such that k∇f (x)k ≤ ε requires Ω(ε−2 ) gradient evaluations.

Theorem. (Nesterov). There exists f ∈ S∞


L,µ (µ > 0, κ > 1) s.t.

√ 2k
∗ µ κ−1
k
f (x ) − f (x ) ≥ √ kx0 − x∗ k22 ,
2 κ+1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 27
Gradient methods – lower bounds

Theorem. (Carmon-Duchi-Hinder-Sidford 2017). There’s an f ∈ C1L ,


such that k∇f (x)k ≤ ε requires Ω(ε−2 ) gradient evaluations.

Theorem. (Nesterov). There exists f ∈ S∞


L,µ (µ > 0, κ > 1) s.t.

√ 2k
∗ µ κ−1
k
f (x ) − f (x ) ≥ √ kx0 − x∗ k22 ,
2 κ+1

Theorem. (Nesterov). For any x0 ∈ Rn , and 1 ≤ k ≤ 21 (n − 1),


there is a convex f ∈ C1L , s.t.

3Lkx0 − x∗ k22
f (xk ) − f (x∗ ) ≥
32(k + 1)2
kxk − x0 k2 ≥ 1
8 kx
0
− x∗ k2 .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 27
Accelerated gradient methods

Upper bounds: (i) O(1/k); and (ii) linear rate involving κ



Lower bounds: (i) O(1/k2 ); and (ii) linear rate involving κ

Challenge: Close this gap!

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 28
Accelerated gradient methods

Upper bounds: (i) O(1/k); and (ii) linear rate involving κ



Lower bounds: (i) O(1/k2 ); and (ii) linear rate involving κ

Challenge: Close this gap!

Nesterov (1983) closed the gap.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 28
Background: ravine method

Long, narrow ravines slow


down GD

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 29
Background: ravine method

Long, narrow ravines slow


down GD
Gel’fand-Tsetlin (1961):
Ravine method

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 29
Background: ravine method

Long, narrow ravines slow


down GD
Gel’fand-Tsetlin (1961):
Ravine method
Intuition: descent to bottom of
ravine not hard, but moving
along narrow ravine harder.
Thus, mix two types of steps:
gradient step and a “ravine step”

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 29
Background: ravine method

Long, narrow ravines slow


down GD
Gel’fand-Tsetlin (1961):
Ravine method
Intuition: descent to bottom of
ravine not hard, but moving
along narrow ravine harder.
Thus, mix two types of steps:
gradient step and a “ravine step”

Simplest form of ravine method


k+1
x = yk − α∇f (yk ), yk+1 = xk+1 + β(xk+1 − xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 29
Background: Heavy-ball method

Polyak’s Momentum Method (1964)

xk+1 = xk − ηk ∇f (xk ) + βk (xk − xk−1 )

Theorem. Let f = 12 xT Ax + bT x ∈ S1L,µ . Then, choose

√ √
√ 2 κ−1
ηk = 4/( L + µ), βk = q , q = √
κ+1

the heavy-ball method satisfies kxk − x∗ k = O(qk ).

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 30
Background: Heavy-ball method

Polyak’s Momentum Method (1964)

xk+1 = xk − ηk ∇f (xk ) + βk (xk − xk−1 )

Theorem. Let f = 12 xT Ax + bT x ∈ S1L,µ . Then, choose

√ √
√ 2 κ−1
ηk = 4/( L + µ), βk = q , q = √
κ+1

the heavy-ball method satisfies kxk − x∗ k = O(qk ).

Motivated originally from so-called “Ravine method” of


Gelfand-Tsetlin (1961), that runs the iteration

zk = xk − ηk ∇f (xk ), xk+1 = zk + βk (zk − zk−1 )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 30
Background: Heavy-ball method

Polyak’s Momentum Method (1964)

xk+1 = xk − ηk ∇f (xk ) + βk (xk − xk−1 )

Can view it as a discretization of 2nd-order ODE:

ẍ + aẋ + b∇f (x) = 0

(analogy: movement of a heavy-ball in a potential field f (x)


governed not only by ∇f (x) but by a momentum term)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 31
Background: Heavy-ball method

Polyak’s Momentum Method (1964)

xk+1 = xk − ηk ∇f (xk ) + βk (xk − xk−1 )

Can view it as a discretization of 2nd-order ODE:

ẍ + aẋ + b∇f (x) = 0

(analogy: movement of a heavy-ball in a potential field f (x)


governed not only by ∇f (x) but by a momentum term)
Why does momentum help?
Explore: Check out: https://distill.pub/2017/momentum/

What about the general convex case?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 31
Nesterov’s AGM

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 32
Nesterov’s AGM
Nesterov’s (1983) method
k+1
x = yk − L1 ∇f (yk )
yk+1 = xk+1 + βk (xk+1 − xk )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 33
Nesterov’s AGM
Nesterov’s (1983) method
k+1
x = yk − L1 ∇f (yk )
yk+1 = xk+1 + βk (xk+1 − xk )

Essentially same as the ravine method!!

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 33
Nesterov’s AGM
Nesterov’s (1983) method
k+1
x = yk − L1 ∇f (yk )
yk+1 = xk+1 + βk (xk+1 − xk )

Essentially same as the ravine method!!

αk − 1
q
βk = , 2αk+1 = 1 + 4αk2 + 1, α0 = 1
αk+1
2Lky0 − x∗ k2
f (xk ) − f (x∗ ) ≤ .
(k + 2)2

In the strongly convex case, instead we use βk = √κ−1 κ+1
. This leads to

O( κ log(1/ε)) iterations to ensure f (xk ) − f (x∗ ) ≤ ε.
(Remark: Nemirovski proposed a method that achieves optimal complexity,
but it required 2D line-search. Nesterov’s method was the real breakthrough
and remains a fascinating topic to study even today.)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 33
Analyzing Nesterov’s method
(II Ravine method worked well and sparked numerous heuristics for
selecting its parameters and improving its behavior. However, its
convergence was never proved. Inspired Polyak’s heavy-ball method, which
seems to have inspired Nesterov’s AGM.)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 34
Analyzing Nesterov’s method
(II Ravine method worked well and sparked numerous heuristics for
selecting its parameters and improving its behavior. However, its
convergence was never proved. Inspired Polyak’s heavy-ball method, which
seems to have inspired Nesterov’s AGM.)
Some ways to analyze AGM
Nesterov’s Estimate sequence method
Approaches based on potential (Lyapunov) functions
Derivation based on viewing AGM as approximate PPM
Using “linear coupling,” mixing a primal-dual view
Analysis based on SDPs

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 34
Analyzing Nesterov’s method
(II Ravine method worked well and sparked numerous heuristics for
selecting its parameters and improving its behavior. However, its
convergence was never proved. Inspired Polyak’s heavy-ball method, which
seems to have inspired Nesterov’s AGM.)
Some ways to analyze AGM
Nesterov’s Estimate sequence method
Approaches based on potential (Lyapunov) functions
Derivation based on viewing AGM as approximate PPM
Using “linear coupling,” mixing a primal-dual view
Analysis based on SDPs
See discussion in the paper

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 34
Potential analysis – sketch

Choose potential: judge closeness of iterates to the optimal


Ensure the potential is decreasing with iteration
AGM does not satisfy f (xk+1 ) ≤ f (xk ), so...

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 35
Potential analysis – sketch

Choose potential: judge closeness of iterates to the optimal


Ensure the potential is decreasing with iteration
AGM does not satisfy f (xk+1 ) ≤ f (xk ), so...
Slightly more general AGM iteration

xk+1 ← yk + αk+1 (zk − yk )


yk+1 ← xk+1 − γk+1 ∇f (xk+1 )
zk+1 ← xk+1 + βk+1 (zk − xk+1 ) − ηk+1 ∇f (xk+1 )

Mixing intution from “descent” and “ravines”


Φk := Ak (f (yk ) − f (x∗ )) + Bk kzk − x∗ k2

Pick parameters Ak , Bk , ηk , γk , αk , βk to ensure that we have


Φk − Φk−1 ≤ 0. Turns out a “simple” choice does that job!

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 35
Potential analysis – sketch
Using the shorthand:

∆γ := γ(1 − Lγ/2) , ∇ := ∇f (xt+1 ) , X := xt+1 − x∗ , and W := zt − xt+1 ,

using smoothness and convexity, show that Φk+1 − Φk is upper-boudned by

c1 kWk2 + c2 kXk2 + c3 k∇k2 + c4 hW, Xi + c5 hW, ∇i + c6 hX, ∇i ,


α2
 2 µ µ
c1 := β Bk+1 − Bk − 2 (1−α)2 Ak , c2 := Bk+1 − Bk − 2 (Ak+1 − Ak ) ,

c3 := η 2 Bk+1 − ∆γ · Ak+1 , c4 := 2 · (βBk+1 − Bk ) ,

α
c5 := 1−α Ak − 2βηBk+1 , and c6 := (Ak+1 − Ak ) − 2ηBk+1 .

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 36
Potential analysis – sketch
Using the shorthand:

∆γ := γ(1 − Lγ/2) , ∇ := ∇f (xt+1 ) , X := xt+1 − x∗ , and W := zt − xt+1 ,

using smoothness and convexity, show that Φk+1 − Φk is upper-boudned by

c1 kWk2 + c2 kXk2 + c3 k∇k2 + c4 hW, Xi + c5 hW, ∇i + c6 hX, ∇i ,


α2
 2 µ µ
c1 := β Bk+1 − Bk − 2 (1−α)2 Ak , c2 := Bk+1 − Bk − 2 (Ak+1 − Ak ) ,

c3 := η 2 Bk+1 − ∆γ · Ak+1 , c4 := 2 · (βBk+1 − Bk ) ,

α
c5 := 1−α Ak − 2βηBk+1 , and c6 := (Ak+1 − Ak ) − 2ηBk+1 .

Now choose parameters to ensure Φk+1 − Φk ≤ 0. Finally, leads to a bound of


the form
f (yk ) − f (x∗ ) = O((1 − ξ1 ) · · · (1 − ξk )),
where the sequence {ξk } fully characterizes convergence.

Ref: See details in the paper: Ahn, Sra (2020). From Nesterov’s Estimate
Sequence to Riemannian Acceleration.

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 36

You might also like