Optimization For Machine Learning: Lecture 8: Subgradient Method Accelerated Gradient 6.881: MIT
Optimization For Machine Learning: Lecture 8: Subgradient Method Accelerated Gradient 6.881: MIT
Suvrit Sra
Massachusetts Institute of Technology
16 Mar, 2021
First-order methods
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 2
Subgradient method
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 3
Unconstrained convex problem
min f (x)
x
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem
min f (x)
x
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem
min f (x)
x
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem
min f (x)
x
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem
min f (x)
x
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Subgradient method
xk+1 = xk − ηk gk
where gk ∈ ∂f (xk ) is any subgradient
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 5
Subgradient method
xk+1 = xk − ηk gk
where gk ∈ ∂f (xk ) is any subgradient
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 5
Subgradient method
xk+1 = xk − ηk gk
where gk ∈ ∂f (xk ) is any subgradient
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 5
Example
min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 6
Example
min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))
2
10
1
10
0
10
0 20 40 60 80 100
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 6
Example
min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 7
Example
min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))
2
10
1
10
0
10
0 20 40 60 80 100
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 8
Subgradient method – stepsizes
I Diminishing
X
lim ηk = 0, ηk = ∞
k k
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 9
Subgradient method – stepsizes
I Diminishing
X
lim ηk = 0, ηk = ∞
k k
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 9
Convergence
(sketch)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 10
Convergence analysis
Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Convergence analysis
Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Convergence analysis
Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f
I Bounded domain: kx0 − x∗ k2 ≤ R
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Convergence analysis
Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f
I Bounded domain: kx0 − x∗ k2 ≤ R
k
Convergence results for: fmin := min0≤i≤k f i
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Subgradient method – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence
Xk Xk
kxk+1 − x∗ k22 ≤ R2 + G2 ηt2 − 2 ηt (f t − f ∗ ).
t=1 t=1
I To get a bound on the last term, simply notice (for t ≤ k)
f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 13
Subgradient method – convergence
Xk Xk
kxk+1 − x∗ k22 ≤ R2 + G2 ηt2 − 2 ηt (f t − f ∗ ).
t=1 t=1
I To get a bound on the last term, simply notice (for t ≤ k)
f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t
I Plugging this in yields the bound
Xk Xk
2 ηt (f t − f ∗ ) ≥ 2(fmin
k
− f ∗) ηt .
t=1 t=1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 13
Subgradient method – convergence
Xk Xk
kxk+1 − x∗ k22 ≤ R2 + G2 ηt2 − 2 ηt (f t − f ∗ ).
t=1 t=1
I To get a bound on the last term, simply notice (for t ≤ k)
f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t
I Plugging this in yields the bound
Xk Xk
2 ηt (f t − f ∗ ) ≥ 2(fmin
k
− f ∗) ηt .
t=1 t=1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 13
Subgradient method – convergence
Xk Xk
kxk+1 − x∗ k22 ≤ R2 + G2 ηt2 − 2 ηt (f t − f ∗ ).
t=1 t=1
I To get a bound on the last term, simply notice (for t ≤ k)
f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t
I Plugging this in yields the bound
Xk Xk
2 ηt (f t − f ∗ ) ≥ 2(fmin
k
− f ∗) ηt .
t=1 t=1
Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 13
Subgradient method – convergence
Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt
k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence
Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt
k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Constant step: ηk = η; We obtain
R2 + G2 kη 2
k
fmin − f∗ ≤
2kη
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence
Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt
k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Constant step: ηk = η; We obtain
R2 + G2 kη 2 G2 η
k
fmin − f∗ ≤ → as k → ∞.
2kη 2
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence
Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt
k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Constant step: ηk = η; We obtain
R2 + G2 kη 2 G2 η
k
fmin − f∗ ≤ → as k → ∞.
2kη 2
2
P P
Square summable, not summable: k ηk < ∞, k ηk =∞
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence
Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt
k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Constant step: ηk = η; We obtain
R2 + G2 kη 2 G2 η
k
fmin − f∗ ≤ → as k → ∞.
2kη 2
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence
Pk
k ∗ R2 +G2 t=1 ηt2
fmin −f ≤ Pk
2 t=1 ηt
k
Exercise: Analyze limk→∞ fmin − f ∗ for the different choices of
stepsize that we mentioned.
Constant step: ηk = η; We obtain
R2 + G2 kη 2 G2 η
k
fmin − f∗ ≤ → as k → ∞.
2kη 2
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 14
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
I Optimize the bound for ηt : want
k
fmin − f∗ ≤ ε
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
I Optimize the bound for ηt : want
R2 + G2 kt=1 ηt2
P
k ∗
fmin −f ≤ ≤ ε
2 kt=1 ηt
P
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
I Optimize the bound for ηt : want
R2 + G2 kt=1 ηt2
P
k ∗
fmin −f ≤ ≤ ε
2 kt=1 ηt
P
R2 + G2 kη 2 R
≤ ⇒ η= √
2kη G k
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
I Optimize the bound for ηt : want
R2 + G2 kt=1 ηt2
P
k ∗
fmin −f ≤ ≤ ε
2 kt=1 ηt
P
R2 + G2 kη 2 R
≤ ⇒ η= √
2kη G k
√
k
I Then, after k steps fmin − f ∗ ≤ RG/ k.
I For accuracy , we need at least (RG/)2 = O(1/2 ) steps
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Subgradient method – convergence
k
I Suppose we want fmin − f ∗ ≤ ε, how big should k be?
I Optimize the bound for ηt : want
R2 + G2 kt=1 ηt2
P
k ∗
fmin −f ≤ ≤ ε
2 kt=1 ηt
P
R2 + G2 kη 2 R
≤ ⇒ η= √
2kη G k
√
k
I Then, after k steps fmin − f ∗ ≤ RG/ k.
I For accuracy , we need at least (RG/)2 = O(1/2 ) steps
I (quite slow but already hits the lower bound!)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 15
Exercise: Support vector machines
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 16
Exercise: Geometric median
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 17
Optimization with simple constraints
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 18
Optimization with simple constraints
Previously:
xt+1 = xt − ηt gt
This could be infeasible!
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 18
Optimization with simple constraints
Previously:
xt+1 = xt − ηt gt
This could be infeasible!
Use projection
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 18
Projected subgradient method
xk+1 = PC (xk − ηk gk )
where gk ∈ ∂f (xk ) is any subgradient
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 19
Projected subgradient method
xk+1 = PC (xk − ηk gk )
where gk ∈ ∂f (xk ) is any subgradient
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 19
Projected subgradient method
xk+1 = PC (xk − ηk gk )
where gk ∈ ∂f (xk ) is any subgradient
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 19
Key idea: Projection Theorem
hz − y∗ , y − y∗ i ≤ 0 for all y ∈ C
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 20
Convergence analysis
Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f
I Bounded domain: kx0 − x∗ k2 ≤ R
Analysis
I Let zt+1 = PC (xt − ηt gt ).
I Then xt+1 = PC (zt+1 ).
I Recall analysis of unconstrained method:
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 21
Convergence analysis: Key idea
kxt − ηt gt − x∗ k22
≤ kxt − x∗ k22 + ηt2 kgt k22 − 2ηt (f (xt ) − f ∗ )
...
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 22
Convergence analysis: Key idea
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 22
Examples of simple projections
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 23
Some remarks
I Why care?
simple
low-memory
stochastic version possible
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 24
Some remarks
I Why care?
simple
low-memory
stochastic version possible
Another perspective
1
xk+1 = min hx, gk i + kx − xk k2
x∈C 2ηk
Mirror Descent version
1
xk+1 = min hx, gk i + Dϕ (x, xk )
x∈C ηk
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 24
Accelerated gradient
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 25
Gradient methods – upper bounds
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 26
Gradient methods – upper bounds
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 26
Gradient methods – upper bounds
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 26
Gradient methods – lower bounds
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 27
Gradient methods – lower bounds
√ 2k
∗ µ κ−1
k
f (x ) − f (x ) ≥ √ kx0 − x∗ k22 ,
2 κ+1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 27
Gradient methods – lower bounds
√ 2k
∗ µ κ−1
k
f (x ) − f (x ) ≥ √ kx0 − x∗ k22 ,
2 κ+1
3Lkx0 − x∗ k22
f (xk ) − f (x∗ ) ≥
32(k + 1)2
kxk − x0 k2 ≥ 1
8 kx
0
− x∗ k2 .
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 27
Accelerated gradient methods
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 28
Accelerated gradient methods
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 28
Background: ravine method
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 29
Background: ravine method
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 29
Background: ravine method
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 29
Background: ravine method
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 29
Background: Heavy-ball method
√ √
√ 2 κ−1
ηk = 4/( L + µ), βk = q , q = √
κ+1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 30
Background: Heavy-ball method
√ √
√ 2 κ−1
ηk = 4/( L + µ), βk = q , q = √
κ+1
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 30
Background: Heavy-ball method
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 31
Background: Heavy-ball method
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 31
Nesterov’s AGM
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 32
Nesterov’s AGM
Nesterov’s (1983) method
k+1
x = yk − L1 ∇f (yk )
yk+1 = xk+1 + βk (xk+1 − xk )
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 33
Nesterov’s AGM
Nesterov’s (1983) method
k+1
x = yk − L1 ∇f (yk )
yk+1 = xk+1 + βk (xk+1 − xk )
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 33
Nesterov’s AGM
Nesterov’s (1983) method
k+1
x = yk − L1 ∇f (yk )
yk+1 = xk+1 + βk (xk+1 − xk )
αk − 1
q
βk = , 2αk+1 = 1 + 4αk2 + 1, α0 = 1
αk+1
2Lky0 − x∗ k2
f (xk ) − f (x∗ ) ≤ .
(k + 2)2
√
In the strongly convex case, instead we use βk = √κ−1 κ+1
. This leads to
√
O( κ log(1/ε)) iterations to ensure f (xk ) − f (x∗ ) ≤ ε.
(Remark: Nemirovski proposed a method that achieves optimal complexity,
but it required 2D line-search. Nesterov’s method was the real breakthrough
and remains a fascinating topic to study even today.)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 33
Analyzing Nesterov’s method
(II Ravine method worked well and sparked numerous heuristics for
selecting its parameters and improving its behavior. However, its
convergence was never proved. Inspired Polyak’s heavy-ball method, which
seems to have inspired Nesterov’s AGM.)
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 34
Analyzing Nesterov’s method
(II Ravine method worked well and sparked numerous heuristics for
selecting its parameters and improving its behavior. However, its
convergence was never proved. Inspired Polyak’s heavy-ball method, which
seems to have inspired Nesterov’s AGM.)
Some ways to analyze AGM
Nesterov’s Estimate sequence method
Approaches based on potential (Lyapunov) functions
Derivation based on viewing AGM as approximate PPM
Using “linear coupling,” mixing a primal-dual view
Analysis based on SDPs
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 34
Analyzing Nesterov’s method
(II Ravine method worked well and sparked numerous heuristics for
selecting its parameters and improving its behavior. However, its
convergence was never proved. Inspired Polyak’s heavy-ball method, which
seems to have inspired Nesterov’s AGM.)
Some ways to analyze AGM
Nesterov’s Estimate sequence method
Approaches based on potential (Lyapunov) functions
Derivation based on viewing AGM as approximate PPM
Using “linear coupling,” mixing a primal-dual view
Analysis based on SDPs
See discussion in the paper
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 34
Potential analysis – sketch
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 35
Potential analysis – sketch
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 35
Potential analysis – sketch
Using the shorthand:
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 36
Potential analysis – sketch
Using the shorthand:
Ref: See details in the paper: Ahn, Sra (2020). From Nesterov’s Estimate
Sequence to Riemannian Acceleration.
Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 36