0% found this document useful (0 votes)

68 views89 pages

Optimization For Machine Learning: Lecture 8: Subgradient Method Accelerated Gradient 6.881: MIT

The document summarizes a lecture on optimization for machine learning. It discusses the subgradient method for solving unconstrained convex optimization problems. The subgradient method generates a sequence of solutions by taking steps in the negative direction of a subgradient at each iteration. The method converges slowly with a constant step size but can converge faster with a diminishing step size that satisfies certain properties. The document provides examples of applying the subgradient method and analyzes its convergence under assumptions like bounded subgradients and a bounded domain.

Uploaded by

Mufakir Qamar Ansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views89 pages

Optimization For Machine Learning: Lecture 8: Subgradient Method Accelerated Gradient 6.881: MIT

Uploaded by

Mufakir Qamar Ansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 89

Optimization for Machine Learning

Lecture 8: Subgradient method; Accelerated gradient

6.881: MIT

Suvrit Sra
Massachusetts Institute of Technology

16 Mar, 2021
First-order methods

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 2
Subgradient method

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 3
Unconstrained convex problem

min f (x)
x

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem

min f (x)
x

1 Start with some guess x0 ; set k = 0

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem

min f (x)
x

1 Start with some guess x0 ; set k = 0

2 If 0 ∈ ∂f (xk ), stop; output xk

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem

min f (x)
x

1 Start with some guess x0 ; set k = 0

2 If 0 ∈ ∂f (xk ), stop; output xk
3 Otherwise, generate next guess xk+1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Unconstrained convex problem

min f (x)
x

1 Start with some guess x0 ; set k = 0

2 If 0 ∈ ∂f (xk ), stop; output xk
3 Otherwise, generate next guess xk+1
4 Repeat above procedure until f (xk ) ≤ f (x∗ ) + ε

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 4
Subgradient method

xk+1 = xk − ηk gk
where gk ∈ ∂f (xk ) is any subgradient

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 5
Subgradient method

xk+1 = xk − ηk gk
where gk ∈ ∂f (xk ) is any subgradient

Stepsize ηk > 0 must be chosen

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 5
Subgradient method

xk+1 = xk − ηk gk
where gk ∈ ∂f (xk ) is any subgradient

Stepsize ηk > 0 must be chosen

I Method generates sequence xk k≥0

I Does this sequence converge to an optimal solution x∗ ?

I If yes, then how fast?
I What if we have constraints: x ∈ C?

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 5
Example

min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 6
Example

min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))

2
10

1
10

0
10
0 20 40 60 80 100

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 6
Example

min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 7
Example

min 1
2 kAx − bk22 + λkxk1
xk+1 = xk − ηk (AT (Axk − b) + λ sgn(xk ))
2
10

1
10

0
10
0 20 40 60 80 100

(More careful implementation)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 7
Exercise

Exercise: Experiment with deep neural network classifier

where we want to learn sparse weights. In particular, exper-
iment with the following loss function:
n
1X
min L(x) := `(yi , N N (x, ai )) + λkxk1 .
x n
i=1

Implement a stochastic subgradient update to minimize L.

(Hint: If we pretend that the loss part is differentiable, then we can
invoke Clarke’s rule: ∂◦ L = ∇loss + λ∂reg)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 8
Subgradient method – stepsizes

I Constant Set ηk = η > 0, for k ≥ 0

I Normalized ηk = η/kgk k2 (kxk+1 − xk k2 = η)
I Square summable
X X
ηk2 < ∞, ηk = ∞
k k

I Diminishing
X
lim ηk = 0, ηk = ∞
k k

I Adaptive stepsizes (not covered)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 9
Subgradient method – stepsizes

I Constant Set ηk = η > 0, for k ≥ 0

I Normalized ηk = η/kgk k2 (kxk+1 − xk k2 = η)
I Square summable
X X
ηk2 < ∞, ηk = ∞
k k

I Diminishing
X
lim ηk = 0, ηk = ∞
k k

I Adaptive stepsizes (not covered)

Not a descent method!

Could use best f k so far: fmin
k
:= min0≤i≤k f i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 9
Convergence
(sketch)

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 10
Convergence analysis

Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Convergence analysis

Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Convergence analysis

Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f
I Bounded domain: kx0 − x∗ k2 ≤ R

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Convergence analysis

Assumptions
I Min is attained: f∗ := inf x f (x) > −∞, with f (x∗ ) = f ∗
I Bounded subgradients: kgk2 ≤ G for all g ∈ ∂f
I Bounded domain: kx0 − x∗ k2 ≤ R

k
Convergence results for: fmin := min0≤i≤k f i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 11
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22

= kxk − x∗ k22 + ηk2 kgk k22 − 2hηk gk , xk − x∗ i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22

= kxk − x∗ k22 + ηk2 kgk k22 − 2hηk gk , xk − x∗ i
≤ kxk − x∗ k22 + ηk2 kgk k22 − 2ηk (f (xk ) − f ∗ ),

since f ∗ = f (x∗ ) ≥ f (xk ) + hgk , x∗ − xk i

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22

= kxk − x∗ k22 + ηk2 kgk k22 − 2hηk gk , xk − x∗ i
≤ kxk − x∗ k22 + ηk2 kgk k22 − 2ηk (f (xk ) − f ∗ ),

since f ∗ = f (x∗ ) ≥ f (xk ) + hgk , x∗ − xk i

Apply same argument to kxk − x∗ k22 recursively

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22

= kxk − x∗ k22 + ηk2 kgk k22 − 2hηk gk , xk − x∗ i
≤ kxk − x∗ k22 + ηk2 kgk k22 − 2ηk (f (xk ) − f ∗ ),

since f ∗ = f (x∗ ) ≥ f (xk ) + hgk , x∗ − xk i

Apply same argument to kxk − x∗ k22 recursively
Xk Xk
kxk+1 − x∗ k22 ≤ kx0 − x∗ k22 + ηt2 kgt k22 − 2 ηt (f t − f ∗ ).
t=1 t=1

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence

Lyapunov function: Distance to x∗ (instead of f − f ∗ )

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22

= kxk − x∗ k22 + ηk2 kgk k22 − 2hηk gk , xk − x∗ i
≤ kxk − x∗ k22 + ηk2 kgk k22 − 2ηk (f (xk ) − f ∗ ),

since f ∗ = f (x∗ ) ≥ f (xk ) + hgk , x∗ − xk i

Apply same argument to kxk − x∗ k22 recursively
Xk Xk
kxk+1 − x∗ k22 ≤ kx0 − x∗ k22 + ηt2 kgt k22 − 2 ηt (f t − f ∗ ).
t=1 t=1

Now use our convenient assumptions!

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 12
Subgradient method – convergence
Xk Xk
kxk+1 − x∗ k22 ≤ R2 + G2 ηt2 − 2 ηt (f t − f ∗ ).
t=1 t=1
I To get a bound on the last term, simply notice (for t ≤ k)

f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t

Suvrit Sra ([email protected]) 6.881 Optimization for Machine Learning (3/16/21; Lecture 8) 13
Subgradient method – convergence
Xk Xk
kxk+1 − x∗ k22 ≤ R2 + G2 ηt2 − 2 ηt (f t − f ∗ ).
t=1 t=1
I To get a bound on the last term, simply notice (for t ≤ k)

f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t
I Plugging this in yields the bound
Xk Xk
2 ηt (f t − f ∗ ) ≥ 2(fmin
k
− f ∗) ηt .
t=1 t=1

I So that we finally have

Xk Xk
0 ≤ kxk+1 − x∗ k2 ≤ R2 + G2 k
ηt2 − 2(fmin − f ∗) ηt
t=1 t=1

f t ≥ fmin
t k
≥ fmin t
since fmin := min f (xi )
0≤i≤t
I Plugging this in yields the bound
Xk Xk
2 ηt (f t − f ∗ ) ≥ 2(fmin
k
− f ∗) ηt .
t=1 t=1