0% found this document useful (0 votes)
49 views

Optimization Methods For Large-Scale Machine Learning - 2021

The document is a presentation on optimization methods for large-scale machine learning. It discusses how optimization problems arise in machine learning applications, such as text classification and training deep neural networks. It presents an overview of stochastic gradient descent (SGD) and how it has traditionally played a central role in large-scale machine learning. The presentation outlines recent advances in optimization methods, including techniques that reduce noise in stochastic gradients and methods that make use of second-order derivatives.

Uploaded by

da da
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Optimization Methods For Large-Scale Machine Learning - 2021

The document is a presentation on optimization methods for large-scale machine learning. It discusses how optimization problems arise in machine learning applications, such as text classification and training deep neural networks. It presents an overview of stochastic gradient descent (SGD) and how it has traditionally played a central role in large-scale machine learning. The presentation outlines recent advances in optimization methods, including techniques that reduce noise in stochastic gradients and methods that make use of second-order derivatives.

Uploaded by

da da
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

GD and SG GD vs.

SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Optimization Methods for Large-Scale Machine Learning

Frank E. Curtis, Lehigh University

presented at

East Coast Optimization Meeting


George Mason University
Fairfax, Virginia

April 2, 2021

Optimization Methods for Large-Scale Machine Learning 1 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

References

? Léon Bottou, Frank E. Curtis, and Jorge Nocedal.


Optimization Methods for Large-Scale Machine Learning.
SIAM Review, 60(2):223–311, 2018.

? Frank E. Curtis and Katya Scheinberg.


Optimization Methods for Supervised Machine Learning: From Linear Models to
Deep Learning.
In INFORMS Tutorials in Operations Research, chapter 5, pages 89–114. Institute
for Operations Research and the Management Sciences (INFORMS), 2017.

Optimization Methods for Large-Scale Machine Learning 2 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Motivating questions

I How do optimization problems arise in machine learning applications, and


what makes them challenging?
I What have been the most successful optimization methods for large-scale
machine learning, and why?
I What recent advances have been made in the design of algorithms, and what
are open questions in this research area?

Optimization Methods for Large-Scale Machine Learning 3 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 4 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 5 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Learning problems and (surrogate) optimization problems

Learn a prediction function h : X → Y to solve


Z
max 1[h(x) ≈ y]dP (x, y)
h∈H X ×Y

Various meanings for h(x) ≈ y depending on the goal:


I Binary classification, with y ∈ {−1, +1}: y · h(x) > 0.
I Regression, with y ∈ Rny : kh(x) − yk ≤ δ.
Parameterizing h by w ∈ Rd , we aim to solve
Z
max 1[h(w; x) ≈ y]dP (x, y)
w∈Rd X ×Y

Now, common practice is to replace the indicator with a smooth loss. . .

Optimization Methods for Large-Scale Machine Learning 6 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Stochastic optimization

Over a parameter vector w ∈ Rd and given

`(·; y) ◦ h(w; x) (loss w.r.t. “true label” ◦ prediction w.r.t. “features”),

consider the unconstrained optimization problem

min f (w), where f (w) = E(x,y) [`(h(w; x), y)].


w∈Rd

Given training set {(xi , yi )}n


i=1 , approximate problem given by

n
1X
min fn (w), where fn (w) = `(h(w; xi ), yi ).
w∈Rd n i=1

Optimization Methods for Large-Scale Machine Learning 7 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Text classification

SIAM REVIEW c 2018 Society for Industrial and Applied Mathematics


!
Vol. 60, No. 2, pp. 223–311

Optimization Methods for


Large-Scale Machine Learning∗
Léon Bottou†
Frank E. Curtis‡
math
Jorge Nocedal§

Abstract. This paper provides a review and commentary on the past, present, and future of numeri-
cal optimization algorithms in the context of machine learning applications. Through case
studies on text classification and the training of deep neural networks, we discuss how op-
timization problems arise in machine learning and what makes them challenging. A major
theme of our study is that large-scale machine learning represents a distinctive setting in
which the stochastic gradient (SG) method has traditionally played a central role while
conventional gradient-based nonlinear optimization techniques typically falter. Based on
this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG
algorithm, discuss its practical behavior, and highlight opportunities for designing algo-
rithms with improved performance. This leads to a discussion about the next generation
of optimization methods for large-scale machine learning, including an investigation of two
main streams of research on techniques that diminish noise in the stochastic directions and
methods that make use of second-order derivative approximations.

Key words. numerical optimization, machine learning, stochastic gradient methods, algorithm com-
plexity analysis, noise reduction methods, second-order methods

AMS subject classifications. 65K05, 68Q25, 68T05, 90C06, 90C30, 90C90

DOI. 10.1137/16M1080173

poetry
Contents
1 Introduction 224

2 Machine Learning Case Studies 226


2.1 Text Classification via Convex Optimization . . . . . . . . . . . . . . 226
2.2 Perceptual Tasks via Deep Neural Networks . . . . . . . . . . . . . . 228
2.3 Formal Machine Learning Procedure . . . . . . . . . . . . . . . . . . . 231

3 Overview of Optimization Methods 235


3.1 Formal Optimization Problem Statements . . . . . . . . . . . . . . . . 235
∗ Received by the editors June 16, 2016; accepted for publication (in revised form) April 19, 2017;

published electronically May 8, 2018.


http://www.siam.org/journals/sirev/60-2/M108017.html
Funding: The work of the second author was supported by U.S. Department of Energy grant
DE-SC0010615 and U.S. National Science Foundation grant DMS-1016291. The work of the third
author was supported by Office of Naval Research grant N00014-14-1-0313 P00003 and Department
of Energy grant DE-FG02-87ER25047s.
† Facebook AI Research, New York, NY 10003 ([email protected]).
‡ Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015

([email protected]).
n
§ Department of Industrial Engineering and Management Sciences, Northwestern University,

Evanston, IL 60201 ([email protected]). 1X λ


223
min log(1 + exp(−(wT xi )yi )) + kwk22
w∈Rd n i=1 2

Optimization Methods for Large-Scale Machine Learning 8 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Image / speech recognition

What pixel combinations represent the number 4?

What sounds are these? (“Here comes the sun” – The Beatles)

Optimization Methods for Large-Scale Machine Learning 9 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Deep neural networks


h(w; x) = al (Wl . . . (a2 (W2 (a1 (W1 x + ω1 )) + ω2 )) . . . )

x1 [W1 ]11
h11 [W2 ]11 h21
[W3 ]11

x2 h1
h12 h22

Output Layer
Input Layer

x3 h2

x4 h13 h23
h3

[W3 ]43
x5 h14 [W2 ]44 h24
[W1 ]54
Hidden Layers

Figure: Illustration of a DNN

Optimization Methods for Large-Scale Machine Learning 10 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Tradeoffs of large-scale learning

Bottou, Bousquet (2008) and Bottou (2010)

Notice that we went from our true problem


Z
max 1[h(x) ≈ y]dP (x, y)
h∈H X ×Y

to say that we’ll find our solution h ≡ h(w; ·) by (approximately) solving


n
1X
min `(h(w; xi ), yi ).
w∈Rd n i=1

Three sources of error:


I approximation
I estimation
I optimization

Optimization Methods for Large-Scale Machine Learning 11 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Approximation error
Choice of prediction function family H has important implications; e.g.,

HC := {h ∈ H : Ω(h) ≤ C}.

misclassification rate misclassification rate

testing testing

training training

C training time

Figure: Illustration of C and training time vs. misclassification rate

Optimization Methods for Large-Scale Machine Learning 12 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Problems of interest

Let’s focus on the expected loss/risk problem

min f (w), where f (w) = E(x,y) [`(h(w; x), y)]


w∈Rd

and the empirical loss/risk problem


n
1X
min fn (w), where fn (w) = `(h(w; xi ), yi ).
w∈Rd n i=1

For this talk, let’s assume


I f is continuously differentiable, bounded below, and potentially nonconvex;
I ∇f is L-Lipschitz continuous, i.e., k∇f (w) − ∇f (w)k2 ≤ Lkw − wk2 .

Optimization Methods for Large-Scale Machine Learning 13 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.

Algorithm GD : Gradient Descent


1: choose an initial point w0 ∈ Rn and stepsize α > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for

f (wk )

wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.

Algorithm GD : Gradient Descent


1: choose an initial point w0 ∈ Rn and stepsize α > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for

f (wk )

f (w)? f (w)?

wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.

Algorithm GD : Gradient Descent


1: choose an initial point w0 ∈ Rn and stepsize α > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for

f (wk ) + ∇f (wk )T (w − wk ) + 1 Lkw − wk k2


2
2
f (wk )

wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.

Algorithm GD : Gradient Descent


1: choose an initial point w0 ∈ Rn and stepsize α > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − α∇f (wk )
4: end for

f (wk ) + ∇f (wk )T (w − wk ) + 1 Lkw − wk k2


2
2
f (wk )

f (wk ) + ∇f (wk )T (w − wk ) + 1 ckw − wk k2


2
2

wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

GD theory

Theorem GD

X
If α ∈ (0, 1/L], then k∇f (wk )k22 < ∞, which implies {∇f (wk )} → 0.
k=0
If, in addition, f is c-strongly convex, then for all k ≥ 1:

f (wk ) − f∗ ≤ (1 − αc)k (f (x0 ) − f∗ ).

Proof.
f (wk+1 ) ≤ f (wk ) + ∇f (wk )T (wk+1 − wk ) + 12 Lkwk+1 − wk k22
· · · (due to stepsize choice)
≤ f (wk ) − 12 αk∇f (wk )k22
≤ f (wk ) − αc(f (wk ) − f∗ ).
=⇒ f (wk+1 ) − f∗ ≤ (1 − αc)(f (wk ) − f∗ ).

Optimization Methods for Large-Scale Machine Learning 15 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

GD illustration

Figure: GD with fixed stepsize

Optimization Methods for Large-Scale Machine Learning 16 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Stochastic gradient method (SG)

Invented by Herbert Robbins and Sutton Monro in 1951.

Sutton Monro, former Lehigh faculty member

Optimization Methods for Large-Scale Machine Learning 17 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Stochastic gradient descent

Approximate gradient only; e.g., random ik so E[∇w `(h(w; xik ), yik )|w] = ∇f (w).

Algorithm SG : Stochastic Gradient


1: choose an initial point w0 ∈ Rn and stepsizes {αk } > 0
2: for k ∈ {0, 1, 2, . . . } do
3: set wk+1 ← wk − αk gk , where gk ≈ ∇f (wk )
4: end for

Not a descent method!


. . . but can guarantee eventual descent in expectation (with Ek [gk ] = ∇f (wk )):

f (wk+1 ) ≤ f (wk ) + ∇f (wk )T (wk+1 − wk ) + 12 Lkwk+1 − wk k22


= f (wk ) − αk ∇f (wk )T gk + 12 α2k Lkgk k22
=⇒ Ek [f (wk+1 )] ≤ f (wk ) − αk k∇f (wk )k22 + 21 α2k LEk [kgk k22 ].

Markov process: wk+1 depends only on wk and random choice at iteration k.

Optimization Methods for Large-Scale Machine Learning 18 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

SG theory

Theorem SG
If Ek [kgk k22 ] ≤ M + k∇f (wk )k22 , then:
 
k
1 1 X
2
αk = =⇒ E  k∇f (wj )k2  ≤ M
L k j=1
 
  k
1 X
2
αk = O =⇒ E  αj k∇f (wj )k2  < ∞.
k j=1

If, in addition, f is c-strongly convex, then:


 
1 (αL)(M/c)
αk = =⇒ E[f (wk ) − f∗ ] ≤ O
L 2
   
1 (L/c)(M/c)
αk = O =⇒ E[f (wk ) − f∗ ] = O .
k k

(*Assumed unbiased gradient estimates; see paper for more generality.)

Optimization Methods for Large-Scale Machine Learning 19 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Why O(1/k)?

Mathematically:

X ∞
X
αk = ∞ while α2k < ∞
k=1 k=1

Graphically (sequential version of constant stepsize result):

Optimization Methods for Large-Scale Machine Learning 20 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

SG illustration

Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right)

Optimization Methods for Large-Scale Machine Learning 21 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 22 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Why SG over GD for large-scale machine learning?


GD: E[fn (wk ) − fn,∗ ] = O(ρk ) linear convergence
SG: E[fn (wk ) − fn,∗ ] = O(1/k) sublinear convergence

So why SG?

Motivation Explanation
Intuitive data “redundancy”
Empirical SG vs. L-BFGS with batch gradient (below)
Theoretical E[fn (wk ) − fn,∗ ] = O(1/k) and E[f (wk ) − f∗ ] = O(1/k)

0.6

0.5
Empirical Risk

0.4

0.3 LBFGS

0.2

0.1 SGD

0
0 0.5 1 1.5 2 2.5 3 3.5 4
Accessed Data Points 5
x 10

Optimization Methods for Large-Scale Machine Learning 23 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Work complexity
Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010).
Time Time for
Convergence rate per iteration -optimality
GD: E[fn (wk ) − fn,∗ ] = O(ρk ) + O(n) =⇒ n log(1/)
SG: E[fn (wk ) − fn,∗ ] = O(1/k) + O(1) =⇒ 1/

Considering total (estimation + optimization) error as

E = E[f (wn ) − f (w∗ )] + E[f (w̃n ) − f (wn )] ∼ 1


n
+

and a time budget T , one finds:


I SG: Process as many samples as possible (n ∼ T ), leading to

1
E∼ .
T
I GD: With n ∼ T / log(1/), minimizing E yields  ∼ 1/T and

log(T ) 1
E∼ + .
T T

Optimization Methods for Large-Scale Machine Learning 24 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

Outline

GD and SG

GD vs. SG

Beyond SG

Noise Reduction Methods

Second-Order Methods

Conclusion

Optimization Methods for Large-Scale Machine Learning 25 of 59


GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion

End of the story?

SG is great! Let’s keep proving how great it is!


I SG is “stable with respect to inputs”
I SG avoids “steep minima”
I SG avoids “saddle points”
I . . . (many more)
No, we should want more. . .
I SG requires a lot of “hyperparameter” tuning
I Sublinear convergence is not satisfactory
I . . . “linearly” convergent method eventually wins
I . . . with higher budget, faster computation, parallel?, distributed?
Also, any “gradient”-based method is not scale invariant.

Optimization Methods for Large-Scale Machine Learning 26 of 59

You might also like