Optimization Methods For Large-Scale Machine Learning - 2021
Optimization Methods For Large-Scale Machine Learning - 2021
presented at
April 2, 2021
References
Motivating questions
Outline
GD and SG
GD vs. SG
Beyond SG
Second-Order Methods
Conclusion
Outline
GD and SG
GD vs. SG
Beyond SG
Second-Order Methods
Conclusion
Stochastic optimization
n
1X
min fn (w), where fn (w) = `(h(w; xi ), yi ).
w∈Rd n i=1
Text classification
Abstract. This paper provides a review and commentary on the past, present, and future of numeri-
cal optimization algorithms in the context of machine learning applications. Through case
studies on text classification and the training of deep neural networks, we discuss how op-
timization problems arise in machine learning and what makes them challenging. A major
theme of our study is that large-scale machine learning represents a distinctive setting in
which the stochastic gradient (SG) method has traditionally played a central role while
conventional gradient-based nonlinear optimization techniques typically falter. Based on
this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG
algorithm, discuss its practical behavior, and highlight opportunities for designing algo-
rithms with improved performance. This leads to a discussion about the next generation
of optimization methods for large-scale machine learning, including an investigation of two
main streams of research on techniques that diminish noise in the stochastic directions and
methods that make use of second-order derivative approximations.
Key words. numerical optimization, machine learning, stochastic gradient methods, algorithm com-
plexity analysis, noise reduction methods, second-order methods
DOI. 10.1137/16M1080173
poetry
Contents
1 Introduction 224
([email protected]).
n
§ Department of Industrial Engineering and Management Sciences, Northwestern University,
What sounds are these? (“Here comes the sun” – The Beatles)
x1 [W1 ]11
h11 [W2 ]11 h21
[W3 ]11
x2 h1
h12 h22
Output Layer
Input Layer
x3 h2
x4 h13 h23
h3
[W3 ]43
x5 h14 [W2 ]44 h24
[W1 ]54
Hidden Layers
Approximation error
Choice of prediction function family H has important implications; e.g.,
HC := {h ∈ H : Ω(h) ≤ C}.
testing testing
training training
C training time
Problems of interest
Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.
f (wk )
wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.
f (wk )
f (w)? f (w)?
wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.
wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
Gradient descent
Aim: Find a stationary point, i.e., w with ∇f (w) = 0.
wk
Optimization Methods for Large-Scale Machine Learning 14 of 59
GD and SG GD vs. SG Beyond SG Noise Reduction Methods Second-Order Methods Conclusion
GD theory
Theorem GD
∞
X
If α ∈ (0, 1/L], then k∇f (wk )k22 < ∞, which implies {∇f (wk )} → 0.
k=0
If, in addition, f is c-strongly convex, then for all k ≥ 1:
Proof.
f (wk+1 ) ≤ f (wk ) + ∇f (wk )T (wk+1 − wk ) + 12 Lkwk+1 − wk k22
· · · (due to stepsize choice)
≤ f (wk ) − 12 αk∇f (wk )k22
≤ f (wk ) − αc(f (wk ) − f∗ ).
=⇒ f (wk+1 ) − f∗ ≤ (1 − αc)(f (wk ) − f∗ ).
GD illustration
Approximate gradient only; e.g., random ik so E[∇w `(h(w; xik ), yik )|w] = ∇f (w).
SG theory
Theorem SG
If Ek [kgk k22 ] ≤ M + k∇f (wk )k22 , then:
k
1 1 X
2
αk = =⇒ E k∇f (wj )k2 ≤ M
L k j=1
k
1 X
2
αk = O =⇒ E αj k∇f (wj )k2 < ∞.
k j=1
Why O(1/k)?
Mathematically:
∞
X ∞
X
αk = ∞ while α2k < ∞
k=1 k=1
SG illustration
Outline
GD and SG
GD vs. SG
Beyond SG
Second-Order Methods
Conclusion
So why SG?
Motivation Explanation
Intuitive data “redundancy”
Empirical SG vs. L-BFGS with batch gradient (below)
Theoretical E[fn (wk ) − fn,∗ ] = O(1/k) and E[f (wk ) − f∗ ] = O(1/k)
0.6
0.5
Empirical Risk
0.4
0.3 LBFGS
0.2
0.1 SGD
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Accessed Data Points 5
x 10
Work complexity
Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010).
Time Time for
Convergence rate per iteration -optimality
GD: E[fn (wk ) − fn,∗ ] = O(ρk ) + O(n) =⇒ n log(1/)
SG: E[fn (wk ) − fn,∗ ] = O(1/k) + O(1) =⇒ 1/
1
E∼ .
T
I GD: With n ∼ T / log(1/), minimizing E yields ∼ 1/T and
log(T ) 1
E∼ + .
T T
Outline
GD and SG
GD vs. SG
Beyond SG
Second-Order Methods
Conclusion