0% found this document useful (0 votes)
5 views33 pages

Better Theory For SGD in The Nonconvex World

Machine learning paper

Uploaded by

dw501450
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views33 pages

Better Theory For SGD in The Nonconvex World

Machine learning paper

Uploaded by

dw501450
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Better Theory for SGD in the Nonconvex World

Ahmed Khaled1,2 and Peter Richtárik1


1
KAUST, Thuwal, Saudi Arabia
2
Cairo University, Giza, Egypt

July 27, 2020


arXiv:2002.03329v3 [math.OC] 24 Jul 2020

Abstract
Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among
practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit
the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced
expected smoothness assumption which governs the behaviour of the second moment of the stochastic
gradient. We show that our assumption is both more general and more reasonable than assumptions
made in all prior work. Moreover, our results yield the optimal O(ε−4 ) rate for finding a stationary point
of nonconvex smooth functions, and recover the optimal O(ε−1 ) rate for finding a global solution if the
Polyak-Lojasiewicz condition is satisfied. We compare against convergence rates under convexity and
prove a theorem on the convergence of SGD under Quadratic Functional Growth and convexity, which
might be of independent interest. Moreover, we perform our analysis in a framework which allows for
a detailed study of the effects of a wide array of sampling strategies and minibatch sizes for finite-sum
optimization problems. We corroborate our theoretical results with experiments on real and synthetic
data.

1 Introduction
In this work we study the complexity of stochastic gradient descent (SGD) for solving unconstrained optimization
problems of the form

min f (x), (1)


x∈Rd

where f : Rd → R is possibly nonconvex and satisfies the following smoothness and regularity conditions.
Assumption 1. Function f is bounded from below by an infimum f inf ∈ R, differentiable, and ∇f is
L-Lipschitz:
k∇f (x) − ∇f (y)k ≤ L kx − yk , for all x, y ∈ Rd .
Motivating this problem is perhaps unnecessary. Indeed, the training of modern deep learning models reduces
to nonconvex optimization problems, and the state-of-the-art methods for solving them are all variants of
SGD (Goodfellow et al., 2016; Sun, 2020). SGD is a randomized first-order method performing iterations of
the form
xk+1 = xk − γk g(xk ), (2)
where g(x) is an unbiased estimator of the gradient ∇f (x) (i.e., E [g(x)] = ∇f (x)), and γk is an appropriately
chosen learning rate. Since f can have many local minima and/or saddle points, solving (1) to global optimality
is intractable (Nemirovsky and Yudin, 1983; Vavasis, 1995). However, the problem becomes tractable if one
scales down the requirements on the point of interest from global optimality to some relaxed version thereof,
such as stationarity or local optimality. In this paper we are interested in the fundamental
h iproblem of finding
2
an ε-stationary point, i.e., we wish to find a random vector x ∈ Rd for which E k∇f (x)k ≤ ε2 , where E [·]
is the expectation over the randomness of the algorithm.

1
1.1 Modelling stochasticity
Since unbiasedness alone is not enough to conduct a complexity analysis of SGD, it is necessary to impart
further assumptions on the connection between the stochastic gradient g(x) and the true gradient ∇f (x).
The most commonly used assumptions take the form of various structured bounds on the second moment of
g(x). We argue (see Section 3) that bounds proposed in the literature are often too strong and unrealistic as
they do not fully capture how randomness in g(x) arises in practice. Indeed, existing bounds are primarily
constructed in order to facilitate analysis, and their match with reality often takes the back seat. In order to
obtain meaningful theoretical insights into the workings of SGD, it is very important to model this randomness
both correctly, so that the assumptions we impart are provably satisfied, and accurately, so as to obtain as
tight bounds as possible.

1.2 Sources of stochasticity


Practical applications of SGD typically involve the training of supervised machine learning models via empirical
risk minimization (Shalev-Shwartz and Ben-David, 2014), which leads to optimization problems of a finite-sum
structure:
n
!
def 1
X
min f (x) = fi (x) . (3)
x∈Rd n i=1

In a single-machine setup, n is the number of training data points, and fi (x) represents the loss of model x
on data point i. In this setting, data access is expensive, and g(x) is typically constructed via subsampling
techniques such as minibatching (Dekel et al., 2012) and importance sampling (Needell et al., 2016). In the
rather general arbitrary sampling paradigm (Gower et al., 2019), one may choose an arbitrary random subset
S ⊆ [n] of examples, and subsequently g(x) is assembled from the information stored in the gradients ∇fi (x)
for i ∈ S only. This leads to formulas of the form
X
g(x) = vi ∇fi (x), (4)
i∈S

where vi are appropriately defined random variables ensuring unbiasedness.


In a distributed setting, n corresponds to the number of machines (e.g., number of mobile devices in federated
learning) and fi (x) represents the loss of model x on all the training data stored on machine i. In this setting,
communication is expensive, and modern gradient-type methods therefore rely on various randomized gradient
compression mechanisms such as quantization (Gupta et al., 2015), sparsification (Wangni et al., 2018), and
dithering (Alistarh et al., 2017). Given an appropriately chosen (unbiased) randomized compression map
Q : Rd → Rd , the local gradients ∇fi (x) are first compressed to Qi (∇fi (x)), where Qi is an independent
instantiation of Q sampled by machine i in each iteration, and subsequently communicated to a master node,
which performs aggregation (Khirirat et al., 2018). This gives rise to SGD with stochastic gradient of the form
n
1X
g(x) = Qi (∇fi (x)). (5)
n i=1

In many applications, each fi has a finite sum structure of its own, reflecting the empirical risk composed
of the training data stored on that device. In such situations, it is often assumed that compression is not
applied to exact gradients, but to stochastic gradients coming from subsampling (Gupta et al., 2015; Ben-Nun
and Hoefler, 2019; Horváth et al., 2019). This further complicates the structure of the stochastic gradient.

2 Contributions
The highly specific and elaborate structure of the stochastic gradient g(x) used in practice, such as that coming
from subsampling as in (4) or compression as in (5), raises questions about appropriate theoretical modelling of
its second moment. As we shall explain in Section 3, existing approaches do not offer a satisfactory treatment.

2
(GC)

(M-SG) (E-SG) (RG) (ES)

(BV) (SS)

Figure 1: Assumption hierarchy on the second moment of the stochastic gradient of nonconvex functions.
An arrow indicates implication. Our newly proposed ES assumption is the most general. The statement is
formalized as Theorem 1.

Indeed, as we show through a simple example, none of the existing assumptions are satisfied even in the
simple scenario of subsampling from the sum of two functions (see Proposition 1).
Our work is motivated by the need of a more accurate modelling of the stochastic gradient for nonconvex
optimization problems, which we argue would lead to a more accurate and informative analysis of SGD in the
nonconvex world–a problem of the highest importance in modern deep learning.
The key contributions of our work are:
• Inspired by recent developments in the analysis of SGD in the (strongly) convex setting (Richtárik and
Takáč, 2017; Gower et al., 2020; 2019), we propose a new assumption, which we call expected smoothness
(ES), for modelling the second moment of the stochastic gradient, specifically focusing on nonconvex
problems (see Section 4). In particular, we assume that there exist constants A, B, C ≥ 0 such that
h i
2 2
E kg(x)k ≤ 2A f (x) − f inf + Bk∇f (x)k + C.


• We show in Section 4.3 that (ES) is the weakest, and hence the most general, among all assumptions in
the existing literature we are aware of (see Figure 1), including assumptions such as bounded variance
(BV) (Ghadimi and Lan, 2013), maximal strong growth (M-SG) (Schmidt and Roux, 2013), strong
growth (SG) (Vaswani et al., 2019), relaxed growth (RG) (Bottou et al., 2018), and gradient confusion
(GC) (Sankararaman et al., 2019), which we review in Section 3.
• Moreover, we prove that unlike existing assumptions, which typically implicitly assume that stochas-
ticity comes from perturbation (see Section 4.4), (ES) automatically holds under standard and weak
assumptions made on the loss function in settings such as subsampling (see Section 4.5) and compression
(see Section 4.6). In this sense, (ES) not an assumption but an inequality which provably holds and can
be used to accurately and more precisely capture the convergence of SGD. For instance, to the best of
our knowledge, while the combination of gradient compression and subsampling is not covered by any
prior analysis of SGD for nonconvex objectives, our results can be applied to this setting.
• We recover the optimal O(ε−4 ) rate for general smooth nonconvex problems and O(ε−1 ) rate under the
PL condition (see Section 5). However, our rates are informative enough for us to be able to deduce, for
the first time in the literature on nonconvex SGD, importance sampling probabilities and formulas for
the optimal minibatch size (see Section 6).

3 Existing Models of Stochastic Gradient


Ghadimi and Lan (2013) analyze SGD under the assumption that f is lower bounded, that the stochastic
gradients g(x) are unbiased and have bounded variance
h i
2
E kg(x) − ∇f (x)k ≤ σ 2 . (BV)

Note that due to unbiasedness, this is equivalent to


h i
2 2
E kg(x)k ≤ k∇f (x)k + σ 2 . (6)

3
With an appropriately chosen constant stepsize γ, their results imply a O(ε−4 ) rate of convergence.
In the context of finite-sum problems with uniform sampling, where at each step one i ∈ [n] is sampled
uniformly and the stochastic gradient estimator used is g(x) = ∇fi (x) for a randomly selected index i, the
maximal strong growth condition requires the inequality
2 2
kg(x)k ≤ αk∇f (x)k (M-SG)

to hold almost surely for some α ≥ 0. Tseng (1998) used the maximal strong growth condition early to
establish the convergence of the incremental gradient method, a closely related variant of SGD. Schmidt and
Roux (2013) prove the linear convergence of SGD for strongly convex objectives under (M-SG).
We may also assume that (M-SG) holds in expectation rather than uniformly, leading to expected strong
growth: h i
2 2
E kg(x)k ≤ α k∇f (x)k (E-SG)

for some α > 0. Like maximal growth, variants of this condition have also been used in convergence results for
incremental gradient methods (Solodov, 1998). Vaswani et al. (2019) prove that under (E-SG), SGD converges
to an ε-stationary point in O(ε−2 ) steps. This assumption is quite strong and necessitates interpolation: if
∇f (x) = 0, then g(x) = 0 almost surely. This is typically not true in the distributed, finite-sum case where
the functions fi can be very different; see e.g., (McMahan et al., 2017).
Bottou et al. (2018) consider the relaxed growth condition which is a version of (E-SG) featuring an
additive constant:
h i
2 2
E kg(x)k ≤ α k∇f (x)k + β. (RG)

In view of (6), (RG) can be also seen as a slight generalization of the bounded variance assumption (BV).
Bertsekas and Tsitsiklis (2000) establish the almost-sure convergence of SGD under (RG), and Bottou et al.
(2018) give a O(ε−2 ) convergence rate for nonconvex objectives to a neighborhood of a stationary point of
radius linearly proportional to β. Unfortunately, (RG) is quite difficult to verify in practice and can be shown
not to hold for some simple problems, as the following proposition shows.

Proposition 1. There is a simple finite-sum minimization problem with two functions for which (RG) is not
satisfied.
The proof of this proposition and all subsequent proofs are relegated to the supplementary material.
In recent development, Sankararaman et al. (2019) postulate a gradient confusion bound for finite-sum
problems and SGD with uniform single-element sampling. This bound requires the existence of η > 0 such
that
h∇fi (x), ∇fj (x)i ≥ −η (GC)
holds for all i 6= j (and all x ∈ Rd ). For general nonconvex objectives, they prove convergence to a
neighborhood only. For functions satisfying the PL condition (Assumption 5), they prove linear convergence
to a neighborhood of a stationary point.
Lei et al. (2019) analyze SGD for f of the form

f (x) = Eξ∼D [fξ (x)] . (7)

They assume that fξ is nonnegative and almost-surely α-Hölder-continuous in x and then use g(x) = ∇fξ (x)
for ξ sampled i.i.d. from D. Specialized to the case of L-smoothness, their assumption reads

kg(x) − g(y)k ≤ L kx − yk and fξ (x) ≥ 0 (SS)

almost surely for all x, y ∈ Rd . We term this condition sure-smoothness (SS). They establish the sample
complexity O(ε−4 log(ε−1 )). Unfortunately, their results do not recover full gradient descent and their analysis
is not easily extendable to compression and subsampling.
We summarize the relations between the aforementioned assumptions and our own (introduced in Section 4)
in Figure 1. These relations are formalized as Theorem 1.

4
4 ES in the Nonconvex World
In this section, we first briefly review the notion of expected smoothness as recently proposed in several
contexts different from ours, and use this development to motivate our definition of expected smoothness (ES)
for nonconvex problems. We then proceed to show that (ES) is the weakest from all previous assumptions
modelling the behaviour of the second moment of stochastic gradient for nonconvex problems reviewed in
Section 3, thus substantiating Figure 1. Finally, we show how (ES) provides a correct and accurate model for
the behaviour of the stochastic gradient arising not only from classical perturbation, but also from subsampling
and compression.

4.1 Brief history of expected smoothness: from convex quadratic to convex


optimization
Our starting point is the work of Richtárik and Takáč (2017) who, motivated by the desire to obtain deeper
insights into the workings of the sketch and project algorithms developed by Gower and Richtárik (2015),
study the behaviour of SGD applied to a reformulation of a consistent linear system as a stochastic convex
quadratic optimization problem of the form
def
min f (x) = E [fξ (x)] . (8)
x∈Rd

The above problem encodes a linear system in the sense that f (x) is nonnegative and equal to zero if and only
if x solves the linear system. The distribution behind the randomness in their reformulation (8) plays the role
of a parameter which can be tuned in order to target specific properties of SGD, such
h as convergence
i rate or
2
cost per iteration. The stochastic gradient g(x) = ∇fξ (x) satisfies the identity E kg(x)k = 2f (x), which
plays a key role in the analysis. Since in their setting g(x∗ ) = 0 almost surely for any minimizer x∗ (which
suggests that their problem is over-parameterized), the above identity can be written in the equivalent form
h i
2
E kg(x) − g(x∗ )k = 2 (f (x) − f (x∗ )) . (9)

Equation (9) is the first instance of the expected smoothness property/inequality we are aware of. Using tools
from matrix analysis, Richtárik and Takáč (2017) are able to obtain identities for the expected iterates of SGD.
Kovalev et al. (2018) study
h the sameimethod for spectral distributions and for some of these establish stronger
2 2
identities of the form E kxk − x∗ k = αk kx0 − x∗ k , suggesting that the property (9) has the capacity to
achieve a perfectly precise mean-square analysis of SGD in this setting.
Expected smoothness as an inequality was later used to analyze the JacSketch method (Gower et al., 2020),
which is a general variance-reduced SGD that includes the widely-used SAGA algorithm (Defazio et al., 2014)
as a special case. By carefully considering and optimizing over smoothness, Gower et al. (2020) obtain the
currently best-known convergence rate for SAGA. Assuming strong convexity and the existence a global
minimizer x? , their assumption in our language reads
h i
2
E kg(x) − g(x? )k ≤ 2A (f (x) − f (x? )) , (C-ES)

where A ≥ 0, g(x) is a stochastic gradient and the expectation is w.r.t. the randomness embedded in g.
We refer to the above condition by the name convex expected smoothness (C-ES) as it provides a good
model of the stochastic gradient of convex objectives. (C-ES) was subsequently used to analyze SGD for
quasi strongly-convex functions by Gower et al. (2019), which allowed the authors to study a wide array of
subsampling strategies with great accuracy as well as provide the first formulas for the optimal minibatch size
for SGD in the strongly convex regime. These rates are tight up to non-problem specific constants in the
setting of convex stochastic optimization (Nguyen et al., 2019).
Independently and motivated by extending the analysis of subgradient methods for convex objectives, Grimmer
(2019) also studied the convergence of SGD under a similar setting to (C-ES) (in fact, their assumption is (10)
below). Grimmer (2019) develops quite general and elegant theory that includes the use of projection operators
and non-smooth functions, but the convergence rate they obtain for smooth strongly convex functions shows
a suboptimal dependence on the condition number.

5
4.2 Expected smoothness for nonconvex optimization
Given the utility of (C-ES), it is then natural to ask: can we extend (C-ES) beyond convexity? The first
problem we are faced with is that x? is ill-defined for nonconvex optimization problems, which may not have
any global minima. However, Gower et al. (2019) use (C-ES) through following direct consequence of (C-ES)
only: h i
2
E kg(x)k ≤ 4A (f (x) − f (x? )) + 2σ 2 , (10)
h i
def 2
where σ 2 = E kg(x? )k . We thus propose to remove x? and merely ask for a global lower bound f inf on
the function f rather than a global minimizer, dispense with the interpretation of σ 2 as the variance at the
optimum x? and merely ask for the existence of any such constant. This yields the new condition
h i
2
E kg(x)k ≤ 4A f (x) − f inf + C,

(11)

for some A, C ≥ 0. While (11) may be satisfactory for the analysis of convex problems, it does not enable us
to easily recover the convergence of full gradient descent or SGD under strong growth in the case of nonconvex
objectives. As we shall see in further sections, the fix is to add a third term to the bound, which finally leads
to our (ES) assumption:
Assumption 2 (Expected smoothness). The second moment of the stochastic gradient satisfies
h i
2 2
E kg(x)k ≤ 2A f (x) − f inf + B · k∇f (x)k + C,

(ES)

for some A, B, C ≥ 0 and all x ∈ Rd .


In the rest of this section, we turn to the generality of Assumption 2 and its use in correct and accurate
modelling of sources of stochasticity arising in practice.

4.3 Expected smoothness as the weakest assumption


As discussed in Section 3, assumptions on the stochastic gradients abound in the literature on SGD. If we
hope for a correct and tighter theory, then we may ask that we at least recover the convergence of SGD under
those assumptions. Our next theorem, described informally below and stated and proved formally in the
supplementary material, proves exactly this.

Theorem 1 (Informal). The expected smoothness assumption (Assumption 2) is the weakest condition
among the assumptions in Section 3.

4.4 Perturbation
One of the simplest models of stochasticity is the case of additive zero-mean noise with bounded variance,
that is
g(x) = ∇f (x) + ξ,
h i
2
where ξ is a random variable satisfying E [ξ] = 0 and E kξk ≤ σ 2 . Because the bounded variance condition
(BV) is clearly satisfied, then by Theorem 1, our Assumption 2 is also satisfied. While this model can be
useful for modelling artificially injected noise into the full gradient (Ge et al., 2015; Fang et al., 2019), it is
unreasonably strong for practical sources of noise: indeed, as we saw in Proposition 1, it does not hold for
subsampling with just two functions. It is furthermore unable to model sources of rather simple multiplicative
noise which arises in the case of gradient compression operators.

6
4.5 Subsampling
Now consider f having a finite-sum structure (3). In order to develop a general theory of SGD for a wide
array of subsampling strategies, we follow the stochastic reformulation formalism pioneered by Richtárik and
Takáč (2017); Gower et al. (2020) in the form proposed in (Gower et al., 2019). Given a sampling vector
v ∈ Rn drawn from some user-defined distribution D (where a sampling vector is one such that ED [vi ] = 1
n
def
for all i ∈ [n]), we define the random function fv (x) = n1
P
vi fi (x). Noting that Ev∼D [fv (x)] = f (x), we
i=1
reformulate (3) as a stochastic minimization problem
" n
#!
1X
min Ev∼D [fv (x)] = Ev∼D vi fi (x) , (12)
x∈Rd n i=1

where we assume only access to unbiased estimates of ∇f (x) through the stochastic realizations
n
1X
∇fv (x) = vi ∇fi (x). (13)
n i=1

def
That is, given current point x, we sample v ∼ D and set g(x) = ∇fv (x). We will now show that (ES) is
satisfied under very mild and natural assumptions on the functions fi and the sampling vectors vi . In that
sense, (ES) is not an additional assumption, it is an inequality that is automatically satisfied.
Assumption 3. Each fi is bounded from below by fiinf and is Li -smooth: That is, for all x, y ∈ Rd we have

Li 2
fi (y) ≤ fi (x) + h∇fi (x), y − xi + ky − xk .
2
To show that Assumption 2 is an automatic consequence of Assumption 3, we rely on the following crucial
lemma.
Lemma 1. Let f be a function for which Assumption 1 is satisfied. Then for all x ∈ Rd we have
2
k∇f (x)k ≤ 2L f (x) − f inf .

(14)

This lemma shows up in several recent works and is often used in conjunction with other assumptions such as
bounded variance (Li and Orabona, 2019) and convexity (Stich and Karimireddy, 2019). Lei et al. (2019)
also use a version of it to prove the convergence of SGD for nonconvex objectives, and we compare our
results against theirs in Section 5. Armed with Lemma 1, we can prove that Assumption 2 holds for all
non-degenerate distributions D.
def Pn
for all i. Let ∆inf = n1 i=1 f inf −
 
Proposition 2. Suppose that Assumption 3 holds and that E vi2 is finite

fiinf . Then ∆inf ≥ 0 and Assumption 2 holds with A = maxi Li E vi2 , B = 0 and C = 2A∆inf .
 
The condition that E vi2 is finite is a very mild condition on D and is satisfied for virtually all practical
subsampling schemes in the literature. However, the generality of Proposition 2 comes at a cost: the bounds
are too pessimistic. By making more specific (and practical) choices of the sampling distribution D, we can
get much tighter bounds. We do this by considering some representative sampling distributions next, without
aiming to be exhaustive:

• Sampling with replacement. An n-sided die is rolled a total of τ > 0 times and the number of times
the number i shows up is recorded as Si . We can then define
Si
vi = , (15)
τ qi
Pn
where qi is the probability that the i-th side of the die comes up and i=1 qi = 1. In this case, the
number of stochastic gradients queried is always τ though some of them may be repeated.

7
• Independent sampling without replacement. We generate a random subset S ⊆ {1, 2, . . . , n} and
define
1i∈S
vi = , (16)
pi
where 1i∈S = 1 if i ∈ S and 0 otherwise, and pi = Prob (i ∈ S) > 0. We assume that each number i is
included in S with probability pi independently of all the others.PIn this case, the number of stochastic
n
gradients queried |S| is not fixed but has expectation E [|S|] = i=1 pi .

• τ -nice sampling without replacement. This is similar to the previous sampling, but we generate a
random subset S ⊆ {1, 2, . . . , n} by choosing uniformly from all subsets of size τ for integer τ ∈ [1, n].
We define vi as in (16) and it is easy to see that pi = nτ for all i.

These sampling distributions were considered in the context of SGD for convex objective functions in (Gorbunov
et al., 2020; Gower et al., 2019). We show next that Assumption 2 is satisfied for these distributions with
much better constants than the generic Proposition 2 would suggest.
Pn
Proposition 3. Suppose that Assumptions 1 and 3 hold and let ∆inf = n1 i=1 (f inf − fiinf ). Then:
Li
(i) For independent sampling with replacement, Assumption 2 is satisfied with A = maxi τ nqi , B = 1 − τ1 ,
and C = 2A∆inf .
(1−pi )Li
(ii) For independent sampling without replacement, Assumption 2 is satisfied with A = maxi pi n ,
B = 1, and C = 2A∆inf .
n−τ n(τ −1)
(iii) For τ -nice sampling without replacement, Assumption 2 is satisfied with A = τ (n−1) maxi Li , B = τ (n−1) ,
and C = 2A∆inf .

4.6 Compression
We now further show that our framework is general enough to capture the convergence of stochastic gradient
quantization or compression schemes. Consider the finite-sum problem (3) and let gi (x) be stochastic gradients
such that E [gi (x)] = ∇fi (x). We construct an estimator g(x) via
n
1X
g(x) = Qi (gi (x)), (17)
n i=1

where the Qi are sampled independently for all i and across all iterations. Clearly, this generalizes (5). We
consider the class of ω-compression operators:
Assumption 4. We say that a stochastic operator Q = Qξ : Rd → Rd is an ω-compression operator if
h i
2 2
Eξ [Qξ (x)] = x, Eξ kQξ (x) − xk ≤ ωkxk . (18)

Assumption 4 is mild and is satisfied by many compression operators in the literature, including random
dithering (Alistarh et al., 2017), random sparsification, block quantization (Horváth et al., 2019), and others.
The next proposition then shows that if the stochastic gradients gi (x) themselves satisfy Assumption 2 with
their respective functions fi , then g(x) also satisfies Assumption 2.
Proposition 4. Suppose that a stochastic gradient estimator g(x) is constructed via (17) such that each
Qi is a ωi -compressor satisfying Assumption 4. Suppose further that the stochastic gradient gi (x) is such
that E [gi (x)] = ∇fi (x) and that each satisfies Assumption 2 with constants (Ai , Bi , Ci ). Then there exists
constants A, B, C ≥ 0 such that g(x) satisfies Assumption 2 with (A, B, C).

To the best of our knowledge, the combination of gradient compression and subsampling is not covered by any
analysis of SGD for nonconvex objectives. Hence, Proposition 4 shows that Assumption 2 is indeed versatile
enough to model practical and diverse sources of stochasticity well.

8
5 SGD in the Nonconvex World
5.1 General convergence theory
Our main convergence result relies on the following key lemma.
Lemma 2. Suppose that Assumptions 1 and 2 are satisfied. Choose constant stepsize γ > 0 such that
1
γ ≤ BL . Then,
K−1 K−1
1 X wK−1 w−1 LC X
wk rk + δK ≤ δ0 + wk γ.
2 γ γ 2
k=0 k=0
h i
def 2 def w−1 def inf
where rk = E k∇f (xk )k , wk = (1+Lγ 2 A)k+1 for w−1 > 0 arbitrary, and δk = E [f (xk )] − f .

Lemma 2 bounds a weighted sum of stochastic gradients over the entire run of the algorithm. This idea of
weighting different iterates has been used in the analysis of SGD in the convex case (Rakhlin et al., 2012;
Shamir and Zhang, 2013; Stich, 2019) typically with the goal of returning a weighted average of the iterates
x̄K at the end. In contrast, we only use the weighting to facilitate the proof.
Theorem 2. Suppose that Assumptions 1 and 2 hold. Suppose that a stepsize γ > 0 is chosen such that
1 def
γ ≤ LB . Letting δ0 = f (x0 ) − f inf , we have
K
h
2
i 2 1 + Lγ 2 A
min E k∇f (xk )k ≤ LCγ + δ0 .
0≤k≤K−1 γK
While the bound of Theorem 2 shows possible exponential blow-up, we can show that by carefully controlling
the stepsize we can nevertheless attain an ε-stationary point given O(ε−4 ) stochastic gradient evaluations.
This dependence is in fact optimal for SGD without extra assumptions on second-order smoothness or
disruptiveness of the stochastic gradient noise (Drori and Shamir, 2019). We use a similar stepsize to Ghadimi
and Lan (2013).
n o
1 1 ε
Corollary 1. Fix ε > 0. Choose the stepsize γ > 0 as γ = min √LAK , LB , 2LC . Then provided that
 
12δ0 L 12δ0 A 2C
K≥ max B, , , (19)
ε2 ε2 ε2
we have min0≤k≤K−1 E [k∇f (xk )k] ≤ ε.
As a start, the iteration complexity given by (19) recovers full gradient descent: plugging in B = 1 and
A = C = 0 shows that we require a total of 12δ0 Lε−2 iterations in required to a reach an ε-stationary point.
This is the standard rate of convergence for gradient descent on nonconvex objectives (Beck, 2017), up to
absolute (non-problem-specific) constants.
Plugging in A = C = 0 and B to be any nonnegative constant recovers the fast convergence of SGD under
strong growth (E-SG). Our bounds are similar to Lei et al. (2019) but improve upon them by recovering full
gradient descent, assuming smoothness only in expectation, and attaining the optimal O(ε−4 ) rate without
logarithmic terms.

5.2 Convergence under the Polyak-Lojasiewicz condition


One of the popular generalizations of strong convexity in the literature is the Polyak-Lojasiewicz (PL) condition
(Karimi et al., 2016; Lei et al., 2019). We first define this condition and then establish convergence of SGD for
functions satisfying it and our (ES) assumption. In the rest of this section, we assume that the function f has
def
a minimizer and denote f ? = min f .
Assumption 5. We say that a differentiable function f satisfies the Polyak-Lojasiewicz condition if for all
x ∈ Rd ,
1 2
k∇f (x)k ≥ µ (f (x) − f ? ) .
2

9
We rely on the following lemma where we use the stepsize sequence recently introduced by Stich (2019) but
without iterate averaging, as averaging in general may not make sense for nonconvex models.
Lemma 3. Consider a sequence (rt )t satisfying

rt+1 ≤ (1 − aγt ) rt + cγt2 , (20)


1
where γt ≤ b for all t ≥ 0 and a, c ≥ 0 with a ≤ b. Fix K > 0 and let k0 = d K
2 e. Then choosing the stepsize as
(
1
, if K ≤ ab or t < k0 ,
γt = b 2
a(s+t−k0 ) if K ≥ ab and t > k0
2b
gives rK ≤ exp − aK 9c

with s = a 2b r0 + a2 K .

Using the stepsize scheme of Lemma 3, we can show that SGD finds an optimal global solution at a 1/K rate,
where K is the total number of iterations.
Theorem 3. Suppose that Assumptions 1, 2, and 5 hold. Suppose that SGD is run for K > 0 iterations
µ 1
with the stepsize sequence (γk )k of Lemma 3 with γk ≤ min 2AL , 2BL for all k. Then
 
9κf C K
E [f (xK ) − f ∗ ] ≤ + exp − (f (x0 ) − f ∗ ) .
2µK 2κf max {κS , B}
def def
where κf = L/µ is the condition number of f and κS = A/µ is the stochastic condition number.
The next corollary recovers the O(ε−1 ) convergence rate for strongly convex functions, which is the optimal
dependence on the accuracy ε Nguyen et al. (2019).
def
Corollary 2. In the same setting of Theorem 3, fix ε > 0. Let r0 = f (x0 ) − f ? . Then E [f (xK ) − f ? ] ≤ ε
as long as      
2r0 2r0 9C
K ≥ κf max 2κS log , 2B log , .
ε ε 2µε
While the dependence on ε is optimal, the situation is different when we consider the dependence on problem
constants, and in particular the dependence on κS : Corollary 2 shows a possibly multiplicative dependence
κf κS . This is different for objectives where we assume convexity, and we show this next. It is known that the
PL-condition implies the quadratic functional growth (QFG) condition (Necoara et al., 2019), and is in fact
equivalent to it for convex and smooth objectives (Karimi et al., 2016). We will adopt this assumption in
conjunction with the convexity of f for our next result.
Assumption 6. We say that a convex function f satisfies the quadratic functional growth condition if
µ 2
f (x) − f ? ≥ kx − π(x)k (21)
2
def 2
for all x ∈ Rd where f ? is the minimum value of f and where π(x) = argminy∈X ? kx − yk is the projection
def 
on the set of minima X ? = y ∈ Rd | f (y) = f ? .
There are only a handful of results under QFG (Drusvyatskiy and Lewis, 2018; Necoara et al., 2019; Grimmer,
2019) and only one applies to our setting (Grimmer, 2019). For QFG in conjunction with convexity and
expected smoothness, we can prove convergence in function values similar to Theorem 3,
n o
1 1
Theorem 4. Assume that Assumptions 2 and 6 hold with f convex. Choose γ ≤ min 4L , 4(BL+A)
according to Lemma 3. Then
 
18κf C κf K
E [f (xK ) − f ? ] ≤ + exp − (f (x0 ) − f ? ) ,
µK 2 M
def
where M = 8 max {4κf , 4Bκf + κS }.

10
Theorem 4 allows stepsizes O(1/L), which are much larger than the O(1/(κL)) stepsizes in (Nguyen et al.,
2019; Grimmer, 2019). Hence, it improves upon the prior results of Nguyen et al. (2019) and Grimmer (2019)
in the context of finite-sum problems where the individual functions fi are smooth but possibly nonconvex,
and the average f is strongly convex or satisfies Assumption 6.
The following straightforward corollary of Theorem 4 shows that when convexity is assumed, we can get a
dependence on the sum of the condition numbers κf + κS rather than their product. This is a significant
difference from the nonconvex setting, and it is not known whether it is an artifact of our analysis or an
inherent difference.
Corollary 3. In the same setting of Theorem 4, fix ε > 0. Then E [f (xK ) − f ? ] ≤  as long as
  
36κf C 4κf r0
K ≥ max , M log ,
µ 
def
where M = 32Bκf + 8κS and r0 = f (x0 ) − f ? .

6 Importance Sampling and Optimal Minibatch Size


As an example application of our results, we consider importance sampling: choosing the sampling distribution
to maximize convergence speed. We consider independent sampling with replacement with minibatch size τ .
Plugging the bound on A, B, C from Proposition 3 into the sample complexity from Corollary 1 yields:
 
12δ0 L −1
 D Li
K≥ max 1 − τ , max , (22)
ε2 ε2 i τ nqi

where D = max 12δ0 , 4∆inf . Optimizing (22) over (qi )ni=1 yields the sampling distribution

Li
qi? = Pn . (23)
j=1 Lj

The same sampling distribution has appeared in the literature before (Zhao and Zhang, 2015; Needell et al.,
2016), and our work is the first to give it justification for SGD on nonconvex objectives. Plugging the
distribution of (23) into (22) and considering the total number of stochastic gradient evaluations K × τ we
get,  
12δ0 L DL̄
Kτ ≥ max τ − 1,
ε2 ε2
def Pn
where L̄ = n1 i=1 Li . This is minimized over the minibatch size τ whenever τ ≤ τ ∗ = 1 + DL̄ε−2 . Similar
 

expressions for importance sampling and minibatch size can be obtained for other sampling distributions as in
(Gower et al., 2019).

7 Experiments
7.1 Linear regression with a nonconvex regularizer
We first consider a linear regression problem with nonconvex regularization to test the importance sampling
scheme given in Section 6,
  
n d 2

def 1 X X x i

min f (x) = li (x) + λ  , (24)
x∈Rd  ni=1
1 + x2i 
j=1

2
where li (x) = khai , xi − yi k , a1 , . . . , an ∈ Rd are generated, and λ = 0.1. We use n = 1000 and
√ d = 50 and
initialize x = 0. We sample minibatches of size τ = 10 with replacement and use γ = 0.1/ LAK, where
K = 5000 is the number of iterations and A is as in Proposition 3. Similar to Needell and Ward (2017), we

11
Figure 2: Results on regularized linear regression with (right) and without (left) normalization. Normalization
means forcing kai k = 1.

Table 1: Fitted constants in the regularized logistic regression problem. Predicted constants are per Proposi-
tion 3. The residual is the mean square error, see Section 14.1 in the supplementary for more discussion of
this table.
2A B C Residual
ES, Predicted 9 0 0.994 6.113
ES, Fit 10.09 0 0.373 0.413
- α β Residual
RG, Fit - 0.38 2.09 0.57

illustrate the utility of importance sampling by sampling ai from a zero-mean Gaussian of variance i without
2
normalizing the ai . Since Li ∝ kai k , we can expect importance sampling to outperform uniform sampling in
this case. However, when we normalize, the two methods should not be very different, and Figure 2 (of a
single evaluation run) shows this.

7.2 Logistic regression with a nonconvex regularizer


We now consider the regularized logistic regression problem from (Tran-Dinh et al., 2019) with the aim of
testing the fit of our Assumption 2 compared to other assumptions. The problem has the same form as (24) but
with the logistic loss li (x) = log(1 + exp(−aTi x)) for a1 , . . . , an ∈ Rd given and λ = 0.5. We run experiments
on the a9a dataset (n = 32561 and d = 123) from LIBSVM √ (Chang and Lin, 2011). We fix τ = 1, and run
SGD for K = 500 iterations with a stepsize γ = 1/ LAK as in the previous experiment. We use uniform
Pn 2
sampling with replacement and measure the average squared stochastic gradient norm 1/n i=1 k∇fi (xk )k
every five iterations, in addition to the loss and the squared gradient norm. We then run nonnegative linear
least squares regression to fit the data for expected smoothness (ES) and compare to relaxed growth (RG).
We also compare with theoretically estimated constants for (ES). The result is in Table 1, where we see a
tight fit between our theory and the observed. The experimental setup and estimation details are explained
more thoroughly in Section 14.1 in the supplementary material.

Acknowledgements
We thank anonymous reviewers for their helpful suggestions. Part of this work was done while the first author
was an intern at KAUST.

12
References
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-Efficient
SGD via Gradient Quantization and Encoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems
30, pages 1709–1720. Curran Associates, Inc., 2017. 2, 8
Amir. Beck. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadel-
phia, PA, 2017. doi: 10.1137/1.9781611974997. 9
Tal Ben-Nun and Torsten Hoefler. Demystifying Parallel and Distributed Deep Learning: An In-Depth
Concurrency Analysis. ACM Comput. Surv., 52(4), August 2019. ISSN 0360-0300. doi: 10.1145/3320060. 2
Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient Convergence in Gradient methods with Errors. SIAM
Journal on Optimization, 10(3):627–642, January 2000. doi: 10.1137/s1052623497331063. 4
Léon. Bottou, Frank E. Curtis, and Jorge. Nocedal. Optimization Methods for Large-Scale Machine Learning.
SIAM Review, 60(2):223–311, 2018. doi: 10.1137/16M1080173. 3, 4
Chih-Chung Chang and Chih-Jen Lin. LibSVM: A library for support vector machines. ACM Transactions
on Intelligent Systems and Technology (TIST), 2(3):27, 2011. 12
Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method
with Support for Non-Strongly Convex Composite Objectives. In Proceedings of the 27th International
Conference on Neural Information Processing Systems - Volume 1, NIPS14, page 16461654, Cambridge,
MA, USA, 2014. MIT Press. 5
Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal Distributed Online Prediction Using
Mini-Batches. J. Mach. Learn. Res., 13(null):165202, January 2012. ISSN 1532-4435. 2
Yoel Drori and Ohad Shamir. The Complexity of Finding Stationary Points with Stochastic Gradient Descent.
arXiv preprint arXiv:1910.01845, 2019. 9
Dmitriy Drusvyatskiy and Adrian S. Lewis. Error Bounds, Quadratic Growth, and Linear Convergence of
Proximal Methods. Mathematics of Operations Research, 43(3):919–948, 2018. 10
Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp Analysis for Nonconvex SGD Escaping from Saddle Points.
In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning
Theory, volume 99 of Proceedings of Machine Learning Research, pages 1192–1234, Phoenix, USA, 25–28
Jun 2019. PMLR. 6
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping From Saddle Points — Online Stochastic
Gradient for Tensor Decomposition. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings
of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages
797–842, Paris, France, 03–06 Jul 2015. PMLR. 6
Saeed Ghadimi and Guanghui Lan. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic
Programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. doi: 10.1137/120880811. 3, 9
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The MIT Press, 2016. ISBN 0262035618.
1
Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A Unified Theory of SGD: Variance Reduction,
Sampling, Quantization and Coordinate Descent. In Silvia Chiappa and Roberto Calandra, editors,
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume
108 of Proceedings of Machine Learning Research, pages 680–690, Online, 26–28 Aug 2020. PMLR. 8
Robert M Gower, Peter Richtrik, and Francis Bach. Stochastic quasi-gradient methods: variance reduction
via Jacobian sketching. Mathematical Programming, pages 1–58, 2020. ISSN 0025-5610. doi: 10.1007/
s10107-020-01506-0. 3, 5, 7

13
Robert Mansel Gower and Peter Richtárik. Randomized iterative methods for linear systems. SIAM Journal
on Matrix Analysis and Applications, 36(4):1660–1690, 2015. 5
Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik.
SGD: General Analysis and Improved Rates. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 5200–5209, Long Beach, California, USA, 09–15 Jun 2019. PMLR. 2, 3, 5, 6, 7, 8,
11
Benjamin Grimmer. Convergence Rates for Deterministic and Stochastic Subgradient Methods without
Lipschitz Continuity. SIAM Journal on Optimization, 29(2):13501365, Jan 2019. ISSN 1095-7189. doi:
10.1137/18m117306x. 5, 10, 11

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning with Limited
Numerical Precision. In Proceedings of the 32nd International Conference on International Conference on
Machine Learning - Volume 37, ICML15, page 17371746. JMLR.org, 2015. 2
Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter Richtárik. Stochastic
distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115,
2019. 2, 8
Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear Convergence of Gradient and Proximal-Gradient
Methods Under the Polyak-Lojasiewicz Condition. In European Conference on Machine Learning and
Knowledge Discovery in Databases - Volume 9851, ECML PKDD 2016, page 795811, Berlin, Heidelberg,
2016. Springer-Verlag. 9, 10

Sarit Khirirat, Hamid Reza Feyzmahdavian, and Mikael Johansson. Distributed learning with compressed
gradients. arXiv preprint arXiv:1806.06573, 2018. 2
Dmitry Kovalev, Eduard Gorbunov, Elnur Gasanov, and Peter Richtárik. Stochastic spectral and conjugate
descent methods. In Advances in Neural Information Processing Systems, volume 31, pages 3358–3367,
2018. 5

Yunwei Lei, Ting Hu, Guiying Li, and Ke Tang. Stochastic Gradient Descent for Nonconvex Learning Without
Bounded Gradient Assumptions. IEEE Transactions on Neural Networks and Learning Systems, pages 1–7,
2019. ISSN 2162-2388. doi: 10.1109/TNNLS.2019.2952219. 4, 7, 9
Xiaoyu Li and Francesco Orabona. On the Convergence of Stochastic Gradient Descent with Adaptive
Stepsizes. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning
Research, volume 89 of Proceedings of Machine Learning Research, pages 983–992. PMLR, 16–18 Apr 2019.
7
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-
Efficient Learning of Deep Networks from Decentralized Data. In Aarti Singh and Jerry Zhu, editors,
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of
Proceedings of Machine Learning Research, pages 1273–1282, Fort Lauderdale, FL, USA, 20–22 Apr 2017.
PMLR. 4
I Necoara, Y. Nesterov, and F. Glineur. Linear Convergence of First Order Methods for Non-Strongly Convex
Optimization. Mathematical Programming, 175(12):69107, May 2019. ISSN 0025-5610. 10, 31

Deanna Needell and Rachel Ward. Batched Stochastic Gradient Descent with Weighted Sampling. In
Gregory E. Fasshauer and Larry L. Schumaker, editors, Approximation Theory XV: San Antonio 2016,
pages 279–306, Cham, 2017. Springer International Publishing. 11
Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the
randomized Kaczmarz algorithm. Mathematical Programming, 155(1):549–573, Jan 2016. ISSN 1436-4646.
doi: 10.1007/s10107-015-0864-7. 2, 11

14
Arkadi Nemirovsky and David B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley,
New York, 1983. ISBN 9780471103455. 1
Phuong Ha Nguyen, Lam Nguyen, and Marten van Dijk. Tight Dimension Independent Lower Bound on the
Expected Convergence Rate for Diminishing Step Sizes in SGD. In H. Wallach, H. Larochelle, A. Beygelzimer,
F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32,
pages 3660–3669. Curran Associates, Inc., 2019. 5, 10, 11
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making Gradient Descent Optimal for Strongly
Convex Stochastic Optimization. In Proceedings of the 29th International Coference on International
Conference on Machine Learning, ICML12, page 15711578, Madison, WI, USA, 2012. Omnipress. ISBN
9781450312851. 9
Peter Richtárik and Martin Takáč. Stochastic Reformulations of Linear Systems: Algorithms and Convergence
Theory. arXiv preprint arXiv:1706.01108, 2017. 3, 5, 7
Karthik A. Sankararaman, Soham De, Zheng Xu, W. Ronny Huang, and Tom Goldstein. The Impact of
Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent. arXiv
preprint arXiv:1904.06963, 2019. 3, 4
Mark Schmidt and Nicolas Le Roux. Fast Convergence of Stochastic Gradient Descent under a Strong Growth
Condition. arXiv preprint arXiv:1308.6370, 2013. 3, 4
Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: from theory to algorithms.
Cambridge University Press, 2014. ISBN 1107057132. 2
Ohad Shamir and Tong Zhang. Stochastic Gradient Descent for Non-Smooth Optimization: Convergence Re-
sults and Optimal Averaging Schemes. In Proceedings of the 30th International Conference on International
Conference on Machine Learning - Volume 28, ICML13, page I71I79. JMLR.org, 2013. 9
M.V. Solodov. Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero. Computational
Optimization and Applications, 11(1):23–35, 1998. doi: 10.1023/a:1018366000512. 4
Sebastian U. Stich. Unified Optimal Analysis of the (Stochastic) Gradient Method. arXiv preprint
arXiv:1907.04232, 2019. 9, 10, 25, 28
Sebastian U. Stich and Sai Praneeth Karimireddy. The Error-Feedback Framework: Better Rates for SGD
with Delayed Gradients and Compressed Communication. arXiv preprint arXiv:1909.05350, 2019. 7
Ruo-Yu Sun. Optimization for Deep Learning: An Overview. Journal of the Operations Research Society of
China, 8(2):249–294, Jun 2020. ISSN 2194-6698. doi: 10.1007/s40305-020-00309-6. 1
Quoc Tran-Dinh, Nhan H. Pham, Dzung T. Phan, and Lam M. Nguyen. Hybrid Stochastic Gradient Descent
Algorithms for Stochastic Nonconvex Optimization. arXiv preprint arXiv:1905.05920, 2019. 12
Paul Tseng. An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize
Rule. SIAM Journal on Optimization, 8(2):506–531, May 1998. doi: 10.1137/s1052623495294797. 4
Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and Faster Convergence of SGD for Over-Parameterized
Models and an Accelerated Perceptron. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings
of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1195–1204.
PMLR, 16–18 Apr 2019. 3, 4
Stephen A. Vavasis. Complexity Issues in Global Optimization: A Survey. In Horst R. and Pardalos P. M.,
editors, Handbook of Global Optimization. Nonconvex Optimization and Its Applications, volume 2. Springer,
Boston, Massachusets, 1995. doi: https://doi.org/10.1007/978-1-4615-2025-2 2. 1
Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient Sparsification for Communication-Efficient
Distributed Optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1299–1309. Curran
Associates, Inc., 2018. 2

15
Peilin Zhao and Tong Zhang. Stochastic Optimization with Importance Sampling for Regularized Loss
Minimization. In Proceedings of the 32nd International Conference on International Conference on Machine
Learning - Volume 37, ICML15, page 19. JMLR.org, 2015. 11

16
Supplementary Material
Contents
1 Introduction 1
1.1 Modelling stochasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Sources of stochasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Contributions 2

3 Existing Models of Stochastic Gradient 3

4 ES in the Nonconvex World 5


4.1 Brief history of expected smoothness: from convex quadratic to convex optimization . . . . . 5
4.2 Expected smoothness for nonconvex optimization . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Expected smoothness as the weakest assumption . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.4 Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.5 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.6 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 SGD in the Nonconvex World 9


5.1 General convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Convergence under the Polyak-Lojasiewicz condition . . . . . . . . . . . . . . . . . . . . . . . 9

6 Importance Sampling and Optimal Minibatch Size 11

7 Experiments 11
7.1 Linear regression with a nonconvex regularizer . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7.2 Logistic regression with a nonconvex regularizer . . . . . . . . . . . . . . . . . . . . . . . . . . 12

8 Basic Facts and Notation 18

9 Relations Between Assumptions 19


9.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
9.2 Formal Statement and Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

10 Proofs for Section 4.5 and 4.6 21


10.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
10.2 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
10.3 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
10.4 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

11 Proofs for General Smooth Objectives 25


11.1 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
11.2 A descent lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
11.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
11.4 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

12 Proofs Under Assumption 5 28


12.1 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
12.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

13 Proofs Under Assumption 6 31


13.1 A Lemma for Full Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
13.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

17
14 Experimental Details 33
14.1 Logistic regression with a nonconvex regularizer . . . . . . . . . . . . . . . . . . . . . . . . . . 33

8 Basic Facts and Notation


We will use the following facts from probability theory: If X is a random variable and Y is a constant vector,
then
h i h i
2 2 2
E kY − Xk = kY − E [X]k + E kX − E [X]k . (25)
h i h i
2 2 2
E kX − E [X]k = E kXk − kE [X]k . (26)

A consequence of (26) is the following,


h i h i
2 2
E kX − E [X]k ≤ E kXk (27)

We will also make use of the following facts from linear algebra: for any a, b ∈ Rd and any ζ > 0,
2 2 2
2 ha, bi = kak + kbk − ka − bk . (28)
2 2 −1
 2
kak ≤ (1 + ζ) ka − bk + 1 + ζ kbk . (29)
2
For vectors X1 , X2 , . . . , Xn all in Rd , the convexity of the squared norm k·k and a trivial application of
Jensen’s inequality yields the following inequality:
n 2 n
1X 1X 2
Xi ≤ kXi k . (30)
n i=1 n i=1

For an L-smooth function f we have that for all x, y ∈ Rd ,


L 2
f (x) ≤ f (y) + h∇f (y), x − yi + kx − yk . (31)
2
k∇f (x) − ∇f (y)k ≤ L kx − yk . (32)

Given a point x ∈ Rd and a stepsize γ > 0, we define the one step gradient descent mapping as,
def
Tγ (x) = x − γ∇f (x). (33)

18
9 Relations Between Assumptions
9.1 Proof of Proposition 1
Proof. We define the function f : R → R as,
(
x2
2 if |x| < 1,
f (x) = 1
|x| − 2 otherwise.

Then f is 1-smooth and lower bounded by 0. We consider using SGD with the stochastic gradient vector
( p
∇f (x) + |x| with probability 1/2,
g(x) = p
∇f (x) − |x| with probability 1/2.

Then suppose that (RG) holds, then there exists constants α and β such that,
h i
2 2
E kg(x)k ≤ αk∇f (x)k + β.

Consider x = 2 max {1, 2 (α + β)}, then |x| > 1 and hence ∇f (x) = 1 by the definition of f . Specializing
(RG) we get,
h i
2
E kg(x)k ≤ α + β. (34)

On the other hand,


h
2
i 1 √ 2 √ 2  1 √ 2
E kg(x)k = 1+ x + 1− x ≥ 1+ x
2 2
1
≥ |x| = max {1, 2 (α + β)} . (35)
2
We see that (34) and (35) are in clear contradiction. It follows that (RG) does not hold. We now show that
(ES) holds: first, suppose that |x| ≥ 1, then
 
h
2
i 1  p 2  p 2
E kg(x)k = 1 + |x| + 1 − |x|
2
1
= (2 + 2 |x|) = 1 + |x| ,
2
3
+ f (x) − f inf ,

= (36)
2
where in the last line we used that when |x| ≥ 1 we have f (x) − f inf = |x| − 1/2. Now suppose that |x| ≤ 1,
then
 
h
2
i 1  p 2  p 2
E kg(x)k = x + |x| + x − |x|
2
= x2 + |x| ≤ 1 + 1 = 2. (37)

Combining (36) and (37) then we have that for all x ∈ R,


h i
2
E kg(x)k ≤ f (x) − f inf + 2.

It follows that (ES) is satisfied with A = 1/2, B = 0, and C = 2. 

19
9.2 Formal Statement and Proof of Theorem 1
Theorem 1 (Formal) Suppose that f : Rd → R is L-smooth. Then the following relations hold:
1. The maximal strong growth condition (M-SG) implies the strong growth condition (E-SG).
2. The strong growth condition (E-SG) implies the relaxed growth condition (RG).
3. Bounded stochastic gradient variance (BV) implies the relaxed growth condition (RG).
Pn
4. The gradient confusion bound (GC) for finite-sum problems f = i=1 fi /n implies the relaxed growth
condition (RG).
5. The relaxed growth condition implies the expected smoothness condition (ES).
6. If f has the expectation structure (7), then the sure-smoothness assumption implies the expected
smoothness condition (ES).
Proof. 1. Suppose that (M-SG) holds. Then,
h i h i
2 2 2
E kg(x)k ≤ E αk∇f (x)k = αk∇f (x)k .

Hence (E-SG) holds.


2. If (E-SG) holds, then plugging β = 0 shows that (RG) holds.
3. If (BV) holds, then plugging α = 0 shows that (RG) holds.
2
4. We use the definition of k∇f (x)k ,
* n n
+
2
X ∇fi (x) X ∇fj (x)
k∇f (x)k = ,
i=1
n j=1
n
n n
1 XX
= h∇fi (x), ∇fj (x)i
n2 i=1 j=1
 
n n n
1 X 2
X X
= k∇fi (x)k + h∇fi (x), ∇fj (x)i
n2 i=1 i=1 j=1,j6=i

η(n2 − n)
 h 
1 2
i
≥ Ei k∇fi (x)k −
n n
 
1 h
2
i 1
= Ei k∇fi (x)k − η · 1 − .
n n
Rearranging we get, h i
2 2
E k∇fi (x)k ≤ n · k∇f (x)k + η (n − 1) .
Hence (RG) holds with α = n and β = η (n − 1).
5. Putting A = 0, B = α and C = β shows that (ES) is satisfied.
6. Note that fξ (x) is almost surely bounded from below by 0, and that by (SS), fξ (x) is almost-surely
L-smooth. Then using Lemma 1 we have almost surely,
2
k∇fξ (x)k ≤ 2Lfξ (x).
Let f inf be the infimum of f (necessarily exists, since f (x) = E [fξ (x)] ≥ 0). Then,
2
k∇fξ (x)k ≤ 2L fξ (x) − f inf + 2Lf inf .


Taking expectations and using that f (x) = Eξ∼D [fξ (x)] we see that (ES) is satisfied with A = 2L and
C = 2Lf inf .


20
10 Proofs for Section 4.5 and 4.6
10.1 Proof of Lemma 1
1
Proof. If we let x+ = x − L ∇f (x), then using the L-smoothness of f we get
L 2
f (x+ ) ≤ f (x) + h∇f (x), x+ − xi + kx+ − xk .
2
Using that f inf ≤ f (x+ ) and the definition of x+ we have,
1 2 1 2 1 2
f inf ≤ f (x+ ) ≤ f (x) − k∇f (x)k + k∇f (x)k = f (x) − k∇f (x)k .
L 2L 2L
Rearranging we get the claim. 

10.2 Proof of Proposition 2


2
Proof. We start out with the definition of ∇fv (·) then use the convexity of the squared norm k·k and the
linearity of expectation:
 
n 2
h
2
i 1 X
E k∇fv (x)k = E vi ∇fi (x) 
n i=1
(30) n n
1X h 2
i 1 X  2 2
≤ E kvi ∇fi (x)k = E vi k∇fi (x)k .
n i=1 n i=1

We now use Lemma 1 and the nonnegativity of fi (x) − fiinf (since fiinf is a lower bound on fi ),
(1) n
h
2
i 2X
Li E vi2 fi (x) − fiinf
  
E k∇fv (x)k ≤
n i=1
  n
2 maxi Li E vi2 X
fi (x) − fiinf


n i=1
 2  n
2 maxi Li E vi X
fi (x) − f inf + f inf − fiinf

=
n i=1
n
  1 X
2 max Li E vi2 f (x) − f inf + 2 max Li E vi2 f inf − fiinf .
   
=
i i n i=1
Pn Pn Pn
It remains to notice that since f (x) = n1 i=1 fi (x) ≥ n1 i=1 fiinf , then n1 i=1 fiinf is a lower bound on f ,
and hence
n n
1 X inf 1 X inf
∆inf = f − fiinf = f inf −

f ≥ 0,
n i=1 n i=1 i
as f inf is the infimum (greatest lower bound) of {f (x) | x ∈ Rd } by definition. 

10.3 Proof of Proposition 3


Proof. From the definition of ∇fv and the linearity of expectation,
  * +
n 2 n n
h
2
i 1 X 1 X 1 X
E k∇fv (x)k = E vi ∇fi (x)  = E  vi ∇fi (x), vj ∇fj (x) 
n i=1 n i=1 n j=1
n n
1 XX
= E [vi vj ] h∇fi (x), ∇fj (x)i . (38)
n2 i=1 j=1

We now consider each of the specific samplings individually:

21
(i) Independent sampling with replacement: we write
τ
Si 1 X
vi = = Si,k , (39)
τ qi τ qi
k=1

where Si,k is defined by (


1 if the k-th dice roll resulted in i,
Si,k =
0 otherwise,
and clearly E [Si,k ] = qi . Then,
τ τ
1 XX
E [vi vj ] = E [Si,k Sj,t ] . (40)
τ 2 qi qj t=1
k=1

It is straightforward to see by the independence of the dice rolls that



qi qj
 if k 6= t,
E [Si,k Sj,t ] = qi if k = t, i = j,

0 if k = t, i 6= j.

Using this in (40) yields (


1
1− τ if i 6= j,
E [vi vj ] = 1 1
1− τ + τ qi otherwise.
Plugging the last equality into (38), we get
  n n n  
h
2
i 1 1 X X 1 X 1 1 2
E k∇fv (x)k = 1− h∇fi (x), ∇fj (x)i + 2 1− + k∇fi (x)k
τ n2 i=1 n i=1 τ τ qi
j=1,j6=i
  n X n n 2
1 1 X 1 X k∇fi (x)k
= 1− h∇f i (x), ∇f j (x)i + (41)
τ n2 i=1 j=1 n2 i=1 τ qi
  *Xn n
+ n 2
1 ∇fi (x) X ∇fj (x) 1 X k∇fi (x)k
= 1− , + 2
τ i=1
n j=1
n n i=1 τ qi
  n 2
1 2 1 X k∇fi (x)k
= 1− k∇f (x)k + 2 .
τ n i=1 τ qi

Using Lemma 1 on each fi ,

(14)
  n
h
2
i 1 2 2 X Li
fi (x) − fiinf

E k∇fv (x)k ≤ 1− k∇f (x)k +
τ n i=1 nτ qi
    X n
1 2 Li 1
fi (x) − fiinf .

≤ 1− k∇f (x)k + 2 max
τ i nτ qi n i=1
Pn
It remains to use that n1

i=1 fi (x) − fiinf = f (x) − f inf + ∆inf , and the nonnegativity of ∆inf was
proved in Proposition 2.
(ii) Independent sampling without replacement: it is not difficult to see that,
(
1 if i 6= j,
E [vi vj ] = 1
pi if i = j.

22
Using this in (38),
n n n
h
2
i 1 X X 1 X 1 2
E k∇fv (x)k = h∇f i (x), ∇f j (x)i + k∇fi (x)k
n2 i=1 n2 i=1 pi
j=1,j6=i
n X n n  
1 X 1 X 1 2
= h∇f i (x), ∇f j (x)i + − 1 k∇f (x)k .
n2 i=1 j=1 n2 i=1 pi

Continuing similarly to the previous sampling (from (41) onward), we get the required claim.
(iii) τ -nice sampling without replacement: it is not difficult to see by elementary combinatorics that
(
τ
if i = j,
Prob (i ∈ S and j ∈ S) = nτ (τ −1)
n(n−1) otherwise.
From this we can easily compute E [vi vj ]. Substituting into (38) and proceeding as the previous two
cases yields the proposition’s claim.


10.4 Proof of Proposition 4


Proof. First, it is easy to see that g(xk ) is an unbiased estimator using the tower property of expectation and
Assumption 4:
n n
1X 1X
E [g(xk )] = E [Qi,k (gi (xk ))] = E [E [Qi,k (gi (xk )) | gi (xk )]]
n i=1 n i=1
n n
1X 1X
= E [gi (xk )] = ∇fi (xk ) = ∇f (xk ).
n i=1 n i=1
Second, to show that Assumption 2 is satisfied, we have for expectation conditional on g1 (xk ), g2 (xk ), . . . , gn (xk ):
 
n 2
h
2
i 1 X
E kg(xk )k = E Qi,k (gi (xk )) 
n i=1
 
n 2 n 2
(26) 1 X 1 X
= E (Qi,k (gi (xk )) − gi (xk ))  + gi (xk ) .
n i=1 n i=1

Since Q1,k , Q2,k , . . . , Qn,k are independent by definition, the variance decomposes:
n n 2
h
2
i 1 X h 2
i 1X
E kg(xk )k = E kQi,k (gi (xk )) − gi (xk )k + gi (xk )
n2 i=1 n i=1
n n 2
(18) 1 X 1X
2
≤ ω i kgi (x k )k + gi (xk ) . (42)
n2 i=1 n i=1
We now take expectation with respect to the randomness in the gi (xk ). We consider the first term in (42),
where once again the variance decomposes by the independence of g1 (xk ), g2 (xk ), . . . , gn (xk ):
   
n 2 n 2
1 X (26) 1 X 2
E gi (xk )  = E  gi (xk ) − ∇fi (xk )  + k∇f (xk )k
n i=1 n i=1
n
1 X h 2
i
2
= E kgi (x k ) − ∇fi (x k )k + k∇f (xk )k
n2 i=1
(27) n
1 X h 2
i
2
≤ E kgi (x k )k + k∇f (xk )k . (43)
n2 i=1

23
Combining (42) and (43),
n
h
2
i 1 X h
2
i
2
E kg(xk )k ≤ 2 (1 + ωi ) E kgi (xk )k + k∇f (xk )k . (44)
n i=1

For the first term in (44), we have using Assumption 2 and then Lemma 1,
n n
1 X h
2
i 1 X 
inf
 2

(1 + ωi ) E kg (x
i k )k ≤ (1 + ωi ) 2A i f (x
i k ) − f i + B i k∇f (x
i k )k + C i
n2 i=1 n2 i=1
n
1 X
(1 + ωi ) 2 (Ai + Bi Li ) fi (xk ) − fiinf + Ci
 
≤ 2
n i=1
n n
1X 1 X
2A fi (xk ) − fiinf + 2

≤ (1 + ωi ) Ci
n i=1 n i=1
2A f (xk ) − f inf + C.

= (45)

def def Pn
where A = maxi (1 + ωi ) (Ai + Bi Li ) /n, and C = 2A∆inf + n12 i=1 (1 + ωi ) Ci , and where ∆inf is defined
as in Proposition 2. Combining (45) with (44) we finally get,
h i
2 2
≤ 2A f (xk ) − f inf + k∇f (xk )k + C,

E kg(xk )k

which shows that Assumption 2 is satisfied. 

24
11 Proofs for General Smooth Objectives
11.1 Proof of Lemma 2
Proof. We start with the L-smoothness of f , which implies
L 2
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + kxk+1 − xk k
2
Lγ 2 2
= f (xk ) − γ h∇f (xk ), g(xk )i + kg(xk )k . (46)
2
Taking expectations in (46) conditional on xk , and using Assumption 2, we get

2 Lγ 2 h 2
i
E [f (xk+1 ) | xk ] = f (xk ) − γk∇f (xk )k + E kg(xk )k
2
(ES) 2 
2 Lγ 2

2A f (xk ) − f inf + Bk∇f (xk )k + C

≤ f (xk ) − γk∇f (xk )k +
 2
 Lγ 2 C

LBγ 2
= f (xk ) − γ 1 − k∇f (xk )k + Lγ 2 A f (xk ) − f inf + .
2 2

Subtracting f inf from both sides gives

Lγ 2 C
 
LBγ 2
E [f (xk+1 ) | xk ] − f inf ≤ 1 + Lγ 2 A f (xk ) − f inf − γ 1 −
 
k∇f (xk )k + ,
2 2

taking expectation again, using the tower property and rearranging, we get
 Lγ 2 C
  h
inf LBγ 2
i
E k∇f (xk )k ≤ 1 + Lγ 2 A E f (xk ) − f inf +
   
E f (xk+1 ) − f +γ 1− .
2 2
h i
def   def 2
Letting δk = E f (xk ) − f inf and rk = E k∇f (xk )k , we can rewrite the last inequality as

Lγ 2 C
 
LBγ
rk ≤ 1 + Lγ 2 A δk − δk+1 +

γ 1− .
2 2
LBγ
Our choice of stepsize guarantees that 1 − 2 ≥ 12 . As such,

γ Lγ 2 C
rk ≤ 1 + Lγ 2 A δk − δk+1 +

. (47)
2 2
We now follow Stich (2019) and define a weighting sequence w0 , w1 , w2 , . . . , wK that is used to weight the
terms in (47). However, unlike Stich (2019), we are interested in the weighting sequence solely as a proof
wk−1
technique, and it does not show up in the final bounds. Fix w−1 > 0. Define wk = 1+LγA for all k ≥ 0.
Multiplying (47) by wk /γ,

1 wk 1 + Lγ 2 A wk LCwk γ
wk rk ≤ δk − δk+1 +
2 γ γ 2
wk−1 wk LCwk γ
= δk − δk+1 + .
γ γ 2
Summing up both sides as k = 0, 1, . . . , K − 1 we have,
K−1 K−1
1 X w−1 wK−1 LC X
wk rk ≤ δ0 − δK + wk γ.
2 γ γ 2
k=0 k=0

Rearranging we get the lemma’s statement. 

25
11.2 A descent lemma
Lemma 4. Under Assumption 1 we have for any γ > 0,
γ 2
f (x) − f (Tγ (x)) ≤ (3Lγ + 2) k∇f (x)k , (48)
2
def
where Tγ (x) = x − γ∇f (x).
Proof. Using the L-smoothness of f ,
L 2
f (x) − f (Tγ (x)) ≤ h∇f (Tγ (x)), x − Tγ (x)i + kx − Tγ (x)k
2
(33) Lγ 2 2
= γ h∇f (Tγ (x)), ∇f (x)i + k∇f (x)k
2
(28) γ  Lγ 2
2 2 2 2
= k∇f (Tγ (x))k + k∇f (x)k − k∇f (Tγ (x)) − ∇f (x)k + k∇f (x)k
2 2
(29) γ  Lγ 2
2 2 2
ζk∇f (Tγ (x)) − ∇f (x)k + 2 + ζ −1 k∇f (x)k +

≤ k∇f (x)k
2 2
(32) γ 2  Lγ 2
2 2 2
L ζkTγ (x) − xk + 2 + ζ −1 k∇f (x)k +

≤ k∇f (x)k
2 2
(33) γ 2 2 2
L ζγ + ζ −1 + 2 + Lγ k∇f (x)k .

=
2
1
Put ζ = Lγ ,

γ 2
f (x) − f (Tγ (x)) ≤ (3Lγ + 2) k∇f (x)k . 
2

11.3 Proof of Theorem 2


Proof. We start with Lemma 2,
K−1 K−1 K−1
1 X 1 X wK−1 w−1 LC X
wk rk ≤ wk rk + δK ≤ δ0 + wk γ.
2 2 γ γ 2
k=0 k=0 k=0

PK−1
Let WK = k=0 wk . Dividing both sides by WK we have,
K−1
1 1 X w−1 δ0 LCγ
min rk ≤ wk rk ≤ + . (49)
2 0≤k≤K−1 WK WK γ 2
k=0

Note that,
K−1 K−1
X X Kw−1
WK = wk ≥ min wi = KwK−1 = K
. (50)
k=0 k=0
0≤i≤K−1 (1 + Lγ 2 A)

Using this in (49),


K
1 1 + Lγ 2 A LCγ
min rk ≤ δ0 + .
2 0≤k≤K−1 γK 2

Dividing both sides by 1/2 yields the theorem’s claim. 

26
11.4 Proof of Corollary 1
Proof. From Theorem 2 under the condition that γ ≤ 1/(BL),
K
h
2
i 2 1 + Lγ 2 A
f (x0 ) − f inf .

min E k∇f (xk )k ≤ LCγ + (51)
0≤k≤K−1 γK

Using the fact that 1 + x ≤ exp(x), we have that


K
1 + Lγ 2 A ≤ exp(Lγ 2 AK) ≤ exp(1) ≤ 3,

where the second inequality holds because γ ≤ 1/ LAK by assumption. Substituting in (51) we get,
h
2
i 6
f (x0 ) − f inf .

min E k∇f (xk )k ≤ LCγ + (52)
0≤k≤K−1 γK

To make the right hand side of (52) smaller than ε2 , we require that the first term satisfies

ε2 ε2
LCγ ≤ ⇒γ≤ .
2 2LC
Similarly for the second term, we get that the number of iterations must satisfy:

6δ0 ε2 12δ0
≤ ⇒K≥ . (53)
γK 2 γε2
Note that we have three requirements on the stepsize γ so far:

1 ε2 1
γ≤ γ≤ γ≤√ .
BL 2LC LAK
Plugging each of the previous bounds on the stepsize into (53),

12δ0 BL 24δ0 LC 12δ0 LAK
K≥ K≥ K≥ . (54)
ε2 ε4 ε2
Since K appears in both sides of the last requirement, by cancellation and squaring we simplify (54) to

12δ0 BL 24δ0 LC 122 δ02 LA


K≥ K≥ K≥ .
ε2 ε4 ε4
Finally, we collect the terms into a single bound:
 
12δ0 L 2C 12δ0 A
K≥ max B, 2 , . 
ε2 ε ε2

27
12 Proofs Under Assumption 5
12.1 Proof of Lemma 3
Proof. This proof loosely follows Lemma 3 in (Stich, 2019) without using the (st )t sequence as it is not
relevant to our bounds and without using exponential averaging. If K ≤ ab , then starting from (20) we have
for γ = 1b ,

rK ≤ (1 − aγ) rK−1 + γ 2 c.

Recursing the above inequality we get,


K−1
X
K i
rK ≤ (1 − aγ) r0 + γ 2 c (1 − aγ)
i=0
K γc
≤ (1 − aγ) r0 +
a
γc
≤ exp (−aγK) r0 +
  a
aK c
= exp − r0 +
b ab
 
aK c
≤ exp − r0 + 2 . (55)
b a K

where in the last line we used that K ≤ ab . If K ≥ ab , then first we have


 
aK c
rk0 ≤ exp − r0 + . (56)
2b ab

Then for t > k0 ,


s + t − k0 − 2 4c
rt ≤ (1 − aγt ) rt−1 + cγt2 = rt−1 + 2.
s + t − k0 2
a (s + t − k0 )

Multiplying both sides by (s + t − k0 )2 ,

2 4c
(s + t − k0 ) rt ≤ (s + t − k0 − 2) (s + t − k0 ) rt−1 + 2
a

2
 4c
= (s + t − k0 − 1) − 1 rt−1 + 2
a
2 4c
≤ (s + t − k0 − 1) rt−1 + 2 .
a
2
Let wt = (s + t − k0 ) . Then,
4c
wt rt ≤ wt−1 rt−1 + .
a2
Summing up for t = k0 + 1, k0 + 2, . . . , K and telescoping we get,

4c (K − k0 ) 4c (K − k0 )
wK rK ≤ wk0 rk0 + = s2 rk0 + .
a2 a2
Dividing both sides by wK and using that since s ≥ k0 then wK = (s + K − k0 )2 ≥ (K − k0 )2 ,

s2 rk0 4c(K − k0 ) s2 rk0 4c


rK ≤ + ≤ + 2 . (57)
wK a2 wK (K − k0 )2 a (K − k0 )

28
By the definition of k0 we have K − k0 ≥ K/2. Plugging this estimate into (57)

4s2 rk0 8c
rK ≤ + 2 (58)
K2 a K
b2
Using (56) and the fact that K 2 ≥ a2 = 4s2 we have,

4s2 rk0 4s2


   
aK c
≤ exp − r0 + (59)
K2 K2 2b ab
2
4cs2
 
4s aK
= exp − r0 + (60)
K2 2b abK 2
2
 
aK 4cs
≤ exp − r0 + (61)
2b abK 2

For the second term in (61), note that since K ≥ b/a

4s2 4s2 1 4s2 a 1


= ≤ = . (62)
bK b K b b a
We substitute this into (61),

4s2 rk0
 
aK c
≤ exp − r0 + 2 . (63)
K2 2b a K

Substituting with (63) in (58),


 
aK 9c
rK ≤ exp − r0 + 2 . (64)
2b a K

It remains to take the maximum of the two bounds (64) and (55). 

12.2 Proof of Theorem 3


Proof. Using the L-smoothness of f ,
L 2
f (xk+1 ) − f ? ≤ f (xk ) − f ? + h∇f (xt ), xk+1 − xk i +
kxk+1 − xk k
2
Lγk2 2
= f (xk ) − f ? − γk h∇f (xk ), g(xk )i + kg(xk )k .
2
Taking expectation conditional on xk and using Assumption 2,

2 Lγk2 h 2
i
E [f (xk+1 ) − f ? ] ≤ f (xk ) − f ? − γk k∇f (xk )k + E kg(xk )k
2
Lγk2 C
(ES)
 
LBγ k 2
≤ f (xk ) − f ? − γk 1 − k∇f (xk )k + Lγk2 A (f (xk ) − f ? ) + .
2 2
LBγk 3
Using 1 − 2 ≥ 4 and Assumption 5,

Lγk2 C
 
3γk µ
E [f (xk+1 ) − f ? ] ≤ 1− + Lγk2 A (f (xk ) − f ? ) + .
2 2
µ
Our choice of stepsize implies Lγk A ≤ 2, hence

Lγk2 C
E [f (xk+1 − f ? )] ≤ (1 − γk µ) (f (xk ) − f ? ) + .
2

29
Taking unconditional expectations and letting rk = E [f (xk ) − f ? ] we have,

Lγk2 C
rk+1 ≤ (1 − γk µ) rk + .
2
Applying Lemma 3 with a = µ, b = max {2κf A, 2LB}, and c = LC/2 yields
 
? µK 9LC
E [f (xK ) − f ] ≤ exp − (f (x0 ) − f ? ) + 2 .
max {2κf A, 2LB} 2µ K

Using the definition of κS yields the theorem’s statement. 

30
13 Proofs Under Assumption 6
13.1 A Lemma for Full Gradient Descent
1
Lemma 5. Under Assumption 6 and for Tγ (x) = x − γ∇f (x) for γ ≤ L we have,
2 2 2
kTγ (x) − π(x)k ≤ kx − π(x)k − (1 − Lγ) γ 2 k∇f (x)k − 2γ (f (Tγ (x)) − f ? ) . (65)

Proof. This is the second to last step in the proof of Theorem 12 in (Necoara et al., 2019), and we reproduce
it for completeness. Let y ∈ Rd . Then using the definition of Tγ (x),
2 2
kTγ (x) − yk = kTγ (x) − x + x − yk (66)
2 2
= kx − yk + 2 hx − y, Tγ (x) − xi + kTγ (x) − xk
2 2
= kx − yk + 2 hTγ (x) − y, Tγ (x) − xi − kTγ (x) − xk
(33) 2 2
= kx − yk − 2γ h∇f (x), Tγ (x) − yi − kTγ (x) − xk
   
2 L 2 1 L 2
= kx − yk − 2γ h∇f (x), Tγ (x) − yi + kTγ (x) − xk + − kTγ (x) − xk
2 2γ 2
 
2 2 L 2
= kx − yk + (Lγ − 1) kTγ (x) − xk − 2γ h∇f (x), Tγ (x) − yi + kTγ (x) − xk . (67)
2

We isolate the third term in (67),

L 2 L 2
h∇f (x), Tγ (x) − yi + kTγ (x) − xk = h∇f (x), x − yi + h∇f (x), Tγ (x) − xi + kTγ (x) − xk . (68)
2 2
Since f is L-smooth, then
L 2
f (Tγ (x)) − f (x) ≤ h∇f (x), Tγ (x) − xi + kTγ (x) − xk
2
Using this in (68) and using convexity,

L 2
h∇f (x), Tγ (x) − yi + kTγ (x) − xk ≥ h∇f (x), x − yi + f (Tγ (x)) − f (x)
2
≥ f (x) − f (y) + f (Tγ (x)) − f (x) = f (Tγ (x)) − f (y). (69)

Using (69) in (67),


2 2 2
kTγ (x) − yk ≤ kx − yk − (1 − Lγ)kTγ (x) − xk − 2γ (f (Tγ (x)) − f (y)) .

It remains to use that Tγ (x) − x = −γ∇f (x) and to put y = π(x). 

13.2 Proof of Theorem 4


Proof. For expectation conditional on xk , we have
h i h i
2 2
E kxk+1 − π(xk )k = E kxk − π(xk ) − γk g(xk )k
(25)
h i
2 2
= kxk − γk ∇f (xk ) − π(xk )k + γk2 · E kg(xk ) − ∇f (xk )k
(26)
 h i 
2 2 2
= kxk − γk ∇f (xk ) − π(xk )k + γk2 E kg(xk )k − k∇f (xk )k
(65)
2 2
≤ kxk − π(xk )k − (2 − Lγk ) γk2 k∇f (xk )k − 2γk (f (Tγk (xk )) − f ? ) (70)
h i
2
+ γk2 · E kg(xk )k .

31
We can decompose the function decrease term and then use Lemma 4 as follows,

−2γk (f (Tγk (xk )) − f ? ) = −2γk (f (xk ) − f ? ) + 2γk (f (xk ) − f (Tγk (xk )))
(48)
2
≤ −2γk (f (xk ) − f ? ) + γk2 (3Lγk + 2) k∇f (xk )k . (71)
2
Let rk = kxk − π(xk )k . Continuing from (70),
h i h i
2 2 2
E kxk+1 − π(xk )k ≤ rk − (2 − Lγk ) γk2 k∇f (xk )k − 2γk (f (Tγk (xk )) − f ? ) + γk2 · E kg(xk )k
(71) h i
2 2
≤ rk − 2γk (f (xk ) − f ? ) + 4Lγk3 · k∇f (x)k + γk2 · E kg(xk )k . (72)

Using Assumption 2 and Lemma 1 applied on f ,


h i
2 2
E kg(xk )k ≤ 2A (f (xk ) − f ? ) + B k∇f (xk )k + C
≤ (2A + 2BL) (f (xk ) − f ? ) + C = 2ρ (f (xk ) − f ? ) + C, (73)
def
where ρ = A + BL. We now use Lemma 1 to bound the squared gradient term:
h i (72) h i
2 2 2
E kxk+1 − π(xk )k ≤ rk − 2γk (f (xk ) − f ? ) + 4Lγk3 · k∇f (xk )k + γk2 · E kg(xk )k
(14) h i
2
rk − 2γk 1 − 4L2 γk2 (f (xk ) − f ? ) + γk2 · E kg(xk )k


(73)
rk − 2γk 1 − 4L2 γk2 − γk ρ (f (xk ) − f ? ) + γk2 C.

≤ (74)

Now note that our choice of γk guarantees that 1 − 4L2 γk2 − γk ρ ≥ 12 , using this and Assumption 6 in (74) we
get, h
2
i  γk µ 
E kxk+1 − π(xk )k ≤ rk − γk (f (xk ) − f ? ) + γk2 C ≤ 1− rk + γk2 C.
2
h i h i
2 2
We now use the property of projections that E kxk+1 − π(xk+1 )k ≤ E kxk+1 − yk for any y ∈ X ? .
Specializing this with y = π(xk ) we get,
h
2
i  γk µ  2
E kxk+1 − π(xk+1 )k ≤ 1 − kxk − π(xk )k + γk2 C.
2
Taking unconditional expectation we get,
 γk µ 
E [rk+1 ] ≤ 1 − E [rk ] + γk2 C.
2
µ
Using Lemma 3 with a = 2, b = max {4L, 4BL + 4A}, and c = C we recover
 
−µK 36C
E [rK ] ≤ exp r0 + 2 . (75)
2 max {4L, 4BL + 4A} µ K

Equation (75) gives a convergence guarantee in terms of the distance of the iterates from the set of optima.
We may convert this to convergence in function values using the following consequences of the quadratic
functional growth property and smoothness: r0 ≤ µ2 (f (x0 ) − f ? ) and f (xK ) − f ? ≤ L2 rK . Hence,
 
1 −µK 18κf C
E [f (xK ) − f ? ] ≤ κf (f (x0 ) − f ? ) + . 
2 2 max {4L, 4BL + 4A} µK

32
SGD with Uniform Sampling 0.70 SGD with Uniform Sampling 3.5 SGD with Uniform Sampling
100
0.65 3.0
0.60 2.5
10 1

1/ni = 1|| fi||2


0.55 2.0
|| f||2

f(x)

n
0.50 1.5
10 2
0.45 1.0
0.40 0.5
10 3
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
iterations iterations iterations

Figure 3: Results from SGD applied to the regularized logistic regression problem.

14 Experimental Details
14.1 Logistic regression with a nonconvex regularizer
Experimental Setup: The optimization problem considered is
 
n d 2
1 X X x j

log 1 + exp(−aTi x) + λ

min ,
x∈Rd  n
i=1 j=1
1 + x2j 

where a1 , a2 , . . . , an ∈ Rd are from the a9a dataset with n = 32561 and d = 123. We set λ = 1/2. We run
SGD for K = 500 iterations with minibatch size 1 with uniform sampling.
Data Collected: We collect a snapshot once every five iterates (so x0 , x5 , x15 , . . . , x500 ) consisting of the
2
squared hfull gradienti normPk∇f (xk )k , the loss on the entire dataset f (xk ), and the average stochastic gradient
2 n 2
norm E kg(x)k = 1/n i=1 k∇fi (xk )k . These are plotted in Figure 3. The full gradient norms go to zero,
as expected, but the stochastic gradient norms decrease and quickly however around 0.5, and the full loss
shows a similar pattern. We note that Lemma 2 can explain the plateau pattern, as using it one can show that
provided the stepsize is set correctly, the loss at the last point does not exceed the initial loss in expectation.
2
Hyperparameter Estimation: We estimate the individual Lipschitz constants as Li ≈ ka4i k + 2λ. We
estimate the global Lipschitz constant L by their average L̄ = n1 Li . We estimate A for independent sampling
as in Proposition 3 by A = maxi Li /(τ nqi ) = maxi Li as τ = 1 and qi = 1/n for all i. We estimate f inf
as the minimum value encountered over the entire SGD run and estimate fiinf by running gradient descent
for 200√iterations for each fi and reporting the minimum value encountered. The stepsize we use is then
γ = 1/ LKA.
Data Fitting: For Assumption 2, the parameters A, B, C ≥ 0 model the average stochastic gradient norms
in terms of the loss and the full gradient. To find the best such parameters empirically, we solve the following
optimization problem, minimizing the squared residual error
n o
2
min
3
g(z) = kW z − bk , (76)
z∈R ,z≥0
 
where z = z1 z2 z3 are the constants to be fit corresponding to 2A, B, and C respectively. W is an m × 3
2
matrix (for m the number of snapshots, equal to 100 here) such that Wi,1 = f (xi )−f inf , Wi,2 = k∇f (xi )k , and
n 2
Wi,3 = 1. Moreover, b ∈ Rm is the vector of average stochastic gradient norms, i.e. bk = 1/n i=1 k∇fi (xk )k .
P
Problem (76) is a nonnegative linear least squares problem and can be solved efficiently by linear algebra
toolboxes. To model relaxed growth, we solve a similar problem but with only the parameters B and C (and
with corresponding changes to the matrix W ).
Discussion of Table 1: Table 1 shows that (ES) can achieve a smaller residual error compared to (RG) by
relying on the function values to model the stochastic gradients. The value of the first coefficient is quite close
to the theoretically estimated value of Lmax . The value of the coefficient C, however, is difficult to estimate.
This is because according to our theory it depends on the global greatest lower bounds of the losses fi and
f . In the case of convex objectives, however, this constant can be calculated well given knowledge of the
minimizers of all the fi and of f .

33

You might also like