Handbook of Convergence Theorems
Handbook of Convergence Theorems
Abstract
This is a handbook of simple proofs of the convergence of gradient and stochastic gradient
descent type methods. We consider functions that are Lipschitz, smooth, convex, strongly
convex, and/or Polyak-Lojasiewicz functions. Our focus is on “good proofs” that are also
simple. Each section can be consulted separately. We start with proofs of gradient descent, then
on stochastic variants, including minibatching and momentum. Then move on to nonsmooth
problems with the subgradient method, the proximal gradient descent and their stochastic
variants. Our focus is on global convergence rates and complexity rates. Some slightly less
common proofs found here include that of SGD (Stochastic gradient descent) with a proximal
step in 11, with momentum in Section 7, and with mini-batching in Section 6.
1 Introduction
Here we collect our favourite convergence proofs for gradient and stochastic gradient based meth-
ods. Our focus has been on simple proofs, that are easy to copy and understand, and yet achieve
the best convergence rate for the setting.
Disclaimer: Theses notes are not proper review of the literature. Our aim is to have an easy
to reference handbook. Most of these proofs are not our work, but rather a collection of known
proofs. If you find these notes useful, feel free to cite them, but we kindly ask that you cite the
original sources as well that are given either before most theorems or in the bibliographic notes at
the end of each section.
1
How to use these notes
We recommend searching for the theorem you want in the table of contents, or in the in Table 1a
just below, then going directly to the section to see the proof. You can then follow the hyperlinks
for the assumptions and properties backwards as needed. For example, if you want to know about
the proof of Gradient Descent in the convex and smooth case you can jump ahead to Section 3.1.
There you will find you need a property of convex function given in Lemma 2.8. These notes were
not made to be read linearly: it would be impossibly boring.
Acknowledgements
The authors would like to thank Shuvomoy Das Gupta and Benjamin Grimmer, for pointing out
typos and errors in an earlier version of this document.
2
Contents
1 Introduction 1
3 Gradient Descent 14
3.1 Convergence for convex and smooth functions . . . . . . . . . . . . . . . . . . . . . 14
3.2 Convergence for strongly convex and smooth functions . . . . . . . . . . . . . . . . 17
3.3 Convergence for Polyak-Lojasiewicz and smooth functions . . . . . . . . . . . . . . 18
3.4 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Minibatch SGD 32
6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2 Convergence for convex and smooth functions . . . . . . . . . . . . . . . . . . . . . 34
6.3 Rates for strongly convex and smooth functions . . . . . . . . . . . . . . . . . . . . 35
6.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3
7 Stochastic Momentum 37
7.1 The many ways of writing momentum . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 Convergence for convex and smooth functions . . . . . . . . . . . . . . . . . . . . . 39
7.3 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A Appendix 68
A.1 Lemmas for Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2 A nonconvex PL function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4
Table 1: Where to find the corresponding theorem and complexity for all the algorithms and
assumptions. GD = Gradient Descent, SGD = Stochastic Gradient Descent, mini-SGD = SGD
with mini-batching, momentum = SGD with momentum, also known as stochastic heavy ball,
prox-GD is proximal GD, and proximal SGD. The X’s are settings which are currently not covered
in the handbook.
(b) Table of the complexity of each algorithm, where in each cell we give the number of iterations
required to guarantee E kxt+1 − x∗ k2 ≤ in the strongly convex setting, or E [f (xt ) − inf f ] ≤
in the convex and PL settings. Further σf∗ is defined in (36), ∆∗f in (35), D := kx0 − x∗ k,
δf := f (x0 ) − inf f . For composite functions F = f + g we have δF := F (x0 ) − inf F and σF∗ is
defined in (73). For the mini-batch-SGD with fixed batch size b ∈ N, we have σb∗ defined in (55)
and Lb defined in (54).
5
2 Theory : Smooth functions and convexity
2.1 Differentiability
2.1.1 Notations
where we write F(x) = (f1 (x), . . . , fp (x)). Consequently DF(x) is a matrix with DF(x) ∈ Rp×d .
2 ∂2f
∇ f (x) i,j = (x), for i, j = 1, . . . , d.
∂xi ∂xj
Remark 2.4 (Hessian and eigenvalues). If f is twice differentiable, then its Hessian is always
a symmetric matrix (Schwarz’s Theorem). Therefore, the Hessian matrix ∇2 f (x) admits d
eigenvalues (Spectral Theorem).
Lemma 2.6. Let F : Rd → Rp be differentiable, and L > 0. Then F is L-Lipschitz if and only
if
for all x ∈ Rd , kDF(x)k ≤ L
Proof. ⇒ Assume that F is L-Lipschitz. Let x ∈ Rd , and let us show that kDF(x)k ≤ L. This
is equivalent to show that kDF(x)vk ≤ L, for any v ∈ Rd such that kvk = 1. For a given v, the
directional derivative is given by
F(x + tv) − F(x)
DF(x)v = lim .
t↓0 t
6
Taking the norm in this equality, and using our assumption that F is L-Lipschitz, we indeed obtain
⇐ Assume now that kDF(z)k ≤ L for every vector z ∈ Rd , and let us show that F is L-Lipschitz.
For this, fix x, y ∈ Rd , and use the Mean-Value Inequality (see e.g. [8, Theorem 17.2.2]) to write
!
kF(y) − F(x)k ≤ sup kDF(z)k ky − xk ≤ Lky − xk.
z∈[x,y]
2.2 Convexity
for all x, y ∈ Rd , for all t ∈ [0, 1], f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y). (1)
The next two lemmas characterize the convexity of a function with the help of first and second-
order derivatives. These properties will be heavily used in the proofs.
Lemma 2.9. Let f : Rd → R be convex and twice differentiable. Then, for all x ∈ Rd , for every
eigenvalue λ of ∇2 f (x), we have λ ≥ 0.
Proof. Since f is convex we can use (2) twice (permuting the roles of x and y) and summing the
resulting two inequalities, to obtain that
∇f (x + tv) − ∇f (x) 1
h∇2 f (x)v, vi = hlim , vi = lim 2 h∇f (x + tv) − ∇f (x), (x + tv) − xi ≥ 0,
t→0 t t→0 t
7
where the first equality follows because the gradient is a continuous function and the last inequality
follows from (3). Now we can conclude : if λ is an eigenvalue of ∇2 f (x), take any non zero
eigenvector v ∈ Rd and write
λkvk2 = hλv, vi = h∇2 f (x)v, vi ≥ 0.
Example 2.10 (Least-squares is convex). Let Φ ∈ Mn,d (R) and y ∈ Rn , and let f (x) =
1 2 2 >
2 kΦx − yk be the corresponding least-squares function. Then f is convex, since ∇ f (x) ≡ Φ Φ
is positive semi-definite.
Definition 2.11. Let f : Rd → R ∪ {+∞}, and µ > 0. We say that f is µ-strongly convex if,
for every x, y ∈ Rd , and every t ∈ [0, 1] we have that
t(1 − t)
µ kx − yk2 + f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y).
2
We say that µ is the strong convexity constant of f .
The lemma below shows that it is easy to craft a strongly convex function : just add a multiple
of k·k2 to a convex function. This happens for instance when using Tikhonov regularization (a.k.a.
ridge regularization) in machine learning or inverse problems.
Lemma 2.12. Let f : Rd → R, and µ > 0. The function f is µ-strongly convex if and only if
there exists a convex function g : Rd → R such that f (x) = g(x) + µ2 kxk2 .
Proof. Given f and µ, define g(x) := f (x) − µ2 kxk2 . We need to prove that f is µ-strongly convex
if and only if g is convex. We start from Definition 2.11 and write (we note zt = (1 − t)x + ty):
f is µ-strongly convex
µ
⇔ ∀t ∀x, y, f (zt ) + t(1 − t)kx − yk2 ≤ (1 − t)f (x) + tf (y)
2
µ µ µ µ
⇔ ∀t ∀x, y, g(zt ) + kzt k2 + t(1 − t)kx − yk2 ≤ (1 − t)g(x) + tg(y) + (1 − t) kxk2 + t kyk2 .
2 2 2 2
Let us now gather all the terms multiplied by µ to find that
1 1 1 1
kzt k2 + t(1 − t)kx − yk2 − (1 − t) kxk2 − t kyk2
2 2 2 2
= (1 − t) kxk + t kyk + 2t(1 − t)hx, yi + t(1 − t)kxk2 + t(1 − t)kyk2 − 2t(1 − t)hx, yi
2 2 2 2
− (1 − t)kx|2 − tkyk2
kxk2 (1 − t)2 + t(1 − t) − (1 − t) + kyk2 t2 + t(1 − t) − t
=
= 0.
So we see that all the terms in µ disappear, and what remains is exactly the definition for g to be
convex.
8
Lemma 2.13. If f : Rd → R is a continuous strongly convex function, then f admits a unique
minimizer.
Now we present some useful variational inequalities satisfied by strongly convex functions.
Proof. Define g(x) := f (x) − µ2 kxk2 . According to Lemma 2.12, g is convex. It is also clearly
differentiable by definition. According to the sum rule, we have ∇f (x) = ∇g(x) + µx. Therefore
we can use the convexity of g with Definition 8.4 to write
µ µ µ
f (y) − f (x) − h∇f (x), y − xi ≥ kyk2 − kxk2 − hµx, y − xi = ky − xk2 .
2 2 2
Lemma 2.15. Let f : Rd → R be a twice differentiable µ-strongly convex function. Then, for
all x ∈ Rd , for every eigenvalue λ of ∇2 f (x), we have λ ≥ µ.
Proof. Define g(x) := f (x) − µ2 kxk2 , which is convex according to Lemma 2.12. It is also twice
differentiable, by definition, and we have ∇2 f (x) = ∇2 g(x) + µId. So the eigenvalues of ∇2 f (x)
are equal to the ones of ∇2 g(x) plus µ. We can conclude by using Lemma 2.9.
2.4 Polyak-Lojasiewicz
We just say that f is Polyak-Lojasiewicz (PL for short) if there exists µ > 0 such that f is
µ-Polyak-Lojasiewicz.
9
Lemma 2.18. Let f : Rd → R be differentiable, and µ > 0. If f is µ-strongly convex, then f is
µ-Polyak-Lojasiewicz.
Proof. Let x∗ be a minimizer of f (see Lemma 2.13), such that f (x∗ ) = inf f. Multiplying (4) by
minus one and substituting y = x∗ as the minimizer, we have that
µ ∗
f (x) − f (x∗ ) ≤ h∇f (x), x − x∗ i −
kx − xk2
2
1 √ 1 1
= − k µ(x − x∗ ) − √ ∇f (x)k2 + k∇f (x)k2
2 µ 2µ
1
≤ k∇f (x)k2 .
2µ
It is important to note that the Polyak-Lojasiewicz property can hold without strong convexity
or even convexity, as illustrated in the next examples.
• Let f (t) = t2 + 3 sin(t)2 . It is an exercise to verify that f is PL, while not being convex
(see Lemma A.3 for more details).
• If Ω ⊂ Rd is a closed set and f (x) = dist(x; Ω)2 is the squared distance function to this set,
then it can be shown that f is PL. See Figure 1 for an example, and [9] for more details.
Figure 1: Graph of a PL function f : R2 → R. Note that the function is not convex, but that the
only critical points are the global minimizers (displayed as a white curve).
1
Example 2.21 (PL for nonlinear models). Let f (x) = 2 kΦ(x) − yk2 , where Φ : Rd → Rn is
10
differentiable. Then f is PL if DΦ> (x) is uniformly injective:
there exists µ > 0 such that for all x ∈ Rd , λmin (DΦ(x)DΦ(x)> ) ≥ µ. (6)
k∇f (x)k2 = kDΦ(x)> (Φ(x) − y)k2 ≥ µkΦ(x) − yk2 = 2µf (x) ≥ 2µ(f (x) − inf f ).
Note that assumption (6) requires d ≥ n, which holds if Φ represents an overparametrized neural
network. For more refined arguments, including less naive assumptions and exploiting the neural
network structure of Φ, see [23].
One must keep in mind that the PL property is rather strong, as it is a global property and
requires the following to be true, which is typical of convexity.
Remark 2.23 (Local Lojasiewicz inequalities). In this document we focus only on the Polyak-
Lojasiewicz inequality, for simplicity. Though there exists a much larger family of Lojasiewicz
inequalities, which by and large cover most functions used in practice.
• The inequality can be more local. For instance by requiring that (5) holds only on some
subset Ω ⊂ Rd instead of the whole Rd . For instance, logistic functions typically verify
(5) on every bounded set, but not on the whole space. The same can be said about the
empirical risk associated to wide enough neural networks [23].
• While PL describes functions that grow like x 7→ µ2 kxk2 , there are p-Lojasiewicz inequalities
p−1
describing functions that grow like x 7→ µ p kxkp and satisfy f (x) − inf f ≤ qµ 1
k∇f (x)kq
on some set Ω, with p1 + 1q = 1.
• The inequality can be even more local, by dropping the property that every critical point is
a global minimum. For this we do not look at the growth of f (x)−inf f , but of f (x)−f (x∗ )
instead, where x∗ ∈ Rd is a critical point of interest. This can be written as
1 1 1
for all x ∈ Ω, f (x) − f (x∗ ) ≤ k∇f (x)kq , where + = 1. (7)
qµ p q
A famous result [4, Corollary 16] shows that any semi-algebraic function (e.g. sums and products
of polynomials by part functions) verifies (7) at every x∗ ∈ Rd for some p ≥ 1, µ > 0, and Ω
being an appropriate neighbourhood of x∗ . This framework includes for instance quadratic losses
evaluating a Neural Network with ReLU as activations.
11
2.5 Smoothness
Definition 2.24. Let f : Rd → R, and L > 0. We say that f is L-smooth if it is differentiable
and if ∇f : Rd → Rd is L-Lipschitz:
As for the convexity (and strong convexity), we give two characterizations of the smoothness by
means of first and second order derivatives.
Proof. Let x, y ∈ Rd be fixed. Let φ(t) := f (x + t(y − x)). Using the Fundamental Theorem of
Calculus on φ, we can write that
Z 1
f (y) = f (x) + h∇f (x + t(y − x)), y − xi dt.
t=0
Z 1
= f (x) + h∇f (y), x − yi + h∇f (x + t(y − x)) − ∇f (x), y − xi dt.
t=0
Z 1
≤ f (x) + h∇f (x), y − xi + k∇f (x + t(y − x)) − ∇f (x)kky − xkdt
t=0
(8)
Z 1
≤ f (x) + h∇f (x), y − xi + Ltky − xk2 dt
t=0
L
≤ f (x) + h∇f (x), y − xi + ky − xk2 .
2
Lemma 2.26. Let f : Rd → R be a twice differentiable L-smooth function. Then, for all x ∈ Rd ,
for every eigenvalue λ of ∇2 f (x), we have |λ| ≤ L.
Proof. Use Lemma 2.6 with F = ∇f , together with the fact that D(∇f )(x) = ∇2 f (x). We obtain
that, for all x ∈ Rd , we have k∇2 f (x)k ≤ L. Therefore, for every eigenvalue λ of ∇2 f (x), we can
write for a nonzero eigenvector v ∈ Rd that
12
Remark 2.27. From Lemmas 2.25 and 2.14 we see that if a function is L-smooth and µ-strongly
convex then µ ≤ L.
Some direct consequences of the smoothness are given in the following lemma. You can compare
(12) with Lemma 2.18.
Proof. The first inequality (11) follows by inserting y = x − λ∇f (x) in (9) since
L
f (x − λ∇f (x)) ≤ f (x) − λh∇f (x), ∇f (x)i + kλ∇f (x)k2
2
λL
= f (x) − λ 1 − k∇f (x)k2 .
2
Assume now inf f > −∞. By using (11) with λ = 1/L, we get (12) up to a multiplication by −1 :
1
inf f − f (x) ≤ f (x − L1 ∇f (x)) − f (x) ≤ − k∇f (x)k2 . (13)
2L
There are many problems in optimization where the function is both smooth and convex. Such
functions enjoy properties which are strictly better than a simple combination of their convex and
smooth properties.
Lemma 2.29. If f : Rd → R is convex and L-smooth, then for all x, y ∈ Rd we have that
1
k∇f (y) − ∇f (x)k2 ≤ f (y) − f (x) − h∇f (x), y − xi , (14)
2L
1
k∇f (x) − ∇f (y)k2 ≤ h∇f (y) − ∇f (x), y − xi (Co-coercivity) (15)
L
Proof. To prove (14), fix x, y ∈ Rd and start by using the convexity and the smoothness of f to
write, for every z ∈ Rd ,
13
To get the tightest upper bound on the right hand side, we can minimize the right hand side with
respect to z, which gives
1
z = y − (∇f (y) − ∇f (x)).
L
Substituting this z in gives, after reorganizing the terms:
L
f (x) − f (y) ≤ h∇f (x), x − zi + h∇f (y), z − yi + kz − yk2 .
2
1 1
= h∇f (x), x − yi − k∇f (y) − ∇f (x)k2 + k∇f (y) − ∇f (x)k2
L 2L
1
= h∇f (x), x − yi − k∇f (y) − ∇f (x)k2 .
2L
This proves (14). To obtain (15), apply (14) twice by interchanging the roles of x and y
1
k∇f (y) − ∇f (x)k2 ≤ f (y) − f (x) − h∇f (x), y − xi ,
2L
1
k∇f (x) − ∇f (y)k2 ≤ f (x) − f (y) − h∇f (y), x − yi ,
2L
and sum those two inequalities.
3 Gradient Descent
Problem 3.1 (Differentiable Function). We want to minimize a differentiable function f : Rd →
R. We require that the problem is well-posed, in the sense that argmin f 6= ∅.
Algorithm 3.2 (GD). Let x0 ∈ Rd , and let γ > 0 be a step size. The Gradient Descent
(GD) algorithm defines a sequence (xt )t∈N satisfying
Remark 3.3 (Vocabulary). Stepsizes are often called learning rates in the machine learning
community.
We will now prove that the iterates of (GD) converge. In Theorem 3.4 we will prove sub-
linear convergence under the assumption that f is convex. In Theorem 3.6 we will prove linear
convergence (a faster form of convergence) under the stronger assumption that f is µ–strongly
convex.
Theorem 3.4. Consider the Problem (Differentiable Function) and assume that f is convex
and L-smooth, for some L > 0. Let (xt )t∈N be the sequence of iterates generated by the (GD)
algorithm, with a stepsize satisfying 0 < γ ≤ L1 . Then, for all x∗ ∈ argmin f , for all t ∈ N we
14
have that
kx0 − x∗ k2
f (xt ) − inf f ≤ . (17)
2γt
For this theorem we give two proofs. The first proof uses an energy function, that we will also use
later on. The second proof is a direct proof taken from [5].
Proof of Theorem 3.4 with Lyapunov arguments. Let x∗ ∈ argmin f be any minmizer of f .
First, we will show that f (xt ) is decreasing. Indeed we know from (11), and from our assumption
γL ≤ 1, that
γL
f (xt+1 ) − f (xt ) ≤ −γ(1 − )k∇f (xt )k2 ≤ 0. (18)
2
Second, we will show that kxt − x∗ k2 is also decreasing. For this we expand the squares to write
1 t+1 1 t −1 t+1
kx − x∗ k2 − kx − x∗ k2 = kx − xt k2 − h∇f (xt ), xt+1 − x∗ i
2γ 2γ 2γ
−1 t+1
= kx − xt k2 − h∇f (xt ), xt+1 − xt i + h∇f (xt ), x∗ − xt i.
2γ
(19)
Now to bound the right hand side we use the convexity of f and (2) to write
To bound the other inner product we use the smoothness of L and (9) which gives
L t+1
−h∇f (xt ), xt+1 − xt i ≤kx − xt k2 + f (xt ) − f (xt+1 ).
2
By using the two above inequalities in (19) we obtain
1 t+1 1 t −1 t+1
kx − x∗ k2 − kx − x∗ k2 ≤ kx − xt k2 − (f (xt+1 ) − inf f ),
2γ 2γ 2γ
≤ −(f (xt+1 ) − inf f ). (20)
Let us now combine the two positive decreasing quantities f (xt ) − inf f and 1
2γ kx
t − x∗ k2 , and
introduce the following Lyapunov energy, for all t ∈ N:
1 t
Et := kx − x∗ k2 + t(f (xt ) − inf f ).
2γ
We want to show that it is decreasing with time. For this we start by writing
1 t+1 1 t
Et+1 − Et = (t + 1)(f (xt+1 ) − f (xt )) − t(f (xt ) − inf f ) + kx − x∗ k2 − kx − x∗ k2
2γ 2γ
1 t+1 1 t
= f (xt+1 ) − inf f + t(f (xt+1 ) − f (xt )) + kx − x∗ k2 − kx − x∗ k2 . (21)
2γ 2γ
Combining now (21), (18) and (20), we finally obtain (after cancelling terms) that
1 t+1 1 t
Et+1 − Et ≤ f (xt+1 ) − inf f + kx − x∗ k2 − kx − x∗ k2 using (18)
2γ 2γ
≤ f (xt+1 ) − inf f − (f (xt+1 ) − inf f ) using (20)
≤ 0.
15
Thus Et is decreasing. Therefore we can write that
1 0
t(f (xt ) − inf f ) ≤ Et ≤ E0 = kx − x∗ k2 ,
2γ
and the conclusion follows after dividing by t.
Proof of Theorem 3.4 with direct arguments. Let f be convex and L–smooth. It follows
that
Calling upon (11) and subtracting f (x∗ ) from both sides gives
1
f (xt+1 ) − f (x∗ ) ≤ f (xt ) − f (x∗ ) − k∇f (xt )k2 . (24)
2L
Applying convexity we have that
Let δt = f (xt ) − f (x∗ ). Since δt+1 ≤ δt , and by manipulating (26) we have that
1
×δ δ δt 1 1 δt+1 ≤δt 1 1
t t+1
δt+1 ≤ δt − βδt2 ⇔ β ≤ − ⇔ β≤ − .
δt+1 δt+1 δt δt+1 δt
Summing up both sides over t = 0, . . . , T − 1 and using telescopic cancellation we have that
1 1 1
Tβ ≤ − ≤ .
δT δ0 δT
Re-arranging the above we have that
1 2Lkx0 − x∗ k2
f (xT ) − f (x∗ ) = δT ≤ = .
β(T − 1) T
16
Corollary 3.5 (O(1/t) Complexity). Under the assumptions of Theorem 3.4, for a given > 0
and γ = L we have that
L kx0 − x∗ k2
t≥ =⇒ f (xt ) − inf f ≤ (27)
2
Theorem 3.6. Consider the Problem (Differentiable Function) and assume that f is µ-strongly
convex and L-smooth, for some L ≥ µ > 0. Let (xt )t∈N be the sequence of iterates generated by
the (GD) algorithm, with a stepsize satisfying 0 < γ ≤ L1 . Then, for x∗ = argmin f and for all
t ∈ N:
kxt+1 − x∗ k2 ≤ (1 − γµ)t+1 kx0 − x∗ k2 . (28)
1
Remark 3.7. Note that with the choice γ = L, the iterates enjoy a linear convergence with a
rate of (1 − µ/L).
Below we provide two different proofs for this Theorem 3.6. The first one makes use of first-
order variational inequalities induced by the strong convexity and smoothness of f . The second
one (assuming further that f is twice differentiable) exploits the fact that the eigenvalues of the
Hessian of f are in between µ and L.
Proof of Theorem 3.6 with first-order properties. From (GD) we have that
kxt+1 − x∗ k2 ≤ (1 − γµ)kxt − x∗ k2 .
Proof of Theorem 3.6 with the Hessian. Let T : Rd → Rd be defined by T (x) = x − γ∇f (x),
so that we can write an iteration of Gradient Descent as xt+1 = T (xt ). Note that the minimizer
x∗ verifies ∇f (x∗ ) = 0, so it is a fixed point of T in the sense that T (x∗ ) = x∗ . This means that
kxt+1 − x∗ k = kT (xt ) − T (x∗ )k. Now we want to prove that
17
Indeed, unrolling the recurrence from (30) would provide the desired bound (28).
We see that (30) is true as long as T is θ-Lipschitz, with θ = (1 − λµ). From Lemma 2.6, we
know that is equivalent to proving that the norm of the differential of T is bounded by θ. It is
easy to compute this differential : DT (x) = Id − λ∇2 f (x). If we note v1 (x) ≤ · · · ≤ vd (x) the
eigenvalues of ∇2 f (x), we know by Lemmas 2.15 and 2.26 that µ ≤ vi (x) ≤ L. Since we assume
λL ≤ 1, we see that 0 ≤ 1 − λvi (x) ≤ 1 − λµ. So we can write
which allows us to conclude that (30) is true. To conclude the proof of Theorem 3.6, take the
squares in (30) and use the fact that θ ∈]0, 1[⇒ θ2 ≤ θ.
The linear convergence rate in Theorem 3.6 can be transformed into a complexity result as we
show next.
Corollary 3.8 (O (log(1/)) Complexity). Under the same assumptions as Theorem 3.6, for a
given > 0, we have that if γ = 1/L then
L 1
t ≥ log ⇒ kxt+1 − x∗ k2 ≤ kx0 − x∗ k2 . (31)
µ
The proof of this lemma follows by applying Lemma A.1 in the appendix.
Theorem 3.9. Consider the Problem (Differentiable Function) and assume that f is µ-Polyak-
Lojasiewicz and L-smooth, for some L ≥ µ > 0. Consider (xt )t∈N a sequence generated by the
(GD) algorithm, with a stepsize satisfying 0 < γ ≤ L1 . Then:
Proof. We can use Lemma 2.25, together with the update rule of (GD), to write
L t+1
f (xt+1 ) ≤ f (xt ) + h∇f (xt ), xt+1 − xt i + kx − xt k2
2
Lγ 2
= f (xt ) − γk∇f (xt )k2 + k∇f (xt )k2
2
γ
= f (xt ) − (2 − Lγ) k∇f (xt )k2
2
γ
≤ f (x ) − k∇f (xt )k2 ,
t
2
18
where in the last inequality we used our hypothesis on the stepsize that γL ≤ 1. We can now use
the Polyak-Lojasiewicz property (recall Definition 2.17) to write:
The conclusion follows after subtracting inf f on both sides of this inequality, and using recursion.
Corollary 3.10 (log(1/) Complexity). Under the same assumptions as Theorem 3.9, for a given
> 0, we have that if γ = 1/L then
L 1
t ≥ log ⇒ f (xt ) − inf f ≤ (f (x0 ) − inf f ). (32)
µ
The proof of this lemma follows by applying Lemma A.1 in the appendix.
where fi : Rd → R. We require that the problem is well-posed, in the sense that argmin f 6= ∅
and that the fi ’s are bounded from below.
Assumption 4.2 (Sum of Convex). We consider the Problem (Sum of Functions) where each
fi : Rd → R is assumed to be convex.
Assumption 4.3 (Sum of Lmax –Smooth). We consider the Problem (Sum of Functions) where
def def
each fi : Rd → R is assumed to be Li -smooth. We will note Lmax = max Li , and Lavg =
1,...,n
19
1 Pn
n i=1 Li . We will also note Lf the Lipschitz constant of ∇f .
Note that, in the above Assumption (Sum of Lmax –Smooth), the existence of Lf is not an
assumption but the consequence of the smoothness of the fi ’s. Indeed:
Lemma 4.4. Consider the Problem (Sum of Functions). If the fi ’s are Li -smooth, then f is
Lavg -smooth.
= Lavg ky − xk.
2. V [X] ≤ E kXk2 .
Proof. Item 2 is a direct consequence of the first with y = 0. To prove item 1, we use that
20
Lemma 4.7. If Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold, then f is
Lmax -smooth in expectation, in the sense that
1
for all x, y ∈ Rd , E k∇fi (y) − ∇fi (x)k2 ≤ f (y) − f (x) − h∇f (x), y − xi .
(33)
2Lmax
Proof. Using (14) in Lemma 2.29 applied to fi , together with the fact that Li ≤ Lmax , allows us
to write
1
k∇fi (y) − ∇fi (x)k2 ≤ fi (y) − fi (x) + h∇fi (x), y − xi .
2Lmax
1 1 P
To conclude, multiply the above inequality by n, and sum over i, using the fact that n i fi =f
and n1 i ∇fi = ∇f .
P
Lemma 4.8. If Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold, then, for every
x ∈ Rd and every x∗ ∈ argmin f , we have that
1
E k∇fi (x) − ∇fi (x∗ )k2 ≤ f (x) − inf f.
(34)
2Lmax
Proof. Apply Lemma 4.7 with x = x∗ and y = x, since f (x∗ ) = inf f and ∇f (x∗ ) = 0.
4.3.1 Interpolation
Definition 4.9. Consider the Problem (Sum of Functions). We say that interpolation holds
if there exists a common x∗ ∈ Rd such that fi (x∗ ) = inf fi for all i = 1, . . . , n. In this case, we
say that interpolation holds at x∗ .
Proof. Let interpolation hold at x∗ ∈ Rd . By Definition 4.9, this means that x∗ ∈ argmin fi .
Therefore, for every x ∈ Rd ,
n n n
1X 1X 1X
f (x∗ ) = fi (x∗ ) = inf fi ≤ fi (x) = f (x).
n n n
i=1 i=1 i=1
21
This proves that x∗ ∈ argmin f .
Interpolation means that there exists some x∗ that simultaneously achieves the minimum of
all loss functions fi . In terms of learning problems, this means that the model perfectly fits every
data point. This is illustrated below with a couple of examples.
Example 4.11 (Least-squares and interpolation). Consider a regression problem with data
1
(φi , yi )ni=1 ⊂ Rd × R, and let f (x) = 2m kΦx − yk2 be the corresponding least-squares func-
tion with Φ = (φi )ni=1 and y = (yi )ni=1 . This is a particular case of Problem (Sum of Functions),
with fi (x) = 12 (hφi , xi − yi )2 . We see here that interpolation holds if and only if there exists
x∗ ∈ Rd such that hφi , x∗ i = yi . In other words, we can find an hyperplane in Rd × R passing
through each data point (φi , yi ). This is why we talk about interpolation.
For this linear model, note that interpolation holds if and only if y is in the range of Φ, which
is always true if Φ is surjective. This generically holds when d > n, which is usually called the
overparametrized regime.
Here we introduce different measures of how far from interpolation we are. We start with a first
quantity measuring how the infimum of f and the fi ’s are related.
Definition 4.13. Consider the Problem (Sum of Functions). We define the function noise as
n
def 1X
∆∗f = inf f − inf fi . (35)
n
i=1
Example 4.14 (Function noise for least-squares). Let f be a least-squares as in Example 4.11.
It is easy to see that inf fi = 0, implying that the function noise is exactly ∆∗f = inf f . We see in
this case that ∆∗f = 0 if and only if interpolation holds (see also the next Lemma). If the function
noise ∆∗f = inf f is nonzero, it can be seen as a measure of how far we are from interpolation.
1. ∆∗f ≥ 0.
22
2. Interpolation holds if and only if ∆∗f = 0.
Proof.
If instead we have ∆∗f = 0, then we can write for some x∗ ∈ argmin f that
n n
1X 1X
0 = ∆∗f = f (x∗ ) − inf fi = (fi (x∗ ) − inf fi ) .
n n
i=1 i=1
Clearly we have fi (x∗ ) − inf fi ≥ 0, so this sum being 0 implies that fi (x∗ ) − inf fi = 0 for
all i = 1, . . . , n. In other words, interpolation holds.
We can also measure how close we are to interpolation using gradients instead of function
values.
Definition 4.16. Let Assumption (Sum of Lmax –Smooth) hold. We define the gradient noise
as
def
σf∗ = ∗ inf V [∇fi (x∗ )] , (36)
x ∈argmin f
V[X] := E[ kX − E[X] k2 ].
Lemma 4.17. Let Assumption (Sum of Lmax –Smooth) hold. It follows that
1. σf∗ ≥ 0.
2. If Assumption (Sum of Convex) holds, then σf∗ = V [∇fi (x∗ )] for every x∗ ∈ argmin f .
Proof.
23
1. From Definition 4.5 we have that the variance V [∇fi (x∗ )] is nonnegative, which implies
σf∗ ≥ 0.
2. Let x∗ , x0 ∈ argmin f , and let us show that V [∇fi (x∗ )] = V [∇fi (x0 )]. Since Assumptions
(Sum of Lmax –Smooth) and (Sum of Convex) hold, we can use the expected smoothness via
Lemma 4.8 to obtain
1
E k∇fi (x0 ) − ∇fi (x∗ )k2 ≤ f (x0 ) − inf f = inf f − inf f = 0.
2Lmax
This means that E k∇fi (x0 ) − ∇fi (x∗ )k2 = 0, which in turns implies that, for every i =
1, . . . , n, we have k∇fi (x0 ) − ∇fi (x∗ )k = 0. In other words, ∇fi (x0 ) = ∇fi (x∗ ), and thus
V [∇fi (x∗ )] = V [∇fi (x0 )] .
3. If interpolation holds, then there exists (see Lemma 4.10) x ∈ argmin f such that x∗ ∈
argmin fi for every i = 1, . . . , n. From Fermat’s theorem, this implies that ∇fi (x∗ ) = 0
and ∇f (x∗ ) = 0. Consequently V [∇fi (x∗ )] = E k∇fi (x∗ ) − ∇f (x∗ )k2 = 0. This proves
that σf∗ = 0. Now, if Assumption (Sum of Convex) holds and σf∗ = 0, then we can use the
previous item to say that for any x∗ ∈ argmin f we have V [∇fi (x∗ )] = 0. By definition of the
variance and the fact that ∇f (x∗ ) = 0, this implies that for every i = 1, . . . , n, ∇fi (x∗ ) = 0.
Using again the convexity of the fi ’s, we deduce that x∗ ∈ argmin fi , which means that
interpolation holds.
Both σf∗ and ∆∗f measure how far we are from interpolation. Furthermore, these two constants
are related through the following bounds.
Proof.
1. Let x∗ ∈ argmin f . Using Lemma 2.28, we can write k∇fi (x∗ )k2 ≤ 2Lmax (fi (x∗ ) − inf fi ) for
each i. The conclusion follows directly after taking the expectation on this inequality, and
using the fact that E k∇fi (x∗ )k2 = V [∇fi (x∗ )] ≥ σf∗ .
2. This is exactly the same proof, except that we use Lemma 2.18 instead of 2.28.
Here we provide two lemmas which allow to exchange variance-like terms like E k∇fi (x)k2 with
interpolation constants and function values. This is important since E k∇fi (x)k2 actually controls
24
Lemma 4.19 (Variance transfer : function noise). If Assumption (Sum of Lmax –Smooth) holds,
then for all x ∈ Rd we have
k∇fi (x)k2 ≤ 2Lmax (fi (x) − inf fi ) = 2Lmax (fi (x) − fi (x∗ )) + 2Lmax (fi (x∗ ) − inf fi ), (37)
for each i. The conclusion follows directly after taking expectation over the above inequality.
Lemma 4.20 (Variance transfer : gradient noise). If Assumptions (Sum of Lmax –Smooth) and
(Sum of Convex) hold, then for all x ∈ Rd we have that
Proof. Let x∗ ∈ argmin f , so that σf∗ = V k∇fi (x∗ )k2 according to Lemma 4.17. Start by
writing
k∇fi (x)k2 ≤ 2k∇fi (x) − ∇fi (x∗ )k2 + 2k∇fi (x∗ )k2 .
Taking the expectation over the above inequality, then applying Lemma 4.8 gives the result.
Remark 5.2 (Unbiased estimator of the gradient). An important feature of the (SGD) Algo-
rithm is that at each iteration we follow the direction −∇fit (xt ), which is an unbiaised estimator
of −∇f (xt ). Indeed, since
n
X 1
E ∇fi (xt ) | xt = ∇fi (xt ) = ∇f (xt ).
(41)
n
i=1
25
each γt is less than 2L1max . The particular cases of constant stepsizes and of decreasing stepsizes
are dealt with in Theorem 5.5.
Theorem 5.3. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. Consider
(xt )t∈N a sequence generated by the (SGD) algorithm, with a stepsize satisfying 0 < γt < 2L1max .
It follows that
Pt−1 2
kx0 − x∗ k2 k=0 γk
t
σf∗ ,
E f (x̄ ) − inf f ≤ Pt−1 + Pt−1 (42)
2 k=0 γk (1 − 2γk Lmax ) k=0 γk (1 − 2γk Lmax )
Proof. Let x∗ ∈ argmin f , so we have σf∗ = V[∇fi (x∗ )] (see Lemma 4.17). We will note Ek [·]
instead of Ek · | xk , for simplicity. Let us start by analyzing the behaviour of kxk − x∗ k2 . By
26
Remark 5.4 (On the choice of stepsizes for (SGD)). Looking at the bound obtained in Theorem
5.3, we see that the first thing we want is ∞
P
s=0 γs = +∞ so that the first term (a.k.a the bias
term) vanishes. This can be achieved with constant stepsizes, or with stepsizes of the form t1α
with α < 1 (see Theorem 5.5 below). The second term (a.k.a the variance term) is less trivial to
analyse.
• If interpolation holds (see Definition 4.9), then the variance term σf∗ is zero. This means
1
that the expected values converge to zero at a rate of the order Pt−1 . For constant
s=0 γs
1 1 1
stepsizes this gives a O t rate. For decreasing stepsizes γt = tα this gives a O t1−α
rate. We see that the best among those rates is obtained when α = 0 and the decay in the
stepsize is slower. In other words when the stepsize is constant. Thus when interpolation
holds the problem is so easy that the stochastic algorithm behaves like the deterministic
one and enjoys a 1/t rate with constant stepsize, as in Theorem 3.4.
• If interpolation does not hold the expected values will be asymptotically controlled by
Pt−1 2
γs
Ps=0
t−1 .
s=0 γs
We see that we want γs to decrease as slowly as possible (so that the denominator is
big) but at the same time that γs vanishes as fast as possible (so that the numerator is
small). So a trade-off must be found. For constant stepsizes, this term becomes a constant
O(1), and thus (SPGD) does not converge for constant stepsizes. For decreasing stepsizes
γt = t1α , this term becomes (omitting logarithmic terms) O t1α if 0 < α ≤ 12 , and O t1−α
1
if 12 ≤ α < 1. So the best compromise for this bound is to take α = 1/2. This case is
detailed in the next Theorem.
Theorem 5.5. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. Consider
(xt )t∈N a sequence generated by the (SGD) algorithm with a stepsize γt > 0.
1
1. If γt = γ < 2Lmax , then for every t ≥ 1 we have that
kx0 − x∗ k2 1 γ
E f (x̄t ) − f (x∗ ) ≤ σ∗ ,
+ (43)
2γ(1 − 2γLmax ) t 1 − 2γLmax f
def 1 Pt−1
where x̄t = t k=0 x
k.
2. If γt = √γ with γ ≤ 1
t+1 2Lmax , then for every t we have that
h i kx0 − x∗ k2 γ 2 log(k)
log(k)
k ∗ ∗
E f (x̄ ) − f (x ) ≤ √ + √ σf = O √ . (44)
2γ k γ k k
27
def Pt−1 def γk (1−2γk Lmax )
where x̄t = k
k=0 pt,k x , with pt,k = Pt−1 .
i=0 γi (1−2γi Lmax )
Now using (45) and (46), together with the fact that γLmax ≤ 12 , we have that for k ≥ 2
k−1 √ √ √
X √ √
γi (1 − 2γi Lmax ) ≥ 2γ k− 2 − γLmax log(k) ≥ 2γ k − 2 − log( k) .
i=0
√
Because X 7→ X2 − ln(X) is increasing for X ≥ 2, we know that X2 − ln(X) ≥ 2 for X large
√
enough (for instance X ≥ 7). So by taking X = k, we deduce from the above inequality
that for k ≥ 72 we have
k−1
X √
γi (1 − 2γi Lmax ) ≥ γ k.
i=0
It remains to use the above inequality in (42) and (45) again to arrive at
Pk−1
h
∗
i kx0 − x∗ k2 γt2
k
E f (x̄ ) − f (x ) ≤ Pk−1 + Pk−1 t=0
σf∗
2 i=0 γi (1 − 2γi Lmax ) i=0 γi (1 − 2γi Lmax )
kx0 − x∗ k2 γ 2 log(k) ∗
≤ √ + √ σf
2γ k γ k
log(k)
=O √ .
k
Corollary 5.6 (O(1/2 ) Complexity). Consider the setting of Theorem 5.5. Let ≥ 0 and let
γ = 2L1max √1t . It follows that for t ≥ 4 that
2
σf∗
1 ∗ 2
0
=⇒ E f (x̄t ) − f (x∗ ) ≤
t≥ 2 2Lmax kx − x k + (47)
Lmax
28
1 √
Proof. From (43) plugging γ = we have
2Lmax t
√
∗
2Lmax tkx0 − x∗ k2 1 1 1
t
σf∗ .
E f (x̄ ) − f (x ) ≤ 1 + √
2(1 − t )
√ t 2Lmax t 1 − √1t
Now note that for t ≥ 4 we have that
1
≤ 2.
1 − √1t
Using this in the above we have that
σf∗
1
E f (x̄t ) − f (x∗ ) ≤ √ 0 ∗ 2
2Lmax kx − x k + .
t Lmax
Consequently (47) follows by demanding that the above inequality is less than .
Theorem 5.7. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold, and assume
further that f is µ-strongly convex. Consider the (xt )t∈N sequence generated by the (SGD)
algorithm with a constant stepsize satisfying 0 < γ < 2L1max . It follows that for t ≥ 0,
2γ ∗
Ekxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 + σ .
µ f
Proof. Let x∗ ∈ argmin f , so that σf∗ = V[∇fi (x∗ )] (see Lemma 4.17). We will note Ek [·] instead
of E · | xk , for simplicity. Expanding the squares we have
(39)
kxt+1 − x∗ k2 = kxk − x∗ − γ∇fi (xk )k2
= kxk − x∗ k2 − 2γhxk − x∗ , ∇fi (xk )i + γ 2 k∇fi (xk )k2 .
29
Corollary 5.8 (Õ(1/) Complexity). Consider the setting of Theorem 5.7. Let > 0. If we set
( )
µ 1
γ = min , (48)
2 2σf∗ 2Lmax
then ∗
1 4σf 2Lmax 2kx0 − x∗ k2
t ≥ max , log =⇒ kxt − x∗ k2 ≤ (49)
µ2 µ
2σf∗
Proof. Applying Lemma A.2 with A = µ , C = 2Lmax and α0 = kx0 − x∗ k2 gives the result (49).
Theorem 5.9. Let Assumption (Sum of Lmax –Smooth) hold, and assume that f is µ-Polyak-
Lojasiewicz for some µ > 0. Consider (xt )t∈N a sequence generated by the (SGD) algorithm,
with a constant stepsize satisfying 0 < γ ≤ Lf Lµmax . It follows that
γLf Lmax ∗
E[f (xt ) − inf f ] ≤ (1 − γµ)t (f (x0 ) − inf f ) + ∆f .
µ
Proof. Remember from Assumption (Sum of Lmax –Smooth) that f is Lf -smooth, so we can use
Lemma 2.25, together with the update rule of SGD, to obtain:
Lf t+1
f (xt+1 ) ≤ f (xt ) + h∇f (xt ), xt+1 − xt i + kx − xt k2
2
Lf γ 2
= f (xt ) − γh∇f (xt ), ∇fi (xt )i + k∇fi (xt )k2 .
2
After taking expectation conditioned on xt , we can use a variance transfer lemma together with
the Polyak-Lojasiewicz property to write
Lf γ 2
E f (xt+1 ) | xt f (xt ) − γk∇f (xt )k2 + E k∇fi (xt )k2 | xt
≤
2
Lemma 4.19
≤ f (xt ) − γk∇f (xt )k2 + γ 2 Lf Lmax (f (xt ) − inf f ) + γ 2 Lf Lmax ∆∗f
Definition 2.17
≤ f (xt ) + γ(γLf Lmax − 2µ)(f (xt ) − inf f ) + γ 2 Lf Lmax ∆∗f
≤ f (xt ) − γµ(f (xt ) − inf f ) + γ 2 Lf Lmax ∆∗f ,
where in the last inequality we used our assumption on the stepsize to write γLf Lmax − 2µ ≤ −µ.
Note that µγ ≤ 1 because of our assumption on the stepsize, and the fact that µ ≤ Lf ≤ Lmax (see
Remark 2.27). Subtracting inf f from both sides in the last inequality, and taking expectation, we
obtain
30
Recursively applying the above and summing up the resulting geometric series gives:
t−1
X
t 0t 2 ∗
(1 − γµ)j .
E f (x ) − inf f ≤ (1 − µγ) E f (x ) − inf f + γ Lf Lmax ∆f
j=0
Pt−1 1−(1−µγ)t 1
Using i=0 (1 − µγ)i = 1−1+µγ ≤ µγ , in the above gives (5.9).
Corollary 5.10 (Õ(1/) Complexity). Consider the setting of Theorem 5.9. Let ≥ 0 be given.
If we set
( )
µ
γ = min , 1 (50)
Lf Lmax 2∆∗f
then
2∆∗f 2(f (x0 ) − inf f )
Lf Lmax
t≥ max , 1 log =⇒ f (x0 ) − inf f ≤ (51)
µ2
Lf Lmax ∗
Proof. Applying Lemma A.2 with A = µ ∆f and α0 = f (x0 ) − inf f gives the result (51).
Remark 5.11 (From finite sum to expectation). The theorems we prove here also holds when
31
the objective is a true expectation where
Further we have defined the Lmax smoothness as the largest smoothness constant of every f (x, ξ)
for every ξ. The gradient noise σf∗ is would now be given by
def
σf∗ = Eξ k∇f (x∗ , ξ)k2 .
inf
x∗ ∈argmin f
With these extended definitions we have that Theorems 5.5, 5.7 and 5.9 hold verbatim.
We also give some results for minimizing expectation for stochastic subgradient in Section 9.
6 Minibatch SGD
6.1 Definitions
When solving (Sum of Functions) in practice, an estimator of the gradient is often computed using
a small batch of functions, instead of a single one as in (SGD). More precisely, given a subset
B ⊂ {1, . . . , n}, we want to make use of
def 1 X
∇fB (xt ) = ∇fi (xt ).
|B|
i∈B
Algorithm 6.1 (MiniSGD). Let x0 ∈ Rd , let a batch size b ∈ {1, . . . , n}, and let γt > 0 be
a sequence of step sizes. The Minibatching Stochastic Gradient Descent (MiniSGD)
algorithm is given by the iterates (xt )t∈N where
Definition 6.2 (Mini-batch distribution). We impose in this section that the batches B are
sampled uniformly among all subsets of size b in {1, . . . , n}. This means that each batch is
sampled with probability
1 (n − b)!b!
n = ,
b
n!
and that we will compute expectation and variance with respect to this uniform law. For instance
32
the expectation of the minibatched gradient writes as
1 X
Eb [∇fB (x)] = n
∇fB (x),
b B⊂{1,...,n}
|B|=b
Mini-batching makes better use of parallel computational resources and it also speeds-up the
convergence of (SGD), as we show next. To do so, we will need the same central tools than for
(SGD), that is the notions of gradient noise, of expected smoothness, and a variance transfer
lemma.
Definition 6.3. Let Assumption (Sum of Lmax –Smooth) hold, and let b ∈ {1, . . . , n}. We define
the minbatch gradient noise as
def
σb∗ = inf Vb [∇fB (x∗ )] , (53)
x∗ ∈argmin f
Definition 6.4. Let Assumption (Sum of Lmax –Smooth) hold, and let b ∈ {1, . . . , n}. We say
that f is Lb -smooth in expectation if
1
for all x, y ∈ Rd , Eb k∇fB (y) − ∇fB (x)k2 ≤ f (y) − f (x) − h∇f (x), y − xi ,
2Lb
where B is sampled according to Definition 6.2.
Lemma 6.5 (From single batch to minibatch). Let Assumptions (Sum of Lmax –Smooth) and
(Sum of Convex) hold. Then f is Lb -smooth in expectation with
n(b − 1) n−b
Lb = L+ Lmax , (54)
b(n − 1) b(n − 1)
Remark 6.6 (Minibatch interpolates between single and full batches). It is intersting to look
at variations of the expected smoothness constant Lb and minibatch gradient noise σb∗ when b
varies from 1 to n. For b = 1, where (MiniSGD) reduces to (SGD), we have that Lb = Lmax and
σb∗ = σf∗ , which are the constants governing the complexity of (SGD) as can be seen in Section 5.
On the other extreme, when b = n (MiniSGD) reduces to (GD), we see that Lb = L and σb∗ = 0.
33
We recover the fact that the behavior of (GD) is controlled by the Lipschitz constant L, and has
no variance.
We end this presentation with a variance transfer lemma, analog to Lemma 4.20 (resp. Lemma 2.29)
in the single batch (resp. full batch).
Lemma 6.7. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. It follows
that
Eb k∇fB (x)k2 ≤ 4Lb (f (x) − inf f ) + 2σb∗ .
Proof of Lemmas 6.5 and 6.7. See Proposition 3.8 and 3.10 in [17].
Theorem 6.8. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. Consider
(xt )t∈N a sequence generated by the (MiniSGD) algorithm, with a sequence of stepsizes satisfying
0 < γt < 2L1 b . It follows that
Pt−1 2
kx0 − x∗ k2 γ
E f (x̄t ) − inf f ≤ + Pt−1 k=0 k σb∗ ,
Pt−1
2 k=0 γk (1 − 2γk Lb ) k=0 γ k (1 − 2γ k Lb )
def Pt−1 def γk (1−2γk Lb )
where x̄t = k=0 pt,k x
k, with pt,k = Pt−1 .
i=0 γi (1−2γi Lb )
Proof. Let x∗ ∈ argmin f , so we have σb∗ = V[∇fB (x∗ )]. Let us start by analyzing the behaviour
of kxk − x∗ k2 . By developing the squares, we obtain
Hence, after taking the expectation conditioned on xk , we can use the convexity of f and the
variance transfer lemma to obtain:
h i h i
Eb kxk+1 − x∗ k2 | xk = kxk − x∗ k2 + 2γk h∇f (xk ), x∗ − xk i + γk2 Eb k∇fBt (xk )k2 | xk
Lem. 2.8 & 6.7
≤ kxk − x∗ k2 + 2γk (2γk Lb − 1)(f (xk ) − inf f )) + 2γk2 σb∗ .
34
Finally, define for k = 0, . . . , t − 1
def γk (1 − 2γk Lb )
pt,k = Pt−1
i=0 γi (1 − 2γi Lb )
Pt−1
and observe that pt,k ≥ 0 and k=0 pt,k = 1. This allows us to treat the (pt,k )t−1 k=0 as if it was a
probability vector. Indeed, using that f is convex together with Jensen’s inequality gives
t−1
" #
X γ k (1 − 2γ k Lb )
Eb f (x̄t ) − inf f ≤ (f (xk ) − inf f )
Eb Pt−1
k=0 i=0 γ i (1 − 2γ i L b )
kx − x∗ k2
0 σb∗ t−1 2
P
k=0 γk
≤ Pt−1 + Pt−1 .
2 i=0 γi (1 − 2γi Lb ) i=0 γi (1 − 2γi Lb )
Theorem 6.9. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. Consider
(xt )t∈N a sequence generated by the (MiniSGD) algorithm with a sequence of stepsizes γt > 0.
1
1. If γt = γ < 2Lb , then for every t ≥ 1 we have that
kx0 − x∗ k2 1 γ
E f (x̄t ) − inf f ≤ σ∗,
+
2γ(1 − 2γLb ) t 1 − 2γLb b
def 1 Pt−1
where x̄t = t k=0 x
k.
2. If γt = √γ with γ ≤ 1
t+1 2Lb , then for every t we have that
kx0 − x∗ k2 γ 2 log(t) ∗
log(t)
E f (x̄t ) − inf f ≤
√ + √ σb = O √ .
2γ t γ t t
def Pt−1 def γk (1−2γk Lb )
where x̄t = k
k=0 pt,k x , with pt,k = Pt−1 .
i=0 γi (1−2γi Lb )
Corollary 6.10 (O(1/2 ) Complexity). Consider the setting of Theorem 6.8. Let ≥ 0 and let
γ = 2L1 b √1t . It follows that for t ≥ 4 that
σb∗ 2
1 0 ∗ 2
=⇒ E f (x̄t ) − inf f ≤ .
t≥ 2 2Lb kx − x k +
Lb
Proof. The proof is a consequence of Theorem 6.8 and follows exactly the lines of the proofs of
Theorem 5.5 and Corollary 5.6.
35
Theorem 6.11. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold, and as-
sume further that f is µ-strongly convex. Consider (xt )t∈N a sequence generated by the (Min-
iSGD) algorithm, with a constant sequence of stepsizes γt ≡ γ ∈]0, 2L1 b ]. Then
2γσb∗
Eb kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 +
.
µ
Proof. Let x∗ ∈ argmin f , so that σb∗ = Vb [∇fB (x∗ )]. Expanding the squares we have
(M iniSGD)
kxt+1 − x∗ k2 = kxt − x∗ − γ∇fBt (xt )k2
= kxt − x∗ k2 − 2γhxt − x∗ , ∇fBt (xt )i + γ 2 k∇fBt (xt )k2 .
Taking expectation conditioned on xt and using Eb [∇fB (x)] = ∇f (x) (see Remark 6.2), we obtain
1
where we used in the last inequality that 2γLb ≤ 1 since γ ≤ 2Lb . Recursively applying the above
and summing up the resulting geometric series gives
t−1
X
Eb kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 + 2 (1 − γµ)k γ 2 σb∗
k=0
2γσb∗
≤ (1 − γµ)t kx0 − x∗ k2 + .
µ
Corollary 6.12 (Õ(1/) Complexity). Consider the setting of Theorem 6.11. Let > 0 be
given. Hence, given any > 0, choosing stepsize
1 µ
γ = min , , (56)
2Lb 4σb∗
and
2Lb 4σb∗ 2kx0 − x∗ k2
t ≥ max , log =⇒ Ekxt − x∗ k2 ≤ . (57)
µ µ2
2σb∗
Proof. Applying Lemma A.2 with A = µ , C = 2Lb and α0 = kx0 −x∗ k2 gives the result (57).
36
6.4 Bibliographic Notes
The SGD analysis in [26] was later extended to a mini-batch analysis [27], but restricted to mini-
batches that are disjoint partitions of the data. Our results on mini-batching in Section 6 are
instead taken from [17]. We choose to adapt the proofs from [17] since these proofs allow for
sampling with replacement. The smoothness constant in (54) was introduced in [15] and this
particular formula was conjectured in [11].
7 Stochastic Momentum
For most, if not all, machine learning applications SGD is used with momentum. In the machine
learning community, the momentum method is often written as follows
Algorithm 7.1 (Momentum). Let Assumption (Sum of Lmax –Smooth) hold. Let x0 ∈ Rd and
m−1 = 0, let (γt )t∈N ⊂]0, +∞[ be a sequence of stepsizes, and let (βt )t∈N ⊂ [0, 1] be a sequence
of momentum parameters. The Momentum algorithm defines a sequence (xt )t∈N satisfying for
every t ∈ N
At the end of this section we will see in Corollary 7.4 that in the convex setting, the sequence xt
generated by the (Momentum) algorithm has a complexity rate of O(1/ε2 ). This is an improvement
with respect to (SGD), for which we only know complexity results about the average of the iterates,
see Corollary 5.6.
Lemma 7.2. The algorithms (Momentum) and Heavy Ball (given by (59)) are the equivalent.
More precisely, if (xt )t∈N is generated by (Momentum) from parameters γt , βt , then it verifies
(59) by taking γ̂t = γt and β̂t = γγt−1
t βt
, where γ−1 = 0.
xt+1 = xt − γt mt
= xt − γt βt mt−1 − γt ∇fi (xt ).
37
xt−1 −xt
Now using (58) at time t − 1 we have that mt−1 = γt−1 which when inserted in the above gives
γt βt t−1
xt+1 = xt − (x − xt ) − γt ∇fi (xt ).
γt−1
γt βt
The conclusion follows by taking γ̂t = γt and β̂t = γt−1 .
There is yet a third equivalent way of writing down the momentum method that will be useful
in establishing convergence.
More precisely, if (xt )t∈N is generated by (Momentum) from parameters γt , βt , then it verifies
(IMA) by chosing any parameters (ηt , λt ) satisfying
γt−1 λt
βt λt+1 = − βt , ηt = (1 + λt+1 )γt , and z t−1 = xt + λt (xt − xt−1 ).
γt
which after dividing by (1 + λt+1 ) directly gives us (61). Now use Lemma 7.2 to write that
γt βt
where β̂t = γt−1 . Going back to the definition of z t , we can write
where in the last but one equality we used the fact that
γt βt
(1 + λt+1 )γt = ηt and (1 + λt+1 )β̂t = (1 + λt+1 ) = λt .
γt−1
38
7.2 Convergence for convex and smooth functions
Theorem 7.4. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. Consider
(xt )t∈N the iterates generated by the (Momentum) algorithm with stepsize and momentum pa-
rameters taken according to
2η t 1
γt = , βt = , with η≤ .
t+3 t+2 4Lmax
Then the iterates converge according to
kx0 − x∗ k2
E f (xt ) − inf f ≤ + 2ησf∗ .
(63)
η (t + 1)
Proof. For the proof, we rely on the iterate-moving-average (IMA) viewpoint of momentum given
in Lemma 7.3. It is easy to verify that the parameters
t
ηt = η, λt = and z t−1 = xt + λt (xt − xt−1 )
2
verify the conditions of Lemma 7.3. Let us then consider the iterates (xt , z t ) of (IMA), and we
start by studing the variations of kz t − x∗ k2 . Expanding squares we have for t ∈ N that
In the last equality we made appear xt−1 which , for t = 0, can be taken equal to zero. Then
taking conditional expectation, using the convexity of f (via Lemma 2.8) and a variance transfer
lemma (Lemma 4.20), we have
E kz t − x∗ k2 | xt = kz t−1 − x∗ k2 + 2ηh∇f (xt ), x∗ − xt i + 2ηλt h∇f (xt ), xt−1 − xt i + η 2 Et k∇fit (xt )k2 | xt ,
≤ kz t−1 − x∗ k2 + (4η 2 Lmax − 2η) f (xt ) − inf f + 2ηλt f (xt−1 ) − f (xt ) + 2η 2 σf∗
where we used the facts that η ≤ 4L1max and λt + 12 = λt+1 in the last inequality. Taking now
expectation and summing over t = 0, . . . , T , we have after telescoping and cancelling terms
Now, the fact that λ0 = 0 cancels one term, and also implies that z −1 = x0 + λ0 (x0 − x−1 ) = x0 .
After dropping the positive term E kz T − x∗ k2 , we obtain
39
Dividing through by 2ηλT +1 , where our assumption on the parameters gives 2λT +1 = T + 1, we
finally conclude that for all T ∈ N
kx0 − x∗ k2
E f (xT ) − inf f ≤ + 2σf∗ η.
η(T + 1)
Corollary 7.5 (O(1/2 ) Complexity). Consider the setting of Theorem 7.4. We can guarantee
that E f (xT ) − inf f ≤ provided that we take
2
1 1 1 1 1
η=√ and T ≥ 2 2 kx0 − x∗ k2 + σf∗ (64)
T + 1 4Lmax 4Lmax 2
Proof. By plugging η = √1 1
into (63), we obtain
T +1 4Lmax
1 1 1 ∗ 2 ∗
E [f (xT ) − inf f ] ≤ √ kx0 − x k + σf . (65)
2Lmax T + 1 2
The final complexity (64) follows by demanding that (65) be less than .
2. Differentiable functions not being defined on the entire space. An other way to say it is that
they take the value +∞ outside of their domain. This can be seen as nonsmoothness, as the
behaviour of the function at the boundary of the domain can be degenerate.
40
• The most typical example is the indicator function of some constraint C ⊂ Rd , and
which is defined as δC (x) = 0 if x ∈ C, +∞ if x ∈
/ C. Such function is useful because
it allows to say that minimizing a function f over the constraint C is the same as
minimizing the sum f + δC .
We say that f is lower semi-continuous (l.s.c. for short) if f is lower semi-continuous at every
x̄ ∈ Rd .
• If C ⊂ Rd is closed and nonempty, then its indicator function δC is proper and l.s.c.
As hinted by the above example, it is safe to say that most functions used in practice are proper
and l.s.c.. It is a minimal technical assumption which is nevertheless needed for what follows (see
e.g. Lemmas 8.9 and 8.11).
41
def
We also call ∂f (x) the subdifferential of f . Finally, define dom ∂f = {x ∈ Rd | ∂f (x) 6= ∅}.
If f is differentiable, then ∇f (x) is the unique subgradient at x, as we show next. This means
that the subdifferential is a faithful generalization of the gradient.
Proof. (Proof adapted from [33, Proposition 3.20]). From Lemma 2.8 we have that ∇f (x) ∈
∂f (x). Suppose now that η ∈ ∂f (x), and let us show that η = ∇f (x). For this, take any v ∈ Rd
and t > 0, and Definition 8.4 to write
f (x + tv) − f (x)
f (x + tv) − f (x) − hη, (x + tv) − xi ≥ 0 ⇔ ≥ hη, vi.
t
Taking the limit when t ↓ 0, we obtain that
By choosing v = η − ∇f (x), we obtain that k∇f (x) − ηk2 ≤ 0 which in turn allows us to conclude
that ∇f (x) = η.
Remark 8.7. As hinted by the previous results and comments , this definition of subdifferential
is tailored for nonsmooth convex functions. There exists other notions of subdifferential which are
better suited for nonsmooth nonconvex functions. But we will not discuss it in this monograph,
for the sake of simplicity. The reader interested in this topic can consult [6, 36].
x̄ is a minimizer of f
⇔ for all y ∈ Rd , f (y) − f (x̄) ≥ 0
⇔ for all y ∈ Rd , f (y) − f (x̄) − h0, y − xi ≥ 0
(66)
⇔ 0 ∈ ∂f (x̄).
42
Lemma 8.9 (Sum rule). Let f : Rd → R be convex and differentiable. Let g : Rd → R ∪ {+∞}
be proper l.s.c. convex. Then, for all x ∈ Rd , ∂(f + g)(x) = {∇f (x)} + ∂g(x).
Lemma 8.10 (Positive homogeneity). Let f : Rd → R be proper l.s.c. convex. Let x ∈ Rd , and
γ ≥ 0. Then ∂(γf )(x) = γ∂f (x).
Lemma 8.11. If f : Rd → R∪{+∞} is a proper l.s.c. µ-strongly convex function, then f admits
a unique minimizer.
Lemma 8.12. If f : Rd → R ∪ {+∞} is a proper l.s.c and µ-strongly convex function, then for
every x, y ∈ Rd , and for every η ∈ ∂f (x) we have that
µ
f (y) − f (x) − hη, y − xi ≥ ky − xk2 . (68)
2
Proof. Define g(x) := f (x) − µ2 kxk2 . According to Lemma 2.12, g is convex. It is also clearly l.s.c.
and proper, by definition. According to the sum rule in Lemma 8.9, we have ∂f (x) = ∂g(x) + µx.
Therefore we can use the convexity of g with Definition 8.4 to write
µ µ µ
f (y) − f (x) − hη, y − xi ≥ kyk2 − kxk2 − hµx, y − xi = ky − xk2 .
2 2 2
Definition 8.13. Let g : Rd → R ∪ {+∞} be a proper l.s.c convex function. We define the
proximal operator of g as the function proxg : Rd → Rd defined by
1
proxg (x) := argmin g(x0 ) + kx0 − xk2 (69)
x0 ∈Rd 2
The proximal operator is well defined because, since g(x0 ) is convex the sum g(x0 ) + 21 kx0 − xk2 is
strongly convex in x0 . Thus there exists only one minimizer (recall Lemma 8.11).
43
Example 8.14 (Projection is a proximal operator). Let C ⊂ Rd be a nonempty closed convex
set, and let δC be its indicator function. Then the proximal operator of δC is exactly the projection
operator onto C:
def
proxδC (x) = projC (x) = argmin kc − xk2 .
c∈C
Lemma 8.15. Let g : Rd → R ∪ {+∞} be a proper l.s.c convex function, and let x, p ∈ Rd .
Then p = proxg (x) if and only if
x − p ∈ ∂g(p). (70)
Proof. From Definition 8.13 we know that p = proxf (x) if and onlny if p is the minimizer of
φ(u) := g(u) + 21 ku − xk2 . From our hypotheses on g, it is clear that φ is proper l.s.c convex. So
we can use Proposition 8.8 to say that it is equivalent to 0 ∈ ∂φ(p). Moreover, we can use the sum
rule from Lemma 8.9 to write that ∂φ(p) = ∂g(p)+{p−x}. So we have proved that p = proxg (x) if
and only if 0 ∈ ∂g(p) + {p − x}, which is what we wanted to prove, after rearranging the terms.
We show that, like the projection, the proximal operator is 1-Lipschitz (we also say that it is
non-expansive). This property will be very interesting for some proofs since it will allow us to “get
rid” of the proximal terms.
def def
Proof. Let py = proxg (y) and px = proxg (x). From px = proxg (x) we have x − px ∈ ∂g(px ) (see
Lemma 8.15), so from the definition of the subdifferential (Definition 8.4), we obtain
g(py ) − g(px ) − hx − px , py − px i ≥ 0.
g(px ) − g(py ) − hy − py , px − py i ≥ 0.
hy − x − py + px , px − py i ≤ 0.
Expanding the left argument of the inner product, and using the Cauchy-Schwartz inequality gives
kpx − py k2 ≤ hx − y, px − py i ≤ kx − ykkpx − py k.
Dividing through by kpx − py k (assuming this is non-zero otherwise (71) holds trivially) we
have (71).
44
We end this section with an important property of the proximal operator : it can help to
characterize the minimizers of composite functions as fixed points.
Proof. Since x∗ ∈ argmin(f + g) we have that 0 ∈ ∂(f + g)(x∗ ) = ∇f (x∗ ) + ∂g(x∗ ) (Proposition
8.8 and Lemma 8.9). By multiplying both sides by γ then by adding x∗ to both sides gives
According to Lemma 8.15, this means that proxγg (x∗ − γ∇f (x∗ )) = x∗ .
Note that the divergence Df (y; x) is always nonegative when f is convex due to Lemma 2.8.
Moreover, the divergence is also upper bounded by suboptimality.
Proof. Since x∗ ∈ argmin F , we can use the Fermat Theorem (Proposition 8.8) and the sum rule
(Lemma 8.9) to obtain the existence of some η ∗ ∈ ∂g(x∗ ) such that ∇f (x∗ ) + η ∗ = 0. Use now the
definition of the Bregman divergence, and the convexity of g (via Lemma 2.8) to write
Next we provide a variance transfer lemma, generalizing Lemma 4.20, which will prove to be
useful when dealing with nonsmooth sum of functions in Section 11.
45
Lemma 8.20 (Variance transfer - General convex case). Let f verify Assumptions (Sum of
Lmax –Smooth) and (Sum of Convex). For every x, y ∈ Rd , we have
Proof. Simply use successively Lemma 4.6, the inequality ka + bk2 ≤ 2kak2 + 2kbk2 , and the
expected smoothness (via Lemma 4.7) to write:
Note the difference between σf∗ introduced in Definition 4.16 and σF∗ introduced here is that
the variance of gradients taken at the minimizers of the composite sum F , as opposed to f .
Lemma 8.22. Let f : Rd → R verify Assumptions (Sum of Lmax –Smooth) and (Sum of Convex).
Let g : Rd → R ∪ {+∞} be proper l.s.c convex. Let F = g + f be such that argmin F 6= ∅.
1. σF∗ ≥ 0.
3. If σF∗ = 0 then there exists x∗ ∈ argmin F such that x∗ ∈ argmin (g +fi ) for all i = 1, . . . , n.
The converse implication is also true if g is differentiable at x∗ .
Proof. Item 1 is trivial. For item 2, consider two minimizers x∗ , x0 ∈ argmin F , and use the
expected smoothness of f (via Lemma 4.7) together with Lemma 8.19 to write
1
E k∇fi (x∗ ) − ∇fi (x0 )k2 ≤ Df (x∗ ; x0 ) ≤ F (x∗ ) − inf F = 0.
2Lmax
In other words, we have ∇fi (x∗ ) = ∇fi (x0 ) for all i = 1, . . . , n, which means that indeed V [∇fi (x∗ )] =
V [∇fi (x0 )]. Now we turn to item 3, and start by assuming that σF∗ = 0. Let x∗ ∈ argmin F , and
we know from the previous item that V [∇fi (x∗ )] = 0. This is equivalent to say that, for every i,
46
∇fi (x∗ ) = ∇f (x∗ ). But x∗ being a minimizer implies that −∇f (x∗ ) ∈ ∂g(x∗ ) (use Proposition
8.8 and Lemma 8.9). So we have that −∇fi (x∗ ) ∈ ∂g(x∗ ), from which we conclude by the same
arguments that x∗ ∈ argmin (g + fi ). Now let us prove the converse implication, by assuming
further that g is differentiable at x∗ . From the assumption x∗ ∈ argmin (g + fi ), we deduce that
−∇fi (x∗ ) ∈ ∂g(x∗ ) = ∇g(x∗ ) (see Lemma 8.6). Taking the expectation on this inequality also
gives us that −∇f (x∗ ) = ∇g(x∗ ). In other words, ∇fi (x∗ ) = ∇f (x∗ ) for every i. We can then
conclude that V [∇fi (x∗ )] = 0. We finally turn to item 4, which is a direct consequence of Lemma
8.20 (with x = x∗ ∈ argmin F and y = x∗f ∈ argmin f ) :
In this section we assume that the functions fξ are convex and have bounded subgradients.
Assumption 9.2 (Stochastic convex and G–bounded subgradients). We consider the problem
(Stochastic Function) and assume that
• (bounded subgradients) there exists G > 0 such that, for all x ∈ Rd , ED kg(x, ξ)k2 ≤ G2 .
We now define the Stochastic Subgradient Descent algorithm, which is an extension of (SGD).
Instead of considering the gradient of a function fi , we consider here some subgradient of fξ .
Algorithm 9.3 (SSD). Consider Problem (Stochastic Function) and let Assumption (Stochastic
convex and G–bounded subgradients) hold. Let x0 ∈ Rd , and let γt > 0 be a sequence of stepsizes.
47
The Stochastic Subgradient Descent (SSD) algorithm is given by the iterates (xt )t∈N where
Lemma 9.4. If Assumption (Stochastic convex and G–bounded subgradients) holds then for all
x ∈ Rd , we have ED [g(x, ξ)] ∈ ∂f (x).
Proof. With Assumption (Stochastic convex and G–bounded subgradients) we can use the fact
that g(x, ξ) ∈ ∂fξ (x) to write
Taking expectation with respect to ξ ∼ D, and using the fact that ED [g(x, ξ)] is well-defined, we
obtain
f (y) − f (x) − hED [g(x, ξ)] , y − xi ≥ 0,
Theorem 9.5. Let Assumption (Stochastic convex and G–bounded subgradients) hold, and
consider (xt )t∈N a sequence generated by the (SSD) algorithm, with a sequence of stepsizes
def 1 PT −1
γt > 0. Then for x̄T = PT −1 t=0 γt xt we have
k=0 γk
PT −1 2 2
T
kx0 − x∗ k2 t=0 γt G
ED f (x̄ ) − inf f ≤ PT −1 + P −1
. (76)
2 t=0 γk 2 Tt=0 γk
We will use the fact that our subgradients are bounded from Assumption (Stochastic convex and G–
bounded subgradients), and that ED g(xt , ξt ) | xt ∈ ∂f (xt ) (see Lemma 9.4). Taking expectation
48
Re-arranging, taking expectation and summing up from t = 0, . . . , T − 1 gives
T −1 −1 −1
X
t
TX t ∗ 2
t+1 ∗ 2
TX
γt2 G2
2 γt ED f (x ) − inf f ≤ ED kx − x k − ED kx −x k +
t=0 t=0 t=0
T
X −1
= ED kx0 − x∗ k2 − ED kxT − x∗ k2 + γt2 G2
t=0
T
X −1
≤ kx0 − x∗ k2 + γt2 G2 . (78)
t=0
1 PT −1 PT −1
Let x̄T = PT −1 t=0 γt xt . Dividing through by 2 k=0 γk and using Jensen’s inequality we
k=0 γk
have
T −1
Jensen 1 X
ED f (x̄T ) − inf f γt ED f (xt ) − inf f )
≤ PT −1
k=0 γk t=0
∗
PT −1 2 2
kx0 − x k2 t=0 γt G
≤ PT −1 + −1
. (79)
2 Tk=0
P
2 k=0 γk γk
Theorem 9.6. Let Assumption (Stochastic convex and G–bounded subgradients) hold, and
consider (xt )t∈N a sequence generated by the (SSD) algorithm, with a sequence of stepsizes
def γ def 1 PT −1
γt = √t+1 for some γ > 0. We have for x̄T = PT −1 t=0 γt xt that
k=0 γk
T −1 Z T −1 √
X 1
√ ≥ (s + 1)−1/2 ds = 2( T − 1).
t=0
t+1 s=1
T −1 Z T −1
X 1
≤ (s + 1)−1 ds = log(T ).
t+1 s=0
t=0
49
PT −1 √ P −1 2
Now consider the constant stepsize γt = √γ .
T
Since k=0 γk = γ T and Tt=0 γt = γ 2 , we deduce
from (76) that
kx0 − x∗ k2 γG2
ED [f (x̄T )] − inf f ≤ √ + √ . (81)
4γ T 4 T
Corollary 9.7. Consider the setting of Theorem 9.5. We can guarantee that ED f (x̄T ) − inf f ≤
ε provided that
kx0 − x∗ kG kx0 − x∗ k2
T ≥ and γ t ≡ √ .
2 G T
Proof. Consider the result of Theorem 9.5 with a constant stepsize γt ≡ γ > 0 :
kx0 − x∗ k2 γ 2 T G2 kx0 − x∗ k2 γG2
ED f (x̄T ) − inf f ≤
+ = + .
2γT 2γT 2γT 2
kx0 −x∗ k2
It is a simple exercise to see that the above right-hand term is minimal when γ = √
G T
. In this
case, we obtain
kx0 − x∗ k2 G
ED f (x̄T ) − inf f ≤
√ .
T
kx0 −x∗ k2 G kx0 −x∗ kG
The result follows by writing √
T
≤ε⇔T ≥ 2
.
9.2 Better convergence rates for convex functions with bounded solution
In the previous section, we saw that (SSD) has a O ln(T )
convergence rate, but enjoys a O ε12
√
T
complexity rate.
The latter suggests that it is possible to get rid of the logarithmic term and
achieve a O √1T convergence rate. In this section, we see that this can be done, by making a
localization assumption on the solution of the problem, and by making a slight modification to the
(SSD) algorithm.
Assumption 9.8 (B–Bounded Solution). There exists B > 0 and a solution x∗ ∈ argmin f
such that kx∗ k ≤ B.
We will exploit this assumption by modifying the (SSD) algorithm, adding a projection step
onto the closed ball B(0, B) where we know that the solution belongs. In this case the projection
onto the ball is given by
x
if kxk ≤ B,
projB(0,B) (x) := B
x
if kxk > B.
kxk
See Example 8.14 for the definition of the projection onto a closed convex set.
50
Algorithm 9.9 (PSSD). Consider Problem (Stochastic Function) and let Assumptions (Stochas-
tic convex and G–bounded subgradients) and (B–Bounded Solution) hold. Let x0 ∈ B(0, B),
and let γt > 0 be a sequence of stepsizes. The Projected Stochastic Subgradient Descent
(PSSD) algorithm is given by the iterates (xt )t∈N where
We now prove the following theorem, which is a simplified version of Theorem 19 in [7].
Theorem 9.10. Let Assumptions (Stochastic convex and G–bounded subgradients) and (B–
Bounded Solution) hold. Let (xt )t∈N be the iterates of (PSSD), with a decreasing sequence of
def def P −1
γ
stepsizes γt = √t+1 , with γ > 0. Then we have for T ≥ 2 and x̄T = T1 Tt=0 xt that
3B 2
1 2
ED [f (x̄T ) − inf f ] ≤ √ + γG .
T γ
Proof. We start by using Assumption (B–Bounded Solution) to write projB(0,B) (x∗ ) = x∗ . This
together with the fact that the projection is nonexpansive (see Lemma 8.16 and Example 8.14)
allows us to write, after expanding the squares
(P SSD)
kxt+1 − x∗ k2 = kprojB(0,B) (xt − γt g(xt , ξt )) − projB(0,B) (x∗ )k2
≤ kxt − x∗ − γt g(xt , ξt )k2
= kxt − x∗ k2 − 2γt g(xt , ξt ), xt − x∗ + γt2 kg(xt , ξt )k2 ,
We now want to take expectation conditioned on xt . We will use the fact that our subgradi-
ents are bounded from Assumption (Stochastic convex and G–bounded subgradients), and that
ED g(xt , ξt ) | xt ∈ ∂f (xt ) (see Lemma 9.4).
1
Taking expectation, dividing through by 2γt and re-arranging gives
1 1 γ t G2
ED f (xt ) − inf f ≤ ED kxt − x∗ k2 − ED kxt+1 − x∗ k2 +
.
2γt 2γt 2
Summing up from t = 0, . . . , T − 1 and using telescopic cancellation gives
T −1 T −2 −1
G2 TX
X 1 1X 1 1
ED f (xt ) − inf f ≤ kx0 − x∗ k2 + ED kxt+1 − x∗ k2 +
− γt .
2γ0 2 γt+1 γt 2
t=0 t=0 t=0
51
In the above inequality, we are going to bound the term
√ √
1 1 t+1− t 1
− = ≤ √ ,
γt+1 γt γ 2γ t
by using the fact that the the square root function is concave. We are also going to bound the
term ED kxt+1 − x∗ k2 by using the fact that x∗ and the sequence xt belong to B(0, B), due to
T −1 √ ! T −2 T −1
X 4B 2 1 2 1 1 X 4B 2 γG2 X 1
ED f (xt ) − inf f ≤ 4B 2 +
+ − √ + √
2γ 2 γ γ 2 2γ t 2 t+1
t=0 t=1 t=0
T T
3B 2 B 2 X 1 γG2 X 1
≤ + √ + √
γ γ t 2 t
t=1 t=1
2 √
2 2
3B B γG
≤ + + 2 T −1
γ γ 2
√
Cancelling some negative terms, and using the fact that T ≥ 1, we end up with
T −1 √ √
2B 2 B2
2
X 3B
ED f (xt ) − inf f ≤ 2 2
+ γG T+ ≤ + γG T.
γ γ γ
t=0
1 PT −1
Finally let x̄T = T t=0 xt , dividing through by 1/T , and using Jensen’s inequality we have that
T −1
1 X
ED f (xt ) − inf f
ED [f (x̄T ) − inf f ] ≤
T
t=0
2
3B 2 1
≤ + γG √ .
γ T
52
Lemma 9.11. Consider Problem (Stochastic Function). There exist no functions fξ (x) such
that f (x) = E [fξ (x)] is µ–strongly convex and Assumption (Stochastic convex and G–bounded
subgradients) holds.
def
Proof. For ease of notation we will note g(x) = ED [g(x, ξ)], for which we know that g(x) ∈ ∂f (x)
according to Lemma 9.4. Re-arranging the definition of strong convexity for non-smooth functions
in (68) with y = x∗ we have
µ ∗ µ
hg(x), x − x∗ i ≥ f (x) − f (x∗ ) +kx − xk2 ≥ kx∗ − xk2
2 2
∗
where we used that f (x) − f (x ) ≥ 0 and g(x) ∈ ∂f (x). Using the above and the Cauchy-Schwarz
inequality we have that
µ2 kx − x∗ k2 ≤ 4kg(x)k2 = 4kED [g(x, ξ)] k2 ≤ 4ED kg(x, ξ)k2 ≤ 4G2 , for all x ∈ Rd . (85)
Since the above holds for all x ∈ Rd , we need only take x ∈/ B x∗ , 2G
µ to arrive at a contradiction.
The problem in Lemma 9.11 is that we make two global assumptions which are incompatible.
But we can consider a problem where those assumptions are only local. In the next result, we will
assume to know that the solution x∗ lives in a certain ball, that the subgradients are bounded on
this ball, and we will consider the projected stochastic subgradient method (PSSD).
Theorem 9.12. Let f be µ-strongly convex, and let Assumption (B–Bounded Solution) hold.
Consider Assumption (Stochastic convex and G–bounded subgradients), but assume that the
bounded subgradients assumption holds only for every x ∈ B(0, B). Consider (xt )t∈N a sequence
generated by the (PSSD) algorithm, with a constant stepsize γt ≡ γ ∈]0, µ1 [. The iterates satisfy
γG2
ED kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 +
.
µ
Proof. Our assumption Assumption (B–Bounded Solution) guarantees that projB(0,B) (x∗ ) = x∗ ,
so the definition of (PSSD) together with the nonexpasiveness of the projection gives
53
We will now use our bound on the subgradients g(xt , ξt ) from Assumption (Stochastic convex and
G–bounded subgradients), using that the sequence xt belongs in B(0, B) because of the projection
step in (PSSD). Next we use that ED g(xt , ξt ) | xt ∈ ∂f (xt ) (see Lemma 9.4) and that f is
≤ kxt − x∗ k2 − 2γ ED g(xt , ξt ) | xt , xt − x∗ + γ 2 G2
(68)
≤ kxt − x∗ k2 − 2γ(f (xt ) − inf f ) − γµkxt − x∗ k2 + γ 2 G2
≤ (1 − γµ)kxt − x∗ k2 + γ 2 G2
Taking expectation on the above, and using a recurrence argument, we can deduce that
t−1
X
ED kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 + (1 − γµ)t γ 2 G2 .
k=0
Since
t−1
X 1 − (1 − γµ)t γ 2 G2 γG2
(1 − γµ)t γ 2 G2 = γ 2 G2 ≤ = ,
γµ γµ µ
k=0
we conclude that
γG2
ED kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 +
.
µ
Corollary 9.13 (O 1ε complexity). Consider the setting of Theorem 9.12. For every ε > 0, we
2G2
2
εµ 1 8B
γ = min ; and t ≥ max ; 1 ln .
2G2 µ εµ2 ε
G2
Proof. Use Lemma A.2 with A = µ , C = µ, and use the fact that kx0 − x∗ k ≤ 2B.
54
where f : Rd → R is differentiable, and g : Rd → R ∪ {+∞} is proper l.s.c. We require that the
problem is well-posed, in the sense that argmin F 6= ∅.
To exploit the structure of this composite sum, we will use the proximal gradient descent
algorithm, which alternates gradient steps with respect to the differentiable term f , and proximal
steps with respect to the nonsmooth term g.
Algorithm 10.2 (PGD). Let x0 ∈ Rd , and let γ > 0 be a stepsize. The Proximal Gradient
Descent (PGD) algorithm defines a sequence (xt )t∈N which satisfies
Theorem 10.3. Consider the Problem (Composite), and suppose that g is convex, and that f
is convex and L-smooth, for some L > 0. Let (xt )t∈N be the sequence of iterates generated by
the algorithm (PGD), with a stepsize γ ∈]0, L1 ]. Then, for all x∗ ∈ argmin F , for all t ∈ N we
have that
kx0 − x∗ k2
F (xt ) − inf F ≤ .
2γt
Proof. Let x∗ ∈ argmin F be any minmizer of F . We start by studying two (decreasing and
nonnegative) quantities of interest : F (xt ) − inf F and kxt − x∗ k2 .
First, we show that F (xt+1 ) − inf F decreases. For this, using the definition of proxγg in (69)
together with the update rule (87), we have that
1 0
xt+1 = argmin kx − (xt − γ∇f (xt ))k2 + γg(x0 ).
x0 ∈Rd 2
Consequently
1 t+1 1 t
g(xt+1 ) + kx − (xt − γ∇f (xt ))k2 ≤ g(xt ) + kx − (xt − γ∇f (xt ))k2 .
2γ 2γ
After expanding the squares and rearranging the terms, we see that the above inequality is equiv-
alent to
−1 t+1
g(xt+1 ) − g(xt ) ≤ kx − xt k2 − h∇f (xt ), xt+1 − xt i. (88)
2γ
Now, we can use the fact that f is L-smooth and (9) to write
L t+1
f (xt+1 ) − f (xt ) ≤ h∇f (xt ), xt+1 − xt i + kx − xt k2 . (89)
2
Summing (88) and (89), and using the fact that γL ≤ 1, we obtain that
−1 t+1 L
F (xt+1 ) − F (xt ) ≤ kx − xt k2 + kxt+1 − xt k2 ≤ 0. (90)
2γ 2
55
Now we show that kxt+1 − x∗ k2 is decreasing. For this we first expand the squares as follows
1 t+1 1 t −1 t+1 xt − xt+1 t+1
kx − x∗ k2 − kx − x∗ k2 = kx − xt k2 − h ,x − x∗ i. (91)
2γ 2γ 2γ γ
Since xt+1 = proxγg (xt − γ∇f (xt )), we know from (70) that
xt − xt+1
∈ ∇f (xt ) + ∂g(xt+1 ).
γ
Using the above in (91) we have that there exists some η t+1 ∈ ∂g(xt+1 ) such that
1 t+1 1 t
kx − x∗ k2 − kx − x∗ k2
2γ 2γ
−1 t+1
= kx − xt k2 − h∇f (xt ) + η t+1 , xt+1 − x∗ i
2γ
−1 t+1
= kx − xt k2 − hη t+1 , xt+1 − x∗ i − h∇f (xt ), xt+1 − xt i + h∇f (xt ), x∗ − xt i.
2γ
On the first inner product term we can use that η t+1 ∈ ∂g(xt+1 ) and definition of subgradient (66)
to write
− hη t+1 , xt+1 − x∗ i = hη t+1 , x∗ − xt+1 i ≤ g(x∗ ) − g(xt+1 ). (92)
On the second inner product term we can use the smoothness of L and (9) to write
L t+1
− h∇f (xt ), xt+1 − xt i ≤
kx − xt k2 + f (xt ) − f (xt+1 ). (93)
2
On the last term we can use the convexity of f and (2) to write
By combining (92), (93), (94), and using the fact that γL ≤ 1, we obtain
1 t+1 1 t −1 t+1 L
kx − x∗ k2 − kx − x∗ k2 ≤ kx − xt k2 + kxt+1 − xt k2
2γ 2γ 2γ 2
+g(x∗ ) − g(xt+1 ) + f (xt ) − f (xt+1 ) + f (x∗ ) − f (xt )
−1 t+1 L
= kx − xt k2 + kxt+1 − xt k2 − (F (xt+1 ) − F (x∗ ))
2γ 2
t+1
≤ −(F (x ) − inf F ). (95)
Now that we have established that the iterate gap and functions values are decreasing, we want to
show that the Lyapunov energy
1 t
Et := kx − x∗ k2 + t(F (xt ) − inf F ),
2γ
is decreasing. Indeed, re-arranging the terms and using (90) and (95) we have that
1 t+1 1 t
Et+1 − Et = (t + 1)(F (xt+1 ) − inf F ) − t(F (xt ) − inf F ) + kx − x∗ k2 − kx − x∗ k2
2γ 2γ
1 t+1 1 t
= F (xt+1 ) − inf F + t(F (xt+1 ) − F (xt )) + kx − x∗ k2 − kx − x∗ k2
2γ 2γ
(90) 1 t+1 1 t (95)
≤ F (xt+1 ) − inf F + kx − x∗ k2 − kx − x∗ k2 ≤ 0. (96)
2γ 2γ
56
We have shown that Et is decreasing, therefore we can write that
1 0
t(F (xt ) − inf F ) ≤ Et ≤ E0 = kx − x∗ k2 ,
2γ
and the conclusion follows after dividing by t.
Corollary 10.4 (O(1/t) Complexity). Consider the setting of Theorem 10.3, for a given > 0
and γ = L we have that
L kx0 − x∗ k2
t≥ =⇒ F (xt ) − inf F ≤ (97)
2
Theorem 10.5. Consider the Problem (Composite), and suppose that h is convex, and that f
is µ-strongly convex and L-smooth, for some L ≥ µ > 0. Let (xt )t∈N be the sequence of iterates
generated by the algorithm (PGD), with a stepsize 0 < γ ≤ L1 . Then, for x∗ = argmin F and
t ∈ N we have that
kxt+1 − x∗ k2 ≤ (1 − γµ) kxt − x∗ k2 . (98)
As for (GD) (see Theorem 3.6) we provide two different proofs here.
Proof of Theorem 10.5 with first-order properties. Use the definition of (PGD) together
with Lemma 8.17, and the nonexpansiveness of the proximal operator (Lemma 8.16), to write
kxt+1 − x∗ k2 ≤ kproxγg (xt − γ∇f (xt )) − proxγg (x∗ − γ∇f (x∗ ))k2
≤ k(xt − x∗ ) − γ(∇f (xt ) − ∇f (x∗ ))k2
= kxt − x∗ k2 + γ 2 k∇f (xt ) − ∇f (x∗ )k2 − 2γh∇f (xt ) − ∇f (x∗ ), xt − x∗ i.
We conclude after observing that f (xt ) − f (x∗ ) − h∇f (x∗ ), xt − x∗ i ≥ 0 (because f is convex, see
Lemma 2.8), and that 2γ 2 L − 2γ ≤ 0 (because of our assumption on the stepsize).
57
Proof of Theorem 10.5 with the Hessian. Let T (x) := x − γ∇f (x) so that the iterates of
(PGD) verify xt+1 = proxγg (T (xt )). From Lemma 8.17 we know that proxγg (T (x∗ )) = x∗ , so we
can write
kxt+1 − x∗ k = kproxγh (T (xt )) − proxγg (T (x∗ ))k.
Further, we already proved in the proof of Theorem 3.6 that T is (1 − γµ)-Lipschitz (assuming
further that f is twice differentiable). Consequently,
kxt+1 − x∗ k ≤ (1 − γµ)kxt − x∗ k.
To conclude the proof, take the squares in the above inequality, and use the fact that (1 − γµ)2 ≤
(1 − γµ).
Corollary 10.6 (log(1/) Complexity). Consider the setting of Theorem 10.5, for a given > 0,
we have that if γ = 1/L then
L 1
k ≥ log ⇒ kxt+1 − x∗ k2 ≤ kx0 − x∗ k2 . (99)
µ
The proof of this lemma follows by applying Lemma A.1 in the appendix.
58
Assumption 11.2 (Composite Sum of Convex). We consider the Problem (Composite Sum of
Functions) and we suppose that h and each fi are convex.
Assumption 11.3 (Composite Sum of Lmax –Smooth). We consider the Problem (Composite
Sum of Functions) and suppose that each fi is Li -smooth. We note Lmax = max Li .
i=1,...,n
Algorithm 11.4 (SPGD). Consider the Problem (Composite Sum of Functions). Let x0 ∈ Rd ,
and let γt > 0 be a sequence of step sizes. The Stochastic Proximal Gradient Descent
(SPGD) algorithm defines a sequence (xt )t∈N satisfying
1
it ∈ {1, . . . n} Sampled with probability ,
n
xt+1 = proxγg xt − γt ∇fit (xt ) .
(100)
Theorem 11.5. Let Assumptions (Composite Sum of Lmax –Smooth) and (Composite Sum of
Convex) hold. Let (xt )t∈N be a sequence generated by the (SPGD) algorithm with a nonincreasing
sequence of stepsizes verifying 0 < γ0 < 4L1max . Then, for all x∗ ∈ argmin F , for all t ∈ N,
Pt−1 2
t kx0 − x∗ k2 + 2γ0 (F (x0 ) − inf F ) 2σF∗ γs
Ps=0
E F (x̄ ) − inf F ≤ Pt−1 + t−1 ,
2(1 − 4γ0 Lmax ) s=0 γs (1 − 4γ0 Lmax ) s=0 γs
def 1 Pt−1
where x̄t = Pt−1
γs s=0 γs x
s.
s=0
Proof. Let us start by looking at kxt+1 − x∗ k2 − kxt − x∗ k2 . Since we just compare xt to xt+1 , to
lighten the notations we fix j := it . Expanding the squares, we have that
−h∇f (xt ) + η t+1 , xt+1 − x∗ i = −hη t+1 , xt+1 − x∗ i − h∇f (xt ), xt+1 − xt i + h∇f (xt ), x∗ − xt i
59
For the first term in the above we can use the fact that η t+1 ∈ ∂g(xt+1 ) to write
On the second term we can use the fact that f is L-smooth and (9) to write
L t+1
− h∇f (xt ), xt+1 − xt i ≤ kx − xt k2 + f (xt ) − f (xt+1 ). (103)
2
On the last term we can use the convexity of f and (2) to write
By combining (102), (103), (104), and using the fact that γt L ≤ γ0 Lmax ≤ 1, we obtain
1 1
kxt+1 − x∗ k2 − kxt − x∗ k2
2γt 2γt
−1 t+1 L
≤ kx − xt k2 + kxt+1 − xt k2 − (F (xt+1 ) − inf F ) − h∇fj (xt ) − ∇f (xt ), xt+1 − x∗ i
2γt 2
≤ −(F (x ) − inf F ) − h∇fj (xt ) − ∇f (xt ), xt+1 − x∗ i.
t+1
(105)
We now have to control the last term of (105), in expectation. To shorten the computation we
temporarily introduce the operators
def
T = Id − γt ∇f,
def
T̂ = Id − γt ∇fj .
Notice in particular that xt+1 = proxγt g (T̂ (xt )). We have that
−h∇fj (xt ) − ∇f (xt ), xt+1 − x∗ i = −h∇fj (xt ) − ∇f (xt ), proxγt g (T̂ (xt )) − proxγt g (T (xt ))i
−h∇fj (xt ) − ∇f (xt ), proxγt g (T (xt )) − x∗ i, (106)
and observe that the last term is, in expectation, is equal to zero. This is due to the fact that
proxγt g (T (xt )) − x∗ is deterministic when conditioned on xt . Since we will later on take expecta-
tions, we drop this term and keep on going. As for the first term, using the nonexpansiveness of
the proximal operator (Lemma 8.16), we have that
−h∇fj (xt ) − ∇f (xt ), proxγt g (T̂ (xt )) − proxγt g (T (xt ))i ≤ k∇fj (xt ) − ∇f (xt )kkT̂ (xt ) − T (xt )k
= γt k∇fj (xt ) − ∇f (xt )k2
Using the above two bounds in (106) we have proved that (after taking expectation)
E −h∇fj (xt ) − ∇f (xt ), xt+1 − x∗ i ≤ γt E k∇fj (xt ) − ∇f (xt )k2 = γt V ∇fj (xt ) .
60
To control the variance term V ∇fi (xt ) we use the variance transfer Lemma 8.20 with x = xt
and y = x∗ , which together with Definition 4.16 and Lemma 8.19 gives
and we will show that it is decreasing (we note γ−1 := γ0 ). Using the above definition we have
that
1 1
ET +1 − ET = E kxT +1 − x∗ k2 − E kxT − x∗ k2
2 2
T +1
) − inf F − γT −1 E F (xT ) − inf F
+ γT E F (x
+ γT (1 − 4γT Lmax )E F (xT ) − inf F − 2σF∗ γT2 .
Using the fact that the stepsizes are nonincreasing (γT ≤ γT −1 ) we consequently conclude that
ET +1 − ET ≤ 0. Therefore, from the definition of ET (and using again γt ≤ γ0 ), we have that
kx0 − x∗ k2 + 2γ0 (F (x0 ) − inf F )
E0 =
2
TX−1 T
X −1
γt (1 − 4γt Lmax )E F (xt ) − inf F − 2σF∗ γt2
≥ ET ≥
t=0 t=0
T
X −1 T
X −1
γt E F (xt ) − inf F − 2σF∗ γt2 .
≥ (1 − 4γ0 Lmax )
t=0 t=0
P −1
Now passing the term in σf∗ to the other side, dividing this inequality by (1 − 4γLmax ) Tt=0 γt ,
which is strictly positive since 4γ0 Lmax < 1, and using Jensen inequality, we finally conclude that
T −1
1 X
E F (x̄T ) − inf F ≤ γt E F (xt ) − inf F
PT −1
t=0 γt t=0
P −1 2
kx0 − ∗
x k2 +
2(F (x0 ) − inf F ) 2σF∗ Tt=0 γt
≤ PT −1 + PT −1 .
2(1 − 4γ0 Lmax ) t=0 γt (1 − 4γ0 Lmax ) t=0 γt
61
Analogously to Remark 5.4, different choices for the step sizes γt allow us to trade off the
convergence speed for the constant variance term. In the next two corollaries we choose a constant
√
and a 1/ t step size, respectively, followed by a O(2 ) complexity result.
Corollary 11.6. Let Assumptions (Composite Sum of Lmax –Smooth) and (Composite Sum of
Convex) hold. Let (xt )t∈N be a sequence generated by the (SPGD) algorithm with a constant
stepsize verifying 0 < γ < 4L1max . Then, for all x∗ ∈ argmin F and for all t ∈ N we have that
Corollary 11.7. Let Assumptions (Composite Sum of Lmax –Smooth) and (Composite Sum of
Convex) hold. Let (xt )t∈N be a sequence generated by the (SPGD) algorithm with a sequence of
stepsizes γt = √γt+1
0
verifying 0 < γ0 < 4L1max . Then, for all t ≥ 3 we have that
ln(t)
E F (x̄t ) − inf F = O
√ ,
t
def 1 Pt−1
where x̄t = Pt−1 s=0 γs x
s.
s=0 γs
Proof. Apply Theorem 11.5 with γt = √γ0 , and use the following integral bounds
t+1
and
t−1
X Z t−1
γ0 √ √
γs ≥ √ = 2γ0 t− 2 .
s=0 s=1 s+1
√√ 1√
For t ≥ 3 we have t−
2 > 6 t, and ln(t) > 1, so we can conclude that
Pt−1 2
1 1
√ √
1 ln(t)
√ s=0 γs ln(t)
√
Pt−1 ≤ √ = O = O and Pt−1 = O .
s=0 γs 2γ0 ( t − 2) t t s=0 γs t
Even though constant stepsizes do not lead to convergence of the algorithm, we can guarantee
arbitrary precision provided we take small stepsizes and a sufficient number of iterations.
62
Corollary 11.8 (O(−2 )–Complexity). Let Assumptions (Composite Sum of Lmax –Smooth) and
(Composite Sum of Convex) hold. Let (xt )t∈N be a sequence generated by the (SPGD) algorithm
∗
σF
, we can guarantee that E F (x̄t ) − inf F ≤ ε,
with a constant stepsize γ. For every 0 < ε ≤ Lmax
provided that
ε σ∗
γ= ∗ and t ≥ C0 F2 ,
8σf ε
Pt−1 s
where x̄t = 1t s=0 x , C0 = 16 kx0 − x∗ k2 + 4L1max (F (x0 ) − inf F ) and x∗ ∈ argmin F .
σ∗ ε 1
Proof. First, note that our assumptions ε ≤ Lmax F
and γ = 8σf∗ imply that γ ≤ 8Lmax , so the
conditions of Corollary 11.6 hold true, and we obtain
ε h0 + 2γr0 2
A≤ ⇔t≥ . (109)
2 2(1 − 4γLmax )γ ε
1 ε ε2
(1 − 4γLmax )γε ≥ ε = ,
2 8σF∗ 16σF∗
so indeed the inequality in (109) is true, which concludes the proof.
Theorem 11.9. Let Assumptions (Composite Sum of Lmax –Smooth) and (Composite Sum of
Convex) hold, and assume further that f is µ-strongly convex, for µ > 0. Let (xt )t∈N be a
sequence generated by the (SPGD) algorithm with a constant sequence of stepsizes verifying
0 < γ ≤ 2L1max . Then, for x∗ = argmin F , for all t ∈ N,
2γσF∗
E kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 +
.
µ
Proof. In this proof, we fix j := it to lighten the notations. Let us start by using the fixed-
point property of the (PGD) algorithm (Lemma 8.17), together with the nonexpansiveness of the
63
proximal operator (Lemma 8.16), to write
kxt+1 − x∗ k2 = kproxγg (xt − γ∇fj (xt )) − proxγg (x∗ − γ∇f (x∗ ))k2
≤ k(xt − γ∇fj (xt )) − (x∗ − γ∇f (x∗ ))k2
= kxt − x∗ k2 + γ 2 k∇fj (xt ) − ∇f (x∗ )k2 − 2γh∇fj (xt ) − ∇f (x∗ ), xt − x∗ i. (110)
Let us analyse the last two terms of the right-hand side of (110). For the first term we use the
Young’s and the triangular inequality together with Lemma 8.20 we obtain
γ 2 Ext k∇fj (xt ) − ∇f (x∗ )k2 ≤ 2γ 2 Ext k∇fj (xt ) − ∇fj (x∗ )k2 + 2γ 2 E k∇fj (x∗ ) − ∇f (x∗ )k2
where Df is the divergence of f (see Definition 8.18). For the second term in (110) we use the
strong convexity of f (Lemma 2.14) to write
where in the last inequality we used our assumption that 2γLmax ≤ 1. Now, recursively apply the
above to write
t−1
X
E kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 + 2γ 2 σF∗ (1 − γµ)s ,
s=0
Corollary 11.10 (Õ(1/) Complexity). Consider the setting of Theorem 11.9. Let > 0. If we
set
µ 1
γ = min , (113)
2 2σF∗ 2Lmax
64
then
1 4σF∗ 2Lmax 2kx0 − x∗ k2
t ≥ max , log =⇒ kxt − x∗ k2 ≤ (114)
µ2 µ
∗
2σF
Proof. Applying Lemma A.2 with A = µ , C = 2Lmax and α0 = kx0 − x∗ k2 gives the result (49).
References
[1] H. Asi and J. C. Duchi. “Stochastic (Approximate) Proximal Point Methods: Convergence,
Optimality, and Adaptivity”. In: SIAM J. Optim. 29.3 (2019), pp. 2257–2290.
[2] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in
Hilbert Spaces. 2nd edition. Springer, 2017.
[3] A. Beck and M. Teboulle. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear
Inverse Problems”. In: SIAM Journal on Imaging Sciences 2.1 (2009), pp. 183–202.
[4] J. Bolte et al. “Clarke Subgradients of Stratifiable Functions”. In: SIAM Journal on Opti-
mization 18.2 (2007), pp. 556–572.
[5] S. Bubeck. “Convex Optimization: Algorithms and Complexity”. In: (2014).
[6] F. Clarke. Optimization and Nonsmooth Analysis. Classics in Applied Mathematics. Society
for Industrial and Applied Mathematics, 1990.
[7] A. Defazio and R. M. Gower. “Factorial Powers for Stochastic Optimization”. In: Asian
Conference on Machine Learning (2021).
[8] D. J. H. Garling. A Course in Mathematical Analysis: Volume 2, Metric and Topological
Spaces, Functions of a Vector Variable. 1 edition. Cambridge: Cambridge University Press,
2014.
[9] G. Garrigos. “Square Distance Functions Are Polyak-Lojasiewicz and Vice-Versa”. In: preprint
arXiv:2301.10332 (2023).
[10] G. Garrigos, L. Rosasco, and S. Villa. “Convergence of the Forward-Backward Algorithm:
Beyond the Worst-Case with the Help of Geometry”. In: Mathematical Programming (2022).
[11] N. Gazagnadou, R. M. Gower, and J. Salmon. “Optimal mini-batch and step sizes for SAGA”.
In: ICML (2019).
65
[12] E. Ghadimi, H. Feyzmahdavian, and M. Johansson. “Global convergence of the Heavy-
ball method for convex optimization”. In: 2015 European Control Conference (ECC). 2015,
pp. 310–315.
[13] E. Gorbunov, F. Hanzely, and P. Richtárik. “A unified theory of sgd: Variance reduction,
sampling, quantization and coordinate descent”. In: arXiv preprint arXiv:1905.11261 (2019).
[14] R. M. Gower. Sketch and Project: Randomized Iterative Methods for Linear Systems and
Inverting Matrices. 2016.
[15] R. M. Gower, P. Richtárik, and F. Bach. “Stochastic Quasi-Gradient Methods: Variance
Reduction via Jacobian Sketching”. In: Mathematical Programming 188.1 (2021), pp. 135–
192.
[16] R. M. Gower, O. Sebbouh, and N. Loizou. “SGD for Structured Nonconvex Functions: Learn-
ing Rates, Minibatching and Interpolation”. In: Proceedings of The 24th International Con-
ference on Artificial Intelligence and Statistics. Vol. 130. PMLR, 2021, pp. 1315–1323.
[17] R. M. Gower et al. “SGD: General Analysis and Improved Rates”. In: International Confer-
ence on Machine Learning. 2019, pp. 5200–5209.
[18] M. Hardt, B. Recht, and Y. Singer. “Train faster, generalize better: stability of stochastic
gradient descent”. In: 33rd International Conference on Machine Learning. 2016.
[19] H. Karimi, J. Nutini, and M. W. Schmidt. “Linear Convergence of Gradient and Proximal-
Gradient Methods Under the Polyak-Lojasiewicz Condition”. In: European Conference on
Machine Learning (ECML) abs/1608.04636 (2016).
[20] A. Khaled and P. Richtarik. “Better Theory for SGD in the Nonconvex World”. In: arXiv:2002.03329
(2020).
[21] A. Khaled et al. Unified Analysis of Stochastic Gradient Methods for Composite Convex and
Smooth Optimization. 2020.
[22] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t)
convergence rate for the projected stochastic subgradient method. 2012.
[23] C. Liu, L. Zhu, and M. Belkin. “Toward a theory of optimization for over-parameterized
systems of non-linear equations: the lessons of deep learning”. In: CoRR (2020).
[24] S. Ma, R. Bassily, and M. Belkin. “The Power of Interpolation: Understanding the Effective-
ness of SGD in Modern Over-parametrized Learning”. In: ICML. 2018.
[25] E. Moulines and F. R. Bach. “Non-asymptotic analysis of stochastic approximation algo-
rithms for machine learning”. In: Advances in Neural Information Processing Systems. 2011,
pp. 451–459.
[26] D. Needell, N. Srebro, and R. Ward. “Stochastic gradient descent, weighted sampling, and the
randomized Kaczmarz algorithm”. In: Mathematical Programming, Series A 155.1 (2016),
pp. 549–573.
66
[27] D. Needell and R. Ward. “Batched Stochastic Gradient Descent with Weighted Sampling”.
In: Approximation Theory XV, Springer. Vol. 204. Springer Proceedings in Mathematics &
Statistics, 2017, pp. 279 –306.
[28] A. Nemirovski and D. B. Yudin. “On Cezari’s convergence of the steepest descent method for
approximating saddle point of convex-concave functions”. In: Soviet Mathetmatics Doklady
19 (1978).
[29] A. Nemirovski and D. B. Yudin. Problem complexity and method efficiency in optimization.
Wiley Interscience, 1983.
[30] A. Nemirovski et al. “Robust stochastic approximation approach to stochastic program-
ming”. In: SIAM Journal on Optimization 19.4 (2009), pp. 1574–1609.
[31] Y. Nesterov. Introductory Lectures on Convex Optimization. Vol. 87. Springer Science &
Business Media, 2004.
[32] F. Orabona. A Modern Introduction to Online Learning. 2019.
[33] J. Peypouquet. Convex Optimization in Normed Spaces. SpringerBriefs in Optimization.
Cham: Springer International Publishing, 2015.
[34] B. T. Polyak. “Some Methods of Speeding up the Convergence of Iteration Methods”. In:
USSR Computational Mathematics and Mathematical Physics 4 (1964), pp. 1–17.
[35] H. Robbins and S. Monro. “A stochastic approximation method”. In: The Annals of Math-
ematical Statistics (1951), pp. 400–407.
[36] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis. New York: Springer, 2009.
[37] M. Schmidt, N. Roux, and F. Bach. “Convergence Rates of Inexact Proximal-Gradient Meth-
ods for Convex Optimization”. In: Advances in neural information processing systems 24
(2011).
[38] M. Schmidt and N. L. Roux. Fast Convergence of Stochastic Gradient Descent under a Strong
Growth Condition. 2013.
[39] O. Sebbouh, R. M. Gower, and A. Defazio. “Almost sure convergence rates for Stochastic
Gradient Descent and Stochastic Heavy Ball”. In: COLT (2021).
[40] S. Shalev-Shwartz, Y. Singer, and N. Srebro. “Pegasos: primal estimated subgradient solver
for SVM”. In: 24th International Conference on Machine Learning. 2007, pp. 807–814.
[41] O. Shamir and T. Zhang. “Stochastic Gradient Descent for Non-smooth Optimization: Con-
vergence Results and Optimal Averaging Schemes”. In: Proceedings of the 30th International
Conference on Machine Learning. 2013, pp. 71–79.
[42] M. Zinkevich. “Online Convex Programming and Generalized Infinitesimal Gradient Ascent”.
In: Proceedings of the Twentieth International Conference on International Conference on
Machine Learning. ICML’03. 2003, 928–935.
67
A Appendix
A.1 Lemmas for Complexity
The following lemma was copied from Lemma 11 in [14].
Lemma A.1. Consider the sequence (αk )k ∈ R+ of positive scalars that converges to zero
according to
αk ≤ ρk α0 , (115)
Proof. First note that if ρ = 0 the result follows trivially. Assuming ρ ∈ (0, 1), rearranging (115)
and applying the logarithm to both sides gives
α0 1
log ≥ k log . (117)
αk ρ
Now using that
1 1
log ≥ 1, (118)
1−ρ ρ
for all ρ ∈ (0, 1) and assuming that
1 1
k≥ log , (119)
1−ρ
we have that
(117)
α0 1
log ≥ k log
αk ρ
(119)
1 1 1
≥ log log
1−ρ ρ
(118)
1
≥ log
Applying exponentials to the above inequality gives (116).
As an example of the use this lemma, consider the sequence of random vectors (Y k )k for which
the expected norm converges to zero according to
h i
E kY k k2 ≤ ρk kY 0 k2 . (120)
Then applying Lemma A.1 with αk = E kY k k2 for a given 1 > > 0 states that
1 1 h i
k≥ log ⇒ E kY k k2 ≤ kY 0 k2 .
1−ρ
68
Lemma A.2. Consider the recurrence given by
where µ > 0 and A, C ≥ 0 are given constants and γ ∈]0, C1 [ is a free parameter. If
1
γ = min , (122)
2A C
then
1 2A C 2α0
t ≥ max , log =⇒ αk ≤ .
µ µ
Proof. First we restrict γ so that the second term in (121) is less than /2, that is
Aγ ≤ =⇒ γ≤ .
2 2A
Thus we set γ according to (122) to also satisfy the constraint that γ ≤ C1 .
Furthermore we want
(1 − µγ)t α0 ≤ .
2
Taking logarithms and re-arranging the above means that we want
2α0 1
log ≤ t log . (123)
1 − γµ
Now using that log ρ1 ≥ 1 − ρ, with ρ = 1 − γµ ∈]0, 1], we see that for (123) to be true, it is
enough to have
1 2α0
t≥ log .
µγ
Substituting in γ from (122) gives
1 2A C 2α0
t ≥ max , log .
µ µ
Proof. The fact that f is not convex follows directly from the fact that f 00 (t) = 2 + 6 cos(2t)
can be nonpositive, for instance f 00 ( π2 ) = −4. To prove that f is PL, start by computing f 0 (t) =
2t + 3 sin(2t), and inf f = 0. Therefore we are looking for a constant α = 2µ > 0 such that
69
Using the fact that sin(t)2 ≤ t2 , we see that it is sufficient to find α > 0 such that
Now let us introduce X = 2t, Y = 3 sin(2t), so that the above property is equivalent to
Now we just need to make sure that the curve Y = 3 sin(X) violates those conditions for α small
enough. We will consider different cases depending on the value of X:
√
• If X ∈ [0, π], we have Y = 3 sin(X) ≥ 0 > −(1 − α)X, provided that α < 1.
• On [π, 45 π] we can use the inequality sin(t) ≥ π − t. One way to prove this inequality is to use
the fact that sin(t) is convex on [π, 2π] (its second derivative is − sin(t) ≥ 0), which implies
that sin(t) is greater than its tangent at t0 = π, whose equation is π − t. This being said, we
can write (remember that X ∈ [π, 45 π] here):
5 3 √ √
Y = 3 sin(X) ≥ 3(π − X) ≥ 3(π − π) = − π > −(1 − α)π ≥ −(1 − α)X,
4 4
√
where the strict inequality is true whenever 34 < 1 − α ⇔ α < 16
1
' 0.06.
• If X ∈]−∞, 0], we can use the exact same arguments (use the fact that sine is a odd function)
√
to obtain that Y < −(1 − α)X.
In every cases, we see that (124) is violated when Y = 3 sin(X) and α = 0.05, which allows us to
conclude that f is µ-PL with µ = α/2 = 0.025 = 1/40.
70