0% found this document useful (0 votes)
33 views70 pages

Handbook of Convergence Theorems

Uploaded by

Taha El Bakkali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views70 pages

Handbook of Convergence Theorems

Uploaded by

Taha El Bakkali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Handbook of Convergence Theorems

for (Stochastic) Gradient Methods


Guillaume Garrigos
Université Paris Cité and Sorbonne Université, CNRS
Laboratoire de Probabilités, Statistique et Modélisation
arXiv:2301.11235v2 [math.OC] 17 Feb 2023

F-75013 Paris, France


[email protected]
Robert M. Gower
Center for Computational Mathematics Flatiron Institute, New York
[email protected]

February 20, 2023

Abstract

This is a handbook of simple proofs of the convergence of gradient and stochastic gradient
descent type methods. We consider functions that are Lipschitz, smooth, convex, strongly
convex, and/or Polyak-Lojasiewicz functions. Our focus is on “good proofs” that are also
simple. Each section can be consulted separately. We start with proofs of gradient descent, then
on stochastic variants, including minibatching and momentum. Then move on to nonsmooth
problems with the subgradient method, the proximal gradient descent and their stochastic
variants. Our focus is on global convergence rates and complexity rates. Some slightly less
common proofs found here include that of SGD (Stochastic gradient descent) with a proximal
step in 11, with momentum in Section 7, and with mini-batching in Section 6.

1 Introduction
Here we collect our favourite convergence proofs for gradient and stochastic gradient based meth-
ods. Our focus has been on simple proofs, that are easy to copy and understand, and yet achieve
the best convergence rate for the setting.
Disclaimer: Theses notes are not proper review of the literature. Our aim is to have an easy
to reference handbook. Most of these proofs are not our work, but rather a collection of known
proofs. If you find these notes useful, feel free to cite them, but we kindly ask that you cite the
original sources as well that are given either before most theorems or in the bibliographic notes at
the end of each section.

1
How to use these notes
We recommend searching for the theorem you want in the table of contents, or in the in Table 1a
just below, then going directly to the section to see the proof. You can then follow the hyperlinks
for the assumptions and properties backwards as needed. For example, if you want to know about
the proof of Gradient Descent in the convex and smooth case you can jump ahead to Section 3.1.
There you will find you need a property of convex function given in Lemma 2.8. These notes were
not made to be read linearly: it would be impossibly boring.

Acknowledgements
The authors would like to thank Shuvomoy Das Gupta and Benjamin Grimmer, for pointing out
typos and errors in an earlier version of this document.

2
Contents
1 Introduction 1

2 Theory : Smooth functions and convexity 6


2.1 Differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Lipschitz functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Strong convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Polyak-Lojasiewicz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Smoothness and nonconvexity . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Smoothness and Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Gradient Descent 14
3.1 Convergence for convex and smooth functions . . . . . . . . . . . . . . . . . . . . . 14
3.2 Convergence for strongly convex and smooth functions . . . . . . . . . . . . . . . . 17
3.3 Convergence for Polyak-Lojasiewicz and smooth functions . . . . . . . . . . . . . . 18
3.4 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Theory : Sum of functions 19


4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Expected smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Controlling the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Interpolation constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.3 Variance transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Stochastic Gradient Descent 25


5.1 Convergence for convex and smooth functions . . . . . . . . . . . . . . . . . . . . . 25
5.2 Convergence for strongly convex and smooth functions . . . . . . . . . . . . . . . . 29
5.3 Convergence for Polyak-Lojasiewicz and smooth functions . . . . . . . . . . . . . . 30
5.4 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Minibatch SGD 32
6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2 Convergence for convex and smooth functions . . . . . . . . . . . . . . . . . . . . . 34
6.3 Rates for strongly convex and smooth functions . . . . . . . . . . . . . . . . . . . . 35
6.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3
7 Stochastic Momentum 37
7.1 The many ways of writing momentum . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 Convergence for convex and smooth functions . . . . . . . . . . . . . . . . . . . . . 39
7.3 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

8 Theory : Nonsmooth functions 40


8.1 Real-extended valued functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Subdifferential of nonsmooth convex functions . . . . . . . . . . . . . . . . . . . . . 41
8.3 Nonsmooth strongly convex functions . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.4 Proximal operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.5 Controlling the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9 Stochastic Subgradient Descent 47


9.1 Convergence for convex functions and bounded gradients . . . . . . . . . . . . . . . 48
9.2 Better convergence rates for convex functions with bounded solution . . . . . . . . 50
9.3 Convergence for strongly convex functions with bounded gradients on bounded sets 52
9.4 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

10 Proximal Gradient Descent 54


10.1 Convergence for convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10.2 Convergence for strongly convex functions . . . . . . . . . . . . . . . . . . . . . . . 57
10.3 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

11 Stochastic Proximal Gradient Descent 58


11.1 Complexity for convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.2 Complexity for strongly convex functions . . . . . . . . . . . . . . . . . . . . . . . 63
11.3 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A Appendix 68
A.1 Lemmas for Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2 A nonconvex PL function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4
Table 1: Where to find the corresponding theorem and complexity for all the algorithms and
assumptions. GD = Gradient Descent, SGD = Stochastic Gradient Descent, mini-SGD = SGD
with mini-batching, momentum = SGD with momentum, also known as stochastic heavy ball,
prox-GD is proximal GD, and proximal SGD. The X’s are settings which are currently not covered
in the handbook.

convex µ-strongly convex µ–PL


Methods L–smooth G–Lipschitz L–smooth L–smooth
GD (16) Theorem 3.4 Theorem 9.5 Theorem 3.6 Theorem 3.9
SGD (39) Theorem 5.3 Theorem 9.5 Theorem 5.7 Theorem 5.9
mini-SGD (52) Theorem 6.8 X Theorem 6.11 X
momentum (58) Theorem 7.4 X X X
prox-GD (87) Theorem 10.3 X Theorem 10.5 X
prox-SGD (100) Theorem 11.5 X Theorem 11.9 X
(a) Main results for each method
convex µ-strongly convex µ–PL
Methods L–smooth G–Lipschitz L–smooth L–smooth
L G L 1 L 1
 
GD (16)  2 µ log  µ log 
 σf∗
2 n σ∗ o ∆∗f L2
SGD (39) 1
2
Lmax D2 + Lmax G
2
max 1 µf2 , Lmax
µ  µ2
∗ 2
  n ∗ o
mini-SGD (52) 1
L D 2 + σb X
σ L
max 1 µb2 , µb X
2 b Lb
 2
momentum (58) 1 1
D 2 + σ ∗ X X X
2
 4L 2 f
L L 1

prox-GD (87)  X µ log  X
2 ∗
  n o
σF 1 σF
prox-SGD (100) 2
D2 + δLF X max  µ2 , Lmax
µ X

(b) Table of the complexity of each algorithm, where in each cell we give the number of iterations
required to guarantee E kxt+1 − x∗ k2 ≤  in the strongly convex setting, or E [f (xt ) − inf f ] ≤ 
 

in the convex and PL settings. Further σf∗ is defined in (36), ∆∗f in (35), D := kx0 − x∗ k,
δf := f (x0 ) − inf f . For composite functions F = f + g we have δF := F (x0 ) − inf F and σF∗ is
defined in (73). For the mini-batch-SGD with fixed batch size b ∈ N, we have σb∗ defined in (55)
and Lb defined in (54).

5
2 Theory : Smooth functions and convexity
2.1 Differentiability
2.1.1 Notations

Definition 2.1 (Jacobian). Let F : Rd → Rp be differentiable, and x ∈ Rd . Then we note


DF(x) the Jacobian of F at x, which is the matrix defined by its first partial derivatives:
  ∂fi
DF(x) ij = (x), for i = 1, . . . p, j = 1, . . . , d,
∂xj

where we write F(x) = (f1 (x), . . . , fp (x)). Consequently DF(x) is a matrix with DF(x) ∈ Rp×d .

Remark 2.2 (Gradient). If f : Rd → R is differentiable, then Df (x) ∈ R1×d is a row vector,


whose transpose is called the gradient of f at x : ∇f (x) = Df (x)> ∈ Rd×1 .

Definition 2.3 (Hessian). Let f : Rd → R be twice differentiable, and x ∈ Rd . Then we note


∇2 f (x) the Hessian of f at x, which is the matrix defined by its second-order partial derivatives:

 2  ∂2f
∇ f (x) i,j = (x), for i, j = 1, . . . , d.
∂xi ∂xj

Consequently ∇2 f (x) is a d × d matrix.

Remark 2.4 (Hessian and eigenvalues). If f is twice differentiable, then its Hessian is always
a symmetric matrix (Schwarz’s Theorem). Therefore, the Hessian matrix ∇2 f (x) admits d
eigenvalues (Spectral Theorem).

2.1.2 Lipschitz functions

Definition 2.5. Let F : Rd → Rp , and L > 0. We say that F is L-Lipschitz if

for all x, y ∈ Rd , kF(y) − F(x)k ≤ Lky − xk.

A differentiable function is L-Lipschitz if and only if its differential is uniformly bounded by L.

Lemma 2.6. Let F : Rd → Rp be differentiable, and L > 0. Then F is L-Lipschitz if and only
if
for all x ∈ Rd , kDF(x)k ≤ L

Proof. ⇒ Assume that F is L-Lipschitz. Let x ∈ Rd , and let us show that kDF(x)k ≤ L. This
is equivalent to show that kDF(x)vk ≤ L, for any v ∈ Rd such that kvk = 1. For a given v, the
directional derivative is given by
F(x + tv) − F(x)
DF(x)v = lim .
t↓0 t

6
Taking the norm in this equality, and using our assumption that F is L-Lipschitz, we indeed obtain

kF(x + tv) − F(x)k Lk(x + tv) − xk Ltkvk


kDF(x)vk = lim ≤ lim = lim = L.
t↓0 t t↓0 t t↓0 t

⇐ Assume now that kDF(z)k ≤ L for every vector z ∈ Rd , and let us show that F is L-Lipschitz.
For this, fix x, y ∈ Rd , and use the Mean-Value Inequality (see e.g. [8, Theorem 17.2.2]) to write
!
kF(y) − F(x)k ≤ sup kDF(z)k ky − xk ≤ Lky − xk.
z∈[x,y]

2.2 Convexity

Definition 2.7. We say that f : Rd → R ∪ {+∞} is convex if

for all x, y ∈ Rd , for all t ∈ [0, 1], f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y). (1)

The next two lemmas characterize the convexity of a function with the help of first and second-
order derivatives. These properties will be heavily used in the proofs.

Lemma 2.8. If f : Rd → R is convex and differentiable then,

for all x, y ∈ Rd , f (x) ≥ f (y) + h∇f (y), x − yi . (2)

Proof. We can deduce (2) from (1) by dividing by t and re-arranging

f (y + t(x − y)) − f (y)


≤ f (x) − f (y).
t
Now taking the limit when t → 0 gives

h∇f (y), x − yi ≤ f (x) − f (y).

Lemma 2.9. Let f : Rd → R be convex and twice differentiable. Then, for all x ∈ Rd , for every
eigenvalue λ of ∇2 f (x), we have λ ≥ 0.

Proof. Since f is convex we can use (2) twice (permuting the roles of x and y) and summing the
resulting two inequalities, to obtain that

for all x, y ∈ Rd , h∇f (y) − ∇f (x), y − xi ≥ 0. (3)

Now, fix x, v ∈ Rd , and write

∇f (x + tv) − ∇f (x) 1
h∇2 f (x)v, vi = hlim , vi = lim 2 h∇f (x + tv) − ∇f (x), (x + tv) − xi ≥ 0,
t→0 t t→0 t

7
where the first equality follows because the gradient is a continuous function and the last inequality
follows from (3). Now we can conclude : if λ is an eigenvalue of ∇2 f (x), take any non zero
eigenvector v ∈ Rd and write
λkvk2 = hλv, vi = h∇2 f (x)v, vi ≥ 0.

Example 2.10 (Least-squares is convex). Let Φ ∈ Mn,d (R) and y ∈ Rn , and let f (x) =
1 2 2 >
2 kΦx − yk be the corresponding least-squares function. Then f is convex, since ∇ f (x) ≡ Φ Φ
is positive semi-definite.

2.3 Strong convexity

Definition 2.11. Let f : Rd → R ∪ {+∞}, and µ > 0. We say that f is µ-strongly convex if,
for every x, y ∈ Rd , and every t ∈ [0, 1] we have that

t(1 − t)
µ kx − yk2 + f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y).
2
We say that µ is the strong convexity constant of f .

The lemma below shows that it is easy to craft a strongly convex function : just add a multiple
of k·k2 to a convex function. This happens for instance when using Tikhonov regularization (a.k.a.
ridge regularization) in machine learning or inverse problems.

Lemma 2.12. Let f : Rd → R, and µ > 0. The function f is µ-strongly convex if and only if
there exists a convex function g : Rd → R such that f (x) = g(x) + µ2 kxk2 .

Proof. Given f and µ, define g(x) := f (x) − µ2 kxk2 . We need to prove that f is µ-strongly convex
if and only if g is convex. We start from Definition 2.11 and write (we note zt = (1 − t)x + ty):
f is µ-strongly convex
µ
⇔ ∀t ∀x, y, f (zt ) + t(1 − t)kx − yk2 ≤ (1 − t)f (x) + tf (y)
2
µ µ µ µ
⇔ ∀t ∀x, y, g(zt ) + kzt k2 + t(1 − t)kx − yk2 ≤ (1 − t)g(x) + tg(y) + (1 − t) kxk2 + t kyk2 .
2 2 2 2
Let us now gather all the terms multiplied by µ to find that
1 1 1 1
kzt k2 + t(1 − t)kx − yk2 − (1 − t) kxk2 − t kyk2
2 2 2 2
= (1 − t) kxk + t kyk + 2t(1 − t)hx, yi + t(1 − t)kxk2 + t(1 − t)kyk2 − 2t(1 − t)hx, yi
2 2 2 2

− (1 − t)kx|2 − tkyk2
kxk2 (1 − t)2 + t(1 − t) − (1 − t) + kyk2 t2 + t(1 − t) − t
 
=
= 0.
So we see that all the terms in µ disappear, and what remains is exactly the definition for g to be
convex.

8
Lemma 2.13. If f : Rd → R is a continuous strongly convex function, then f admits a unique
minimizer.

Proof. See [33, Corollary 2.20].

Now we present some useful variational inequalities satisfied by strongly convex functions.

Lemma 2.14. If f : Rd → R is µ-strongly convex and differentiable function then


µ
for all x, y ∈ Rd , f (y) ≥ f (x) + h∇f (x), y − xi + ky − xk2 . (4)
2

Proof. Define g(x) := f (x) − µ2 kxk2 . According to Lemma 2.12, g is convex. It is also clearly
differentiable by definition. According to the sum rule, we have ∇f (x) = ∇g(x) + µx. Therefore
we can use the convexity of g with Definition 8.4 to write
µ µ µ
f (y) − f (x) − h∇f (x), y − xi ≥ kyk2 − kxk2 − hµx, y − xi = ky − xk2 .
2 2 2

Lemma 2.15. Let f : Rd → R be a twice differentiable µ-strongly convex function. Then, for
all x ∈ Rd , for every eigenvalue λ of ∇2 f (x), we have λ ≥ µ.

Proof. Define g(x) := f (x) − µ2 kxk2 , which is convex according to Lemma 2.12. It is also twice
differentiable, by definition, and we have ∇2 f (x) = ∇2 g(x) + µId. So the eigenvalues of ∇2 f (x)
are equal to the ones of ∇2 g(x) plus µ. We can conclude by using Lemma 2.9.

Example 2.16 (Least-squares and strong convexity). Let f be a least-squares function as in


Exercice 2.10. Then f is strongly convex if and only if Φ is injective. In this case, the strong
convexity constant µ is λmin (Φ> Φ), the smallest eigenvalue of Φ> Φ.

2.4 Polyak-Lojasiewicz

Definition 2.17 (Polyak-Lojasiewicz). Let f : Rd → R be differentiable, and µ > 0. We say


that f is µ-Polyak-Lojasiewicz if it is bounded from below, and if for all x ∈ Rd
1
f (x) − inf f ≤ k∇f (x)k2 . (5)

We just say that f is Polyak-Lojasiewicz (PL for short) if there exists µ > 0 such that f is
µ-Polyak-Lojasiewicz.

The Polyak-Lojasiewicz property is weaker than strong convexity, as we see next.

9
Lemma 2.18. Let f : Rd → R be differentiable, and µ > 0. If f is µ-strongly convex, then f is
µ-Polyak-Lojasiewicz.

Proof. Let x∗ be a minimizer of f (see Lemma 2.13), such that f (x∗ ) = inf f. Multiplying (4) by
minus one and substituting y = x∗ as the minimizer, we have that
µ ∗
f (x) − f (x∗ ) ≤ h∇f (x), x − x∗ i −
kx − xk2
2
1 √ 1 1
= − k µ(x − x∗ ) − √ ∇f (x)k2 + k∇f (x)k2
2 µ 2µ
1
≤ k∇f (x)k2 .

It is important to note that the Polyak-Lojasiewicz property can hold without strong convexity
or even convexity, as illustrated in the next examples.

Example 2.19 (Least-squares is PL). Let f be a least-squares function as in Example 2.10.


Then it is a simple exercise to show that f is PL, and that the PL constant µ is λ∗min (Φ> Φ), the
smallest nonzero eigenvalue of Φ> Φ (see e.g. [10, Example 3.7]).

Example 2.20 (Nonconvex PL functions).

• Let f (t) = t2 + 3 sin(t)2 . It is an exercise to verify that f is PL, while not being convex
(see Lemma A.3 for more details).

• If Ω ⊂ Rd is a closed set and f (x) = dist(x; Ω)2 is the squared distance function to this set,
then it can be shown that f is PL. See Figure 1 for an example, and [9] for more details.

Figure 1: Graph of a PL function f : R2 → R. Note that the function is not convex, but that the
only critical points are the global minimizers (displayed as a white curve).

1
Example 2.21 (PL for nonlinear models). Let f (x) = 2 kΦ(x) − yk2 , where Φ : Rd → Rn is

10
differentiable. Then f is PL if DΦ> (x) is uniformly injective:

there exists µ > 0 such that for all x ∈ Rd , λmin (DΦ(x)DΦ(x)> ) ≥ µ. (6)

Indeed it suffices to write

k∇f (x)k2 = kDΦ(x)> (Φ(x) − y)k2 ≥ µkΦ(x) − yk2 = 2µf (x) ≥ 2µ(f (x) − inf f ).

Note that assumption (6) requires d ≥ n, which holds if Φ represents an overparametrized neural
network. For more refined arguments, including less naive assumptions and exploiting the neural
network structure of Φ, see [23].

One must keep in mind that the PL property is rather strong, as it is a global property and
requires the following to be true, which is typical of convexity.

Lemma 2.22. Let f : Rd → R be a differentiable PL function. Then x∗ ∈ argmin f if and only


if ∇f (x∗ ) = 0.

Proof. Immediate from plugging in x = x∗ in (5).

Remark 2.23 (Local Lojasiewicz inequalities). In this document we focus only on the Polyak-
Lojasiewicz inequality, for simplicity. Though there exists a much larger family of Lojasiewicz
inequalities, which by and large cover most functions used in practice.

• The inequality can be more local. For instance by requiring that (5) holds only on some
subset Ω ⊂ Rd instead of the whole Rd . For instance, logistic functions typically verify
(5) on every bounded set, but not on the whole space. The same can be said about the
empirical risk associated to wide enough neural networks [23].

• While PL describes functions that grow like x 7→ µ2 kxk2 , there are p-Lojasiewicz inequalities
p−1
describing functions that grow like x 7→ µ p kxkp and satisfy f (x) − inf f ≤ qµ 1
k∇f (x)kq
on some set Ω, with p1 + 1q = 1.

• The inequality can be even more local, by dropping the property that every critical point is
a global minimum. For this we do not look at the growth of f (x)−inf f , but of f (x)−f (x∗ )
instead, where x∗ ∈ Rd is a critical point of interest. This can be written as
1 1 1
for all x ∈ Ω, f (x) − f (x∗ ) ≤ k∇f (x)kq , where + = 1. (7)
qµ p q

A famous result [4, Corollary 16] shows that any semi-algebraic function (e.g. sums and products
of polynomials by part functions) verifies (7) at every x∗ ∈ Rd for some p ≥ 1, µ > 0, and Ω
being an appropriate neighbourhood of x∗ . This framework includes for instance quadratic losses
evaluating a Neural Network with ReLU as activations.

11
2.5 Smoothness
Definition 2.24. Let f : Rd → R, and L > 0. We say that f is L-smooth if it is differentiable
and if ∇f : Rd → Rd is L-Lipschitz:

for all x, y ∈ Rd , k∇f (x) − ∇f (y)k ≤ Lkx − yk. (8)

2.5.1 Smoothness and nonconvexity

As for the convexity (and strong convexity), we give two characterizations of the smoothness by
means of first and second order derivatives.

Lemma 2.25. If f : Rd → R is L-smooth then


L
for all x, y ∈ Rd , f (y) ≤ f (x) + h∇f (x), y − xi + ky − xk2 . (9)
2

Proof. Let x, y ∈ Rd be fixed. Let φ(t) := f (x + t(y − x)). Using the Fundamental Theorem of
Calculus on φ, we can write that
Z 1
f (y) = f (x) + h∇f (x + t(y − x)), y − xi dt.
t=0
Z 1
= f (x) + h∇f (y), x − yi + h∇f (x + t(y − x)) − ∇f (x), y − xi dt.
t=0
Z 1
≤ f (x) + h∇f (x), y − xi + k∇f (x + t(y − x)) − ∇f (x)kky − xkdt
t=0
(8)
Z 1
≤ f (x) + h∇f (x), y − xi + Ltky − xk2 dt
t=0
L
≤ f (x) + h∇f (x), y − xi + ky − xk2 .
2

Lemma 2.26. Let f : Rd → R be a twice differentiable L-smooth function. Then, for all x ∈ Rd ,
for every eigenvalue λ of ∇2 f (x), we have |λ| ≤ L.

Proof. Use Lemma 2.6 with F = ∇f , together with the fact that D(∇f )(x) = ∇2 f (x). We obtain
that, for all x ∈ Rd , we have k∇2 f (x)k ≤ L. Therefore, for every eigenvalue λ of ∇2 f (x), we can
write for a nonzero eigenvector v ∈ Rd that

|λ|kvk = kλvk = k∇2 f (x)vk ≤ k∇2 f (x)kkvk ≤ Lkvk. (10)

The conclusion follows after dividing by kvk =


6 0.

12
Remark 2.27. From Lemmas 2.25 and 2.14 we see that if a function is L-smooth and µ-strongly
convex then µ ≤ L.

Some direct consequences of the smoothness are given in the following lemma. You can compare
(12) with Lemma 2.18.

Lemma 2.28. If f is L–smooth and λ > 0 then


 
d λL
for all x, y ∈ R , f (x − λ∇f (x)) − f (x) ≤ −λ 1 − k∇f (x)k2 . (11)
2

If moreover inf f > −∞, then


1
for all x ∈ Rd , k∇f (x)k2 ≤ f (x) − inf f. (12)
2L

Proof. The first inequality (11) follows by inserting y = x − λ∇f (x) in (9) since

L
f (x − λ∇f (x)) ≤ f (x) − λh∇f (x), ∇f (x)i + kλ∇f (x)k2
  2
λL
= f (x) − λ 1 − k∇f (x)k2 .
2

Assume now inf f > −∞. By using (11) with λ = 1/L, we get (12) up to a multiplication by −1 :
1
inf f − f (x) ≤ f (x − L1 ∇f (x)) − f (x) ≤ − k∇f (x)k2 . (13)
2L

2.5.2 Smoothness and Convexity

There are many problems in optimization where the function is both smooth and convex. Such
functions enjoy properties which are strictly better than a simple combination of their convex and
smooth properties.

Lemma 2.29. If f : Rd → R is convex and L-smooth, then for all x, y ∈ Rd we have that
1
k∇f (y) − ∇f (x)k2 ≤ f (y) − f (x) − h∇f (x), y − xi , (14)
2L
1
k∇f (x) − ∇f (y)k2 ≤ h∇f (y) − ∇f (x), y − xi (Co-coercivity) (15)
L

Proof. To prove (14), fix x, y ∈ Rd and start by using the convexity and the smoothness of f to
write, for every z ∈ Rd ,

f (x) − f (y) = f (x) − f (z) + f (z) − f (y)


(2)+(9) L
≤ h∇f (x), x − zi + h∇f (y), z − yi + kz − yk2 .
2

13
To get the tightest upper bound on the right hand side, we can minimize the right hand side with
respect to z, which gives
1
z = y − (∇f (y) − ∇f (x)).
L
Substituting this z in gives, after reorganizing the terms:
L
f (x) − f (y) ≤ h∇f (x), x − zi + h∇f (y), z − yi + kz − yk2 .
2
1 1
= h∇f (x), x − yi − k∇f (y) − ∇f (x)k2 + k∇f (y) − ∇f (x)k2
L 2L
1
= h∇f (x), x − yi − k∇f (y) − ∇f (x)k2 .
2L
This proves (14). To obtain (15), apply (14) twice by interchanging the roles of x and y
1
k∇f (y) − ∇f (x)k2 ≤ f (y) − f (x) − h∇f (x), y − xi ,
2L
1
k∇f (x) − ∇f (y)k2 ≤ f (x) − f (y) − h∇f (y), x − yi ,
2L
and sum those two inequalities.

3 Gradient Descent
Problem 3.1 (Differentiable Function). We want to minimize a differentiable function f : Rd →
R. We require that the problem is well-posed, in the sense that argmin f 6= ∅.

Algorithm 3.2 (GD). Let x0 ∈ Rd , and let γ > 0 be a step size. The Gradient Descent
(GD) algorithm defines a sequence (xt )t∈N satisfying

xt+1 = xt − γ∇f (xt ). (16)

Remark 3.3 (Vocabulary). Stepsizes are often called learning rates in the machine learning
community.

We will now prove that the iterates of (GD) converge. In Theorem 3.4 we will prove sub-
linear convergence under the assumption that f is convex. In Theorem 3.6 we will prove linear
convergence (a faster form of convergence) under the stronger assumption that f is µ–strongly
convex.

3.1 Convergence for convex and smooth functions

Theorem 3.4. Consider the Problem (Differentiable Function) and assume that f is convex
and L-smooth, for some L > 0. Let (xt )t∈N be the sequence of iterates generated by the (GD)
algorithm, with a stepsize satisfying 0 < γ ≤ L1 . Then, for all x∗ ∈ argmin f , for all t ∈ N we

14
have that
kx0 − x∗ k2
f (xt ) − inf f ≤ . (17)
2γt
For this theorem we give two proofs. The first proof uses an energy function, that we will also use
later on. The second proof is a direct proof taken from [5].

Proof of Theorem 3.4 with Lyapunov arguments. Let x∗ ∈ argmin f be any minmizer of f .
First, we will show that f (xt ) is decreasing. Indeed we know from (11), and from our assumption
γL ≤ 1, that
γL
f (xt+1 ) − f (xt ) ≤ −γ(1 − )k∇f (xt )k2 ≤ 0. (18)
2
Second, we will show that kxt − x∗ k2 is also decreasing. For this we expand the squares to write
1 t+1 1 t −1 t+1
kx − x∗ k2 − kx − x∗ k2 = kx − xt k2 − h∇f (xt ), xt+1 − x∗ i
2γ 2γ 2γ
−1 t+1
= kx − xt k2 − h∇f (xt ), xt+1 − xt i + h∇f (xt ), x∗ − xt i.

(19)
Now to bound the right hand side we use the convexity of f and (2) to write

h∇f (xt ), x∗ − xt i ≤ f (x∗ ) − f (xt ) = inf f − f (xt ).

To bound the other inner product we use the smoothness of L and (9) which gives
L t+1
−h∇f (xt ), xt+1 − xt i ≤kx − xt k2 + f (xt ) − f (xt+1 ).
2
By using the two above inequalities in (19) we obtain
1 t+1 1 t −1 t+1
kx − x∗ k2 − kx − x∗ k2 ≤ kx − xt k2 − (f (xt+1 ) − inf f ),
2γ 2γ 2γ
≤ −(f (xt+1 ) − inf f ). (20)

Let us now combine the two positive decreasing quantities f (xt ) − inf f and 1
2γ kx
t − x∗ k2 , and
introduce the following Lyapunov energy, for all t ∈ N:
1 t
Et := kx − x∗ k2 + t(f (xt ) − inf f ).

We want to show that it is decreasing with time. For this we start by writing
1 t+1 1 t
Et+1 − Et = (t + 1)(f (xt+1 ) − f (xt )) − t(f (xt ) − inf f ) + kx − x∗ k2 − kx − x∗ k2
2γ 2γ
1 t+1 1 t
= f (xt+1 ) − inf f + t(f (xt+1 ) − f (xt )) + kx − x∗ k2 − kx − x∗ k2 . (21)
2γ 2γ
Combining now (21), (18) and (20), we finally obtain (after cancelling terms) that
1 t+1 1 t
Et+1 − Et ≤ f (xt+1 ) − inf f + kx − x∗ k2 − kx − x∗ k2 using (18)
2γ 2γ
≤ f (xt+1 ) − inf f − (f (xt+1 ) − inf f ) using (20)
≤ 0.

15
Thus Et is decreasing. Therefore we can write that
1 0
t(f (xt ) − inf f ) ≤ Et ≤ E0 = kx − x∗ k2 ,

and the conclusion follows after dividing by t.

Proof of Theorem 3.4 with direct arguments. Let f be convex and L–smooth. It follows
that

kxt+1 − x∗ k2 = kxt − x∗ − L1 ∇f (xt )k2


= kxt − x∗ k2 − 2 L1 xt − x∗ , ∇f (xt ) + 1
L2
k∇f (xt )k2
(15)
≤ kxt − x∗ k2 − 1
L2
k∇f (xt )k2 . (22)

Thus kxt − x∗ k2 is a decreasing sequence in t, and thus consequently

kxt − x∗ k ≤ kx0 − x∗ k. (23)

Calling upon (11) and subtracting f (x∗ ) from both sides gives
1
f (xt+1 ) − f (x∗ ) ≤ f (xt ) − f (x∗ ) − k∇f (xt )k2 . (24)
2L
Applying convexity we have that

f (xt ) − f (x∗ ) ≤ ∇f (xt ), xt − x∗


(23)
≤ k∇f (xt )kkxt − x∗ k ≤ k∇f (xt )kkx0 − x∗ k. (25)

Isolating k∇f (xt )k in the above and inserting in (24) gives


(24)+(25) 1 1
f (xt+1 ) − f (x∗ ) ≤ f (xt ) − f (x∗ ) − (f (xt ) − f (x∗ ))2 (26)
2L kx − x∗ k2
0
| {z }
β

Let δt = f (xt ) − f (x∗ ). Since δt+1 ≤ δt , and by manipulating (26) we have that
1
×δ δ δt 1 1 δt+1 ≤δt 1 1
t t+1
δt+1 ≤ δt − βδt2 ⇔ β ≤ − ⇔ β≤ − .
δt+1 δt+1 δt δt+1 δt
Summing up both sides over t = 0, . . . , T − 1 and using telescopic cancellation we have that
1 1 1
Tβ ≤ − ≤ .
δT δ0 δT
Re-arranging the above we have that
1 2Lkx0 − x∗ k2
f (xT ) − f (x∗ ) = δT ≤ = .
β(T − 1) T

16
Corollary 3.5 (O(1/t) Complexity). Under the assumptions of Theorem 3.4, for a given  > 0
and γ = L we have that

L kx0 − x∗ k2
t≥ =⇒ f (xt ) − inf f ≤  (27)
 2

3.2 Convergence for strongly convex and smooth functions


Now we prove the convergence of gradient descent for strongly convex and smooth functions.

Theorem 3.6. Consider the Problem (Differentiable Function) and assume that f is µ-strongly
convex and L-smooth, for some L ≥ µ > 0. Let (xt )t∈N be the sequence of iterates generated by
the (GD) algorithm, with a stepsize satisfying 0 < γ ≤ L1 . Then, for x∗ = argmin f and for all
t ∈ N:
kxt+1 − x∗ k2 ≤ (1 − γµ)t+1 kx0 − x∗ k2 . (28)

1
Remark 3.7. Note that with the choice γ = L, the iterates enjoy a linear convergence with a
rate of (1 − µ/L).

Below we provide two different proofs for this Theorem 3.6. The first one makes use of first-
order variational inequalities induced by the strong convexity and smoothness of f . The second
one (assuming further that f is twice differentiable) exploits the fact that the eigenvalues of the
Hessian of f are in between µ and L.

Proof of Theorem 3.6 with first-order properties. From (GD) we have that

kxt+1 − x∗ k2 = kxt − x∗ − γ∇f (xt )k2


= kxt − x∗ k2 − 2γ ∇f (xt ), xt − x∗ + γ 2 k∇f (xt )k2
(4)
≤ (1 − γµ)kxt − x∗ k2 − 2γ(f (xt ) − inf f ) + γ 2 k∇f (xt )k2
(12)
≤ (1 − γµ)kxt − x∗ k2 − 2γ(f (xt ) − inf f ) + 2γ 2 L(f (xt ) − inf f
= (1 − γµ)kxt − x∗ k2 − 2γ(1 − γL)(f (xt ) − inf f ). (29)
1
Since L ≥ γ we have that −2γ(1 − γL) is negative, and thus can be safely dropped to give

kxt+1 − x∗ k2 ≤ (1 − γµ)kxt − x∗ k2 .

It now remains to unroll the recurrence.

Proof of Theorem 3.6 with the Hessian. Let T : Rd → Rd be defined by T (x) = x − γ∇f (x),
so that we can write an iteration of Gradient Descent as xt+1 = T (xt ). Note that the minimizer
x∗ verifies ∇f (x∗ ) = 0, so it is a fixed point of T in the sense that T (x∗ ) = x∗ . This means that
kxt+1 − x∗ k = kT (xt ) − T (x∗ )k. Now we want to prove that

kT (xt ) − T (x∗ )k ≤ (1 − λµ)kxt − x∗ k. (30)

17
Indeed, unrolling the recurrence from (30) would provide the desired bound (28).
We see that (30) is true as long as T is θ-Lipschitz, with θ = (1 − λµ). From Lemma 2.6, we
know that is equivalent to proving that the norm of the differential of T is bounded by θ. It is
easy to compute this differential : DT (x) = Id − λ∇2 f (x). If we note v1 (x) ≤ · · · ≤ vd (x) the
eigenvalues of ∇2 f (x), we know by Lemmas 2.15 and 2.26 that µ ≤ vi (x) ≤ L. Since we assume
λL ≤ 1, we see that 0 ≤ 1 − λvi (x) ≤ 1 − λµ. So we can write

for all x ∈ Rd , kDT (x)k = max |1 − λvi (x)| ≤ 1 − λµ = θ,


i=1,...,d

which allows us to conclude that (30) is true. To conclude the proof of Theorem 3.6, take the
squares in (30) and use the fact that θ ∈]0, 1[⇒ θ2 ≤ θ.

The linear convergence rate in Theorem 3.6 can be transformed into a complexity result as we
show next.

Corollary 3.8 (O (log(1/)) Complexity). Under the same assumptions as Theorem 3.6, for a
given  > 0, we have that if γ = 1/L then
 
L 1
t ≥ log ⇒ kxt+1 − x∗ k2 ≤  kx0 − x∗ k2 . (31)
µ 

The proof of this lemma follows by applying Lemma A.1 in the appendix.

3.3 Convergence for Polyak-Lojasiewicz and smooth functions


Here we present a convergence result for nonconvex functions satisfying the Polyak-Lojasiewicz
condition (see Definition 2.17). This is a favorable setting, since all local minima and critical
points are also global minima (Lemma 2.22), which will guarantee convergence. Moreover the PL
property imposes a quadratic growth on the function, so we will recover bounds which are similar
to the strongly convex case.

Theorem 3.9. Consider the Problem (Differentiable Function) and assume that f is µ-Polyak-
Lojasiewicz and L-smooth, for some L ≥ µ > 0. Consider (xt )t∈N a sequence generated by the
(GD) algorithm, with a stepsize satisfying 0 < γ ≤ L1 . Then:

f (xt ) − inf f ≤ (1 − γµ)t (f (x0 ) − inf f ).

Proof. We can use Lemma 2.25, together with the update rule of (GD), to write

L t+1
f (xt+1 ) ≤ f (xt ) + h∇f (xt ), xt+1 − xt i + kx − xt k2
2
Lγ 2
= f (xt ) − γk∇f (xt )k2 + k∇f (xt )k2
2
γ
= f (xt ) − (2 − Lγ) k∇f (xt )k2
2
γ
≤ f (x ) − k∇f (xt )k2 ,
t
2

18
where in the last inequality we used our hypothesis on the stepsize that γL ≤ 1. We can now use
the Polyak-Lojasiewicz property (recall Definition 2.17) to write:

f (xt+1 ) ≤ f (xt ) − γµ(f (xt ) − inf f ).

The conclusion follows after subtracting inf f on both sides of this inequality, and using recursion.

Corollary 3.10 (log(1/) Complexity). Under the same assumptions as Theorem 3.9, for a given
 > 0, we have that if γ = 1/L then
 
L 1
t ≥ log ⇒ f (xt ) − inf f ≤  (f (x0 ) − inf f ). (32)
µ 

The proof of this lemma follows by applying Lemma A.1 in the appendix.

3.4 Bibliographic notes


Our second proof for convex and smooth in Theorem 3.4 is taken from [5]. Proofs in the convex
and strongly convex case can be found in [31]. Our proof under the Polyak-Lojasiewicz was taken
from [19].

4 Theory : Sum of functions


4.1 Definitions
In the next sections we will assume that our objective function is a sum of functions.

Problem 4.1 (Sum of Functions). We want to minimize a function f : Rd → R which writes as


n
def 1X
f (x) = fi (x),
n
i=1

where fi : Rd → R. We require that the problem is well-posed, in the sense that argmin f 6= ∅
and that the fi ’s are bounded from below.

Depending on the applications, we will consider two different sets of assumptions.

Assumption 4.2 (Sum of Convex). We consider the Problem (Sum of Functions) where each
fi : Rd → R is assumed to be convex.

Assumption 4.3 (Sum of Lmax –Smooth). We consider the Problem (Sum of Functions) where
def def
each fi : Rd → R is assumed to be Li -smooth. We will note Lmax = max Li , and Lavg =
1,...,n

19
1 Pn
n i=1 Li . We will also note Lf the Lipschitz constant of ∇f .

Note that, in the above Assumption (Sum of Lmax –Smooth), the existence of Lf is not an
assumption but the consequence of the smoothness of the fi ’s. Indeed:

Lemma 4.4. Consider the Problem (Sum of Functions). If the fi ’s are Li -smooth, then f is
Lavg -smooth.

Proof. Using the triangular inequality we have that


n n n
1X 1X 1X
k∇f (y) − ∇f (x)k = k ∇fi (y) − ∇fi (x)k ≤ k∇fi (y) − ∇fi (x)k ≤ Li ky − xk
n n n
i=1 i=1 i=1

= Lavg ky − xk.

This proves that f (x) is Lavg -smooth.

Definition 4.5 (Notation). Given two random variables X, Y in Rd , we note :

• the expectation of X as E [X],

• the expectation of X conditioned to Y as E [X | Y ],


def
• the variance of X as V[X] = E[kX − E[X]k2 ].

Lemma 4.6 (Variance and expectation). Let X be a random variable in Rd .

1. For all y ∈ Rd , V [X] ≤ E kX − yk2 .


 

2. V [X] ≤ E kXk2 .
 

Proof. Item 2 is a direct consequence of the first with y = 0. To prove item 1, we use that

kX − E [X] k2 = kX − yk2 + ky − E [X] k2 + 2hX − y, y − E [X]i,

and then take expectation to conclude

V [X] = E kX − yk2 − 2E ky − E [X] k2 ≤ E kX − yk2 .


     

4.2 Expected smoothness


Here we focus on the smoothness properties that the functions fi verify in expectation. The so-
called expected smoothness property below can be seen as “cocoercivity in expectation” (remember
Lemma 2.29).

20
Lemma 4.7. If Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold, then f is
Lmax -smooth in expectation, in the sense that
1
for all x, y ∈ Rd , E k∇fi (y) − ∇fi (x)k2 ≤ f (y) − f (x) − h∇f (x), y − xi .
 
(33)
2Lmax

Proof. Using (14) in Lemma 2.29 applied to fi , together with the fact that Li ≤ Lmax , allows us
to write
1
k∇fi (y) − ∇fi (x)k2 ≤ fi (y) − fi (x) + h∇fi (x), y − xi .
2Lmax
1 1 P
To conclude, multiply the above inequality by n, and sum over i, using the fact that n i fi =f
and n1 i ∇fi = ∇f .
P

As a direct consequence we also have the analog of Lemma 2.29 in expectation.

Lemma 4.8. If Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold, then, for every
x ∈ Rd and every x∗ ∈ argmin f , we have that
1
E k∇fi (x) − ∇fi (x∗ )k2 ≤ f (x) − inf f.
 
(34)
2Lmax

Proof. Apply Lemma 4.7 with x = x∗ and y = x, since f (x∗ ) = inf f and ∇f (x∗ ) = 0.

4.3 Controlling the variance


Some stochastic problems are easier than others. For instance, a problem where all the fi ’s are
the same is easy to solve, as it suffices to minimize one fi to obtain a minimizer of f . We can also
imagine that if the fi are not exactly the same but look similar, the problem will also be easy.
And of course, we expect that the easier the problem, the faster our algorithms will be. In this
section we present one way to quantify this phenomena.

4.3.1 Interpolation

Definition 4.9. Consider the Problem (Sum of Functions). We say that interpolation holds
if there exists a common x∗ ∈ Rd such that fi (x∗ ) = inf fi for all i = 1, . . . , n. In this case, we
say that interpolation holds at x∗ .

Even though unspecified, the x∗ appearing in Definition 4.9 must be a minimizer of f .

Lemma 4.10. Consider the Problem (Sum of Functions). If interpolation holds at x∗ ∈ Rd ,


then x∗ ∈ argmin f .

Proof. Let interpolation hold at x∗ ∈ Rd . By Definition 4.9, this means that x∗ ∈ argmin fi .
Therefore, for every x ∈ Rd ,
n n n
1X 1X 1X
f (x∗ ) = fi (x∗ ) = inf fi ≤ fi (x) = f (x).
n n n
i=1 i=1 i=1

21
This proves that x∗ ∈ argmin f .

Interpolation means that there exists some x∗ that simultaneously achieves the minimum of
all loss functions fi . In terms of learning problems, this means that the model perfectly fits every
data point. This is illustrated below with a couple of examples.

Example 4.11 (Least-squares and interpolation). Consider a regression problem with data
1
(φi , yi )ni=1 ⊂ Rd × R, and let f (x) = 2m kΦx − yk2 be the corresponding least-squares func-
tion with Φ = (φi )ni=1 and y = (yi )ni=1 . This is a particular case of Problem (Sum of Functions),
with fi (x) = 12 (hφi , xi − yi )2 . We see here that interpolation holds if and only if there exists
x∗ ∈ Rd such that hφi , x∗ i = yi . In other words, we can find an hyperplane in Rd × R passing
through each data point (φi , yi ). This is why we talk about interpolation.
For this linear model, note that interpolation holds if and only if y is in the range of Φ, which
is always true if Φ is surjective. This generically holds when d > n, which is usually called the
overparametrized regime.

Example 4.12 (Neural Networks and interpolation). Let Φ : Rd → Rn , y ∈ Rn , and consider


the nonlinear least-squares f (x) = 21 kΦ(x) − yk2 . As in the linear case, interpolation holds if and
only if there exists x∗ ∈ Rd such that Φ(x∗ ) = y, or equivalently, if inf f = 0. The interpolation
condition has drawn much attention recently because it was empirically observed that many
overparametrized deep neural networks (with d  n) achieve inf f = 0 [24, 23].

4.3.2 Interpolation constants

Here we introduce different measures of how far from interpolation we are. We start with a first
quantity measuring how the infimum of f and the fi ’s are related.

Definition 4.13. Consider the Problem (Sum of Functions). We define the function noise as
n
def 1X
∆∗f = inf f − inf fi . (35)
n
i=1

Example 4.14 (Function noise for least-squares). Let f be a least-squares as in Example 4.11.
It is easy to see that inf fi = 0, implying that the function noise is exactly ∆∗f = inf f . We see in
this case that ∆∗f = 0 if and only if interpolation holds (see also the next Lemma). If the function
noise ∆∗f = inf f is nonzero, it can be seen as a measure of how far we are from interpolation.

Lemma 4.15. Consider the Problem (Sum of Functions). We have that

1. ∆∗f ≥ 0.

22
2. Interpolation holds if and only if ∆∗f = 0.

Proof.

1. Let x∗ ∈ argmin f , so that we can write


n n
1X 1X
∆∗f = f (x∗ ) − inf fi ≥ f (x∗ ) − fi (x∗ ) = f (x∗ ) − f (x∗ ) = 0.
n n
i=1 i=1

2. Let interpolation hold at x∗ ∈ Rd . According to Definition 4.9 we have x∗ ∈ argmin fi .


According to Lemma 4.10, we have x∗ ∈ argmin f . So we indeed have
n n
1X 1X
∆∗f = inf f − inf fi = f (x∗ ) − fi (x∗ ) = f (x∗ ) − f (x∗ ) = 0.
n n
i=1 i=1

If instead we have ∆∗f = 0, then we can write for some x∗ ∈ argmin f that
n n
1X 1X
0 = ∆∗f = f (x∗ ) − inf fi = (fi (x∗ ) − inf fi ) .
n n
i=1 i=1

Clearly we have fi (x∗ ) − inf fi ≥ 0, so this sum being 0 implies that fi (x∗ ) − inf fi = 0 for
all i = 1, . . . , n. In other words, interpolation holds.

We can also measure how close we are to interpolation using gradients instead of function
values.

Definition 4.16. Let Assumption (Sum of Lmax –Smooth) hold. We define the gradient noise
as
def
σf∗ = ∗ inf V [∇fi (x∗ )] , (36)
x ∈argmin f

where for a random vector X ∈ Rd we use

V[X] := E[ kX − E[X] k2 ].

Lemma 4.17. Let Assumption (Sum of Lmax –Smooth) hold. It follows that

1. σf∗ ≥ 0.

2. If Assumption (Sum of Convex) holds, then σf∗ = V [∇fi (x∗ )] for every x∗ ∈ argmin f .

3. If interpolation holds then σf∗ = 0. This becomes an equivalence if Assumption (Sum of


Convex) holds.

Proof.

23
1. From Definition 4.5 we have that the variance V [∇fi (x∗ )] is nonnegative, which implies
σf∗ ≥ 0.

2. Let x∗ , x0 ∈ argmin f , and let us show that V [∇fi (x∗ )] = V [∇fi (x0 )]. Since Assumptions
(Sum of Lmax –Smooth) and (Sum of Convex) hold, we can use the expected smoothness via
Lemma 4.8 to obtain
1
E k∇fi (x0 ) − ∇fi (x∗ )k2 ≤ f (x0 ) − inf f = inf f − inf f = 0.
 
2Lmax
This means that E k∇fi (x0 ) − ∇fi (x∗ )k2 = 0, which in turns implies that, for every i =
 

1, . . . , n, we have k∇fi (x0 ) − ∇fi (x∗ )k = 0. In other words, ∇fi (x0 ) = ∇fi (x∗ ), and thus
V [∇fi (x∗ )] = V [∇fi (x0 )] .

3. If interpolation holds, then there exists (see Lemma 4.10) x ∈ argmin f such that x∗ ∈
argmin fi for every i = 1, . . . , n. From Fermat’s theorem, this implies that ∇fi (x∗ ) = 0
and ∇f (x∗ ) = 0. Consequently V [∇fi (x∗ )] = E k∇fi (x∗ ) − ∇f (x∗ )k2 = 0. This proves
 

that σf∗ = 0. Now, if Assumption (Sum of Convex) holds and σf∗ = 0, then we can use the
previous item to say that for any x∗ ∈ argmin f we have V [∇fi (x∗ )] = 0. By definition of the
variance and the fact that ∇f (x∗ ) = 0, this implies that for every i = 1, . . . , n, ∇fi (x∗ ) = 0.
Using again the convexity of the fi ’s, we deduce that x∗ ∈ argmin fi , which means that
interpolation holds.

Both σf∗ and ∆∗f measure how far we are from interpolation. Furthermore, these two constants
are related through the following bounds.

Lemma 4.18. Let Assumption (Sum of Lmax –Smooth) hold.

1. We have σf∗ ≤ 2Lmax ∆∗f .

2. If moreover each fi is µ-strongly convex, then 2µ∆∗f ≤ σf∗ .

Proof.
1. Let x∗ ∈ argmin f . Using Lemma 2.28, we can write k∇fi (x∗ )k2 ≤ 2Lmax (fi (x∗ ) − inf fi ) for
each i. The conclusion follows directly after taking the expectation on this inequality, and
using the fact that E k∇fi (x∗ )k2 = V [∇fi (x∗ )] ≥ σf∗ .
 

2. This is exactly the same proof, except that we use Lemma 2.18 instead of 2.28.

4.3.3 Variance transfer

Here we provide two lemmas which allow to exchange variance-like terms like E k∇fi (x)k2 with
 

interpolation constants and function values. This is important since E k∇fi (x)k2 actually controls
 

the variance of the gradients (see Lemma 4.6).

24
Lemma 4.19 (Variance transfer : function noise). If Assumption (Sum of Lmax –Smooth) holds,
then for all x ∈ Rd we have

E k∇fi (x)k2 ≤ 2Lmax (f (x) − inf f ) + 2Lmax ∆∗f .


 

Proof. Let x ∈ Rd and x∗ ∈ argmin f . Using Lemma 2.28, we can write

k∇fi (x)k2 ≤ 2Lmax (fi (x) − inf fi ) = 2Lmax (fi (x) − fi (x∗ )) + 2Lmax (fi (x∗ ) − inf fi ), (37)

for each i. The conclusion follows directly after taking expectation over the above inequality.

Lemma 4.20 (Variance transfer : gradient noise). If Assumptions (Sum of Lmax –Smooth) and
(Sum of Convex) hold, then for all x ∈ Rd we have that

E k∇fi (x)k2 ≤ 4Lmax (f (x) − inf f ) + 2σf∗ .


 
(38)

Proof. Let x∗ ∈ argmin f , so that σf∗ = V k∇fi (x∗ )k2 according to Lemma 4.17. Start by
 

writing

k∇fi (x)k2 ≤ 2k∇fi (x) − ∇fi (x∗ )k2 + 2k∇fi (x∗ )k2 .

Taking the expectation over the above inequality, then applying Lemma 4.8 gives the result.

5 Stochastic Gradient Descent


Algorithm 5.1 (SGD). Consider Problem (Sum of Functions). Let x0 ∈ Rd , and let γt > 0 be
a sequence of step sizes. The Stochastic Gradient Descent (SGD) algorithm is given by the
iterates (xt )t∈N where
1
it ∈ {1, . . . n} Sampled with probability (39)
n
xt+1 = xt − γt ∇fit (xt ). (40)

Remark 5.2 (Unbiased estimator of the gradient). An important feature of the (SGD) Algo-
rithm is that at each iteration we follow the direction −∇fit (xt ), which is an unbiaised estimator
of −∇f (xt ). Indeed, since
n
 X 1
E ∇fi (xt ) | xt = ∇fi (xt ) = ∇f (xt ).

(41)
n
i=1

5.1 Convergence for convex and smooth functions


The behaviour of the (SGD) algorithm is very dependant of the choice of the sequence of stepsizes
γt . In our next Theorem 5.3 we prove the convergence of SGD with a sequence of stepsizes where

25
each γt is less than 2L1max . The particular cases of constant stepsizes and of decreasing stepsizes
are dealt with in Theorem 5.5.

Theorem 5.3. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. Consider
(xt )t∈N a sequence generated by the (SGD) algorithm, with a stepsize satisfying 0 < γt < 2L1max .
It follows that
Pt−1 2
kx0 − x∗ k2 k=0 γk
t
σf∗ ,
 
E f (x̄ ) − inf f ≤ Pt−1 + Pt−1 (42)
2 k=0 γk (1 − 2γk Lmax ) k=0 γk (1 − 2γk Lmax )

def Pt−1 def γk (1−2γk Lmax )


where x̄t = k=0 pt,k x
k, with pt,k = Pt−1 .
i=0 γi (1−2γi Lmax )

Proof. Let x∗ ∈ argmin f , so we have σf∗ = V[∇fi (x∗ )] (see Lemma 4.17). We will note Ek [·]
instead of Ek · | xk , for simplicity. Let us start by analyzing the behaviour of kxk − x∗ k2 . By
 

developing the squares, we obtain


kxk+1 − x∗ k2 = kxk − x∗ k2 − 2γk h∇fi (xk ), xk − x∗ i + γk2 k∇fi (xk )k2
Hence, after taking the expectation conditioned on xk , we can use the convexity of f and the
variance transfer lemma to obtain:
h i
Ek kxk+1 − x∗ k2 kxk − x∗ k2 + 2γk h∇f (xk ), x∗ − xk i + γk2 Ek k∇fi (xk )k2
 
=
(2)+(38)
≤ kxk − x∗ k2 + 2γk (2γk Lmax − 1)(f (xk ) − inf f )) + 2γk2 σf∗ .
Rearranging and taking expectation, we have
h i h i
2γk (1 − 2γk Lmax )E f (xk ) − inf f ≤ E kxk − x∗ k2 − E kxt+1 − x∗ k2 + 2γk2 σf∗ .
 

Summing over k = 0, . . . , t − 1 and using telescopic cancellation gives


t−1
X h i t−1
X
γk (1 − 2γk Lmax )E f (xk ) − inf f ≤ kx0 − x∗ k2 − E kxt − x∗ k2 + 2σf∗ γk2 .
 
2
k=0 k=0

Since E kxt − x∗ k2 ≥ 0, dividing both sides by 2 t−1


  P
i=0 γi (1 − 2γi Lmax ) gives:
t−1
σf∗ t−1
" #
2
P
X γk (1 − 2γk Lmax ) k kx0 − x∗ k2 k=0 γk
E Pt−1 (f (x ) − inf f ) ≤ Pt−1 + Pt−1 .
k=0 i=0 γ i (1 − 2γ i Lmax ) 2 i=0 γ i (1 − 2γ i L max ) i=0 γ i (1 − 2γ i L max )
Finally, define for k = 0, . . . , t − 1
def γk (1 − 2γk Lmax )
pt,k = Pt−1
i=0 γi (1 − 2γi Lmax )
Pt−1
and observe that pt,k ≥ 0 and k=0 pt,k = 1. This allows us to treat the (pt,k )t−1 k=0 as if it was a
probability vector. Indeed, using that f is convex together with Jensen’s inequality gives
t−1
" #
X γ k (1 − 2γ k Lmax )
E f (x̄t ) − f (x∗ ) ≤ (f (xk ) − inf f )
 
E Pt−1
k=0 i=0 iγ (1 − 2γ L
i max )
σf∗ t−1 2
P
kx0 − x∗ k2 k=0 γk
≤ Pt−1 + Pt−1 .
2 i=0 γi (1 − 2γi Lmax ) i=0 γi (1 − 2γi Lmax )

26
Remark 5.4 (On the choice of stepsizes for (SGD)). Looking at the bound obtained in Theorem
5.3, we see that the first thing we want is ∞
P
s=0 γs = +∞ so that the first term (a.k.a the bias
term) vanishes. This can be achieved with constant stepsizes, or with stepsizes of the form t1α
with α < 1 (see Theorem 5.5 below). The second term (a.k.a the variance term) is less trivial to
analyse.

• If interpolation holds (see Definition 4.9), then the variance term σf∗ is zero. This means
1
that the expected values converge to zero at a rate of the order Pt−1 . For constant
s=0 γs
1 1 1
 
stepsizes this gives a O t rate. For decreasing stepsizes γt = tα this gives a O t1−α
rate. We see that the best among those rates is obtained when α = 0 and the decay in the
stepsize is slower. In other words when the stepsize is constant. Thus when interpolation
holds the problem is so easy that the stochastic algorithm behaves like the deterministic
one and enjoys a 1/t rate with constant stepsize, as in Theorem 3.4.

• If interpolation does not hold the expected values will be asymptotically controlled by
Pt−1 2
γs
Ps=0
t−1 .
s=0 γs

We see that we want γs to decrease as slowly as possible (so that the denominator is
big) but at the same time that γs vanishes as fast as possible (so that the numerator is
small). So a trade-off must be found. For constant stepsizes, this term becomes a constant
O(1), and thus (SPGD) does not converge for constant stepsizes. For decreasing stepsizes
γt = t1α , this term becomes (omitting logarithmic terms) O t1α if 0 < α ≤ 12 , and O t1−α
1
 

if 12 ≤ α < 1. So the best compromise for this bound is to take α = 1/2. This case is
detailed in the next Theorem.

Theorem 5.5. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. Consider
(xt )t∈N a sequence generated by the (SGD) algorithm with a stepsize γt > 0.

1
1. If γt = γ < 2Lmax , then for every t ≥ 1 we have that

kx0 − x∗ k2 1 γ
E f (x̄t ) − f (x∗ ) ≤ σ∗ ,
 
+ (43)
2γ(1 − 2γLmax ) t 1 − 2γLmax f
def 1 Pt−1
where x̄t = t k=0 x
k.

2. If γt = √γ with γ ≤ 1
t+1 2Lmax , then for every t we have that

h i kx0 − x∗ k2 γ 2 log(k) 
log(k)

k ∗ ∗
E f (x̄ ) − f (x ) ≤ √ + √ σf = O √ . (44)
2γ k γ k k

27
def Pt−1 def γk (1−2γk Lmax )
where x̄t = k
k=0 pt,k x , with pt,k = Pt−1 .
i=0 γi (1−2γi Lmax )

Proof. For the different choices of step sizes:

1. For γt = γ, it suffices to replace γt in (42).


γ
2. For γt = √t+1 , we first need some estimates on the sum (of squares) of the stepsizes. Using
an integral bound, we have that
k−1 k−1 Z k−1
X X 1 1
γt2 = γ 2 ≤ γ2 = γ 2 log(k). (45)
t+1 t=0 t+1
t=0 t=0

Furthermore using the integral bound again we have that


k−1 √
X Z k−1
γ √ 
γt ≥ √ = 2γ k− 2 . (46)
t=0 t=1 t+1

Now using (45) and (46), together with the fact that γLmax ≤ 12 , we have that for k ≥ 2

k−1 √ √ √ 
X √  √
γi (1 − 2γi Lmax ) ≥ 2γ k− 2 − γLmax log(k) ≥ 2γ k − 2 − log( k) .
i=0

Because X 7→ X2 − ln(X) is increasing for X ≥ 2, we know that X2 − ln(X) ≥ 2 for X large

enough (for instance X ≥ 7). So by taking X = k, we deduce from the above inequality
that for k ≥ 72 we have
k−1
X √
γi (1 − 2γi Lmax ) ≥ γ k.
i=0

It remains to use the above inequality in (42) and (45) again to arrive at
Pk−1
h

i kx0 − x∗ k2 γt2
k
E f (x̄ ) − f (x ) ≤ Pk−1 + Pk−1 t=0
σf∗
2 i=0 γi (1 − 2γi Lmax ) i=0 γi (1 − 2γi Lmax )
kx0 − x∗ k2 γ 2 log(k) ∗
≤ √ + √ σf
2γ k γ k
 
log(k)
=O √ .
k

Corollary 5.6 (O(1/2 ) Complexity). Consider the setting of Theorem 5.5. Let  ≥ 0 and let
γ = 2L1max √1t . It follows that for t ≥ 4 that
2
σf∗

1 ∗ 2
0
=⇒ E f (x̄t ) − f (x∗ ) ≤ 
 
t≥ 2 2Lmax kx − x k + (47)
 Lmax

28
1 √
Proof. From (43) plugging γ = we have
2Lmax t


 2Lmax tkx0 − x∗ k2 1 1 1
t
σf∗ .

E f (x̄ ) − f (x ) ≤ 1 + √
2(1 − t )
√ t 2Lmax t 1 − √1t
Now note that for t ≥ 4 we have that
1
≤ 2.
1 − √1t
Using this in the above we have that
σf∗
 
1
E f (x̄t ) − f (x∗ ) ≤ √ 0 ∗ 2
 
2Lmax kx − x k + .
t Lmax
Consequently (47) follows by demanding that the above inequality is less than .

5.2 Convergence for strongly convex and smooth functions

Theorem 5.7. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold, and assume
further that f is µ-strongly convex. Consider the (xt )t∈N sequence generated by the (SGD)
algorithm with a constant stepsize satisfying 0 < γ < 2L1max . It follows that for t ≥ 0,

2γ ∗
Ekxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 + σ .
µ f

Proof. Let x∗ ∈ argmin f , so that σf∗ = V[∇fi (x∗ )] (see Lemma 4.17). We will note Ek [·] instead
of E · | xk , for simplicity. Expanding the squares we have
 

(39)
kxt+1 − x∗ k2 = kxk − x∗ − γ∇fi (xk )k2
= kxk − x∗ k2 − 2γhxk − x∗ , ∇fi (xk )i + γ 2 k∇fi (xk )k2 .

Taking expectation conditioned on xk we obtain


(41)
h i
Ek kxt+1 − x∗ k2 kxk − x∗ k2 − 2γhxk − x∗ , ∇f (xk )i + γ 2 Ek k∇fi (xk )k2
 
=
Lem. 2.14 h i
≤ (1 − γµ)kxk − x∗ k2 − 2γ[f (xk ) − f (x∗ )] + γ 2 Ek k∇fi (xk )k2 .

Taking expectations again and using Lemma 4.20 gives


 (38)
E kxt+1 − x∗ k2 ≤ (1 − γµ)Ekxk − x∗ k2 + 2γ 2 σf∗ + 2γ(2γLmax − 1)E[f (xk ) − f (x∗ )]

h i
≤ (1 − γµ)E kxk − x∗ k2 + 2γ 2 σf∗ ,
1
where we used in the last inequality that 2γLmax ≤ 1 since γ ≤ 2Lmax . Recursively applying the
above and summing up the resulting geometric series gives
k−1
X
Ekxk − x∗ k2 ≤ (1 − γµ)k kx0 − x∗ k2 + 2 (1 − γµ)j γ 2 σf∗
j=0
2γσf∗
≤ (1 − γµ)k kx0 − x∗ k2 + .
µ

29
Corollary 5.8 (Õ(1/) Complexity). Consider the setting of Theorem 5.7. Let  > 0. If we set
( )
 µ 1
γ = min , (48)
2 2σf∗ 2Lmax

then ∗
1 4σf 2Lmax 2kx0 − x∗ k2
   
t ≥ max , log =⇒ kxt − x∗ k2 ≤  (49)
 µ2 µ 

2σf∗
Proof. Applying Lemma A.2 with A = µ , C = 2Lmax and α0 = kx0 − x∗ k2 gives the result (49).

5.3 Convergence for Polyak-Lojasiewicz and smooth functions

Theorem 5.9. Let Assumption (Sum of Lmax –Smooth) hold, and assume that f is µ-Polyak-
Lojasiewicz for some µ > 0. Consider (xt )t∈N a sequence generated by the (SGD) algorithm,
with a constant stepsize satisfying 0 < γ ≤ Lf Lµmax . It follows that

γLf Lmax ∗
E[f (xt ) − inf f ] ≤ (1 − γµ)t (f (x0 ) − inf f ) + ∆f .
µ

Proof. Remember from Assumption (Sum of Lmax –Smooth) that f is Lf -smooth, so we can use
Lemma 2.25, together with the update rule of SGD, to obtain:
Lf t+1
f (xt+1 ) ≤ f (xt ) + h∇f (xt ), xt+1 − xt i + kx − xt k2
2
Lf γ 2
= f (xt ) − γh∇f (xt ), ∇fi (xt )i + k∇fi (xt )k2 .
2
After taking expectation conditioned on xt , we can use a variance transfer lemma together with
the Polyak-Lojasiewicz property to write

Lf γ 2 
E f (xt+1 ) | xt f (xt ) − γk∇f (xt )k2 + E k∇fi (xt )k2 | xt
  

2
Lemma 4.19
≤ f (xt ) − γk∇f (xt )k2 + γ 2 Lf Lmax (f (xt ) − inf f ) + γ 2 Lf Lmax ∆∗f
Definition 2.17
≤ f (xt ) + γ(γLf Lmax − 2µ)(f (xt ) − inf f ) + γ 2 Lf Lmax ∆∗f
≤ f (xt ) − γµ(f (xt ) − inf f ) + γ 2 Lf Lmax ∆∗f ,

where in the last inequality we used our assumption on the stepsize to write γLf Lmax − 2µ ≤ −µ.
Note that µγ ≤ 1 because of our assumption on the stepsize, and the fact that µ ≤ Lf ≤ Lmax (see
Remark 2.27). Subtracting inf f from both sides in the last inequality, and taking expectation, we
obtain

E f (xt+1 ) − inf f ≤ 1 − µγ E f (xt ) − inf f + γ 2 Lf Lmax ∆∗f .


    

30
Recursively applying the above and summing up the resulting geometric series gives:
t−1
X
t 0t 2 ∗
(1 − γµ)j .
   
E f (x ) − inf f ≤ (1 − µγ) E f (x ) − inf f + γ Lf Lmax ∆f
j=0

Pt−1 1−(1−µγ)t 1
Using i=0 (1 − µγ)i = 1−1+µγ ≤ µγ , in the above gives (5.9).

Corollary 5.10 (Õ(1/) Complexity). Consider the setting of Theorem 5.9. Let  ≥ 0 be given.
If we set
( )
µ 
γ = min , 1 (50)
Lf Lmax 2∆∗f

then
2∆∗f 2(f (x0 ) − inf f )
   
Lf Lmax
t≥ max , 1 log =⇒ f (x0 ) − inf f ≤  (51)
µ2  

Lf Lmax ∗
Proof. Applying Lemma A.2 with A = µ ∆f and α0 = f (x0 ) − inf f gives the result (51).

5.4 Bibliographic notes


The early and foundational works on SGD include [35, 28, 29, 40, 30, 18], though these ref-
erences are either for the non-smooth setting for Lipschitz losses, or are asymptotic. The first
non-asymptotic analyses of SGD the smooth and convex setting that we are aware of is in [25],
closely followed by [38] under a different growth assumption. These results were later improved
in [26], where the authors removed the quadratic dependency on the smoothness constant and
considered importance sampling. The proof of Theorem 5.3 is a simplified version of [16, Theorem
D.6]. The proof of Theorem 5.7 is a simplified version of [17, Theorem 3.1]. The proof of Theorem
5.9 has been adapted from the proof of [16, Theorem 4.6].
For a general convergence theory for SGD in the smooth and non-convex setting we recom-
mend [20]. Also, the definition of function noise that we use here was taken from [20]. The first
time we saw Lemma 4.19 was also in [20]. Theorem 5.9, which relies on the Polyak-Lojasiewicz
condition, is based on the proof in [16], with the only different being that we use function noise as
opposed to gradient noise. This Theorem 5.9 is also very similar to Theorem 3 in [20], with the
difference being that Theorem 3 in [20] is more general (uses weaker assumptions), but also has
a more involved proof and a different step size.
An excellent reference for proof techniques for SGD focused on the online setting is the recent
book [32], which contains proofs for adaptive step sizes such a Adagrad and coin tossing based
step sizes.

Remark 5.11 (From finite sum to expectation). The theorems we prove here also holds when

31
the objective is a true expectation where

f (x) = Eξ∼D [f (x, ξ)] .

Further we have defined the Lmax smoothness as the largest smoothness constant of every f (x, ξ)
for every ξ. The gradient noise σf∗ is would now be given by

def
σf∗ = Eξ k∇f (x∗ , ξ)k2 .
 
inf
x∗ ∈argmin f

The function noise would now be given by


 
def
∆∗f = inf f − Eξ inf f (x, ξ)
x∈Rd x∈Rd

With these extended definitions we have that Theorems 5.5, 5.7 and 5.9 hold verbatim.
We also give some results for minimizing expectation for stochastic subgradient in Section 9.

6 Minibatch SGD
6.1 Definitions
When solving (Sum of Functions) in practice, an estimator of the gradient is often computed using
a small batch of functions, instead of a single one as in (SGD). More precisely, given a subset
B ⊂ {1, . . . , n}, we want to make use of

def 1 X
∇fB (xt ) = ∇fi (xt ).
|B|
i∈B

This leads to the minibatching SGD algorithm:

Algorithm 6.1 (MiniSGD). Let x0 ∈ Rd , let a batch size b ∈ {1, . . . , n}, and let γt > 0 be
a sequence of step sizes. The Minibatching Stochastic Gradient Descent (MiniSGD)
algorithm is given by the iterates (xt )t∈N where

Bt ⊂ {1, . . . n} Sampled uniformly among sets of size b


xt+1 = xt − γt ∇fBt (xt ). (52)

Definition 6.2 (Mini-batch distribution). We impose in this section that the batches B are
sampled uniformly among all subsets of size b in {1, . . . , n}. This means that each batch is
sampled with probability

1 (n − b)!b!
n = ,

b
n!

and that we will compute expectation and variance with respect to this uniform law. For instance

32
the expectation of the minibatched gradient writes as
1 X
Eb [∇fB (x)] = n
 ∇fB (x),
b B⊂{1,...,n}
|B|=b

and it is an exercise to verify that this is exactly equal to ∇f (x).

Mini-batching makes better use of parallel computational resources and it also speeds-up the
convergence of (SGD), as we show next. To do so, we will need the same central tools than for
(SGD), that is the notions of gradient noise, of expected smoothness, and a variance transfer
lemma.

Definition 6.3. Let Assumption (Sum of Lmax –Smooth) hold, and let b ∈ {1, . . . , n}. We define
the minbatch gradient noise as
def
σb∗ = inf Vb [∇fB (x∗ )] , (53)
x∗ ∈argmin f

where B is sampled according to Definition 6.2.

Definition 6.4. Let Assumption (Sum of Lmax –Smooth) hold, and let b ∈ {1, . . . , n}. We say
that f is Lb -smooth in expectation if
1
for all x, y ∈ Rd , Eb k∇fB (y) − ∇fB (x)k2 ≤ f (y) − f (x) − h∇f (x), y − xi ,
 
2Lb
where B is sampled according to Definition 6.2.

Lemma 6.5 (From single batch to minibatch). Let Assumptions (Sum of Lmax –Smooth) and
(Sum of Convex) hold. Then f is Lb -smooth in expectation with

n(b − 1) n−b
Lb = L+ Lmax , (54)
b(n − 1) b(n − 1)

and the minibatch gradient noise can be computed via


n−b ∗
σb∗ = σ . (55)
b(n − 1) f

Remark 6.6 (Minibatch interpolates between single and full batches). It is intersting to look
at variations of the expected smoothness constant Lb and minibatch gradient noise σb∗ when b
varies from 1 to n. For b = 1, where (MiniSGD) reduces to (SGD), we have that Lb = Lmax and
σb∗ = σf∗ , which are the constants governing the complexity of (SGD) as can be seen in Section 5.
On the other extreme, when b = n (MiniSGD) reduces to (GD), we see that Lb = L and σb∗ = 0.

33
We recover the fact that the behavior of (GD) is controlled by the Lipschitz constant L, and has
no variance.
We end this presentation with a variance transfer lemma, analog to Lemma 4.20 (resp. Lemma 2.29)
in the single batch (resp. full batch).

Lemma 6.7. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. It follows
that
Eb k∇fB (x)k2 ≤ 4Lb (f (x) − inf f ) + 2σb∗ .
 

Proof of Lemmas 6.5 and 6.7. See Proposition 3.8 and 3.10 in [17].

6.2 Convergence for convex and smooth functions

Theorem 6.8. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. Consider
(xt )t∈N a sequence generated by the (MiniSGD) algorithm, with a sequence of stepsizes satisfying
0 < γt < 2L1 b . It follows that
Pt−1 2
kx0 − x∗ k2 γ
E f (x̄t ) − inf f ≤ + Pt−1 k=0 k σb∗ ,
 
Pt−1
2 k=0 γk (1 − 2γk Lb ) k=0 γ k (1 − 2γ k Lb )
def Pt−1 def γk (1−2γk Lb )
where x̄t = k=0 pt,k x
k, with pt,k = Pt−1 .
i=0 γi (1−2γi Lb )

Proof. Let x∗ ∈ argmin f , so we have σb∗ = V[∇fB (x∗ )]. Let us start by analyzing the behaviour
of kxk − x∗ k2 . By developing the squares, we obtain

kxk+1 − x∗ k2 = kxk − x∗ k2 − 2γk h∇fBt (xk ), xk − x∗ i + γk2 k∇fBt (xk )k2

Hence, after taking the expectation conditioned on xk , we can use the convexity of f and the
variance transfer lemma to obtain:
h i h i
Eb kxk+1 − x∗ k2 | xk = kxk − x∗ k2 + 2γk h∇f (xk ), x∗ − xk i + γk2 Eb k∇fBt (xk )k2 | xk
Lem. 2.8 & 6.7
≤ kxk − x∗ k2 + 2γk (2γk Lb − 1)(f (xk ) − inf f )) + 2γk2 σb∗ .

Rearranging and taking expectation, we have


h i h i
2γk (1 − 2γk Lb )Eb f (xk ) − inf f ≤ Eb kxk − x∗ k2 − Eb kxt+1 − x∗ k2 + 2γk2 σb∗ .
 

Summing over k = 0, . . . , t − 1 and using telescopic cancellation gives


t−1
X h i t−1
X
γk (1 − 2γk Lb )Eb f (xk ) − inf f ≤ kx0 − x∗ k2 − Eb kxt − x∗ k2 + 2σb∗ γk2 .
 
2
k=0 k=0

Since Eb kxt − x∗ k2 ≥ 0, dividing both sides by 2 t−1


  P
i=0 γi (1 − 2γi Lb ) gives:
t−1
" #
kx0 − x∗ k2 σb∗ t−1 2
P
X γk (1 − 2γk Lb ) k k=0 γk
Eb Pt−1 (f (x ) − inf f ) ≤ Pt−1 + Pt−1 .
k=0 i=0 γi (1 − 2γi Lb ) 2 i=0 γi (1 − 2γi Lb ) i=0 γi (1 − 2γi Lb )

34
Finally, define for k = 0, . . . , t − 1

def γk (1 − 2γk Lb )
pt,k = Pt−1
i=0 γi (1 − 2γi Lb )
Pt−1
and observe that pt,k ≥ 0 and k=0 pt,k = 1. This allows us to treat the (pt,k )t−1 k=0 as if it was a
probability vector. Indeed, using that f is convex together with Jensen’s inequality gives
t−1
" #
X γ k (1 − 2γ k Lb )
Eb f (x̄t ) − inf f ≤ (f (xk ) − inf f )
 
Eb Pt−1
k=0 i=0 γ i (1 − 2γ i L b )
kx − x∗ k2
0 σb∗ t−1 2
P
k=0 γk
≤ Pt−1 + Pt−1 .
2 i=0 γi (1 − 2γi Lb ) i=0 γi (1 − 2γi Lb )

Theorem 6.9. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. Consider
(xt )t∈N a sequence generated by the (MiniSGD) algorithm with a sequence of stepsizes γt > 0.

1
1. If γt = γ < 2Lb , then for every t ≥ 1 we have that

kx0 − x∗ k2 1 γ
E f (x̄t ) − inf f ≤ σ∗,
 
+
2γ(1 − 2γLb ) t 1 − 2γLb b
def 1 Pt−1
where x̄t = t k=0 x
k.

2. If γt = √γ with γ ≤ 1
t+1 2Lb , then for every t we have that

 kx0 − x∗ k2 γ 2 log(t) ∗
 
log(t)
E f (x̄t ) − inf f ≤

√ + √ σb = O √ .
2γ t γ t t
def Pt−1 def γk (1−2γk Lb )
where x̄t = k
k=0 pt,k x , with pt,k = Pt−1 .
i=0 γi (1−2γi Lb )

Corollary 6.10 (O(1/2 ) Complexity). Consider the setting of Theorem 6.8. Let  ≥ 0 and let
γ = 2L1 b √1t . It follows that for t ≥ 4 that

σb∗ 2
 
1 0 ∗ 2
=⇒ E f (x̄t ) − inf f ≤ .
 
t≥ 2 2Lb kx − x k +
 Lb

Proof. The proof is a consequence of Theorem 6.8 and follows exactly the lines of the proofs of
Theorem 5.5 and Corollary 5.6.

6.3 Rates for strongly convex and smooth functions

35
Theorem 6.11. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold, and as-
sume further that f is µ-strongly convex. Consider (xt )t∈N a sequence generated by the (Min-
iSGD) algorithm, with a constant sequence of stepsizes γt ≡ γ ∈]0, 2L1 b ]. Then

2γσb∗
Eb kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 +
 
.
µ

Proof. Let x∗ ∈ argmin f , so that σb∗ = Vb [∇fB (x∗ )]. Expanding the squares we have

(M iniSGD)
kxt+1 − x∗ k2 = kxt − x∗ − γ∇fBt (xt )k2
= kxt − x∗ k2 − 2γhxt − x∗ , ∇fBt (xt )i + γ 2 k∇fBt (xt )k2 .

Taking expectation conditioned on xt and using Eb [∇fB (x)] = ∇f (x) (see Remark 6.2), we obtain

Eb kxt+1 − x∗ k2 | xt kxt − x∗ k2 − 2γhxt − x∗ , ∇f (xt )i + γ 2 Eb k∇fBt (xt )k2 | xt


   
=
Lem.2.14
(1 − γµ)kxt − x∗ k2 − 2γ[f (xt ) − inf f ] + γ 2 Eb k∇fBt (xt )k2 | xt .
 

Taking expectations again and using Lemma 6.7 gives

Eb kxt+1 − x∗ k2 ≤ (1 − γµ)Eb kxt − x∗ k2 + 2γ 2 σb∗ + 2γ(2γLb − 1)E[f (xt ) − inf f ]


   

≤ (1 − γµ)Eb kxt − x∗ k2 + 2γ 2 σb∗ ,


 

1
where we used in the last inequality that 2γLb ≤ 1 since γ ≤ 2Lb . Recursively applying the above
and summing up the resulting geometric series gives
t−1
X
Eb kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 + 2 (1 − γµ)k γ 2 σb∗
 

k=0
2γσb∗
≤ (1 − γµ)t kx0 − x∗ k2 + .
µ

Corollary 6.12 (Õ(1/) Complexity). Consider the setting of Theorem 6.11. Let  > 0 be
given. Hence, given any  > 0, choosing stepsize
 
1 µ
γ = min , , (56)
2Lb 4σb∗

and
2Lb 4σb∗ 2kx0 − x∗ k2
   
t ≥ max , log =⇒ Ekxt − x∗ k2 ≤ . (57)
µ µ2 

2σb∗
Proof. Applying Lemma A.2 with A = µ , C = 2Lb and α0 = kx0 −x∗ k2 gives the result (57).

36
6.4 Bibliographic Notes
The SGD analysis in [26] was later extended to a mini-batch analysis [27], but restricted to mini-
batches that are disjoint partitions of the data. Our results on mini-batching in Section 6 are
instead taken from [17]. We choose to adapt the proofs from [17] since these proofs allow for
sampling with replacement. The smoothness constant in (54) was introduced in [15] and this
particular formula was conjectured in [11].

7 Stochastic Momentum
For most, if not all, machine learning applications SGD is used with momentum. In the machine
learning community, the momentum method is often written as follows

Algorithm 7.1 (Momentum). Let Assumption (Sum of Lmax –Smooth) hold. Let x0 ∈ Rd and
m−1 = 0, let (γt )t∈N ⊂]0, +∞[ be a sequence of stepsizes, and let (βt )t∈N ⊂ [0, 1] be a sequence
of momentum parameters. The Momentum algorithm defines a sequence (xt )t∈N satisfying for
every t ∈ N

mt = βt mt−1 + ∇fit (xt ),


xt+1 = xt − γt mt . (58)

At the end of this section we will see in Corollary 7.4 that in the convex setting, the sequence xt
generated by the (Momentum) algorithm has a complexity rate of O(1/ε2 ). This is an improvement
with respect to (SGD), for which we only know complexity results about the average of the iterates,
see Corollary 5.6.

7.1 The many ways of writing momentum


In the optimization community the momentum method is often written in the heavy ball format
which is
xt+1 = xt − γ̂t ∇fit (xt ) + β̂t (xt − xt−1 ), (59)
where β̂t ∈ [0, 1] is another momentum parameter, it ∈ {1, . . . , n} is sampled uniformly and
i.i.d at each iteration. These two ways of writing down momentum in (Momentum) and (59) are
equivalent, as we show next.

Lemma 7.2. The algorithms (Momentum) and Heavy Ball (given by (59)) are the equivalent.
More precisely, if (xt )t∈N is generated by (Momentum) from parameters γt , βt , then it verifies
(59) by taking γ̂t = γt and β̂t = γγt−1
t βt
, where γ−1 = 0.

Proof. Starting from (Momentum) we have that

xt+1 = xt − γt mt
= xt − γt βt mt−1 − γt ∇fi (xt ).

37
xt−1 −xt
Now using (58) at time t − 1 we have that mt−1 = γt−1 which when inserted in the above gives

γt βt t−1
xt+1 = xt − (x − xt ) − γt ∇fi (xt ).
γt−1
γt βt
The conclusion follows by taking γ̂t = γt and β̂t = γt−1 .

There is yet a third equivalent way of writing down the momentum method that will be useful
in establishing convergence.

Lemma 7.3. The algorithm (Momentum) is equivalent to the following iterate-moving-average


(IMA) algorithm : start from z −1 = 0 and iterate for t ∈ N

z t = z t−1 − ηt ∇fit (xt ), (60)


λt+1 1
xt+1 = xt + zt. (61)
λt+1 + 1 λt+1 + 1

More precisely, if (xt )t∈N is generated by (Momentum) from parameters γt , βt , then it verifies
(IMA) by chosing any parameters (ηt , λt ) satisfying

γt−1 λt
βt λt+1 = − βt , ηt = (1 + λt+1 )γt , and z t−1 = xt + λt (xt − xt−1 ).
γt

Proof. Let (xt )t∈N be generated by (Momentum) from parameters γt , βt . By definition of z t , we


have
z t = (1 + λt+1 )xt+1 − λt+1 xt , (62)

which after dividing by (1 + λt+1 ) directly gives us (61). Now use Lemma 7.2 to write that

xt+1 = xt − γt ∇fit (xt ) + β̂t (xt − xt−1 )

γt βt
where β̂t = γt−1 . Going back to the definition of z t , we can write

zt = (1 + λt+1 )xt+1 − λt+1 xt


= (1 + λt+1 )(xt − γt ∇fit (xt ) + β̂t (xt − xt−1 )) − λt+1 xt
= xt − (1 + λt+1 )γt ∇fit (xt ) + (1 + λt+1 )β̂t (xt − xt−1 )
= (1 + λt )xt − λt xt−1 − ηt ∇fit (xt )
(62)
= z t−1 − ηt ∇fit (xt ),

where in the last but one equality we used the fact that
γt βt
(1 + λt+1 )γt = ηt and (1 + λt+1 )β̂t = (1 + λt+1 ) = λt .
γt−1

38
7.2 Convergence for convex and smooth functions

Theorem 7.4. Let Assumptions (Sum of Lmax –Smooth) and (Sum of Convex) hold. Consider
(xt )t∈N the iterates generated by the (Momentum) algorithm with stepsize and momentum pa-
rameters taken according to
2η t 1
γt = , βt = , with η≤ .
t+3 t+2 4Lmax
Then the iterates converge according to
 kx0 − x∗ k2
E f (xt ) − inf f ≤ + 2ησf∗ .

(63)
η (t + 1)

Proof. For the proof, we rely on the iterate-moving-average (IMA) viewpoint of momentum given
in Lemma 7.3. It is easy to verify that the parameters
t
ηt = η, λt = and z t−1 = xt + λt (xt − xt−1 )
2
verify the conditions of Lemma 7.3. Let us then consider the iterates (xt , z t ) of (IMA), and we
start by studing the variations of kz t − x∗ k2 . Expanding squares we have for t ∈ N that

kz t − x∗ k2 = kz t−1 − x∗ − η∇fit (xt )k2


= kz t−1 − x∗ k2 + 2ηh∇fit (xt ), x∗ − z t−1 i + η 2 k∇fit (xt )k2
(61)
= kz t−1 − x∗ k2 + 2ηh∇fit (xt ), x∗ − xt i + 2ηλt h∇fit (xt ), xt−1 − xt i + η 2 k∇fit (xt )k2 .

In the last equality we made appear xt−1 which , for t = 0, can be taken equal to zero. Then
taking conditional expectation, using the convexity of f (via Lemma 2.8) and a variance transfer
lemma (Lemma 4.20), we have

E kz t − x∗ k2 | xt = kz t−1 − x∗ k2 + 2ηh∇f (xt ), x∗ − xt i + 2ηλt h∇f (xt ), xt−1 − xt i + η 2 Et k∇fit (xt )k2 | xt ,
   

≤ kz t−1 − x∗ k2 + (4η 2 Lmax − 2η) f (xt ) − inf f + 2ηλt f (xt−1 ) − f (xt ) + 2η 2 σf∗
 

= kz t−1 − x∗ k2 − 2η (1 + λt − 2ηLmax ) f (xt ) − inf f + 2ηλt f (xt−1 ) − inf f + 2η 2 σf∗


 

≤ kz t−1 − x∗ k2 − 2ηλt+1 f (xt ) − inf f + 2ηλt f (xt−1 ) − inf f + 2η 2 σf∗ .


 

where we used the facts that η ≤ 4L1max and λt + 12 = λt+1 in the last inequality. Taking now
expectation and summing over t = 0, . . . , T , we have after telescoping and cancelling terms

E kz T − x∗ k2 ≤ kz −1 − x∗ k2 − 2ηλT +1 E f (xT ) − inf f + 2ηλ0 f (x−1 ) − inf f + 2η 2 σf∗ (T + 1).


    

Now, the fact that λ0 = 0 cancels one term, and also implies that z −1 = x0 + λ0 (x0 − x−1 ) = x0 .
After dropping the positive term E kz T − x∗ k2 , we obtain
 

2ηλT +1 E f (xT ) − inf f ≤ kx0 − x∗ k2 + 2(T + 1)σf∗ η 2 .


 

39
Dividing through by 2ηλT +1 , where our assumption on the parameters gives 2λT +1 = T + 1, we
finally conclude that for all T ∈ N
 kx0 − x∗ k2
E f (xT ) − inf f ≤ + 2σf∗ η.

η(T + 1)

Corollary 7.5 (O(1/2 ) Complexity). Consider the setting of Theorem 7.4. We can guarantee
that E f (xT ) − inf f ≤  provided that we take
 

 2
1 1 1 1 1
η=√ and T ≥ 2 2 kx0 − x∗ k2 + σf∗ (64)
T + 1 4Lmax  4Lmax 2

Proof. By plugging η = √1 1
into (63), we obtain
T +1 4Lmax
 
1 1 1 ∗ 2 ∗
E [f (xT ) − inf f ] ≤ √ kx0 − x k + σf . (65)
2Lmax T + 1 2

The final complexity (64) follows by demanding that (65) be less than .

7.3 Bibliographic notes


This section is based on [39]. The deterministic momentum method was designed for strongly
convex functions [34]. The authors in [12] showed that the deterministic momentum method
converged globally and sublinearly for smooth and convex functions. Theorem 7.4 is from [39],
which in turn is an extension of the results in [12]. For convergence proofs for momentum in the
non-smooth setting see [7].

8 Theory : Nonsmooth functions


In this section we present the tools needed to handle nonsmooth functions. “Nonsmoothness” arise
typically in two ways.

1. Continuous functions having points of nondifferentiability. For instance:


Pd
• the L1 norm kxk1 = i=1 |xi |. It is often used as a regularizer that promotes sparse
minimizers.
• the ReLU σ(t) = 0 if t ≤ 0, t if t ≥ 0. It is often used as the activation function for
neural networks, making the associated loss nondifferentiable.

2. Differentiable functions not being defined on the entire space. An other way to say it is that
they take the value +∞ outside of their domain. This can be seen as nonsmoothness, as the
behaviour of the function at the boundary of the domain can be degenerate.

40
• The most typical example is the indicator function of some constraint C ⊂ Rd , and
which is defined as δC (x) = 0 if x ∈ C, +∞ if x ∈
/ C. Such function is useful because
it allows to say that minimizing a function f over the constraint C is the same as
minimizing the sum f + δC .

8.1 Real-extended valued functions


Definition 8.1. Let f : Rd → R ∪ {+∞}.
def
1. The domain of f is defined by dom f = {x ∈ Rd | f (x) < +∞}.

2. We say that f is proper if dom f 6= ∅.

Definition 8.2. Let f : Rd → R∪{+∞}, and x̄ ∈ Rd . We say that f is lower semi-continuous


at x̄ if
f (x̄) ≤ lim inf f (x).
x→x̄

We say that f is lower semi-continuous (l.s.c. for short) if f is lower semi-continuous at every
x̄ ∈ Rd .

Example 8.3 (Most functions are proper l.s.c.).

• If f : Rd → R is continuous, then it is proper and l.s.c.

• If C ⊂ Rd is closed and nonempty, then its indicator function δC is proper and l.s.c.

• A finite sum of proper l.s.c functions is proper l.s.c.

As hinted by the above example, it is safe to say that most functions used in practice are proper
and l.s.c.. It is a minimal technical assumption which is nevertheless needed for what follows (see
e.g. Lemmas 8.9 and 8.11).

8.2 Subdifferential of nonsmooth convex functions


We have seen in Lemma 2.8 that for differentiable convex functions, ∇f (x) verifies inequality (2).
For non-differentiable (convex) functions f , we will use this fact as the basis to define a more
general notion : subgradients.

Definition 8.4. Let f : Rd → R ∪ {+∞}, and x ∈ Rd . We say that η ∈ Rd is a subgradient of


f at x ∈ Rd if
for every y ∈ Rd , f (y) − f (x) − hη, y − xi ≥ 0. (66)

We denote by ∂f (x) the set of all subgradients at x, that is :


def
∂f (x) = {η ∈ Rd | for all y ∈ Rd , f (y) − f (x) − hη, y − xi ≥ 0} ⊂ Rd . (67)

41
def
We also call ∂f (x) the subdifferential of f . Finally, define dom ∂f = {x ∈ Rd | ∂f (x) 6= ∅}.

Subgradients are guaranteed to exist whenever f is convex and continuous.

Lemma 8.5. If f : Rd → R ∪ {+∞} is a convex function and continuous at x ∈ Rd , then


∂f (x) 6= ∅. This is always true if dom f = Rd .

Proof. See [33, Proposition 3.25] and [2, Corollary 8.40].

If f is differentiable, then ∇f (x) is the unique subgradient at x, as we show next. This means
that the subdifferential is a faithful generalization of the gradient.

Lemma 8.6. If f : Rd → R ∪ {+∞} is a convex function that is is differentiable at x ∈ Rd , then


∂f (x) = {∇f (x)}.

Proof. (Proof adapted from [33, Proposition 3.20]). From Lemma 2.8 we have that ∇f (x) ∈
∂f (x). Suppose now that η ∈ ∂f (x), and let us show that η = ∇f (x). For this, take any v ∈ Rd
and t > 0, and Definition 8.4 to write
f (x + tv) − f (x)
f (x + tv) − f (x) − hη, (x + tv) − xi ≥ 0 ⇔ ≥ hη, vi.
t
Taking the limit when t ↓ 0, we obtain that

for all v ∈ Rd , h∇f (x), vi ≥ hη, vi.

By choosing v = η − ∇f (x), we obtain that k∇f (x) − ηk2 ≤ 0 which in turn allows us to conclude
that ∇f (x) = η.

Remark 8.7. As hinted by the previous results and comments , this definition of subdifferential
is tailored for nonsmooth convex functions. There exists other notions of subdifferential which are
better suited for nonsmooth nonconvex functions. But we will not discuss it in this monograph,
for the sake of simplicity. The reader interested in this topic can consult [6, 36].

Proposition 8.8 (Fermat’s Theorem). Let f : Rd → R ∪ {+∞}, and x̄ ∈ Rd . Then x̄ is a


minimizer of f if and only if 0 ∈ ∂f (x̄).

Proof. From the Definition 8.4, we see that

x̄ is a minimizer of f
⇔ for all y ∈ Rd , f (y) − f (x̄) ≥ 0
⇔ for all y ∈ Rd , f (y) − f (x̄) − h0, y − xi ≥ 0
(66)
⇔ 0 ∈ ∂f (x̄).

42
Lemma 8.9 (Sum rule). Let f : Rd → R be convex and differentiable. Let g : Rd → R ∪ {+∞}
be proper l.s.c. convex. Then, for all x ∈ Rd , ∂(f + g)(x) = {∇f (x)} + ∂g(x).

Proof. See [33, Theorem 3.30].

Lemma 8.10 (Positive homogeneity). Let f : Rd → R be proper l.s.c. convex. Let x ∈ Rd , and
γ ≥ 0. Then ∂(γf )(x) = γ∂f (x).

Proof. It is an immediate consequence of Definition 8.4.

8.3 Nonsmooth strongly convex functions


In this context Lemma 2.13 remains true: Strongly convex functions do not need to be continuous
to have a unique minimizer:

Lemma 8.11. If f : Rd → R∪{+∞} is a proper l.s.c. µ-strongly convex function, then f admits
a unique minimizer.

Proof. See [33, Corollary 2.20].

We also have an obvious analogue to Lemma 2.14:

Lemma 8.12. If f : Rd → R ∪ {+∞} is a proper l.s.c and µ-strongly convex function, then for
every x, y ∈ Rd , and for every η ∈ ∂f (x) we have that
µ
f (y) − f (x) − hη, y − xi ≥ ky − xk2 . (68)
2

Proof. Define g(x) := f (x) − µ2 kxk2 . According to Lemma 2.12, g is convex. It is also clearly l.s.c.
and proper, by definition. According to the sum rule in Lemma 8.9, we have ∂f (x) = ∂g(x) + µx.
Therefore we can use the convexity of g with Definition 8.4 to write
µ µ µ
f (y) − f (x) − hη, y − xi ≥ kyk2 − kxk2 − hµx, y − xi = ky − xk2 .
2 2 2

8.4 Proximal operator


In this section we study a key tool used in some algorithms for minimizing nonsmooth functions.

Definition 8.13. Let g : Rd → R ∪ {+∞} be a proper l.s.c convex function. We define the
proximal operator of g as the function proxg : Rd → Rd defined by

1
proxg (x) := argmin g(x0 ) + kx0 − xk2 (69)
x0 ∈Rd 2

The proximal operator is well defined because, since g(x0 ) is convex the sum g(x0 ) + 21 kx0 − xk2 is
strongly convex in x0 . Thus there exists only one minimizer (recall Lemma 8.11).

43
Example 8.14 (Projection is a proximal operator). Let C ⊂ Rd be a nonempty closed convex
set, and let δC be its indicator function. Then the proximal operator of δC is exactly the projection
operator onto C:
def
proxδC (x) = projC (x) = argmin kc − xk2 .
c∈C

The proximal operator can be characterized with the subdifferential :

Lemma 8.15. Let g : Rd → R ∪ {+∞} be a proper l.s.c convex function, and let x, p ∈ Rd .
Then p = proxg (x) if and only if
x − p ∈ ∂g(p). (70)

Proof. From Definition 8.13 we know that p = proxf (x) if and onlny if p is the minimizer of
φ(u) := g(u) + 21 ku − xk2 . From our hypotheses on g, it is clear that φ is proper l.s.c convex. So
we can use Proposition 8.8 to say that it is equivalent to 0 ∈ ∂φ(p). Moreover, we can use the sum
rule from Lemma 8.9 to write that ∂φ(p) = ∂g(p)+{p−x}. So we have proved that p = proxg (x) if
and only if 0 ∈ ∂g(p) + {p − x}, which is what we wanted to prove, after rearranging the terms.

We show that, like the projection, the proximal operator is 1-Lipschitz (we also say that it is
non-expansive). This property will be very interesting for some proofs since it will allow us to “get
rid” of the proximal terms.

Lemma 8.16 (Non-expansiveness). Let g : Rd → R ∪ {+∞} be a proper l.s.c convex function.


Then proxg : Rd → Rd is 1-Lipschitz :

for all x, y ∈ Rd , kproxg (y) − proxg (x)k ≤ ky − xk. (71)

def def
Proof. Let py = proxg (y) and px = proxg (x). From px = proxg (x) we have x − px ∈ ∂g(px ) (see
Lemma 8.15), so from the definition of the subdifferential (Definition 8.4), we obtain

g(py ) − g(px ) − hx − px , py − px i ≥ 0.

Similarly, from py = proxg (y) we also obtain

g(px ) − g(py ) − hy − py , px − py i ≥ 0.

Adding together the above two inequalities gives

hy − x − py + px , px − py i ≤ 0.

Expanding the left argument of the inner product, and using the Cauchy-Schwartz inequality gives

kpx − py k2 ≤ hx − y, px − py i ≤ kx − ykkpx − py k.

Dividing through by kpx − py k (assuming this is non-zero otherwise (71) holds trivially) we
have (71).

44
We end this section with an important property of the proximal operator : it can help to
characterize the minimizers of composite functions as fixed points.

Lemma 8.17. Let f : Rd → R be convex differentiable, let g : Rd → R ∪ {+∞} be proper l.s.c.


convex. If x∗ ∈ argmin(f + g), then

for all γ > 0, proxγg (x∗ − γ∇f (x∗ )) = x∗ .

Proof. Since x∗ ∈ argmin(f + g) we have that 0 ∈ ∂(f + g)(x∗ ) = ∇f (x∗ ) + ∂g(x∗ ) (Proposition
8.8 and Lemma 8.9). By multiplying both sides by γ then by adding x∗ to both sides gives

(x∗ − γ∇f (x∗ )) − x∗ ∈ ∂(γg)(x∗ ).

According to Lemma 8.15, this means that proxγg (x∗ − γ∇f (x∗ )) = x∗ .

8.5 Controlling the variance

Definition 8.18. Let f : Rd → R be a differentiable function. We define the (Bregman)


divergence of f between y and x as
def
Df (y; x) = f (y) − f (x) − h∇f (x), y − xi . (72)

Note that the divergence Df (y; x) is always nonegative when f is convex due to Lemma 2.8.
Moreover, the divergence is also upper bounded by suboptimality.

Lemma 8.19. Let f : Rd → R be convex differentiable, and g : Rd → R ∪ {+∞} be proper l.s.c.


convex, and F = g + f . Then, for all x∗ ∈ argmin F , for all x ∈ Rd ,

0 ≤ Df (x; x∗ ) ≤ F (x) − inf F.

Proof. Since x∗ ∈ argmin F , we can use the Fermat Theorem (Proposition 8.8) and the sum rule
(Lemma 8.9) to obtain the existence of some η ∗ ∈ ∂g(x∗ ) such that ∇f (x∗ ) + η ∗ = 0. Use now the
definition of the Bregman divergence, and the convexity of g (via Lemma 2.8) to write

Df (x; x∗ ) = f (x) − f (x∗ ) − h∇f (x∗ ), x − x∗ i = f (x) − f (x∗ ) + hη ∗ , x − x∗ i


≤ f (x) − f (x∗ ) + g(x) − g(x∗ )
= F (x) − F (x∗ ).

Next we provide a variance transfer lemma, generalizing Lemma 4.20, which will prove to be
useful when dealing with nonsmooth sum of functions in Section 11.

45
Lemma 8.20 (Variance transfer - General convex case). Let f verify Assumptions (Sum of
Lmax –Smooth) and (Sum of Convex). For every x, y ∈ Rd , we have

V [∇fi (x)] ≤ 4Lmax Df (x; y) + 2V [∇fi (y)] .

Proof. Simply use successively Lemma 4.6, the inequality ka + bk2 ≤ 2kak2 + 2kbk2 , and the
expected smoothness (via Lemma 4.7) to write:

V [∇fi (x)] ≤ E k∇fi (x) − ∇f (y)k2


 

≤ 2E k∇fi (x) − ∇fi (y)k2 + 2E k∇fi (y) − ∇f (y)k2


   

= 2E k∇fi (x) − ∇fi (y)k2 + 2V [∇fi (y)]


 

≤ 4Lmax Df (x; y) + 2V [∇fi (y)] .

Definition 8.21. Let f : Rd → R verify Assumption (Sum of Lmax –Smooth). Let g : Rd →


R ∪ {+∞} be proper l.s.c convex. Let F = g + f be such that argmin F 6= ∅. We define the
composite gradient noise as follows
def
σF∗ = inf V [∇fi (x∗ )] . (73)
x∗ ∈argmin F

Note the difference between σf∗ introduced in Definition 4.16 and σF∗ introduced here is that
the variance of gradients taken at the minimizers of the composite sum F , as opposed to f .

Lemma 8.22. Let f : Rd → R verify Assumptions (Sum of Lmax –Smooth) and (Sum of Convex).
Let g : Rd → R ∪ {+∞} be proper l.s.c convex. Let F = g + f be such that argmin F 6= ∅.

1. σF∗ ≥ 0.

2. σF∗ = V [∇fi (x∗ )] for every x∗ ∈ argmin F .

3. If σF∗ = 0 then there exists x∗ ∈ argmin F such that x∗ ∈ argmin (g +fi ) for all i = 1, . . . , n.
The converse implication is also true if g is differentiable at x∗ .

4. σF∗ ≤ 4Lmax (f (x∗ ) − inf f )) + 2σf∗ , for every x∗ ∈ argmin F .

Proof. Item 1 is trivial. For item 2, consider two minimizers x∗ , x0 ∈ argmin F , and use the
expected smoothness of f (via Lemma 4.7) together with Lemma 8.19 to write
1
E k∇fi (x∗ ) − ∇fi (x0 )k2 ≤ Df (x∗ ; x0 ) ≤ F (x∗ ) − inf F = 0.
 
2Lmax
In other words, we have ∇fi (x∗ ) = ∇fi (x0 ) for all i = 1, . . . , n, which means that indeed V [∇fi (x∗ )] =
V [∇fi (x0 )]. Now we turn to item 3, and start by assuming that σF∗ = 0. Let x∗ ∈ argmin F , and
we know from the previous item that V [∇fi (x∗ )] = 0. This is equivalent to say that, for every i,

46
∇fi (x∗ ) = ∇f (x∗ ). But x∗ being a minimizer implies that −∇f (x∗ ) ∈ ∂g(x∗ ) (use Proposition
8.8 and Lemma 8.9). So we have that −∇fi (x∗ ) ∈ ∂g(x∗ ), from which we conclude by the same
arguments that x∗ ∈ argmin (g + fi ). Now let us prove the converse implication, by assuming
further that g is differentiable at x∗ . From the assumption x∗ ∈ argmin (g + fi ), we deduce that
−∇fi (x∗ ) ∈ ∂g(x∗ ) = ∇g(x∗ ) (see Lemma 8.6). Taking the expectation on this inequality also
gives us that −∇f (x∗ ) = ∇g(x∗ ). In other words, ∇fi (x∗ ) = ∇f (x∗ ) for every i. We can then
conclude that V [∇fi (x∗ )] = 0. We finally turn to item 4, which is a direct consequence of Lemma
8.20 (with x = x∗ ∈ argmin F and y = x∗f ∈ argmin f ) :

σF∗ = V [∇fi (x∗ )]


≤ 4Lmax Df (x∗ ; x∗f ) + 2V ∇fi (x∗f )
 

= 4Lmax (f (x∗ ) − inf f )) + 2σf∗ .

9 Stochastic Subgradient Descent

Problem 9.1 (Stochastic Function). We want to minimize a function f : Rd → R which writes


as
def
f (x) = ED [fξ (x)] ,

where D is some distribution over Rq , ξ ∈ Rq is sampled from D, and fξ : Rd → R. We require


that the problem is well-posed, in the sense that argmin f 6= ∅.

In this section we assume that the functions fξ are convex and have bounded subgradients.

Assumption 9.2 (Stochastic convex and G–bounded subgradients). We consider the problem
(Stochastic Function) and assume that

• (convexity) for every ξ ∈ Rq , the function fξ is convex ;

• (subgradient selection) there exists a function g : Rd × Rq → Rd such that, for every


(x, ξ) ∈ Rd × Rq , g(x, ξ) ∈ ∂fξ (x) and ED [g(x, ξ)] is well-defined ;

• (bounded subgradients) there exists G > 0 such that, for all x ∈ Rd , ED kg(x, ξ)k2 ≤ G2 .
 

We now define the Stochastic Subgradient Descent algorithm, which is an extension of (SGD).
Instead of considering the gradient of a function fi , we consider here some subgradient of fξ .

Algorithm 9.3 (SSD). Consider Problem (Stochastic Function) and let Assumption (Stochastic
convex and G–bounded subgradients) hold. Let x0 ∈ Rd , and let γt > 0 be a sequence of stepsizes.

47
The Stochastic Subgradient Descent (SSD) algorithm is given by the iterates (xt )t∈N where

ξt ∈ Rq Sampled i.i.d. ξt ∼ D (74)


t+1 t t t t
x = x − γt g(x , ξt ), with g(x , ξt ) ∈ ∂fξt (x ). (75)

In (SGD), the sampled subgradient g(xt , ξt ) is an unbiaised estimator of a subgradient of f at


xt .

Lemma 9.4. If Assumption (Stochastic convex and G–bounded subgradients) holds then for all
x ∈ Rd , we have ED [g(x, ξ)] ∈ ∂f (x).

Proof. With Assumption (Stochastic convex and G–bounded subgradients) we can use the fact
that g(x, ξ) ∈ ∂fξ (x) to write

fξ (y) − fξ (x) − hg(x, ξ), y − xi ≥ 0.

Taking expectation with respect to ξ ∼ D, and using the fact that ED [g(x, ξ)] is well-defined, we
obtain
f (y) − f (x) − hED [g(x, ξ)] , y − xi ≥ 0,

which proves that ED [g(x, ξ)] ∈ ∂f (x).

9.1 Convergence for convex functions and bounded gradients

Theorem 9.5. Let Assumption (Stochastic convex and G–bounded subgradients) hold, and
consider (xt )t∈N a sequence generated by the (SSD) algorithm, with a sequence of stepsizes
def 1 PT −1
γt > 0. Then for x̄T = PT −1 t=0 γt xt we have
k=0 γk

PT −1 2 2
 T
 kx0 − x∗ k2 t=0 γt G
ED f (x̄ ) − inf f ≤ PT −1 + P −1
. (76)
2 t=0 γk 2 Tt=0 γk

Proof. Expanding the squares we have that


(SSD)
kxt+1 − x∗ k2 = kxt − x∗ − γt g(xt , ξt )k2
= kxt − x∗ k2 − 2γt g(xt , ξt ), xt − x∗ + γt2 kg(xt , ξt )k2 .

We will use the fact that our subgradients are bounded from Assumption (Stochastic convex and G–
bounded subgradients), and that ED g(xt , ξt ) | xt ∈ ∂f (xt ) (see Lemma 9.4). Taking expectation
 

conditioned on xt we have that

ED kxt+1 − x∗ k2 | xt kxt − x∗ k2 − 2γt ED g(xt , ξt ) | xt , xt − x∗ + γt2 ED kg(xt , ξt )k2 | xt


     
=
kxt − x∗ k2 − 2γt ED g(xt , ξt ) | xt , xt − x∗ + γt2 G2
 

(66)
≤ kxt − x∗ k2 − 2γt (f (xt ) − inf f ) + γt2 G2 . (77)

48
Re-arranging, taking expectation and summing up from t = 0, . . . , T − 1 gives
T −1 −1 −1
X
t
 TX  t ∗ 2
 t+1 ∗ 2
 TX
γt2 G2
 
2 γt ED f (x ) − inf f ≤ ED kx − x k − ED kx −x k +
t=0 t=0 t=0
T
X −1
= ED kx0 − x∗ k2 − ED kxT − x∗ k2 + γt2 G2
   
t=0
T
X −1
≤ kx0 − x∗ k2 + γt2 G2 . (78)
t=0

1 PT −1 PT −1
Let x̄T = PT −1 t=0 γt xt . Dividing through by 2 k=0 γk and using Jensen’s inequality we
k=0 γk
have
T −1
Jensen 1 X
ED f (x̄T ) − inf f γt ED f (xt ) − inf f )
   
≤ PT −1
k=0 γk t=0

PT −1 2 2
kx0 − x k2 t=0 γt G
≤ PT −1 + −1
. (79)
2 Tk=0
P
2 k=0 γk γk

In the next Theorem we specialize our previous estimate by considering


  a suitably decreasing
ln(T )
sequence of stepsize from which we obtain a convergence rate O √T . In Corollary 9.7 we will
 
consider a suitable constant stepsize leading to a finite-horizon rate of O √1T , which will traduce
in a O ε12 complexity.


Theorem 9.6. Let Assumption (Stochastic convex and G–bounded subgradients) hold, and
consider (xt )t∈N a sequence generated by the (SSD) algorithm, with a sequence of stepsizes
def γ def 1 PT −1
γt = √t+1 for some γ > 0. We have for x̄T = PT −1 t=0 γt xt that
k=0 γk

kx0 − x∗ k2 1 γG2 log(T )


ED [f (x̄T )] − inf f ≤ √ + √ .
4γ T −1 4 T −1

Proof. Start considering γt = √γ , and use integral bounds to write


t+1

T −1 Z T −1 √
X 1
√ ≥ (s + 1)−1/2 ds = 2( T − 1).
t=0
t+1 s=1
T −1 Z T −1
X 1
≤ (s + 1)−1 ds = log(T ).
t+1 s=0
t=0

Plugging this into (76) gives the result

1 kx0 − x∗ k2 γ log(T )G2


ED [f (x̄T )] − inf f ≤ √ + √ . (80)
4γ T −1 4 T −1

49
PT −1 √ P −1 2
Now consider the constant stepsize γt = √γ .
T
Since k=0 γk = γ T and Tt=0 γt = γ 2 , we deduce
from (76) that

kx0 − x∗ k2 γG2
ED [f (x̄T )] − inf f ≤ √ + √ . (81)
4γ T 4 T

Corollary 9.7. Consider the setting of Theorem 9.5. We can guarantee that ED f (x̄T ) − inf f ≤
 

ε provided that
kx0 − x∗ kG kx0 − x∗ k2
T ≥ and γ t ≡ √ .
2 G T

Proof. Consider the result of Theorem 9.5 with a constant stepsize γt ≡ γ > 0 :
 kx0 − x∗ k2 γ 2 T G2 kx0 − x∗ k2 γG2
ED f (x̄T ) − inf f ≤

+ = + .
2γT 2γT 2γT 2
kx0 −x∗ k2
It is a simple exercise to see that the above right-hand term is minimal when γ = √
G T
. In this
case, we obtain
 kx0 − x∗ k2 G
ED f (x̄T ) − inf f ≤

√ .
T
kx0 −x∗ k2 G kx0 −x∗ kG
The result follows by writing √
T
≤ε⇔T ≥ 2
.

9.2 Better convergence rates for convex functions with bounded solution
 
In the previous section, we saw that (SSD) has a O ln(T )
convergence rate, but enjoys a O ε12


T
complexity rate.
 The latter suggests that it is possible to get rid of the logarithmic term and
achieve a O √1T convergence rate. In this section, we see that this can be done, by making a
localization assumption on the solution of the problem, and by making a slight modification to the
(SSD) algorithm.

Assumption 9.8 (B–Bounded Solution). There exists B > 0 and a solution x∗ ∈ argmin f
such that kx∗ k ≤ B.

We will exploit this assumption by modifying the (SSD) algorithm, adding a projection step
onto the closed ball B(0, B) where we know that the solution belongs. In this case the projection
onto the ball is given by

 x
 if kxk ≤ B,
projB(0,B) (x) := B
 x
 if kxk > B.
kxk

See Example 8.14 for the definition of the projection onto a closed convex set.

50
Algorithm 9.9 (PSSD). Consider Problem (Stochastic Function) and let Assumptions (Stochas-
tic convex and G–bounded subgradients) and (B–Bounded Solution) hold. Let x0 ∈ B(0, B),
and let γt > 0 be a sequence of stepsizes. The Projected Stochastic Subgradient Descent
(PSSD) algorithm is given by the iterates (xt )t∈N where

ξt ∈ R q Sampled i.i.d. ξt ∼ D (82)


xt+1 = projB(0,B) (xt − γt g(xt , ξt )), with g(xt , ξt ) ∈ ∂fξt (xt ). (83)

We now prove the following theorem, which is a simplified version of Theorem 19 in [7].

Theorem 9.10. Let Assumptions (Stochastic convex and G–bounded subgradients) and (B–
Bounded Solution) hold. Let (xt )t∈N be the iterates of (PSSD), with a decreasing sequence of
def def P −1
γ
stepsizes γt = √t+1 , with γ > 0. Then we have for T ≥ 2 and x̄T = T1 Tt=0 xt that

3B 2
 
1 2
ED [f (x̄T ) − inf f ] ≤ √ + γG .
T γ

Proof. We start by using Assumption (B–Bounded Solution) to write projB(0,B) (x∗ ) = x∗ . This
together with the fact that the projection is nonexpansive (see Lemma 8.16 and Example 8.14)
allows us to write, after expanding the squares
(P SSD)
kxt+1 − x∗ k2 = kprojB(0,B) (xt − γt g(xt , ξt )) − projB(0,B) (x∗ )k2
≤ kxt − x∗ − γt g(xt , ξt )k2
= kxt − x∗ k2 − 2γt g(xt , ξt ), xt − x∗ + γt2 kg(xt , ξt )k2 ,

We now want to take expectation conditioned on xt . We will use the fact that our subgradi-
ents are bounded from Assumption (Stochastic convex and G–bounded subgradients), and that
ED g(xt , ξt ) | xt ∈ ∂f (xt ) (see Lemma 9.4).
 

ED kxt+1 − x∗ k2 | xt kxt − x∗ k2 − 2γt ED g(xt , ξt ) | xt , xt − x∗ + γt2 ED kg(xt , ξt )k2 | xt


     
=
kxt − x∗ k2 − 2γt ED g(xt , ξt ) | xt , xt − x∗ + γt2 G2
 

(66)
≤ kxt − x∗ k2 − 2γt (f (xt ) − inf f ) + γt2 G2 .

1
Taking expectation, dividing through by 2γt and re-arranging gives

1 1  γ t G2
ED f (xt ) − inf f ≤ ED kxt − x∗ k2 − ED kxt+1 − x∗ k2 +
    
.
2γt 2γt 2
Summing up from t = 0, . . . , T − 1 and using telescopic cancellation gives
T −1 T −2 −1
 G2 TX
 
X 1 1X 1 1
ED f (xt ) − inf f ≤ kx0 − x∗ k2 + ED kxt+1 − x∗ k2 +
  
− γt .
2γ0 2 γt+1 γt 2
t=0 t=0 t=0

51
In the above inequality, we are going to bound the term
  √ √
1 1 t+1− t 1
− = ≤ √ ,
γt+1 γt γ 2γ t
by using the fact that the the square root function is concave. We are also going to bound the
term ED kxt+1 − x∗ k2 by using the fact that x∗ and the sequence xt belong to B(0, B), due to
 

the projection step in the algorithm:

kxt − x∗ k2 ≤ kxt k2 + 2 xt , x∗ + kx∗ k2 ≤ 4B 2 .



So we can now write (we use 2 − 1 ≤ 12 ) :

T −1 √ ! T −2 T −1
X  4B 2 1 2 1 1 X 4B 2 γG2 X 1
ED f (xt ) − inf f ≤ 4B 2 +

+ − √ + √
2γ 2 γ γ 2 2γ t 2 t+1
t=0 t=1 t=0
T T
3B 2 B 2 X 1 γG2 X 1
≤ + √ + √
γ γ t 2 t
t=1 t=1
2 √
 2 2
 
3B B γG 
≤ + + 2 T −1
γ γ 2

where in the last inequality we used that


T Z T √
X 1
√ ≤ s−1/2 ds = 2( T − 1).
t=1
t s=1


Cancelling some negative terms, and using the fact that T ≥ 1, we end up with
T −1 √ √
2B 2 B2
   2 
X 3B
ED f (xt ) − inf f ≤ 2 2
 
+ γG T+ ≤ + γG T.
γ γ γ
t=0

1 PT −1
Finally let x̄T = T t=0 xt , dividing through by 1/T , and using Jensen’s inequality we have that

T −1
1 X
ED f (xt ) − inf f
 
ED [f (x̄T ) − inf f ] ≤
T
t=0
 2 
3B 2 1
≤ + γG √ .
γ T

9.3 Convergence for strongly convex functions with bounded gradients on


bounded sets
Here we have to be careful, because there are no functions f that are strongly convex and for
which the stochastic gradients are bounded, as in Assumption (Stochastic convex and G–bounded
subgradients).

52
Lemma 9.11. Consider Problem (Stochastic Function). There exist no functions fξ (x) such
that f (x) = E [fξ (x)] is µ–strongly convex and Assumption (Stochastic convex and G–bounded
subgradients) holds.

def
Proof. For ease of notation we will note g(x) = ED [g(x, ξ)], for which we know that g(x) ∈ ∂f (x)
according to Lemma 9.4. Re-arranging the definition of strong convexity for non-smooth functions
in (68) with y = x∗ we have

µ ∗ µ
hg(x), x − x∗ i ≥ f (x) − f (x∗ ) +kx − xk2 ≥ kx∗ − xk2
2 2

where we used that f (x) − f (x ) ≥ 0 and g(x) ∈ ∂f (x). Using the above and the Cauchy-Schwarz
inequality we have that

µkx − x∗ k2 ≤ 2 hg(x), x − x∗ i ≤ 2kg(x)kkx − x∗ k.

Dividing through by kx − x∗ k gives

µkx − x∗ k ≤ 2kg(x)k. (84)

Finally, using Jensen’s inequality with the function X 7→ kXk2 , we have

µ2 kx − x∗ k2 ≤ 4kg(x)k2 = 4kED [g(x, ξ)] k2 ≤ 4ED kg(x, ξ)k2 ≤ 4G2 , for all x ∈ Rd . (85)
 

 
Since the above holds for all x ∈ Rd , we need only take x ∈/ B x∗ , 2G
µ to arrive at a contradiction.

The problem in Lemma 9.11 is that we make two global assumptions which are incompatible.
But we can consider a problem where those assumptions are only local. In the next result, we will
assume to know that the solution x∗ lives in a certain ball, that the subgradients are bounded on
this ball, and we will consider the projected stochastic subgradient method (PSSD).

Theorem 9.12. Let f be µ-strongly convex, and let Assumption (B–Bounded Solution) hold.
Consider Assumption (Stochastic convex and G–bounded subgradients), but assume that the
bounded subgradients assumption holds only for every x ∈ B(0, B). Consider (xt )t∈N a sequence
generated by the (PSSD) algorithm, with a constant stepsize γt ≡ γ ∈]0, µ1 [. The iterates satisfy

γG2
ED kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 +
 
.
µ

Proof. Our assumption Assumption (B–Bounded Solution) guarantees that projB(0,B) (x∗ ) = x∗ ,
so the definition of (PSSD) together with the nonexpasiveness of the projection gives

kxt+1 − x∗ k2 = kprojB(0,B) (xt − γg(xt , ξt )) − projB(0,B) (x∗ )k2


≤ k(xt − γg(xt , ξt )) − x∗ k2
= kxt − x∗ k2 − 2γ g(xt , ξt ), xt − x∗ + γ 2 kg(xt , ξt )k2 .

53
We will now use our bound on the subgradients g(xt , ξt ) from Assumption (Stochastic convex and
G–bounded subgradients), using that the sequence xt belongs in B(0, B) because of the projection
step in (PSSD). Next we use that ED g(xt , ξt ) | xt ∈ ∂f (xt ) (see Lemma 9.4) and that f is
 

strongly convex. Taking expectation conditioned on xt , we have that

ED kxt+1 − x∗ k2 | xt ≤ kxt − x∗ k2 − 2γ ED g(xt , ξt ) | xt , xt − x∗ + γ 2 ED kg(xt , ξt )k2 | xt


     

≤ kxt − x∗ k2 − 2γ ED g(xt , ξt ) | xt , xt − x∗ + γ 2 G2
 

(68)
≤ kxt − x∗ k2 − 2γ(f (xt ) − inf f ) − γµkxt − x∗ k2 + γ 2 G2
≤ (1 − γµ)kxt − x∗ k2 + γ 2 G2

Taking expectation on the above, and using a recurrence argument, we can deduce that
t−1
X
ED kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 + (1 − γµ)t γ 2 G2 .
 

k=0

Since
t−1
X 1 − (1 − γµ)t γ 2 G2 γG2
(1 − γµ)t γ 2 G2 = γ 2 G2 ≤ = ,
γµ γµ µ
k=0
we conclude that
γG2
ED kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 +
 
.
µ

Corollary 9.13 (O 1ε complexity). Consider the setting of Theorem 9.12. For every ε > 0, we


can guarantee that ED kxt − x∗ k2 ≤ ε provided that


 

2G2
     2
εµ 1 8B
γ = min ; and t ≥ max ; 1 ln .
2G2 µ εµ2 ε

G2
Proof. Use Lemma A.2 with A = µ , C = µ, and use the fact that kx0 − x∗ k ≤ 2B.

9.4 Bibliographic notes


The earlier non-asymptotic proof for the non-smooth case first appeared in the online learning
literature, see for example [42]. Outside of the online setting, convergence proofs for SGD in the
non-smooth setting with Lipschitz functions was given in [41]. For the non-smooth strongly convex
setting see [22] where the authors prove a simple 1/t convergence rate.

10 Proximal Gradient Descent


Problem 10.1 (Composite). We want to minimize a function F : Rd → R ∪ {+∞} which is a
composite sum given by
F (x) = f (x) + g(x), (86)

54
where f : Rd → R is differentiable, and g : Rd → R ∪ {+∞} is proper l.s.c. We require that the
problem is well-posed, in the sense that argmin F 6= ∅.

To exploit the structure of this composite sum, we will use the proximal gradient descent
algorithm, which alternates gradient steps with respect to the differentiable term f , and proximal
steps with respect to the nonsmooth term g.

Algorithm 10.2 (PGD). Let x0 ∈ Rd , and let γ > 0 be a stepsize. The Proximal Gradient
Descent (PGD) algorithm defines a sequence (xt )t∈N which satisfies

xt+1 = proxγg (xt − γ∇f (xt )). (87)

10.1 Convergence for convex functions

Theorem 10.3. Consider the Problem (Composite), and suppose that g is convex, and that f
is convex and L-smooth, for some L > 0. Let (xt )t∈N be the sequence of iterates generated by
the algorithm (PGD), with a stepsize γ ∈]0, L1 ]. Then, for all x∗ ∈ argmin F , for all t ∈ N we
have that
kx0 − x∗ k2
F (xt ) − inf F ≤ .
2γt

Proof. Let x∗ ∈ argmin F be any minmizer of F . We start by studying two (decreasing and
nonnegative) quantities of interest : F (xt ) − inf F and kxt − x∗ k2 .
First, we show that F (xt+1 ) − inf F decreases. For this, using the definition of proxγg in (69)
together with the update rule (87), we have that
1 0
xt+1 = argmin kx − (xt − γ∇f (xt ))k2 + γg(x0 ).
x0 ∈Rd 2

Consequently
1 t+1 1 t
g(xt+1 ) + kx − (xt − γ∇f (xt ))k2 ≤ g(xt ) + kx − (xt − γ∇f (xt ))k2 .
2γ 2γ
After expanding the squares and rearranging the terms, we see that the above inequality is equiv-
alent to
−1 t+1
g(xt+1 ) − g(xt ) ≤ kx − xt k2 − h∇f (xt ), xt+1 − xt i. (88)

Now, we can use the fact that f is L-smooth and (9) to write

L t+1
f (xt+1 ) − f (xt ) ≤ h∇f (xt ), xt+1 − xt i + kx − xt k2 . (89)
2
Summing (88) and (89), and using the fact that γL ≤ 1, we obtain that

−1 t+1 L
F (xt+1 ) − F (xt ) ≤ kx − xt k2 + kxt+1 − xt k2 ≤ 0. (90)
2γ 2

Consequently F (xt+1 ) − F (xt ) is decreasing.

55
Now we show that kxt+1 − x∗ k2 is decreasing. For this we first expand the squares as follows
1 t+1 1 t −1 t+1 xt − xt+1 t+1
kx − x∗ k2 − kx − x∗ k2 = kx − xt k2 − h ,x − x∗ i. (91)
2γ 2γ 2γ γ
Since xt+1 = proxγg (xt − γ∇f (xt )), we know from (70) that
xt − xt+1
∈ ∇f (xt ) + ∂g(xt+1 ).
γ
Using the above in (91) we have that there exists some η t+1 ∈ ∂g(xt+1 ) such that
1 t+1 1 t
kx − x∗ k2 − kx − x∗ k2
2γ 2γ
−1 t+1
= kx − xt k2 − h∇f (xt ) + η t+1 , xt+1 − x∗ i

−1 t+1
= kx − xt k2 − hη t+1 , xt+1 − x∗ i − h∇f (xt ), xt+1 − xt i + h∇f (xt ), x∗ − xt i.

On the first inner product term we can use that η t+1 ∈ ∂g(xt+1 ) and definition of subgradient (66)
to write
− hη t+1 , xt+1 − x∗ i = hη t+1 , x∗ − xt+1 i ≤ g(x∗ ) − g(xt+1 ). (92)
On the second inner product term we can use the smoothness of L and (9) to write
L t+1
− h∇f (xt ), xt+1 − xt i ≤
kx − xt k2 + f (xt ) − f (xt+1 ). (93)
2
On the last term we can use the convexity of f and (2) to write

h∇f (xt ), x∗ − xt i ≤ f (x∗ ) − f (xt ). (94)

By combining (92), (93), (94), and using the fact that γL ≤ 1, we obtain
1 t+1 1 t −1 t+1 L
kx − x∗ k2 − kx − x∗ k2 ≤ kx − xt k2 + kxt+1 − xt k2
2γ 2γ 2γ 2
+g(x∗ ) − g(xt+1 ) + f (xt ) − f (xt+1 ) + f (x∗ ) − f (xt )
−1 t+1 L
= kx − xt k2 + kxt+1 − xt k2 − (F (xt+1 ) − F (x∗ ))
2γ 2
t+1
≤ −(F (x ) − inf F ). (95)

Now that we have established that the iterate gap and functions values are decreasing, we want to
show that the Lyapunov energy
1 t
Et := kx − x∗ k2 + t(F (xt ) − inf F ),

is decreasing. Indeed, re-arranging the terms and using (90) and (95) we have that
1 t+1 1 t
Et+1 − Et = (t + 1)(F (xt+1 ) − inf F ) − t(F (xt ) − inf F ) + kx − x∗ k2 − kx − x∗ k2
2γ 2γ
1 t+1 1 t
= F (xt+1 ) − inf F + t(F (xt+1 ) − F (xt )) + kx − x∗ k2 − kx − x∗ k2
2γ 2γ
(90) 1 t+1 1 t (95)
≤ F (xt+1 ) − inf F + kx − x∗ k2 − kx − x∗ k2 ≤ 0. (96)
2γ 2γ

56
We have shown that Et is decreasing, therefore we can write that
1 0
t(F (xt ) − inf F ) ≤ Et ≤ E0 = kx − x∗ k2 ,

and the conclusion follows after dividing by t.

Corollary 10.4 (O(1/t) Complexity). Consider the setting of Theorem 10.3, for a given  > 0
and γ = L we have that

L kx0 − x∗ k2
t≥ =⇒ F (xt ) − inf F ≤  (97)
 2

10.2 Convergence for strongly convex functions

Theorem 10.5. Consider the Problem (Composite), and suppose that h is convex, and that f
is µ-strongly convex and L-smooth, for some L ≥ µ > 0. Let (xt )t∈N be the sequence of iterates
generated by the algorithm (PGD), with a stepsize 0 < γ ≤ L1 . Then, for x∗ = argmin F and
t ∈ N we have that
kxt+1 − x∗ k2 ≤ (1 − γµ) kxt − x∗ k2 . (98)

As for (GD) (see Theorem 3.6) we provide two different proofs here.

Proof of Theorem 10.5 with first-order properties. Use the definition of (PGD) together
with Lemma 8.17, and the nonexpansiveness of the proximal operator (Lemma 8.16), to write

kxt+1 − x∗ k2 ≤ kproxγg (xt − γ∇f (xt )) − proxγg (x∗ − γ∇f (x∗ ))k2
≤ k(xt − x∗ ) − γ(∇f (xt ) − ∇f (x∗ ))k2
= kxt − x∗ k2 + γ 2 k∇f (xt ) − ∇f (x∗ )k2 − 2γh∇f (xt ) − ∇f (x∗ ), xt − x∗ i.

The cocoercivity of f (Lemma 2.29) gives us

γ 2 k∇f (xt ) − ∇f (x∗ )k2 ≤ 2γ 2 L f (xt ) − f (x∗ ) − h∇f (x∗ ), xt − x∗ i ,




while the strong convexity of f gives us (Lemma 2.14)

−2γh∇f (xt ) − ∇f (x∗ ), xt − x∗ i = 2γh∇f (xt ), x∗ − xt i + 2γh∇f (x∗ ), xt − x∗ i


 µ 
≤ 2γ f (x∗ ) − f (xt ) − kxt − x∗ k2 + 2γh∇f (x∗ ), xt − x∗ i
2
= −γµkxt − x∗ k2 − 2γ f (xt ) − f (x∗ ) − h∇f (x∗ ), xt − x∗ i


Combining those three inequalities and rearranging the terms, we obtain

kxt+1 − x∗ k2 ≤ (1 − γµ)kxt − x∗ k2 + (2γ 2 L − 2γ) f (xt ) − f (x∗ ) − h∇f (x∗ ), xt − x∗ i .




We conclude after observing that f (xt ) − f (x∗ ) − h∇f (x∗ ), xt − x∗ i ≥ 0 (because f is convex, see
Lemma 2.8), and that 2γ 2 L − 2γ ≤ 0 (because of our assumption on the stepsize).

57
Proof of Theorem 10.5 with the Hessian. Let T (x) := x − γ∇f (x) so that the iterates of
(PGD) verify xt+1 = proxγg (T (xt )). From Lemma 8.17 we know that proxγg (T (x∗ )) = x∗ , so we
can write
kxt+1 − x∗ k = kproxγh (T (xt )) − proxγg (T (x∗ ))k.

Moreover, we know from Lemma 8.16 that proxγg is 1-Lipschitz, so

kxt+1 − x∗ k ≤ kT (xt ) − T (x∗ )k.

Further, we already proved in the proof of Theorem 3.6 that T is (1 − γµ)-Lipschitz (assuming
further that f is twice differentiable). Consequently,

kxt+1 − x∗ k ≤ (1 − γµ)kxt − x∗ k.

To conclude the proof, take the squares in the above inequality, and use the fact that (1 − γµ)2 ≤
(1 − γµ).

Corollary 10.6 (log(1/) Complexity). Consider the setting of Theorem 10.5, for a given  > 0,
we have that if γ = 1/L then
 
L 1
k ≥ log ⇒ kxt+1 − x∗ k2 ≤  kx0 − x∗ k2 . (99)
µ 

The proof of this lemma follows by applying Lemma A.1 in the appendix.

10.3 Bibliographic notes


A proof of a O( T1 ) convergence rate for the (PGD) algorithm in the convex case can be found in
[3, Theorem 3.1]. The linear convergence rate in the strongly convex case can be found in [37,
Proposition 4].

11 Stochastic Proximal Gradient Descent


Problem 11.1 (Composite Sum of Functions). We want to minimize a function F : Rd → R
which writes as a composite sum
n
def def 1X
F (x) = g(x) + f (x), f (x) = fi (x),
n
i=1

where each fi : Rd → R is differentiable, and g : Rd → R ∪ {+∞} is proper l.s.c. We require that


the problem is well-posed, in the sense that argmin F 6= ∅, and each fi is bounded from below.

58
Assumption 11.2 (Composite Sum of Convex). We consider the Problem (Composite Sum of
Functions) and we suppose that h and each fi are convex.

Assumption 11.3 (Composite Sum of Lmax –Smooth). We consider the Problem (Composite
Sum of Functions) and suppose that each fi is Li -smooth. We note Lmax = max Li .
i=1,...,n

Algorithm 11.4 (SPGD). Consider the Problem (Composite Sum of Functions). Let x0 ∈ Rd ,
and let γt > 0 be a sequence of step sizes. The Stochastic Proximal Gradient Descent
(SPGD) algorithm defines a sequence (xt )t∈N satisfying
1
it ∈ {1, . . . n} Sampled with probability ,
n
xt+1 = proxγg xt − γt ∇fit (xt ) .

(100)

11.1 Complexity for convex functions

Theorem 11.5. Let Assumptions (Composite Sum of Lmax –Smooth) and (Composite Sum of
Convex) hold. Let (xt )t∈N be a sequence generated by the (SPGD) algorithm with a nonincreasing
sequence of stepsizes verifying 0 < γ0 < 4L1max . Then, for all x∗ ∈ argmin F , for all t ∈ N,
Pt−1 2
t kx0 − x∗ k2 + 2γ0 (F (x0 ) − inf F ) 2σF∗ γs
Ps=0
 
E F (x̄ ) − inf F ≤ Pt−1 + t−1 ,
2(1 − 4γ0 Lmax ) s=0 γs (1 − 4γ0 Lmax ) s=0 γs

def 1 Pt−1
where x̄t = Pt−1
γs s=0 γs x
s.
s=0

Proof. Let us start by looking at kxt+1 − x∗ k2 − kxt − x∗ k2 . Since we just compare xt to xt+1 , to
lighten the notations we fix j := it . Expanding the squares, we have that

1 1 −1 t+1 xt − xt+1 t+1


kxt+1 − x∗ k2 − kxt − x∗ k2 = kx − xt k2 − h ,x − x∗ i.
2γt 2γt 2γt γt
xt −xt+1
Since xt+1 = proxγt g (xt − γt ∇fj (xt )), we know from (70) that γt ∈ ∇fj (xt ) + ∂g(xt+1 ). So
there exists some η t+1 ∈ ∂g(xt+1 ) such that
1 1
kxt+1 − x∗ k2 − kxt − x∗ k2 (101)
2γt 2γt
−1 t+1
= kx − xt k2 − h∇fj (xt ) + η t+1 , xt+1 − x∗ i
2γt
−1 t+1
= kx − xt k2 − h∇fj (xt ) − ∇f (xt ), xt+1 − x∗ i − h∇f (xt ) + η t+1 , xt+1 − x∗ i
2γt

We decompose the last term of (101) as

−h∇f (xt ) + η t+1 , xt+1 − x∗ i = −hη t+1 , xt+1 − x∗ i − h∇f (xt ), xt+1 − xt i + h∇f (xt ), x∗ − xt i

59
For the first term in the above we can use the fact that η t+1 ∈ ∂g(xt+1 ) to write

− hη t+1 , xt+1 − x∗ i = hη t+1 , x∗ − xt+1 i ≤ g(x∗ ) − g(xt+1 ). (102)

On the second term we can use the fact that f is L-smooth and (9) to write
L t+1
− h∇f (xt ), xt+1 − xt i ≤ kx − xt k2 + f (xt ) − f (xt+1 ). (103)
2
On the last term we can use the convexity of f and (2) to write

h∇f (xt ), x∗ − xt i ≤ f (x∗ ) − f (xt ). (104)

By combining (102), (103), (104), and using the fact that γt L ≤ γ0 Lmax ≤ 1, we obtain
1 1
kxt+1 − x∗ k2 − kxt − x∗ k2
2γt 2γt
−1 t+1 L
≤ kx − xt k2 + kxt+1 − xt k2 − (F (xt+1 ) − inf F ) − h∇fj (xt ) − ∇f (xt ), xt+1 − x∗ i
2γt 2
≤ −(F (x ) − inf F ) − h∇fj (xt ) − ∇f (xt ), xt+1 − x∗ i.
t+1
(105)

We now have to control the last term of (105), in expectation. To shorten the computation we
temporarily introduce the operators
def
T = Id − γt ∇f,
def
T̂ = Id − γt ∇fj .

Notice in particular that xt+1 = proxγt g (T̂ (xt )). We have that

−h∇fj (xt ) − ∇f (xt ), xt+1 − x∗ i = −h∇fj (xt ) − ∇f (xt ), proxγt g (T̂ (xt )) − proxγt g (T (xt ))i
−h∇fj (xt ) − ∇f (xt ), proxγt g (T (xt )) − x∗ i, (106)

and observe that the last term is, in expectation, is equal to zero. This is due to the fact that
proxγt g (T (xt )) − x∗ is deterministic when conditioned on xt . Since we will later on take expecta-
tions, we drop this term and keep on going. As for the first term, using the nonexpansiveness of
the proximal operator (Lemma 8.16), we have that

−h∇fj (xt ) − ∇f (xt ), proxγt g (T̂ (xt )) − proxγt g (T (xt ))i ≤ k∇fj (xt ) − ∇f (xt )kkT̂ (xt ) − T (xt )k
= γt k∇fj (xt ) − ∇f (xt )k2

Using the above two bounds in (106) we have proved that (after taking expectation)

E −h∇fj (xt ) − ∇f (xt ), xt+1 − x∗ i ≤ γt E k∇fj (xt ) − ∇f (xt )k2 = γt V ∇fj (xt ) .
     

Injecting the above inequality into (105), we finally obtain


1  t+1  1 
− x∗ k2 − E kxt − x∗ k2 ≤ −γt E F (xt+1 ) − inf F + γt2 V ∇fi (xt ) .
    
E kx (107)
2 2

60
To control the variance term V ∇fi (xt ) we use the variance transfer Lemma 8.20 with x = xt
 

and y = x∗ , which together with Definition 4.16 and Lemma 8.19 gives

V ∇fi (xt ) | xt ≤ 4Lmax Df (xt ; x∗ ) + 2σF∗


 

≤ 4Lmax F (xt ) − inf F + 2σF∗ .




Taking expectation in the above inequality and inserting it in (107) gives


1  t+1  1 
− x∗ k2 − E kxt − x∗ k2

E kx (108)
2 2
≤ −γt E F (xt+1 ) − inf F + 4γt2 Lmax E F (xt ) − inf F + 2γt2 σF∗ .
   

Now we define the following Lyapunov energy for all T ∈ N


−1 −1
1  T ∗ 2 T
 TX t ∗
T
X
γt2 ,
   
ET := E kx − x k +γT −1 E F (x ) − inf F + γt (1−4γt Lmax )E F (x ) − inf F −2σF
2
t=0 t=0

and we will show that it is decreasing (we note γ−1 := γ0 ). Using the above definition we have
that
1   1 
ET +1 − ET = E kxT +1 − x∗ k2 − E kxT − x∗ k2

2 2
T +1
) − inf F − γT −1 E F (xT ) − inf F
   
+ γT E F (x
+ γT (1 − 4γT Lmax )E F (xT ) − inf F − 2σF∗ γT2 .
 

Using (108) and cancelling the matching terms, it remains

ET +1 − ET ≤ −γT −1 E F (xT ) − inf F + γT E F (xT ) − inf F .


   

Using the fact that the stepsizes are nonincreasing (γT ≤ γT −1 ) we consequently conclude that
ET +1 − ET ≤ 0. Therefore, from the definition of ET (and using again γt ≤ γ0 ), we have that
kx0 − x∗ k2 + 2γ0 (F (x0 ) − inf F )
E0 =
2
TX−1 T
X −1
γt (1 − 4γt Lmax )E F (xt ) − inf F − 2σF∗ γt2
 
≥ ET ≥
t=0 t=0
T
X −1 T
X −1
γt E F (xt ) − inf F − 2σF∗ γt2 .
 
≥ (1 − 4γ0 Lmax )
t=0 t=0
P −1
Now passing the term in σf∗ to the other side, dividing this inequality by (1 − 4γLmax ) Tt=0 γt ,
which is strictly positive since 4γ0 Lmax < 1, and using Jensen inequality, we finally conclude that
T −1
1 X
E F (x̄T ) − inf F ≤ γt E F (xt ) − inf F
   
PT −1
t=0 γt t=0
P −1 2
kx0 − ∗
x k2 +
2(F (x0 ) − inf F ) 2σF∗ Tt=0 γt
≤ PT −1 + PT −1 .
2(1 − 4γ0 Lmax ) t=0 γt (1 − 4γ0 Lmax ) t=0 γt

61
Analogously to Remark 5.4, different choices for the step sizes γt allow us to trade off the
convergence speed for the constant variance term. In the next two corollaries we choose a constant

and a 1/ t step size, respectively, followed by a O(2 ) complexity result.

Corollary 11.6. Let Assumptions (Composite Sum of Lmax –Smooth) and (Composite Sum of
Convex) hold. Let (xt )t∈N be a sequence generated by the (SPGD) algorithm with a constant
stepsize verifying 0 < γ < 4L1max . Then, for all x∗ ∈ argmin F and for all t ∈ N we have that

 kx0 − x∗ k2 + 2γ(F (x0 ) − inf F ) 2σF∗ γ


E F (x̄t ) − inf F ≤

+ ,
2(1 − 4γLmax )γt (1 − 4γLmax )
def 1 Pt−1
where x̄t = T s=0 x
s.

Proof. It is a direct consequence of Theorem 11.5 with γt = γ and


t−1
X t−1
X
γs = γt and γs2 = γ 2 t.
s=0 s=0

Corollary 11.7. Let Assumptions (Composite Sum of Lmax –Smooth) and (Composite Sum of
Convex) hold. Let (xt )t∈N be a sequence generated by the (SPGD) algorithm with a sequence of
stepsizes γt = √γt+1
0
verifying 0 < γ0 < 4L1max . Then, for all t ≥ 3 we have that
 
ln(t)
E F (x̄t ) − inf F = O
 
√ ,
t
def 1 Pt−1
where x̄t = Pt−1 s=0 γs x
s.
s=0 γs

Proof. Apply Theorem 11.5 with γt = √γ0 , and use the following integral bounds
t+1

t−1 t−1 Z t−1


X X 1 1
γs2 = γ02 ≤ γ02 ds = γ02 ln(t),
s+1 s=0 s+1
s=0 s=0

and
t−1
X Z t−1
γ0 √ √ 
γs ≥ √ = 2γ0 t− 2 .
s=0 s=1 s+1
√√  1√
For t ≥ 3 we have t−
2 > 6 t, and ln(t) > 1, so we can conclude that
    Pt−1 2  
1 1
√ √
1 ln(t)
√ s=0 γs ln(t)

Pt−1 ≤ √ = O = O and Pt−1 = O .
s=0 γs 2γ0 ( t − 2) t t s=0 γs t

Even though constant stepsizes do not lead to convergence of the algorithm, we can guarantee
arbitrary precision provided we take small stepsizes and a sufficient number of iterations.

62
Corollary 11.8 (O(−2 )–Complexity). Let Assumptions (Composite Sum of Lmax –Smooth) and
(Composite Sum of Convex) hold. Let (xt )t∈N be a sequence generated by the (SPGD) algorithm

σF
, we can guarantee that E F (x̄t ) − inf F ≤ ε,
 
with a constant stepsize γ. For every 0 < ε ≤ Lmax
provided that
ε σ∗
γ= ∗ and t ≥ C0 F2 ,
8σf ε
Pt−1 s  
where x̄t = 1t s=0 x , C0 = 16 kx0 − x∗ k2 + 4L1max (F (x0 ) − inf F ) and x∗ ∈ argmin F .

σ∗ ε 1
Proof. First, note that our assumptions ε ≤ Lmax F
and γ = 8σf∗ imply that γ ≤ 8Lmax , so the
conditions of Corollary 11.6 hold true, and we obtain

kx0 − x∗ k2 + 2γ(F (x0 ) − inf F ) 2σF∗ γ


E F (x̄t ) − inf F ≤ A + B
 
where A := , B := .
2(1 − 4γLmax )γt (1 − 4γLmax )

Passing the terms around, we see that


ε ε
B≤ ⇔γ≤ ∗
2 4(σF + εLmax )

σF ε
Since we assumed that ε ≤ Lmax and γ = 8σf∗ , the above inequality is true. Now we note h0 :=
kx0 − x∗ k2 , r0 := F (x0 ) − inf F , and write

ε h0 + 2γr0 2
A≤ ⇔t≥ . (109)
2 2(1 − 4γLmax )γ ε

With our assumptions, we see that


1 C0
h0 + 2γr0 ≤ h0 + r0 = ,
4Lmax 16

1 ε ε2
(1 − 4γLmax )γε ≥ ε = ,
2 8σF∗ 16σF∗
so indeed the inequality in (109) is true, which concludes the proof.

11.2 Complexity for strongly convex functions

Theorem 11.9. Let Assumptions (Composite Sum of Lmax –Smooth) and (Composite Sum of
Convex) hold, and assume further that f is µ-strongly convex, for µ > 0. Let (xt )t∈N be a
sequence generated by the (SPGD) algorithm with a constant sequence of stepsizes verifying
0 < γ ≤ 2L1max . Then, for x∗ = argmin F , for all t ∈ N,

2γσF∗
E kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 +
 
.
µ

Proof. In this proof, we fix j := it to lighten the notations. Let us start by using the fixed-
point property of the (PGD) algorithm (Lemma 8.17), together with the nonexpansiveness of the

63
proximal operator (Lemma 8.16), to write

kxt+1 − x∗ k2 = kproxγg (xt − γ∇fj (xt )) − proxγg (x∗ − γ∇f (x∗ ))k2
≤ k(xt − γ∇fj (xt )) − (x∗ − γ∇f (x∗ ))k2
= kxt − x∗ k2 + γ 2 k∇fj (xt ) − ∇f (x∗ )k2 − 2γh∇fj (xt ) − ∇f (x∗ ), xt − x∗ i. (110)

Let us analyse the last two terms of the right-hand side of (110). For the first term we use the
Young’s and the triangular inequality together with Lemma 8.20 we obtain

γ 2 Ext k∇fj (xt ) − ∇f (x∗ )k2 ≤ 2γ 2 Ext k∇fj (xt ) − ∇fj (x∗ )k2 + 2γ 2 E k∇fj (x∗ ) − ∇f (x∗ )k2
     

= 2γ 2 Ext k∇fj (xt ) − ∇fj (x∗ )k2 + 2γ 2 σF∗


 
(111)
≤ 4γ 2 Lmax Df (xt ; x∗ ) + 2γ 2 σF∗ ,

where Df is the divergence of f (see Definition 8.18). For the second term in (110) we use the
strong convexity of f (Lemma 2.14) to write

−2γExt h∇fj (xt ) − ∇f (x∗ ), xt − x∗ i = −2γh∇f (xt ) − ∇f (x∗ ), xt − x∗ i


 

= 2γh∇f (xt ), x∗ − xt i + 2γh∇f (x∗ ), xt − x∗ i (112)


≤ −γµkxt − x∗ k2 − 2γDf (xt ; x∗ ).

Combining (110), (111) and (112) and taking expectations gives

E kxt+1 − x∗ k2 ≤ (1 − γµ)E kxt − x∗ k2 + 2γ(2γLmax − 1)E Df (xt ; x∗ ) + 2γ 2 σF∗


     

≤ (1 − γµ)E kxt − x∗ k2 + 2γ 2 σF∗ ,


 

where in the last inequality we used our assumption that 2γLmax ≤ 1. Now, recursively apply the
above to write
t−1
X
E kxt − x∗ k2 ≤ (1 − γµ)t kx0 − x∗ k2 + 2γ 2 σF∗ (1 − γµ)s ,
 
s=0

and conclude by upper bounding this geometric sum using


t−1
X 1 − (1 − γµ)t 1
(1 − γµ)s = ≤ .
γµ γµ
s=0

Corollary 11.10 (Õ(1/) Complexity). Consider the setting of Theorem 11.9. Let  > 0. If we
set
 
 µ 1
γ = min , (113)
2 2σF∗ 2Lmax

64
then
1 4σF∗ 2Lmax 2kx0 − x∗ k2
   
t ≥ max , log =⇒ kxt − x∗ k2 ≤  (114)
 µ2 µ 

2σF
Proof. Applying Lemma A.2 with A = µ , C = 2Lmax and α0 = kx0 − x∗ k2 gives the result (49).

11.3 Bibliographic notes


In proving Theorem 11.5 we simplified the more general Theorem 3.3 in [21], which gives a conver-
gence rate for several stochastic proximal algorithms for which proximal SGD is one special case.
In the smooth and strongly convex case the paper [13] gives a general theorem for the convergence
of stochastic proximal algorithms which includes proximal SGD. For adaptive stochastic proximal
methods see [1].

References
[1] H. Asi and J. C. Duchi. “Stochastic (Approximate) Proximal Point Methods: Convergence,
Optimality, and Adaptivity”. In: SIAM J. Optim. 29.3 (2019), pp. 2257–2290.
[2] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in
Hilbert Spaces. 2nd edition. Springer, 2017.
[3] A. Beck and M. Teboulle. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear
Inverse Problems”. In: SIAM Journal on Imaging Sciences 2.1 (2009), pp. 183–202.
[4] J. Bolte et al. “Clarke Subgradients of Stratifiable Functions”. In: SIAM Journal on Opti-
mization 18.2 (2007), pp. 556–572.
[5] S. Bubeck. “Convex Optimization: Algorithms and Complexity”. In: (2014).
[6] F. Clarke. Optimization and Nonsmooth Analysis. Classics in Applied Mathematics. Society
for Industrial and Applied Mathematics, 1990.
[7] A. Defazio and R. M. Gower. “Factorial Powers for Stochastic Optimization”. In: Asian
Conference on Machine Learning (2021).
[8] D. J. H. Garling. A Course in Mathematical Analysis: Volume 2, Metric and Topological
Spaces, Functions of a Vector Variable. 1 edition. Cambridge: Cambridge University Press,
2014.
[9] G. Garrigos. “Square Distance Functions Are Polyak-Lojasiewicz and Vice-Versa”. In: preprint
arXiv:2301.10332 (2023).
[10] G. Garrigos, L. Rosasco, and S. Villa. “Convergence of the Forward-Backward Algorithm:
Beyond the Worst-Case with the Help of Geometry”. In: Mathematical Programming (2022).
[11] N. Gazagnadou, R. M. Gower, and J. Salmon. “Optimal mini-batch and step sizes for SAGA”.
In: ICML (2019).

65
[12] E. Ghadimi, H. Feyzmahdavian, and M. Johansson. “Global convergence of the Heavy-
ball method for convex optimization”. In: 2015 European Control Conference (ECC). 2015,
pp. 310–315.
[13] E. Gorbunov, F. Hanzely, and P. Richtárik. “A unified theory of sgd: Variance reduction,
sampling, quantization and coordinate descent”. In: arXiv preprint arXiv:1905.11261 (2019).
[14] R. M. Gower. Sketch and Project: Randomized Iterative Methods for Linear Systems and
Inverting Matrices. 2016.
[15] R. M. Gower, P. Richtárik, and F. Bach. “Stochastic Quasi-Gradient Methods: Variance
Reduction via Jacobian Sketching”. In: Mathematical Programming 188.1 (2021), pp. 135–
192.
[16] R. M. Gower, O. Sebbouh, and N. Loizou. “SGD for Structured Nonconvex Functions: Learn-
ing Rates, Minibatching and Interpolation”. In: Proceedings of The 24th International Con-
ference on Artificial Intelligence and Statistics. Vol. 130. PMLR, 2021, pp. 1315–1323.
[17] R. M. Gower et al. “SGD: General Analysis and Improved Rates”. In: International Confer-
ence on Machine Learning. 2019, pp. 5200–5209.
[18] M. Hardt, B. Recht, and Y. Singer. “Train faster, generalize better: stability of stochastic
gradient descent”. In: 33rd International Conference on Machine Learning. 2016.
[19] H. Karimi, J. Nutini, and M. W. Schmidt. “Linear Convergence of Gradient and Proximal-
Gradient Methods Under the Polyak-Lojasiewicz Condition”. In: European Conference on
Machine Learning (ECML) abs/1608.04636 (2016).
[20] A. Khaled and P. Richtarik. “Better Theory for SGD in the Nonconvex World”. In: arXiv:2002.03329
(2020).
[21] A. Khaled et al. Unified Analysis of Stochastic Gradient Methods for Composite Convex and
Smooth Optimization. 2020.
[22] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t)
convergence rate for the projected stochastic subgradient method. 2012.
[23] C. Liu, L. Zhu, and M. Belkin. “Toward a theory of optimization for over-parameterized
systems of non-linear equations: the lessons of deep learning”. In: CoRR (2020).
[24] S. Ma, R. Bassily, and M. Belkin. “The Power of Interpolation: Understanding the Effective-
ness of SGD in Modern Over-parametrized Learning”. In: ICML. 2018.
[25] E. Moulines and F. R. Bach. “Non-asymptotic analysis of stochastic approximation algo-
rithms for machine learning”. In: Advances in Neural Information Processing Systems. 2011,
pp. 451–459.
[26] D. Needell, N. Srebro, and R. Ward. “Stochastic gradient descent, weighted sampling, and the
randomized Kaczmarz algorithm”. In: Mathematical Programming, Series A 155.1 (2016),
pp. 549–573.

66
[27] D. Needell and R. Ward. “Batched Stochastic Gradient Descent with Weighted Sampling”.
In: Approximation Theory XV, Springer. Vol. 204. Springer Proceedings in Mathematics &
Statistics, 2017, pp. 279 –306.
[28] A. Nemirovski and D. B. Yudin. “On Cezari’s convergence of the steepest descent method for
approximating saddle point of convex-concave functions”. In: Soviet Mathetmatics Doklady
19 (1978).
[29] A. Nemirovski and D. B. Yudin. Problem complexity and method efficiency in optimization.
Wiley Interscience, 1983.
[30] A. Nemirovski et al. “Robust stochastic approximation approach to stochastic program-
ming”. In: SIAM Journal on Optimization 19.4 (2009), pp. 1574–1609.
[31] Y. Nesterov. Introductory Lectures on Convex Optimization. Vol. 87. Springer Science &
Business Media, 2004.
[32] F. Orabona. A Modern Introduction to Online Learning. 2019.
[33] J. Peypouquet. Convex Optimization in Normed Spaces. SpringerBriefs in Optimization.
Cham: Springer International Publishing, 2015.
[34] B. T. Polyak. “Some Methods of Speeding up the Convergence of Iteration Methods”. In:
USSR Computational Mathematics and Mathematical Physics 4 (1964), pp. 1–17.
[35] H. Robbins and S. Monro. “A stochastic approximation method”. In: The Annals of Math-
ematical Statistics (1951), pp. 400–407.
[36] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis. New York: Springer, 2009.
[37] M. Schmidt, N. Roux, and F. Bach. “Convergence Rates of Inexact Proximal-Gradient Meth-
ods for Convex Optimization”. In: Advances in neural information processing systems 24
(2011).
[38] M. Schmidt and N. L. Roux. Fast Convergence of Stochastic Gradient Descent under a Strong
Growth Condition. 2013.
[39] O. Sebbouh, R. M. Gower, and A. Defazio. “Almost sure convergence rates for Stochastic
Gradient Descent and Stochastic Heavy Ball”. In: COLT (2021).
[40] S. Shalev-Shwartz, Y. Singer, and N. Srebro. “Pegasos: primal estimated subgradient solver
for SVM”. In: 24th International Conference on Machine Learning. 2007, pp. 807–814.
[41] O. Shamir and T. Zhang. “Stochastic Gradient Descent for Non-smooth Optimization: Con-
vergence Results and Optimal Averaging Schemes”. In: Proceedings of the 30th International
Conference on Machine Learning. 2013, pp. 71–79.
[42] M. Zinkevich. “Online Convex Programming and Generalized Infinitesimal Gradient Ascent”.
In: Proceedings of the Twentieth International Conference on International Conference on
Machine Learning. ICML’03. 2003, 928–935.

67
A Appendix
A.1 Lemmas for Complexity
The following lemma was copied from Lemma 11 in [14].

Lemma A.1. Consider the sequence (αk )k ∈ R+ of positive scalars that converges to zero
according to
αk ≤ ρk α0 , (115)

where ρ ∈ [0, 1). For a given 1 >  > 0 we have that


 
1 1
k≥ log ⇒ αk ≤  α0 . (116)
1−ρ 

Proof. First note that if ρ = 0 the result follows trivially. Assuming ρ ∈ (0, 1), rearranging (115)
and applying the logarithm to both sides gives
   
α0 1
log ≥ k log . (117)
αk ρ
Now using that  
1 1
log ≥ 1, (118)
1−ρ ρ
for all ρ ∈ (0, 1) and assuming that
 
1 1
k≥ log , (119)
1−ρ 
we have that
  (117)
 
α0 1
log ≥ k log
αk ρ
(119)
   
1 1 1
≥ log log
1−ρ ρ 
(118)
 
1
≥ log

Applying exponentials to the above inequality gives (116).

As an example of the use this lemma, consider the sequence of random vectors (Y k )k for which
the expected norm converges to zero according to
h i
E kY k k2 ≤ ρk kY 0 k2 . (120)

Then applying Lemma A.1 with αk = E kY k k2 for a given 1 >  > 0 states that
 

 
1 1 h i
k≥ log ⇒ E kY k k2 ≤  kY 0 k2 .
1−ρ 

68
Lemma A.2. Consider the recurrence given by

αk ≤ (1 − γµ)t α0 + Aγ, (121)

where µ > 0 and A, C ≥ 0 are given constants and γ ∈]0, C1 [ is a free parameter. If
 
 1
γ = min , (122)
2A C

then    
1 2A C 2α0
t ≥ max , log =⇒ αk ≤ .
 µ µ 

Proof. First we restrict γ so that the second term in (121) is less than /2, that is
 
Aγ ≤ =⇒ γ≤ .
2 2A
Thus we set γ according to (122) to also satisfy the constraint that γ ≤ C1 .
Furthermore we want

(1 − µγ)t α0 ≤ .
2
Taking logarithms and re-arranging the above means that we want
   
2α0 1
log ≤ t log . (123)
 1 − γµ
 
Now using that log ρ1 ≥ 1 − ρ, with ρ = 1 − γµ ∈]0, 1], we see that for (123) to be true, it is
enough to have  
1 2α0
t≥ log .
µγ 
Substituting in γ from (122) gives
   
1 2A C 2α0
t ≥ max , log .
 µ µ 

A.2 A nonconvex PL function


Lemma A.3 (A nonconvex PL function). Let f (t) = t2 + 3 sin(t)2 . Then f is µ-Polyak-
1
Lojasiewicz with µ = 40 , while not being convex.

Proof. The fact that f is not convex follows directly from the fact that f 00 (t) = 2 + 6 cos(2t)
can be nonpositive, for instance f 00 ( π2 ) = −4. To prove that f is PL, start by computing f 0 (t) =
2t + 3 sin(2t), and inf f = 0. Therefore we are looking for a constant α = 2µ > 0 such that

(2t + 3 sin(2t))2 ≥ α t2 + 3 sin(t)2 .



for all t ∈ R,

69
Using the fact that sin(t)2 ≤ t2 , we see that it is sufficient to find α > 0 such that

for all t ∈ R, (2t + 3 sin(2t))2 ≥ 4αt2 .

Now let us introduce X = 2t, Y = 3 sin(2t), so that the above property is equivalent to

for all (X, Y ) ∈ R2 such that Y = 3 sin(X), (X + Y )2 ≥ αX 2 .

It is easy to check whenever the inequality (X + Y )2 ≥ αX 2 is verified or not:


√ √



 X > 0 and − (1 + α)X < Y < −(1 − α)X

(X + Y )2 < αX 2 ⇔ or (124)
X < 0 and − (1 − √α)X < Y < −(1 + √α)X


Now we just need to make sure that the curve Y = 3 sin(X) violates those conditions for α small
enough. We will consider different cases depending on the value of X:

• If X ∈ [0, π], we have Y = 3 sin(X) ≥ 0 > −(1 − α)X, provided that α < 1.

• On [π, 45 π] we can use the inequality sin(t) ≥ π − t. One way to prove this inequality is to use
the fact that sin(t) is convex on [π, 2π] (its second derivative is − sin(t) ≥ 0), which implies
that sin(t) is greater than its tangent at t0 = π, whose equation is π − t. This being said, we
can write (remember that X ∈ [π, 45 π] here):

5 3 √ √
Y = 3 sin(X) ≥ 3(π − X) ≥ 3(π − π) = − π > −(1 − α)π ≥ −(1 − α)X,
4 4

where the strict inequality is true whenever 34 < 1 − α ⇔ α < 16
1
' 0.06.

• If X ∈ [ 45 π, +∞[, we simply use the fact that


√ 5 √
Y = 3 sin(X) ≥ −3 > −(1 − α) π ≥ −(1 − α)X,
4
√ 12 2
where the strict inequality is true whenever 3 < (1 − α) 54 π ⇔ α < 1 −

5π ' 0.055.

• If X ∈]−∞, 0], we can use the exact same arguments (use the fact that sine is a odd function)

to obtain that Y < −(1 − α)X.

In every cases, we see that (124) is violated when Y = 3 sin(X) and α = 0.05, which allows us to
conclude that f is µ-PL with µ = α/2 = 0.025 = 1/40.

70

You might also like