0% found this document useful (0 votes)
41 views29 pages

Non-Convex Learning Via Stochastic Gradient Langevin Dynamics A Nonasymptotic Analysis

Uploaded by

anirbanc2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views29 pages

Non-Convex Learning Via Stochastic Gradient Langevin Dynamics A Nonasymptotic Analysis

Uploaded by

anirbanc2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Non-Convex Learning via Stochastic Gradient Langevin

Dynamics: A Nonasymptotic Analysis

Maxim Raginsky* Alexander Rakhlin† Matus Telgarsky‡


arXiv:1702.03849v2 [cs.LG] 14 Feb 2017

Abstract
Stochastic Gradient Langevin Dynamics (SGLD) is a popular variant of Stochastic Gradient
Descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of
the gradient at each iteration. This modest change allows SGLD to escape local minima and
suffices to guarantee asymptotic convergence to global minimizers for sufficiently regular non-
convex objectives (Gelfand and Mitter, 1991).
The present work provides a nonasymptotic analysis in the context of non-convex learning
problems: SGLD requires Õ(ε−4 ) iterations to sample Õ(ε)-approximate minimizers of both
empirical and population risk, where Õ(·) hides polynomial dependence on a temperature
parameter, the model dimension, and a certain spectral gap parameter.
As in the asymptotic setting, our analysis relates the discrete-time SGLD Markov chain to
a continuous-time diffusion process. A new tool that drives the results is the use of weighted
transportation cost inequalities to quantify the rate of convergence of SGLD to a stationary
distribution in the Euclidean 2-Wasserstein distance.

1 Introduction and informal summary of results


Consider a stochastic optimization problem
Z
minimize F(w) := EP [f (w, Z)] = f (w, z)P(dz),
Z

where w takes values in Rd and Z is a random element of some space Z with an unknown proba-
bility law P. We have access to an n-tuple Z = (Z1 , . . . , Zn ) of i.i.d. samples drawn from P, and our
goal is to generate a (possibly random) hypothesis W b ∈ Rd with small expected excess risk

b ) − F ∗,
EF(W (1.1)

where F ∗ := infw∈Rd F(w), and the expectation is with respect to the training data Z and any addi-
tional randomness used by the algorithm for generating W b.
* University of Illinois; [email protected]. Research supported in part by the NSF under CAREER award CCF-
1254041, and in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under
grant agreement CCF-0939370.
† University of Pennsylvania; [email protected]. Research supported in part by the NSF under grant
no. CDS&E-MSS 1521529.
‡ University of Illinois, Simons Institute; [email protected].
When the functions w 7→ f (w, z) are not convex, theoretical analysis of global convergence
becomes largely intractable. On the other hand, non-convex optimization is currently witness-
ing an impressive string of empirical successes, most notably in the realm of deep neural net-
works. Towards the aim of bridging this gap between theory and practice, this paper provides
a theoretical justification for Stochastic Gradient Langevin Dynamics (SGLD), a popular variant of
stochastic gradient descent, in which properly scaled isotropic Gaussian noise is added to an unbi-
ased estimate of the gradient at each iteration (Gelfand and Mitter, 1991; Borkar and Mitter, 1999;
Welling and Teh, 2011).
Since the population distribution P is unknown, we attempt to (approximately) minimize
n
1X
Fz (w) := f (w, zi ), (1.2)
n
i=1

the empirical risk of a hypothesis w ∈ Rd on a dataset z = (z1 , . . . , zn ) ∈ Zn . The SGLD algorithm


studied in this work is given by the recursion
q
Wk+1 = Wk − ηgk + 2ηβ −1 ξk (1.3)

where gk is a conditionally unbiased estimate of the gradient ∇Fz (Wk ), ξk is a standard Gaussian
random vector in Rd , η > 0 is the step size, and β > 0 is the inverse temperature parameter. Our
analysis begins with the standard observation (see, e.g., Borkar and Mitter (1999) for a rigorous
treatment or Welling and Teh (2011) for a heuristic discussion) that the discrete-time Markov pro-
cess (1.3) can be viewed as a discretization of the continuous-time Langevin diffusion described
by the Itô stochastic differential equation
q
dW (t) = −∇Fz(W (t))dt + 2β −1 dB(t), t≥0 (1.4)

where {B(t)}t≥0 is the standard Brownian motion in Rd . Under suitable assumptions on f , it can be
shown that the Gibbs measure πz (dw) ∝ exp(−βFz(w)) is the unique invariant distribution of (1.4),
and that the distributions of W (t) converge rapidly to πz as t → ∞ (Chiang et al., 1987). Moreover,
for all sufficiently large values of β, the Gibbs distribution concentrates around the minimizers of
Fz (Hwang, 1980). Consequently, a draw from the Gibbs distribution is, with high probability, an
almost-minimizer of the empirical risk (1.2), and, if one can show that the SGLD recursion tracks
the Langevin diffusion in a suitable sense, then it follows that the distributions of Wk will be close
to the Gibbs measure for all sufficiently large k. Hence, one can argue that, for large enough k, the
output of SGLD is also an almost-minimizer of the empirical risk.
It is well-recognized, however, that minimization of the empirical risk Fz does not immediately
translate into minimization of the population risk F. A standard approach for addressing the
issue is to decompose the excess risk into a sum of two terms, F(W b ) − Fz (W
b ) (the generalization
error of W b ) and Fz (Wb ) − F (the gap between the empirical risk of W
∗ b and the minimum of the
population risk), and then show that both of these terms are small (either in expectation or with
high probability). Taking W b = Wk and letting W b ∗ be the output of the Gibbs algorithm under
which the conditional distribution of W b given Z = z is equal to πz , we decompose the excess risk

(1.1) as follows:
     
EF(Wb ) − F ∗ = EF(W b ∗ ) + EF(W
b ) − EF(W b ∗ ) − EFZ (W
b ∗ ) + EFZ (W b ∗) − F∗ , (1.5)

2
where the first term is the difference of expected population risks of SGLD and the Gibbs algo-
rithm, the second term is the generalization error of the Gibbs
 algorithm, and the third term is
easily upper-bounded in terms of expected suboptimality E FZ (W b ∗ ) − minw FZ (w) of the Gibbs
algorithm for the empirical risk. Observe that only the first term pertains to SGLD, whereas the
other two involve solely the Gibbs distribution. The main contribution of this work is in showing
finite-time convergence of SGLD for a non-convex objective function. Informally, we can state our
main result as follows:
1. For any ε > 0, the first term in (1.5) scales as
! ! !4
1 1 1 ε
ε · Poly β, d, for k  Poly β, d, · and η ≤ , (1.6)
λ∗ λ∗ ε 4 log(1/ε)
where λ∗ is a certain spectral gap parameter that governs the exponential rate of convergence
of the Langevin diffusion to its stationary distribution. This spectral gap parameter itself
might depend on β and d, but is independent of n.

2. The second and third terms in (1.5) scale, respectively, as

(β + d)2 d log(β + 1)
, . (1.7)
λ∗ n β

1.1 Method of analysis: an overview


Our analysis draws heavily on the theory of optimal transportation (Villani, 2003) and on the
analysis of Markov diffusion operators (Bakry et al., 2014) (the necessary background on Markov
semigroups and functional inequalities is given in Appendix A). In particular, we control the
convergence of SGLD to the Gibbs distribution in terms of 2-Wasserstein distance
n o
W2 (µ, ν) := inf (EkV − W k2 )1/2 : µ = L(V ), ν = L(W ) ,

where k · k is the Euclidean (ℓ 2 ) norm on Rd , µ and ν are Borel probability measures on Rd with
finite second moments, and the infimum is taken over all random couples (V , W ) taking values in
Rd × Rd with marginals V ∼ µ and W ∼ ν.
To control the first term on the right-hand side of (1.5), we first upper-bound the 2-Wasserstein
distance between the distributions of Wk (the kth iterate of SGLD) and W (kη) (the point reached
by the Langevin diffusion at time t = kη). This requires some heavy lifting: Existing bounds
on the 2-Wasserstein distance between a diffusion process and its time-discretized version due
to Alfonsi et al. (2015) scale like ηe kη , which is far too crude for our purposes. By contrast, we
take an indirect route via a Girsanov-type change of measure and a weighted transportation-cost
inequality of Bolley and Villani (2005) to obtain a bound that scales like kη · η 1/4 . This step relies
crucially on a certain exponential integrability property of the Langevin diffusion. Next, we show
that the Gibbs distribution satisfies a logarithmic Sobolev inequality, which allows us to conclude
that the 2-Wasserstein distance between the distribution of W (kη) and the Gibbs distribution
decays exponentially as e −kη . Since W2 satisfies the triangle inequality, we can produce an upper-
bound on the first term in (1.5) that scales as kη · η 1/4 + e −kη . This immediately suggests that we
can make this term as small as we wish by first choosing a large enough horizon t = kη and then a
small enough step size η. Overall, this leads to the bounds stated in (1.6).

3
To control the second term in (1.5), we show that the Gibbs algorithm is stable in 2-Wasserstein
distance with respect to local perturbations of the training dataset. This step, again, relies on the
logarithmic Sobolev inequality for the Gibbs distribution. To control the third term, we use a
nonasymptotic Laplace integral approximation to show that a single draw from the Gibbs distri-
bution is an approximate minimizer of the empirical risk. We use a Wasserstein continuity result
due to Polyanskiy and Wu (2016) and a well-known equivalence between stability of empirical
minimization and generalization (Mukherjee et al., 2006; Rakhlin et al., 2005) to show that, in
fact, the Gibbs algorithm samples from near-minimizers of the population risk.
We remark that our result readily extends to the case when the stochastic gradients gk in (1.3)
are formed with respect to independent draws from the data-generating distribution P – e.g.,
when taking a single pass through the dataset. In this case, the target of optimization is F itself
rather than Fz , and we simply omit the second term in (1.5). If the main concern is not consistency
(as in (1.1)) but rather the generalization performance of SGLD itself, then the same analysis
applied to the decomposition
   
b ) − EF(W
EFZ (W b ) = EFZ (W b ∗ ) + EFZ (W
b ) − EFZ (W b ∗ ) − EF(W
b ∗) (1.8)

gives an upper bound of (1.6) plus the first term of (1.7). In other words, while the rate of (1.1)
d log β
may be hampered by the slow convergence of β , the rate of generalization is not. Finally, if
each data point is used only once, the generalization performance is controlled by (1.6) alone.

1.2 Related work


The asymptotic study of convergence of discretized Langevin diffusions for non-convex objectives
has a long history, starting with the work of Gelfand and Mitter (1991). Most of the work has
focused on annealing-type schemes, where both the step size η and the temperature 1/β are de-
creased with time. Márquez (1997) and Pelletier (1998) studied the rates of weak convergence
for both the Langevin diffusion and the discrete-time updates. However, when η and β are kept
fixed, the updates do not converge to a global minimizer, but one can still aim for convergence to a
stationary distribution. An asymptotic study of this convergence, in the sense of relative entropy,
was initiated by Borkar and Mitter (1999).
Dalalyan and Tsybakov (2012) and Dalalyan (2016) analyzed rates of convergence of discrete-
time Langevin updates (with exact gradients) in the case of convex functions, and provided
nonasymptotic rates of convergence in the total variation distance for sampling from log-concave
densities. Durmus and Moulines (2015) refined these results by establishing geometric conver-
gence in total variation distance for convex and strongly convex objective functions. Bubeck et al.
(2015) studied projected Langevin updates in the convex case.
Our work is motivated in part by recent papers on non-convex optimization and, in particular,
on optimization problems related to neural networks. A heuristic analysis of SGLD was given by
Welling and Teh (2011), and a modification of SGLD to improve generalization performance was
recently proposed by Chaudhari et al. (2016). Deliberate addition of noise was also proposed by
Ge et al. (2015) as a strategy for escaping from saddle points, and Belloni et al. (2015) analyzed
a simulated anealing method based on Hit-and-Run for sampling from nearly log-concave distri-
butions. While these methods aim at avoiding local minima through randomn perturbations, the
line of work on continuation methods and graduated optimization (Hazan et al., 2016) attempts
to create sequences of smoothed approximations that can successively localize the optimum.

4
Hardt et al. (2015) studied uniform stability and generalization properties of stochastic gradi-
ent descent with both convex and non-convex objectives. For the non-convex case, their upper
bound on stability degrades with the number of steps of the optimization procedure, which was
taken by the authors as a prescription for early stopping. In contrast, we show that, under our
assumptions, non-convexity does not imply loss of stability when the latter is measured in terms
of 2-Wasserstein distance to the stationary distribution. In addition, we use the fact that Gibbs
distribution concentrates on approximate empirical minimizers, implying convergence for the
population risk via stability (Rakhlin et al., 2005; Mukherjee et al., 2006).

2 The main result


We begin by giving a precise description of the SGLD recursion. A stochastic gradient oracle, i.e.,
the mechanism for accessing the gradient of Fz at each iteration, consists of a collection (Qz )z∈Zn of
probability measures on some space U and a mapping g : Rd × U → Rd , such that, for every z ∈ Zn ,

Eg(w, Uz ) = ∇Fz (w), ∀w ∈ Rd (2.1)

where Uz is a random element of U with probability law Qz . Conditionally on Z = z, the SGLD


update takes the form
q
Wk+1 = Wk − ηg(Wk , Uz,k ) + 2ηβ −1 ξk , k = 0, 1, 2, . . . (2.2)

where {Uz,k }∞ ∞
k=0 is a sequence of i.i.d. random elements of U with probability law Qz and {ξk }k=0
is a sequence of i.i.d. standard Gaussian random vectors in Rd . We assume that W0 , (Z, {UZ,k }∞
k=0 ),
and {ξk }∞
k=0 are mutually independent. We impose the following assumptions (see the discussion
in Section 4 for additional details):

(A.1) The function f takes nonnegative real values, and there exist constants A, B ≥ 0, such that

|f (0, z)| ≤ A and k∇f (0, z)k ≤ B ∀z ∈ Z.

(A.2) For each z ∈ Z, the function f (·, z) is M-smooth: for some M > 0,

k∇f (w, z) − ∇f (v, z)k ≤ Mkw − vk, ∀w, v ∈ Rd .

(A.3) For each z ∈ Z, the function f (·, z) is (m, b)-dissipative (Hale, 1988): for some m > 0 and b ≥ 0,

hw, ∇f (w, z)i ≥ mkwk2 − b, ∀w ∈ Rd . (2.3)

(A.4) There exists a constant δ ∈ [0, 1), such that, for each z ∈ Zn ,1
 
E[kg(w, Uz ) − ∇Fz (w)k2 ] ≤ 2δ M 2 kwk2 + B2 , ∀w ∈ Rd . (2.4)
1 We are reusing the constants M and B from (A.1) and (A.2) in (2.4) mainly out of considerations of technical
convenience; any other constants M ′ , B′ > 0 can be substituted in their place without affecting the results.

5
(A.5) The probability law µ0 of the initial hypothesis W0 has a bounded and strictly positive den-
sity p0 with respect to the Lebesgue measure on Rd , and
Z
2
κ0 := log e kwk p0 (w)dw < ∞.
Rd

We are now ready to state our main result. A crucial role will be played by the uniform spectral gap
R Z 

 k∇gk 2 dπ 

 RRd z 1 d 2 
λ∗ := infn inf 
 : g ∈ C (R ) ∩ L (π z ), g , 0, gdπ z = 0 
 , (2.5)
z∈Z  d g 2 dπ z R d 
R

where πz (dw) ∝ e −βFz (w) dw is the Gibbs distribution. As detailed in Section 4, Assumptions (A.1)–
(A.3) suffice to ensure that λ∗ > 0. In the statement of the theorem, the notation Õ(·) and Ω̃(·)
gives explicit dependence on the parameters β, λ∗ , and d, but hides factors that depend (at worst)
polynomially on the parameters A, B, 1/m, b, M, κ0 . Explicit expressions for these constants are
given in the proof.

Theorem 2.1. Suppose that the regularity conditions (A.1)–(A.5) hold. Then, for any β ≥ 1 ∨ 2/m and
m −Ω̃(λ∗ /β(d+β)) ), the expected excess risk of W is bounded by
any ε ∈ (0, 2M 2 ∧e k
  !  
 β(β + d)2  1 
 (β + d) 2 d log(β + 1) 
EF(Wk ) − F ∗ ≤ Õ  δ 1/4 log
 + ε  + +  ,
 (2.6)
λ∗ ε λ∗ n β

provided
 ! !4
 β(d + β) 5 1  ε
k = Ω̃  log  and η≤ . (2.7)
λ∗ ε 4 ε  log(1/ε)

3 Proof of Theorem 2.1


3.1 A high-level overview
Let µz,k := L(Wk |Z = z), νz,t := L(W (t)|Z = z), and Ez [·] := E[·|Z = z]. In a nutshell, our proof of
Theorem 2.1 consists of the following steps:

1. We first show that, for all sufficiently small η > 0, the SGLD recursion (2.2) tracks the
continuous-time Langevin diffusion process (1.4) in 2-Wasserstein distance:
 
W2 (µz,k , νz,kη ) = Õ (β + d)(δ 1/4 + η 1/4 )kη (3.1)

(the precise statement with explicit constants is given in Proposition 3.1).

2. Next, we show that the Langevin diffusion (1.4) converges exponentially fast to the Gibbs
distribution πz :
!
β + d −Ω̃(λ∗ kη/β(d+β))
W2 (νz,kη , πz ) = Õ √ e .
λ∗

6
This, together with (3.1) and the triangle inequality, yields the estimate
!
  β + d −Ω̃(λ∗ kη/β(d+β))
1/4 1/4
W2 (µz,k , πz ) = Õ (β + d)(δ + η )kη + Õ √ e (3.2)
λ∗
(see Proposition 3.3 for explicit constants). Observe that there are two terms on the right-
hand side of (3.2), one of which grows linearly with t = kη, while the other one decays
exponentially with t. Thus, we can first choose t large enough and then η small enough, so
that
  ! 
 β(d + β)2  1 
W2 (µz,k , πz ) = Õ  1/4
δ log
 + ε  . (3.3)
λ∗ ε

The resulting choices of t = kη and η translate into the expressions for k and η given in
(A.14).

3. The upshot of Eq. (3.3) is that, for large enough k, the conditional probability law of Wk given
Z = z is close, in 2-Wasserstein, to the Gibbs distribution πz . Thus, we are led to consider
the Gibbs algorithm that generates a random draw from πz . We show that the resulting
hypothesis is an almost-minimizer of the empirical risk, i.e.,
Z !
d β +1
Fz (w)πz (dw) − min Fz (w) = Õ log (3.4)
Rd w∈Rd β d

(see Proposition 3.4 for the exact statement), and also that the Gibbs algorithm is stable in
the 2-Wasserstein distance: for any two datasets z, z̄ that differ in a single coordinate,
 p 
 β(d + β) 1 + d/β 
W2 (πz , πz̄ ) = Õ   .

λ∗ n

This estimate, together with Lemma 3.5 below, implies that the Gibbs algorithm is uniformly
stable (Bousquet and Elisseeff, 2002):
Z Z  
 (β + d)2 
sup f (w, z)πz (dw) − f (w, z)πz̄ (dw) = Õ   (3.5)
z∈Z Rd Rd λ∗ n 

(see Proposition 3.5). The almost-ERM property (3.4) and the uniform stability propery (3.5),
together with (3.3), yield the statement of the theorem.

3.2 Technical lemmas


We first collect a few lemmas that will be used in the sequel; see Appendix C for the proofs.
Lemma 3.1 (quadratic bounds on f ). For all w ∈ Rd and z ∈ Z,

k∇f (w, z)k ≤ Mkwk + B (3.6)

and
m b M
kwk2 − log 3 ≤ f (w, z) ≤ kwk2 + Bkwk + A. (3.7)
3 2 2

7
m
Lemma 3.2 (uniform L2 bounds on SGLD and Langevin diffusion). For all 0 < η < 1 ∧ 2M 2 and all
n
z∈Z ,
! !
2 1 2 d
sup Ez kWk k ≤ κ0 + 2 ∨ b + 2B + . (3.8)
k≥0 m β

and
b + d/β  
sup Ez kW (t)k2 ≤ κ0 e −2mt + 1 − e −2mt (3.9)
t≥0 m
b + d/β
≤ κ0 + . (3.10)
m
Lemma 3.3 (exponential integrability of Langevin diffusion). For all β ≥ 2/m, we have
!
h i d
kW (t)k2
log Ez e ≤ κ0 + 2 b + t. (3.11)
β

Lemma 3.4 (relative entropy bound). For any w ∈ Rd and any z ∈ Zn ,


!
d 3π Mκ0 √ 1
D(µ0 kπz ) ≤ log kp0 k∞ + log +β + B κ0 + A + log 3 . (3.12)
2 mβ 3 2

Lemma 3.5 (2-Wasserstein continuity for functions of quadratic growth, Polyanskiy and Wu
(2016)). Let µ, ν be two probability measures on Rd with finite second moments, and let g : Rd → R be
a C 1 function obeying

k∇g(w)k ≤ c1 kwk + c2 , ∀w ∈ Rd (3.13)

for some constants c1 > 0 and c2 ≥ 0. Then


Z Z
g dµ − g dν ≤ (c1 v + c2 )W2 (µ, ν) (3.14)
Rd Rd
R R
where v 2 := Rd
µ(dw)kwk2 ∧ Rd
ν(dw)kwk2 .

3.3 The diffusion approximation


Recall that µz,k = L(Wk |Z = z) and νz,t = L(W (t)|Z = z), and we take µz,0 = νz,0 = µ0 . In this
section, we derive an upper bound on the 2-Wasserstein distance W2 (µz,k , νz,kη ). The analysis
consists of two steps. We first upper-bound the relative entropy D(µz,k kνz,kη ) via a change-of-
measure argument following Dalalyan and Tsybakov (2012) (see also Dalalyan (2016)), except
that we also have to deal with the error introduced by the stochastic gradient oracle. We then use
a weighted transportation-cost inequality of Bolley and Villani (2005) to control the Wasserstein
distance W2 (µz,k , νz,kη ) in terms of D(µz,k kνz,kη ).
The proof of the following lemma is somewhat lengthy, and is given in Appendix D:

8
m
Lemma 3.6. For any k ∈ N and any η ∈ (0, 1 ∧ 2M 2 ), we have

 
D(µz,k kνz,kη ) ≤ C0 βδ + C1 η kη,

with
  ! ! 
 
 1 d 
   
C0 = M 2 κ0 + 2 ∨ b + 2B2 +  + B2  , C1 = 6M 2 βC0 + d .
m β

We now use the following result of Bolley and Villani (2005, Cor. 2.3): For any two µ, ν ∈ P2 (Rd ),
q !1/4 
 D(µkν) 
W2 (µ, ν) ≤ Cν  D(µkν) + ,

2

where
 Z !1/2
 1 3 λkwk2 
Cν = 2 inf  + log e ν(dw)  .
λ>0 λ 2 Rd

2
Let µ = µz,k , ν = νz,kη , and take λ = 1. Since β ≥ m, we can use Lemma 3.3 to write
 !  q !
  2d 
W22 (µz,k , νz,kη ) ≤ 12 + 8 κ0 + 2b + kη  · D(µz,k kνz,kη ) + D(µz,k kνz,kη ) .
β

Moreover, for all k and η satisfying the conditions of Lemma 3.6, plus the additional requirement
kη ≥ 1, we can write
q  p   p  √
D(µz,k kνz,kη ) + D(µz,k kνz,kη ) ≤ C1 + C1 kη 3/2 + βC0 + βC0 · kη δ.

Putting everything together, we obtain the following result:


m
Proposition 3.1. For any k ∈ N and any η ∈ (0, 1 ∧ 2M 2 ) obeying kη ≥ 1, we have

 √ 

W22 (µz,k , νz,kη ) ≤ C̃02 δ + C̃12 η · (kη)2 , (3.15)

with
 !  ! 

 2d   p  
 2d  p 
C̃02 := 12 + 8 κ0 + 2b +  βC0 + βC0 and 2 := 
C̃1 12 + 8 κ0 + 2b + 
 C + C1 .
β  β  1

3.4 Wasserstein distance to the Gibbs distribution


We now fix a time t ≥ 0 and examine the 2-Wasserstein distance W2 (νz,t , πz ). At this point, we need
to use a number of concepts from the analysis of Markov diffusion operators; see Appendix A for
the requisite background. We start by showing the following:

9
Proposition 3.2. For β ≥ 2/m, all of the the Gibbs measures πz satisfy a logarithmic Sobolev inequality
with constant
!
2m2 + 8M 2 1 6M(d + β)
cLS ≤ + +2 .
m2 Mβ λ∗ m

Therefore,
q
W2 (µ, πz ) ≤ 2cLS D(µkπz ) (3.16)

by the Otto–Villani theorem, and, since D(νz,0 kπz ) = D(µ0 kπz ) < ∞ by Lemma 3.4, we also have

D(νz,t kπz ) ≤ D(µ0 kπz )e −2t/βcLS . (3.17)

by the theorem on exponential decay of entropy. Combining Eqs. (3.16) (with µ = νz,t ) and (3.17)
and using Lemma 3.4, we get
s  !
 d 3π Mκ0 √ 1 

W2 (νz,t , πz ) ≤ 2cLS log kp0 k∞ + log +β + B κ0 + A + log 3 e −t/βcLS
2 mβ 3 2
=: C̃2 e −t/βcLS .

Letting t = kη and invoking Proposition 3.1, we obtain the following:

Proposition 3.3. For all k and η satisfying the conditions of Proposition 3.1, we have
 
W2 (µz,k , πz ) ≤ C̃0 δ 1/4 + C̃1 η 1/4 kη + C̃2 e −kη/βcLS .

3.5 Almost-ERM property of the Gibbs algorithm


In this section and the next one, we focus on the properties of the Gibbs algorithm that generates
a random hypothesis W b ∗ with L(Wb ∗ |Z = z) = πz . Let pz (w) = e −βFz (w) /Λ z denote the density of the
R
Gibbs measure πz with respect to the Lebesgue measure on Rd , where Λ z := Rd e −βFz (w) dw is the
normalization constant known as the partition function. We start by writing
Z
1 
Fz (w)πz (dw) = h(pz ) − logΛ z , (3.18)
Rd β

where Z Z
e −βFz (w) e −βFz (w)
h(pz ) = − pz (w) log pz (w)dw = − log dw
Rd Rd Λz Λz
is the differential entropy of pz (Cover and Thomas, 2006). To upper-bound h(pz ), we estimate the
t→∞
second moment of πz . From (3.17), it follows that W2 (νz,t , πz ) −−−−→ 0. Since convergence of prob-
ability measures in 2-Wasserstein distance is equivalent to weak convergence plus convergence of
second moments (Villani, 2003, Theorem 7.12), we have by Lemma 3.2
Z Z
b + d/β
kwk2 πz (dw) = lim kwk2 νz,t (dw) ≤ . (3.19)
R d t→∞ R d m

10
The differential entropy of a probability density with a finite second moment is upper-bounded
by that of a Gaussian density with the same second moment, so we immediately get
!
d 2πe(b + d/β)
h(pz ) ≤ log . (3.20)
2 md

Moreover, let wz∗ be any point that minimizes Fz (w), i.e., Fz∗ := minw∈Rd Fz (w) = Fz (wz∗ ). Then
∇Fz (wz∗ ) = 0, and, since Fz is M-smooth, we have Fz (w) − Fz∗ ≤ M ∗ 2
2 kw − w k by Lemma 1.2.3 in
Nesterov (2004). As a consequence, we can lower-bound log Λ z using a Laplace integral approxi-
mation:
Z
log Λ z = log e −βFz (w) dw
R d
Z


= −βFz + log e β(Fz −Fz (w)) dw
d
ZR
∗ 2
≥ −βFz∗ + log e −βMkw−wz k /2 dw
Rd
!
∗ d 2π
= −βFz + log . (3.21)
2 Mβ

Using Eqs. (3.20) and (3.21) in (3.18) and simplifying, we obtain the following result:

Proposition 3.4. For any β ≥ 2/m,


Z  !
d  eM bβ 
Fz (w)πz (dw) − min Fz (w) ≤ log  + 1  .
Rd w∈Rd 2β m d

3.6 Stability of the Gibbs algorithm


Our last step before the final analysis is to show that the Gibbs algorithm is uniformly stable. Fix
two n-tuples z = (z1 , . . . , zn ), z̄ = (z̄1 , . . . , z̄n ) ∈ Zn with card|{i : zi , z̄i }| = 1. Then the Radon–Nikodym
derivative pz,z̄ = dπ z
dπ can be expressed as

  
β
exp − n f (w, zi0 ) − f (w, z̄i0 )
pz,z̄ (w) = ,
Λ z /Λ z̄

where i0 ∈ [n] is the index of the coordinate where z and z̄ differ. In particular,
q  q
β
∇ pz,z̄ (w) = ∇w f (w, z̄i0 ) − ∇w f (w, zi0 ) pz,z̄ (w).
2n

11
Therefore, since πz̄ satisfies a logarithmic Sobolev inequality with constant cLS given in Proposi-
tion 3.2, we can write
Z
√ 2
D(πz kπz̄ ) ≤ 2cLS ∇ pz,z̄ dπz̄
Z
c β2 2
= LS 2 ∇w f (w, z̄i0 ) − ∇w f (w, zi0 ) pz,z̄ (w)πz̄ (dw)
2n Rd
2 Z
c β 2
= LS 2 ∇w f (w, z̄i0 ) − ∇w f (w, zi0 ) πz (dw)
2n Rd
2 Z !
2cLS β 2 2 2
≤ M kwk πz (dw) + B ,
n2 Rd

where the last line follows from the quadratic growth estimate (3.6). Taking µ = πz in (3.16) and
using the above bound and the second-moment estimate (3.19), we obtain
r
2cLS β M 2 (b + d/β)
W2 (πz , πz̄ ) ≤ B2 + .
n m
Finally, observe that, for each z ∈ Z, the function w 7→ f (w, z) satisfies the conditions of Lemma 3.5
b+d/β
with c1 = M and c2 = B, while πz and πz̄ satisfy the conditions of Lemma 3.5 with v 2 = m . Thus,
we obtain the following uniform stability estimate for the Gibbs algorithm:
Proposition 3.5. For any two z, z̄ ∈ Zn that differ only in a single coordinate,
Z Z

sup f (w, z)πz (dw) − f (w, z)πz̄ (dw) ≤ 3
z∈Z Rd Rd n
with
!
2 b + d/β 2
C̃3 := 4 M +B βcLS .
m

3.7 Completing the proof


Now that we have all the ingredients in place, we can complete the proof of Theorem 2.1. Choose
m
k ∈ N and η ∈ (0, 1 ∧ 2M 2 ) to satisfy

! !4
1 ε
kη = βcLS log and η≤ .
ε log(1/ε)
Then, by Proposition 3.3,
!
1/4 1  
W2 (µz,k , πz ) ≤ C̃0 βcLS δ log + C̃1 βcLS + C̃2 ε.
ε
b and W
Now consider the random hypotheses W b ∗ with L(W
b |Z = z) = µz,k and L(W
b ∗ |Z = z) = πz .
Then
b ) − F ∗ = EF(W
EF(W b ∗ ) + EF(W
b ) − EF(W b ∗) − F ∗
Z Z Z !
⊗n b ∗ ) − F ∗.
= P (dz) F(w)µk,z (dw) − F(w)πz(dw) + EF(W
Zn Rd Rd

12
The function F satisfies the conditions of Lemma 3.5 with c1 = M and c2 = B, while the probability
measures µz,k , πz ∈ P2 (Rd ) satisfy the conditions of Lemma 3.5 with
! !
2 1 2 d
v = κ0 + 2 ∨ b + B (1 + δ) + ,
m β

by Lemma 3.2. Therefore, for all z ∈ Zn ,


Z Z !
1/4 1
F(w)µk,z (dw) − F(w)πz (dw) ≤ K0 δ log + K1 ε (3.22)
Rd Rd ε

with
 s 
 ! ! 
 1 d 
K0 := M + κ0 + 2 ∨ b + 2B2 + + B C̃0 βcLS
 m β 

and
 s 
 ! ! 
 1 d 
K1 := M + κ0 + 2 ∨ b + 2B2 + + B (C̃1 βcLS + C̃2 ).
 m β 

On the other hand, using Propositions 3.4 and 3.5 together with the well-known equivalence
between stability of ERM and generalization (Rakhlin et al., 2005), we have
 !
C̃ 3 d 
 eM bβ 
EF(Wb ∗ ) − F ∗ = EF(W
b ∗ ) − EFZ (W
b ∗ ) + EFZ (W
b ∗) − F∗ ≤ + log  + 1  . (3.23)
n 2β m d

Combining Eqs. (3.22) and (3.23), we obtain the claimed excess risk bound (2.6).

4 Discussion and directions for future research


Regularity assumptions. The first two assumptions are fairly standard in the literature on non-
convex optimization. The dissipativity assumption (A.3) merits some discussion. The term “dis-
sipative” comes from the theory of dynamical systems (Hale, 1988; Stuart and Humphries, 1996),
where it has the following interpretation: Consider the gradient flow described by the ordinary
differential equation

dw
= −∇f (w, z), w(0) = w0 . (4.1)
dt
If f is (m, b)-dissipative, then a simple argument based on the Gronwall lemma√shows that, for
any ε > 0 and any initial condition w0 , the trajectory of (4.1) satisfies kw(t)k ≤ b/m + ε for all
2 √
1
t ≥ 2m log kwε0 k . In other words, for any ε > 0, the Euclidean ball of radius b/m + ε centered at
the origin is an absorbing set for the flow (4.1). If we think of w(t) as the position of a particle
moving in Rd in the presence of the potential f (w, z), then the above property means that the
particle rapidly loses (or dissipates) energy and stays confined in the absorbing set. However,
the behavior of the flow inside this absorbing set may be arbitrarily complicated; in particular,

13
even though
√ (2.3) implies that all of the critical points of w 7→ f (w, z) are contained in the ball of
radius b/m centered at the origin, there can be arbitrarily many such points. The dissipativity
assumption seems restrictive, but, in fact, it can be enforced using weight decay regularization
(Krogh and Hertz, 1992). Indeed, consider the regularized objective
γ
f (w, z) = f 0 (w, z) + kwk2 .
2
Then it is not hard to show that, if the function w 7→ f 0 (w, z) is L-Lipschitz, then f satisfies (A.2)
with m = γ /2 and b = L2 /2γ . Thus, a byproduct of our analysis is a fine-grained characterization
of the impact of weight decay on learning.
Assumption (A.4) provides control of the relative mean-square error of the stochastic gradient,
viz., Ekg(w, Uz )k2  (1 + δ)k∇Fz (w)k2 , and is also easy to satisfy in practice. For example, consider
the case where, at each iteration of SGLD, we sample (uniformly with replacement) a random
i.i.d.
minibatch of size ℓ. Then we can take Uz = (zI1 , . . . , zIℓ ), where I1 , . . . , Iℓ ∼ Uniform({1, . . . , n}), and

1X
g(w, Uz ) = ∇f (w, zIj ). (4.2)

j=1

This gradient oracle is clearly unbiased, and a simple calculation shows that (A.4) holds with
δ = 1/ℓ. On the other hand, using the full empirical gradient clearly gives δ = 0.
Finally, the exponential integrability assumption (A.5) is satisfied, for example, by the Gaus-
sian initialization W0 ∼ N (0, σ 2 Id ) with σ 2 < 1/2.

Effect of gradient noise and minibatch size selection. Observe that the excess risk bound (2.6)
contains a term that goes to zero as ε → 0, as well as a term that grows as log ε −1 , but goes to zero
as the gradient noise level δ → 0. This suggests selecting the minibatch size
!4
1 log(1/ε)
ℓ≥ ≥ .
η ε

to offset the logε −1 term.

Uniform spectral gap. As shown in Appendix B, Assumptions (A.1)–(A.3) are enough to guar-
antee that the spectral gap λ∗ is strictly positive. In particular, we give a very conservative estimate
! !
1 1 d Õ(β+d)
= Õ + Õ 1 + e . (4.3)
λ∗ β(d + β) β

Using this estimate in Eq. (2.6), we end up with a bound on the excess risk that has a dependence
on exp(Õ(β + d)). This in turn suggests choosing ε = 1/n and β = Õ(logn); as a consequence, the
excess risk will decay as 1/ log n, and the number of iterations k will scale as nÕ(1) exp(Õ(d)). The
alternative regime of conditionally independent stochastic gradients (e.g., using a fresh minibatch
at each iteration) amounts to direct optimization of F rather than Fz and suggests the choice of
β ≈ 1/ε. The number of iterations k will then scale like exp(d + 1/ε).
Therefore, in order to apply Theorem 2.1, one needs to fully exploit the structural proper-
ties of the problem at hand and produce an upper bound on 1/λ∗ which is polynomial in d or

14
even dimension-free. (By contrast, exponential dependence of 1/λ∗ on β is unavoidable in the
presence of multiple local minima and saddle points; this is a consequence of sharp upper and
lower bounds on the spectral gap due to Bovier et al. (2005).) For example, consider replacing the
empirical risk (1.2) with a smoothed objective
Z
1 2
F̃z (w) = − log e −βγ kv−wk /2 e −βFz (v) dv
β {kvk≤R}
Z
γ 1 2
2
= kwk − log e βγ hv,wi−βγ kvk /2 e −βFz (v) dv,
2 β {kvk≤R}

and running SGLD with ∇F̃z instead of ∇Fz . Here, γ > 0 and R > 0 are tunable parameters. This
modification is closely related to the Entropy-SGD method, recently proposed by Chaudhari et al.
(2016). Observe that the modified Gibbs measures π̃z (dw) ∝ e −β F̃z (w) are convolutions of a Gaus-
sian measure and a compactly supported probability measure. In this case, it follows from the
results of Bardet et al. (2015) that

1 1 4βγ R2
≤ e .
λ∗ βγ

Note that here, in contrast with (4.3), this bound is completely dimension-free. A tantalizing line
of future work is, therefore, to find other settings where 1/λ∗ is indeed small.

Acknowledgments
The authors would like to thank Arnak Dalalyan and Ramon van Handel for enlightening discus-
sions.

References
A. Alfonsi, B. Jourdain, and A. Kohatsu-Higa. Optimal transport bounds between the time-
marginals of a multidimensional diffusion and its Euler scheme. Electron. J. Probab., 20, 2015.
paper no. 70.

D. Bakry, F. Barthe, P. Cattiaux, and A. Guillin. A simple proof of the Poincaré inequality for a
large class of probability measures including the log-concave case. Electron. Comm. Probab., 13:
60–66, 2008.

D. Bakry, I. Gentil, and M. Ledoux. Analysis and Geometry of Markov Diffusion Operators. Springer,
2014.

J.-B. Bardet, N. Gozlan, F. Malrieu, and P.-A. Zitt. Functional inequalities for Gaussian convo-
lutions of compactly supported measures: explicit bounds and dimension dependence, 2015.
URL http://arxiv.org/abs/1507.02389. To appear in Bernoulli.

A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin. Escaping the local minima via simulated
annealing: Optimization of approximately convex functions. In COLT, pages 240–265, 2015.

15
F. Bolley and C. Villani. Weighted Csiszár–Kullback–Pinsker inequalities and applications to trans-
portation inequalities. Annales de la Faculté des Science de Toulouse, XIV(3):331–352, 2005.

V. S. Borkar and S. K. Mitter. A strong approximation theorem for stochastic recursive algorithms.
Journal of Optimization Theory and Applications, 100(3):499–513, 1999.

O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2
(Mar):499–526, 2002.

A. Bovier, V. Gayrard, and M. Klein. Metastability in reversible diffusion processes II. Precise
asymptotics for small eigenvalues. J. Eur. Math. Soc., 7:69–99, 2005.

S. Bubeck, R. Eldan, and J. Lehec. Sampling from a log-concave distribution


with Projected Langevin Monte Carlo. arXiv preprint 1507.02564, 2015. URL
http://arxiv.org/abs/1507/02564.

P. Cattiaux, A. Guillin, and L. Wu. A note on Talagrand’s transportation inequality and logarithmic
Sobolev inequality. Prob. Theory Rel. Fields, 148:285–334, 2010.

P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun,


and R. Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys. arXiv preprint
1611.01838, 2016. URL http://arxiv.org/abs/1611.01838.

T.-S. Chiang, C.-R. Hwang, and S.-J. Sheu. Diffusion for global optimization in Rn . SIAM Journal
on Control and Optimization, 25(3):737–753, 1987.

T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, New York, 2nd edition, 2006.

A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave
densities. J. Roy. Stat. Soc. Ser. B, 2016. To appear.

A. S. Dalalyan and A. B. Tsybakov. Sparse regression learning by aggregation and Langevin Monte
Carlo. J. Comp. Sys. Sci., 78(1423-1443), 2012.

A. Durmus and E. Moulines. Non-asymptotic convergence analysis for the unadjusted langevin
algorithm. arXiv preprint 1507.05021, 2015. URL http://arxiv.org/abs/1507.05021.

R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points-online stochastic gradient for
tensor decomposition. In COLT, pages 797–842, 2015.

S. B. Gelfand and S. K. Mitter. Recursive stochastic algorithms for global optimization in Rd . SIAM
Journal on Control and Optimization, 29(5):999–1018, 1991.

I. Gyöngy. Mimicking the one-dimensional marginal distributions of processes having an Ito


differential. Prob. Theory Rel. Fields, 71:501–516, 1986.

J. K. Hale. Asymptotic Behavior of Dissipative Systems. American Mathematical Society, 1988.

M Hardt, B Recht, and Y Singer. Train faster, generalize better: Stability of stochastic gradient
descent. arXiv preprint 1509.01240, 2015. URL http://arxiv.org/abs/1509.01240.

16
E. Hazan, K. Levi, and S. Shalev-Shwartz. On graduated optimization for stochastic non-convex
problems. In ICML, 2016.

C.-R. Hwang. Laplace’s method revisited: weak convergence of probability measures. Annals of
Probability, 8(1177-1182), 1980.

A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In J. E. Moody, S. J.
Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages
950–957. 1992.

R. S. Liptser and A. N. Shiryaev. Statistics of Random Processes I: General Theory. Springer, 2nd
edition, 2001.

D. Márquez. Convergence rates for annealing diffusion processes. The Annals of Applied Probability,
pages 1118–1139, 1997.

S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is sufficient for gener-
alization and necessary and sufficient for consistency of empirical risk minimization. Advances
in Computational Mathematics, 25(1):161–193, 2006.

Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, 2004.

M. Pelletier. Weak convergence rates for stochastic approximation with application to multiple
targets and simulated annealing. Annals of Applied Probability, pages 10–44, 1998.

Y. Polyanskiy and Y. Wu. Wasserstein continuity of entropy and outer bounds for interference
channels. IEEE Trans. Inf. Theory, 62(7):3992–4002, July 2016.

A. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applica-
tions, 3(04):397–417, 2005.

A. M. Stuart and A. R. Humphries. Dynamical Systems and Numerical Analysis. Cambridge Univer-
sity Press, 1996.

C. Villani. Topics in Optimal Transportation, volume 58 of Graduate Studies in Mathematics. Amer.


Math. Soc., Providence, RI, 2003.

M. Welling and Y. W Teh. Bayesian learning via stochastic gradient langevin dynamics. In ICML,
pages 681–688, 2011.

17
A Background on Markov semigroups and functional inequalities
Our analysis relies on the theory of Markov diffusion operators and associated functional inequal-
ities. In this Appendix, we only summarize the key ideas and results; the book by Bakry et al.
(2014) provides an in-depth exposition.
Let {W (t)}t≥0 be a continuous-time homogeneous Markov process with values in Rd , and let
P = {Pt }t≥0 be the corresponding Markov semigroup, i.e.,

Ps g(W (t)) = E[g(W (s + t))|W (t)]

for all s, t ≥ 0 and all bounded measurable functions g : Rd → R. (The semigroup law Ps ◦ Pt = Ps+t
is just another way to express
R the Markov
R property.) A Borel probability measure π is called
stationary or invariant if Rd Pt g dπ = Rd g dπ for all g and t. Each Pt can be extended to a bounded
linear operator on L2 (π), such that Pt g ≥ 0 whenever g ≥ 0 and Pt 1 = 1 for all t. The generator of
the semigroup is a linear operator L defined on a dense subspace D(L) of L2 (π) (the domain of L),
such that, for any g ∈ D(L),
∂t Pt g = LPt g = Pt Lg.

R particular, L1 = 0, and π is an invariant probability measure of the semigroup if and only if


In
Rd
Lg dπ = 0 for all g ∈ D(L). The generator L defines the Dirichlet form
Z
E(g) := − gLg dπ. (A.1)
Rd

It can be shown that E(g) ≥ 0, i.e., −L is a positive operator (since L1 = 0, zero is an eigenvalue).
Let P be a Markov semigroup with the unique invariant distribution π and the Dirichlet form
E. We say that π satisfies a Poincaré (or spectral gap) inequality with constant c if, for all probability
measures µ ≪ π,
r 
 dµ 
 
χ 2 (µkπ) ≤ c E  , (A.2)
 dπ 


where χ 2 (µkπ) := k dπ − 1k2L2 (π) is the χ 2 divergence between µ and π. The name “spectral gap"
comes from the fact that, if (A.2) holds with some constant c, then 1/c ≥ λ, where
 Z 

 

 R E(g) 2 
λ := inf   : g ∈ C , g , 0, g = 0 

 d g dπ
2
Rd 
R
 

 Z 


 −hg, Lgi 2
L (π) 2


= inf  2
: g ∈ C , g , 0, g = 0 
 .

 kgkL2 (π) R d 

Hence, if λ > 0, then the spectrum of −L is contained in the set {0}∪ [λ, ∞), so λ is the gap between
the zero eigenvalue and the rest of the spectrum. We say that π satisfies a logarithmic Sobolev
inequality with constant c if, for all µ ≪ π,
r 
 dµ 
 
D(µkπ) ≤ 2c E  , (A.3)
 dπ 

18
R dµ
where D(µkπ) = dµ log dπ is the relative entropy (Kullback–Leibler divergence). We record a
couple of key consequences of the logarithmic Sobolev inequality. Consider a Markov process
{W (t)}t≥0 with a unique invariant distribution π and a Dirichlet form E, such that π satisfies a
logarithmic Sobolev inequality with constant c. Then we have the following:
1. Exponential decay of entropy (Bakry et al., 2014, Th. 5.2.1): Let µt := L(W (t)). Then

D(µt kπ) ≤ D(µ0 kπ)e −2t/c . (A.4)


R
2. Otto–Villani theorem (Bakry et al., 2014, Th. 9.6.1): If E(g) = α k∇gk2 dπ for some α > 0,
then, for any µ ≪ π,
q
W2 (µ, π) ≤ 2cαD(µkπ). (A.5)

Our analysis of SGLD revolves around Markov diffusion processes, so we particularize the
above abstract framework to this concrete setting. Let {W (t)}t≥0 be a Markov process evolving in
Rd according to an Itô SDE

dW (t) = −∇H(W (t))dt + 2 dB(t), t≥0 (A.6)

where H is√a C 1 p
function and {B(t)} is the standard d-dimensional Brownian motion. (Replacing
the factor 2 by 2β −1 is equivalent to the time rescaling t 7→ βt.) The generator of this semigroup
is the second-order differential operator

Lg := ∆g − h∇H, ∇gi (A.7)

for all C 2 functions g, where ∆ := ∇ · ∇ is the Laplace operator. If the map w 7→ ∇H(w) is Lipschitz,
then the Gibbs measure π(dw) ∝ e −H(w) dw is the unique invariant measure of the underlying
Markov semigroup, and a simple argument using integration by parts shows that the Dirichlet
form is given by
Z
E(g) = k∇gk2 dπ. (A.8)
Rd

Thus, the Gibbs measure π satisfies a Poincaré inequality with constant c if, for any µ ≪ π,
Z r 2
2 dµ
χ (µkπ) ≤ c ∇ dπ (A.9)
Rd dπ

and a logarithmic Sobolev inequality with constant c if


Z r 2

D(µkπ) ≤ 2c ∇ dπ. (A.10)
Rd dπ

If H is C 2 and strongly convex, i.e., ∇2 H  KId for some K > 0, then π satisfies a logarithmic
Sobolev inequality with constant c = 1/K. In the absence of convexity, it is in general difficult to
obtain upper bounds on Poincaré or log-Sobolev constants. The following two propositions give
sufficient conditions based on so-called Lyapunov function criteria:

19
Proposition A.1 (Bakry et al. (2008)). Suppose that there exist constants κ0 , λ0 > 0, R ≥ 0 and a C 2
function V : Rd → [1, ∞) such that
LV (w)
≤ −λ0 + κ0 1{kwk ≤ R}. (A.11)
V (w)
Then π satisfies a Poincaré inequality with constant
1  
cP ≤ 1 + Cκ0 R2 e OscR (H) , (A.12)
λ0
where C > 0 is a universal constant and OscR (H) := maxkwk≤R H(w) − minkwk≤R H(w).
Proposition A.2 (Cattiaux et al. (2010)). Suppose the following conditions hold:
1. There exist constants κ, γ > 0 and a C 2 function V : Rd → [1, ∞) such that
LV (w)
≤ κ − γ kwk2 (A.13)
V (w)
for all w ∈ Rd .
2. π satisfies a Poincaré inequality with constant cP .
3. There exists some constant K ≥ 0, such that ∇2 H  −KId .
Let C1 and C2 be defined, for some ε > 0, by
! ! Z !
2 1 K 2 1 K 2
C1 = + + ε and C2 = + κ+γ kwk π(dw) .
γ ε 2 γ ε 2 Rd

Then π satisfies a logarithmic Sobolev inequality with constant cLS = C1 + (C2 + 2)cP .
Remark A.1. In particular, if K , 0, we can take ε = 2/K, in which case
Z !
2K 2 2K 2
C1 = + and C2 = κ+γ kwk dπ . (A.14)
γ K λ Rd

B A lower bound on the uniform spectral gap


Our goal here is to prove the crude lower bound on λ∗ given in Section 4. To that end, we will use
the Lyapunov function criterion due to Bakry et al. (2008), which is reproduced as Proposition A.1
in Appendix A.
We will apply this criterion to the Gibbs distribution πz for some z ∈ Zn . Thus, we have H = βFz
and
Lg = ∆g − βh∇Fz, ∇gi.
2
Consider the candidate Lyapunov function V (w) = e mβkwk /4 . From the fact that V ≥ 1 and from
the dissipativity assumption (A.3), it follows that
 
 mβd (mβ)2 2 mβ 2 
LV (w) =  + kwk − hw, ∇Fz (w)i V (w)
2 4 2
 
 mβ(d + bβ) (mβ)2 2
≤   − kwk  V (w). (B.1)
2 4

20
Thus, V evidently satisfies (A.11) with R2 = 2κ γ , κ0 = κ and λ0 = 2κ. Moreover, from Lemma 3.1
and from the fact that Fz ≥ 0, it follows that
   
 MR2   (M + B)R2 
OscR (βFz) ≤ β  + BR + A ≤ β  + A + B .
2 2
Thus, by Proposition A.1, πz satisfies a Poincaré inequality with constant
!
1 2C(d + bβ) 2
cP ≤ + exp (M + B)(bβ + d) + β(A + B) .
mβ(d + bβ) mβ m
Observe that this bound holds for all z ∈ Zn . Using this fact and the relation 1/λ ≤ cP between the
spectral gap and the Poincaré constant, we see that
!
1 1 2C(d + bβ) 2
≤ + exp (M + B)(bβ + d) + β(A + B) ,
λ∗ mβ(d + bβ) mβ m
which proves the claimed bound.

C Proofs for Section 3.2


Proof of Lemma 3.1. The estimate (3.6) is an easy consequence of conditions (A.1) and (A.2). Next,
observe that, for any two v, w ∈ Rd ,
Z1
f (w, z) − f (v, z) = hw − v, ∇f (tw + (1 − t)v, z)idt. (C.1)
0
In particular, taking v = 0, we obtain
Z 1
f (w, z) = f (0, z) + hw, ∇f (tw)idt
0
Z 1
(i)
≤ A+ kwk k∇f (tw, z)k dt
0
Z 1
(ii)
≤ A + kwk (Mtkwk + B) dt
0
M
=A+ kwk2 + Bkwk,
2
where (i) follows from (A.1) and from Cauchy–Schwarz, while (ii) follows from (3.6). This proves
the upper bound on f (w, z). Now take v = cw for some c ∈ (0, 1] to be chosen later. With this
choice, we proceed from Eq. (C.1) as follows:
Z1
f (w, z) = f (cw, z) + hw, ∇f (tw, z)idt
c
Z 1
(i) 1
≥ htw, ∇f (tw, z)idt
c t
Z1  
(ii) 1
≥ mt 2 kwk2 − b dt
c t
m(1 − c 2 )
= kwk2 + b log c,
2

21
where (i) uses the fact that f ≥ 0, while (ii) uses the dissipativity property (2.3). Taking c = √1 , we
3
get the lower bound in (3.7).

Proof of Lemma 3.2. From (2.2), it follows that


s
8η 2η
Ez kWk+1 k2 = Ez kWk − ηg(Wk , Uz,k )k2 + Ez hWk − ηg(Wk , Uz,k ), ξk i + E kξ k2
β β z k
2ηd
= Ez kWk − ηg(Wk , Uz,k )k2 + , (C.2)
β
where the second step uses independence of Wk −g(Wk , Uz,k ) and ξk and the unbiasedness property
(2.1) of the gradient oracle. We can further expand the first term in (C.2):
Ez kWk − ηg(Wk , Uz,k )k2
2
= Ez Wk − η∇Fz (Wk ) + 2ηEz hWk − η∇Fz (Wk ), ∇Fz (Wk ) − g(Wk , Uz,k )i
+ η 2 Ez k∇Fz (Wk ) − g(Wk , Uz,k )k2
= Ez kWk − η∇Fz (Wk )k2 + η 2 Ez k∇Fz (Wk ) − g(Wk , Uz,k )k2 , (C.3)
where we have used (2.1) once again. By (2.4), the second term in (C.3) can be upper-bounded by
Ez k∇Fz (Wk ) − g(Wk , Uz,k )k2 ≤ 2δ(M 2 Ez kWk k2 + B2 ),
whereas the first term can be estimated as
Ez kWk − η∇Fz (Wk )k2 = Ez kWk k2 − 2ηEz hWk , ∇Fz (Wk )i + η 2 Ez k∇Fz (Wk )k2
≤ Ez kWk k2 + 2η(b − mEz kWk k2 ) + 2η 2 (M 2 Ez kWk k2 + B2 )
 
= 1 − 2ηm + 2η 2 M 2 Ez kWk k2 + 2ηb + 2η 2 B2 ,
where the inequality follows from the dissipativity condition (2.3) and the bound (3.6) in
Lemma 3.1. Combining all of the above, we arrive at the recursion
2ηd
Ez kWk+1 k2 ≤ (1 − 2ηm + 4η 2 M 2 )Ez kWk k2 + 2ηb + 4η 2 B2 + . (C.4)
β
m
Fix some η ∈ (0, 1 ∧ 2M 2 ). There are two cases to consider:

• If 1 − 2ηm + 4η 2 M 2 ≤ 0, then from (C.4) it follows that


2ηd
Ez kWk+1 k2 ≤ 2ηb + 4η 2 B2 +
β
!
2 d 2
≤ Ez kW0 k + 2 b + 2B + . (C.5)
β

• If 0 < 1 − 2ηm + 4η 2 M 2 < 1, then iterating (C.4) gives


ηd
ηb + 2η 2 B2 + β
2 2 2 k 2
Ez kWk k ≤ (1 − 2ηm + 4η M ) Ez kW0 k + 2 2
ηm + 2η M
b + 2B2 + βd
2
≤ Ez kW0 k + . (C.6)
m

22
The bound (3.8) follows from Eqs. (C.5) and (C.6) and from the estimate
2
Ez kW0 k2 = EkW0 k2 ≤ log Ee kW0 k = κ0 , (C.7)

which easily follows from the independence of Z and W0 and from Jensen’s inequality.
We now analyze the diffusion (1.4). Let Y (t) := kW (t)k2 . Then Itô’s lemma gives
r
2d 8
dY (t) = −2hW (t), ∇Fz(W (t))idt + dt + W (t)∗ dB(t),
β β
Pd
where W (t)∗ dB(t) := i=1 Wi (t)dBi (t). This can be rewritten as

2me 2mt Y (t)dt + e 2mt dY (t)


r
2mt 2mt 2d 2mt 8 2mt
= −2e hW (t), ∇Fz (W (t))idt + 2me Y (t)dt + e dt + e W (t)∗ dB(t). (C.8)
β β

Recognizing the left-hand side of (C.8) as the total Itô derivative of e 2mt Y (t), we arrive at
 
d e 2mt Y (t) = −2e 2mt hW (t), ∇Fz (W (t))idt + 2me 2mt Y (t)dt
r
2d 2mt 8 2mt
+ e dt + e W (t)∗ dB(t), (C.9)
β β

which, upon integrating and rearranging, becomes


Z t
Y (t) = e −2mt
Y (0) − 2 e 2m(s−t) hW (s), ∇Fz (W (s))ids
0
Z t
r Zt
2m(s−t) d   8
+ 2m e Y (s)ds + 1 − e −2mt + e 2m(s−t) W (s)∗ dB(s). (C.10)
0 mβ β 0

Now, using the dissipativity condition (2.3), we can write


Zt Zt  
2m(s−t)
−2 e hW (s), ∇Fz (W (s))ids ≤ 2 e 2m(s−t) b − mY (s) ds
0 0
Zt Zt
= 2b e 2m(s−t) ds − 2m e 2m(s−t) Y (s)ds
0 0
Zt
b 
= 1 − e −2mt − 2m e 2m(s−t) Y (s)ds.
m 0

Substituting this into (C.10), we end up with


r Zt
2 b + d/β 
2
 8
kW (t)k ≤ e −2mt
kW (0)k + 1 − e −2mt + e 2m(s−t) W (s)∗ dB(s).
m β 0

Taking expectations and using the martingale property of the Itô integral together with (C.7), we
get (3.9). Eq. (3.10) follows from maximizing the right-hand side of (3.9) over all t ≥ 0.

23
2
Proof of Lemma 3.3. For L(t) = e kW (t)k , Itô’s lemma gives
r
4 2d 8
dL(t) = −2hW (t), ∇Fz(W (t))iL(t)dt + L(t)kW (t)k2 dt + L(t)dt + L(t)W (t)∗ dB(t).
β β β

Integrating, we obtain
Z t !
4 2
L(t) = L(0) + kW (s)k − 2hW (s), ∇Fz(W (s))i L(s)ds
0 β
Z r Zt
2d t 8
+ L(s)ds + L(s)W (s)∗ dB(s).
β 0 β 0

From the dissipativity condition (2.3) and from the assumption that β ≥ 2/m, it follows that
!
4 2 4
kW (s)k − 2hW (s), ∇Fz(W (s))i ≤ 2b + − 2m kW (s)k2 ≤ 2b,
β β

hence
!Z t
r Zt
d 8
L(t) ≤ L(0) + 2 b + L(s)ds + L(s)W (s)∗ dB(s).
β 0 β 0

Taking expectations and using the martingale property of the Itô integral, we get
!Z t
d
E[L(t)] ≤ E[L(0)] + 2 b + E[L(s]ds
β 0
!Z t
κ0 d
=e +2 b+ E[L(s)]ds.
β 0

Eq. (3.11) then follows by an application of the Gronwall lemma.

Proof of Lemma 3.4. Let pz denote the density of πz with respect to the Lebesgue measure on Rd :
Z
e −βFz (w)
pz (w) = , where Λ z = e −βFz (w) dw.
Λz Rd

Since pz > 0 everywhere, we can write


Z
p (w)
D(µ0 kπz ) = p0 (w) log 0 dw
R d pz (w)
Z Z
= p0 (w) logp0 (w)dw + log Λ z + β p0 (w)Fz (w)dw
Rd Z R d

≤ log kp0 k∞ + log Λ z + β p0 (w)Fz (w)dw. (C.11)


Rd

24
We first upper-bound the partition function:
Z
Λz = e −βFz (w) dw
R d
Z  
 β X n 
 
= exp − f (w, zi ) dw
R d  n 
i=1
Z
1 mβkwk2
≤ e 2 β log 3 e − 3 dw
Rd
!d/2

= 3β/2 ,

where the inequality follows from Lemma 3.1. Thus,


d 3π β
log Λ z ≤ log + log 3. (C.12)
2 mβ 2
Moreover, invoking Lemma 3.1 once again, we have
n
1X M
Fz (w) = f (w, zi ) ≤ kwk2 + Bkwk + A. (C.13)
n 3
i=1

Therefore,
Z Z !
M
Fz (w)p0 (w)dw ≤ µ0 (dw) kwk2 + Bkwk + A
R d R d 3
M √
≤ κ0 + B κ0 + A. (C.14)
3
Substituting (C.12), (C.13), and (C.14) into (C.11), we get (3.12).

Proof of Lemma 3.5. The proof is a minor tweak of the proof of Proposition 1 in Polyanskiy and Wu
(2016); we reproduce it here to keep the presentation self-contained. Without loss of generality,
we assume that v 2 < ∞, otherwise the bound holds trivially. For any two v, w ∈ Rd , we have
Z1
g(w) − g(v) = hw − v, ∇g((1 − t)v + tw)idt
0
Z 1
≤ k∇g((1 − t)v + tw)k kw − vk dt
0
Z 1 
≤ c1 (1 − t)kvk + c1 tkwk + c2 kw − vk dt
0 
c c
= 1 kvk + 1 kwk + c2 kw − vk, (C.15)
2 2
where we have used Cauchy–Schwarz and the growth condition (3.13). Now let P be the coupling
of µ and ν that achieves W2 (µ, ν). That is, P = L((W , V )) with µ = L(W ), ν = L(V ), and

W22 (µ, ν) = EP kW − V k2 .

25
Taking expectations in (C.15), we have
Z Z
gdµ − gdν = EP [g(W ) − g(V )]
Rd Rd
r  2 q
c c
≤ EP 1 kW k + 1 kV k + c2 · EP [kW − V k2 ]
2 2
 p p 
c1 c
≤ EkW k2 + 1 EkV k2 + c2 · W2 (µ, ν)
2 2
= (c1 v + c2 ) W2 (µ, ν).
Interchanging the roles of µ and ν, we complete the proof.

D Proof of Lemma 3.6


Conditioned on Z = z, {Wk }∞k=0 is a time-homogeneous Markov process. Consider the following
continuous-time interpolation of this process:
Zt r Zt
2
W (t) = W0 − g(W (⌊s/η⌋η), U z (s))ds + dB(s), t≥0 (D.1)
0 β 0

where U z (t) ≡ Uz,k for t ∈ [kη, (k + 1)η). Note that, for each k, W (kη) and Wk have the same
probability law µz,k . Moreover, by a result of Gyöngy (1986), the process W (t) has the same one-
time marginals as the Itô process
Zt r Zt
2
V (t) = W0 − gz,s (V (s))ds + dB(s)
0 β 0
with
 
gz,t (v) := Ez g(W (⌊t/η⌋η), U z,k (t)) W (t) = v . (D.2)
 
Crucially, V (t) is a Markov process, while W (t) is not. Let PtV := L V (s) : 0 ≤ s ≤ t Z = z and
 
PtW := L W (s) : 0 ≤ s ≤ t Z = z . The Radon–Nikodym derivative of PtW w.r.t. PtV is given by the
Girsanov formula
 Z Z 
dPtW 
 t ∗ β t 

β 2 
t (V ) = exp 
 ∇F z (V (s)) − g z,s (V (s)) dB(s) − k∇F z (V (s)) − g z,s (V (s))k ds 
 (D.3)
dPV 2 0 4 0 

(see, e.g., Sec. 7.6.4 in Liptser and Shiryaev (2001)). Using (D.3) and the martingale property of
the Itô integral, we have
Z
dPt
D(PtV kPtW ) = − dPtV log W
dPtV
Z
β t 2
= Ez ∇Fz (V (s)) − gz,s (V (s)) ds
4 0
Z
β t 2
= E ∇Fz (W (s)) − gz,s (W (s)) ds,
4 0 z

26
where the last line follows from the fact that L(W (s)) = L(V (s)) for each s.
Now let t = kη for some k ∈ N. Then, using the definition (D.2) of gz,s , Jensen’s inequality, and
the M-smoothness of Fz , we can write
k−1 Z (j+1)η
kη kη βX 2
D(PV kPW ) = Ez ∇Fz (W (s)) − gz,s (W (s)) ds
4 jη
j=0
k−1 Z (j+1)η
βX 2
≤ Ez ∇Fz (W (s)) − ∇Fz(W (⌊s/η⌋η)) ds
2 jη
j=0
k−1 Z (j+1)η
βX 2
+ Ez ∇Fz (W (⌊s/η⌋η)) − g(W (⌊s/η⌋η), U z (s)) ds
2 jη
j=0
k−1 Z (j+1)η
βM 2 X 2
≤ Ez W (s) − W (⌊s/η⌋η) ds
2 jη
j=0
k−1 Z (j+1)η
βX 2
+ Ez ∇Fz (W (⌊s/η⌋η)) − g(W (⌊s/η⌋η), U z (s)) ds, (D.4)
2 jη
j=0

We first estimate the first summation in (D.4). Consider some s ∈ [jη, (j + 1)η). From (D.1), we
have
r
2 
W (s) − W (jη) = −(s − jη)g(Wj , Uz,j ) + B(s) − B(jη)
β
r
  2 
= −(s − jη)∇Fz (Wj ) + (s − jη) ∇Fz (Wj ) − g(Wj , Uz,j ) + B(s) − B(jη) .
β

Therefore, using Lemmas 3.1 and 3.2 and the gradient noise assumption (A.4), we arrive at

Ez kW (s) − W (jη)k2
6ηd
≤ 3η 2 Ez k∇Fz (Wj )k2 + 3η 2 Ez k∇Fz (Wj ) − g(Wj , Uz,j )k2 +
β
  6ηd
≤ 12η 2 M 2 Ez kWj k2 + B2 +
β
  ! ! 
 
 1 d 
  6ηd
≤ 12η 2 M 2 κ0 + 2 ∨ b + 2B2 +  + B2  + .
m β β

27
Consequently,
k−1 Z
X (j+1)η
2
Ez W (s) − W (⌊s/η⌋η) ds
j=0 jη
  ! ! 
 
 1 d 
  6d 2
≤ 12 M 2 κ0 + 2 ∨ b + 2B2 +  + B2  kη 3 + kη
m β β
   ! !  
   6d 
  2  1 d 

≤ 12 M κ0 + 2 ∨ b + 2B2 +  + B2  +  · kη 2
 m β β 
!
d
=: 6 2C0 + · kη 2 . (D.5)
β
Similarly, the second summation on the right-hand side of (D.4) can be estimated as follows:
k−1 Z
X (j+1)η
2
Ez ∇Fz (W (⌊s/η⌋η)) − g(W (⌊s/η⌋η), U(s)) ds
j=0 jη

k−1
X 2
=η Ez ∇Fz (Wj ) − g(Wj , Uz,j )
j=0
k−1 
X 
≤ ηδ 2 M 2 Ez kWj k2 + B2
j=0
 ! !
2
 1 2 d 


≤ 2M κ0 + 2 ∨ b + 2B +  kηδ + 2δB2kη
m β
  ! ! 
 
 1 d 
 
= 2 M 2 κ0 + 2 ∨ b + 2B2 +  + B2  · kηδ
m β
= 2C0 · kηδ. (D.6)
Substituting Eqs. (D.5) and (D.6) into (D.4), we obtain
kη kη  
D(PV kPW ) ≤ 6 βM 2 C0 + M 2 d · kη 2 + βC0 · kηδ.

Now, since µz,k = L(W (kη)|Z = z) and νz,kη = L(W (kη)|Z = z), the data-processing inequality for
the KL divergence gives
kη kη
D(µz,k kνz,kη ) ≤ D(PV kPW )
 
≤ 6 βM 2C0 + M 2 d · kη 2 + βC0 · kηδ
=: C1 kη 2 + βC0 kηδ.

E Proof of Proposition 3.2


To establish the log-Sobolev inequality, we will use the Lyapunov function criterion of
Cattiaux et al. (2010), reproduced as Proposition A.2 in Appendix A.

28
We will apply this proposition to the Gibbs distribution πz for some z ∈ Zn , so that H = βFz
and
Lg = ∆g − βh∇Fz, ∇gi.
2
We consider the same Lyapunov function V (w) = e mβkwk /4 as in Appendix B. From Eq. (B.1), V
2
mβ(d+bβ) (mβ)
evidently satisfies (A.13) with κ = 2 and γ = 4, i.e., the first condition of Proposition A.2
is satisfied. Moreover, πz satisfies a Poincaré inequality with constant cP ≤ 1/λ∗ . Thus, the second
condition is also satisfied. Finally, by the M-smoothness assumption (A.2), ∇2 Fz  −MId , so the
third condition of Proposition A.2 is satisfied with K = βM. Consequently, the constants C1 and
C2 in (A.14) are given by

2m2 + 8M 2 6M(d + β)
C1 = and C2 ≤ , (E.1)
m2 Mβ m

where we have also used the estimate (3.19) to upper-bound C2 . Therefore, from Proposition A.2
and from (E.1) it follows that πz satisfies a logarithmic Sobolev inequality with
!
2m2 + 8M 2 1 6M(d + β)
cLS ≤ + +2 .
m2 Mβ λ∗ m

29

You might also like